[go: up one dir, main page]

Academia.eduAcademia.edu
Modelling Highly Symmetrical Molecules: Linking Ontologies and Graphs Oliver Kutz1 , Janna Hastings2 , and Till Mossakowski1,3 1 Research Center on Spatial Cognition, University of Bremen, Germany; European Bioinformatics Institute, Cambridge, UK; 3 DFKI GmbH Bremen; okutz@informatik.uni-bremen.de; hastings@ebi.ac.uk; Till.Mossakowski@dfki.de 2 Abstract. Methods for automated classification of chemical data depend on identifying interesting parts and properties. However, classes of chemical entities which are highly symmetrical and contain large numbers of homogeneous parts (such as carbon atoms) are not straightforwardly classified in this fashion. One such class of molecules is the fullerene family, which shows potential for many novel applications including in biomedicine. The Web Ontology Language OWL cannot be used to represent the structure of fullerenes, as their structure is not treeshaped. While individual members of the fullerene class can be modelled in standard FOL, expressing the properties of the class as a whole (independent of the count of atoms of the members) requires second-order quantification. Given the size of chemical ontologies such as ChEBI, using second-order expressivity in the general case is prohibitively expensive to practical applications. To address these conflicting requirements, we introduce a novel framework in which we heterogeneously integrate standard ontological modelling with monadic second-order reasoning over chemical graphs, enabling various kinds of information flow between the distinct representational layers. 1 Introduction and motivation Organic chemistry has seen a dramatic increase in available data in recent years, tracking progress in the search for novel therapeutics.1 However, large-scale data that are not appropriately organised can be more of a burden than a benefit. Ontologies and knowledge-based methods for automated classification are increasingly harnessed to address this challenge. ChEBI – Chemical Entities of Biological Interest – is a chemical ontology that is widely used to organise and classify chemical data [4]. However, ChEBI is manually maintained, reducing its scalability. Methods for automated classification of chemical entities depend on algorithms which reduce complex molecular graphs to lists of interesting parts and properties, such as atomic constituents and groups, charges and overall molecular weight. Knowledge representation and reasoning for chemistry has also largely been dominated by this paradigm [13, 12, 14]. 1 The basic ideas formulated in this paper were previously presented at the Deep Knowledge Representation Workshop DKR-11, Banff, Canada, 2011 (2nd prize in the DKR competition). In recent years there has been a progression in capacity for the synthesis of highly symmetrical, polycyclic chemical entities, which are made up of a very small number of part sorts (e.g. mainly carbon atoms) with a very large number of actual parts. Polycyclic carbon molecules show incredible topological versatility, not only forming spheres, tubes and sheets, but also molecular Möbius strips [8, 2, 11] and knots [7, 17], as illustrated in Figure 1. These molecules elicit increased interest following advances in synthesis methods towards a nanoscale molecular ‘machinery’ with carefully designed shapes that are able to rival the power and scale of biological machinery [19]. Since their parts are homogeneous, listing part types cannot distinguish distinct such classes. Rather, such molecules must be characterised by their shape or topology. For example, fullerene molecules form spherical or ellipsoidal cages, Möbius molecules display classical Möbius topologies, while molecular knots and interlinked chains display the topological and shape properties of macromolecular knots and chains. To adequately represent Fig. 1. Some examples of highly symmetric molecules, constituted almost entirely by carbon atoms. The overall ar- knowledge about these molecules rangements of atoms in the molecules, rather than the nature requires the ability to describe of functional groups, characterise their types. and reason over features which apply to the entire molecular graph (i.e. the connection of atoms via bonds). In what follows, we will argue that formalisms with limited expressivity such as OWL are not sufficient to represent deep knowledge about this class of molecules. We will therefore complement OWL with the more expressive formalism of monadic second-order logic (MSOL). 2 Background OWL representation. Chemical entities can be represented and exchanged in the form of chemical graphs, in which atoms form the vertices and covalent chemical bonds the edges. However, complex graphs that contain cycles cannot be faithfully modelled at the class level in OWL due to the requirement that all axioms in OWL have models shaped like a tree [12]. Can we recognise members of the fullerene class of molecules based on the structure of their chemical graphs? A formulation of C60 fullerene and a graphene of 60 atoms might refer explicitly to their differing shapes in order to allow automated reasoning to distinguish them, using axioms such as hasPart only CarbonAtom and hasShape some Sphere or hasShape some Flat. However, this approach clearly does not allow automated reasoning to deduce the class of the molecule based on the properties of the molecular graph, since a human has to specify the shape of the molecule. Following this pattern, a different shape has to be defined for every differently shaped molecule, with no means of automatically discerning relationships or similarity between the stated shapes. Furthermore, those properties of the molecules that depend on their shapes are not explained by the information contained in the ontology. Many of the properties of fullerenes stem from the fact that they can enclose other molecules inside their cage structure, a property not shared by graphene. The properties of molecular knots stem from the fact that they are mechanically interlocked. What is required is a framework that is able to define classes of molecules based on properties of the graphs of their members, and then deduce which molecular graphs belong to these classes. Description graphs, rules and FOL. Cycles can be represented adequately in rules, which are combined with OWL for ontology engineering in the DL-safe rule extension [16]. The DL-safe rules extension, however, is applicable only to explicitly named objects in the ontology (individuals), to ensure decidability of the resulting knowledge base. This means that it is not possible to reason at the class level about highly symmetric molecules using this formalism. This shortcoming motivated the introduction of description graphs [15], an OWL extension for expressing the structure of complex objects at the class level. However, the knowledge base is still constrained in that the OWL axioms and the edge properties in the description graphs must be kept separate. The reasoning capability of the framework is limited to what can be expressed in “graph-safe” rules: rules which do not mix graph edge properties and OWL object properties. Furthermore, there are inherent limitations in the use of rules for reasoning, since there is no ∀ quantification in the rules formalism, which means that properties of all atoms in a given graph cannot be used for reasoning [12]. In an effort to relax these limitations, a radically different semantics has recently been proposed, based on logic programming: description graph logic programs [14]. Since the semantics from logic programming ensures decidability in a different way to the OWL model-theoretic semantics, there is no need for property separation, thus the ontology designer may interchangeably use OWL and description graph properties in creating the knowledge base. This formalism allows representation and reasoning with cyclic chemical structures at the class level. It is possible, for example, to define a particular member of the fullerenes class, such as dodecahedrane, and to use reasoning for detection of cycles of fixed lengths. However, it is not possible to express the properties of fullerenes as a whole. Using full FOL it is possible to get very close to a definition for the fullerenes, including axioms that every atom must have 3 or 4 bonds, every atom must belong to a cycle, and every cycle (face) must have 5 or 6 members. However, such constraints (“local perspective”) cannot allow correct classification in all cases. For example, fullerenes of different sizes (for example, C540, C240 and C60) can be nested inside one another. The local perspective at each atom and at each face correctly matches the best definition that is possible to specify in FOL. Yet, this should be classified as a complex consisting of multiple fullerenes, rather than as itself a (single) fullerene molecule. To distinguish the complex from a single molecule, the second-order construct of graph connectedness is needed. However, it is well-known that connectedness is not first-order definable [5]. 3 Properties of graphs for chemical classes In order to distinguish between fullerenes, graphenes, strips and Möbius strips, we need to define some properties of graphs based on chemical graph theory [18]. For simplification we will assume that all graphs are finite, which is true of all graphs corresponding to real chemical entities. Planar polyhedral graphs. A chemical graph is planar if it can be drawn on a flat plane without any edges crossing. Overwhelmingly, most chemical entities can be described by planar graphs. The only exception found in a recent analysis of public compound databases were Möbius-like molecules [8]. A graph is cubic if all vertices have degree three, i.e. are connected to three other vertices. It is connected, if any two vertices are connected by a path, and it is 3-connected if it is connected and remains so after removal of any two vertices. A graph is polyhedral iff it is the graph of some convex polyhedron. By Steinitz’ theorem (1922), this is equivalent to being 3-connected and planar (see [20]). Indeed, polyhedral graphs, while being planar (2D), are typically represented as convex polyhedra (3D). A polycyclic cage is any polyhedral graph. Chemical examples include cubane, tetrahedrane, and of course all fullerenes. A fullerene is a cubic polyhedral graph consisting of hexagons and pentagons only. By the Euler formula for polyhedra, one can show that the number of pentagons must always be 12. A closed nanotube is a fullerene which is extended into a tube shape with a circular extension consisting only of hexagons between the two ends, the latter consisting of two hemispheres of the buckyball structure. An open nanotube is a cubic polyhedral graph consisting of hexagons and two non-hexagons (the two non-hexagons are the outer boundaries). Planar non-polyhedral graphs. A graphene is a planar graph consisting of hexagons and one face (the outer boundary) not necessarily being a hexagon, where all vertices involved in the outer boundary have degree two or three, while the remaining vertices have degree three. Non-planar graphs. A Möbius strip is non-planar graph, consisting of hexagons and one non-hexagon (the outer boundary). 4 Describing molecule graph classes in MSOL We want to formalise the definitions of graph classes such that membership in a graph class can be machine-checked. Cubic ⇔ ∀x.∃!3y.edge(x, y) It has been noted (see [3]) that degree n (x) ⇔ ∃!ny.edge(x, y) the role finite automata play for the specification of word languages Planar ⇔ ¬(∃ ) ∧ ¬(∃ ) is played by monadic second-order Connected Subgraph(C) ⇔ ∀D ⊆ C, E ⊆ C.C = D ∪ E ⇒ ∃u ∈ D, v ∈ E.edge(u, v) logic (MSOL) for expressing graph Connected ⇔ ∃C.∀x.x ∈ C ∧ Connected Subgraph(C) properties and defining graph classes. Cycle(C) ⇔ Although the general problem is Connected Subgraph(C) ∧ ∀x ∈ C ∃y ∈ C.edge(x, y) Three Connected ⇔ ∀x, y.Connected Subgraph(V \ {x , y}) NP-complete, monadic second-order logic for graphs can be modelPolyhedron ⇔ Planar ∧ Three Connected Polycyclic Cage ⇔ Polyhedron checked quite efficiently; indeed, Face(C) ⇔ Cycle(C) ∧ Connected Subgraph(V \ C ) for graphs with bounded tree-width, ∧∀u, x, y, z ∈ C.edge(u, x) ∧ edge(u, y) ∧edge(u, z) → (x = y ∨ x = z ∨ y = z) model checking can be done in linear Pent(C) ⇔ Cycle(C) ∧ ∃!5x.x ∈ C time. MSOL for graphs consists of Hex (C) ⇔ Cycle(C) ∧ ∃!6x.x ∈ C untyped first-order logic, extended Carbon Allotrope ⇔ ∀x.Carbon(x) with quantification over sets (and Fullerene ⇔ Carbon Allotrope ∧ Polycyclic Cage ∧ Cubic ∧ ∀C.Face(C) → P ent(C) ∨ Hex(C) membership in such sets). We asClosed Nanotube ⇔ Fullerene ∧ sume binary predicates edge, edge2 and edge3 for all bonds, double bonds and triple bonds, respectively. ∧ (∃ ) We also assume unary predicates Open Nanotube ⇔ Carbon Allotrope ∧ Polyhedron ∧ Cubic like Carbon for the atoms (and suit∧ ∃B, C.Face(B) ∧ Face(C) ∧ B 6= C able atom classes) in the periodic ta∧ ∀D.Face(D) → B = D ∨ C = D ∨ Hex(D) ble. When writing MSOL formulas, Graphene ⇔ Carbon Allotrope ∧ Planar ∧ ∃B.Face(B) ∧ (∀x ∈ B.degree 2 (x) ∨ degree 3 (x)) we use syntactic sugar like unique∧ ∀C.Face(C) → B = C ∨(Hex(C) ∧ ∀x ∈ C.degree 3 (x)) existential quantifiers and number Moebius Strip ⇔ ¬Planar ∧ ∃B.Face(B) quantifiers, which can easily be ∧∀C.Face(C) → B = C ∨ Hex(C) coded out even in first-order logic. Fig. 2. MSOL formalisation of molecule classes. We also will freely use standard settheoretic notation where it can easily be coded out into MSOL. V denotes the set of all vertices. While the expressive power of MSOL suffices to axiomatise most graph classes that we are interested in, even with the above syntactic sugar, often the axioms can become cumbersome and large. We therefore additionally use the nested conditions of [9, 10]. The simplest and most prominent formulas here are of form (∃G), where G is a graph with some edges annotated with +. The semantics is that G can be injectively embedded into the given graph, where each edge labelled with + may be mapped to a finite path (this may be used to express that a certain G is a minor of the given graph). The MSOL formalisation of the above notions is shown in Fig. 2. Note that graph classes are represented as MSOL model classes; this means that e.g. a graph is cubic if and only if it (when seen as a MSOL model) satisfies the formula ∀x.∃!3y.edge(x, y). The correctness of the definition of polyhedral graph follows from Steinitz’ theorem discussed above, and the correctness of the definition of planarity follows from Kuratowski’s characterisation in terms of forbidden minors. 5 Connecting Ontology and Graph Layers We have seen that monadic second-order logic combined with nested conditions provides a convenient formalism for adequate formalisation of graph-conditions relevant for the modelling of chemicals. However, how can such specifications of graph classes be related to existing ontologies of molecules such as ChEBI [4] that are formulated in a light-weight ontology language like OWL-EL? Clearly, one cannot expect to be able to formalise deeper graph-theoretical properties in OWL. However, using the MSOL formalisation, we can build what we call a grounded ontology: class names such as fullerene are equipped with a (or several) formal MSOL specification(s), and specific instances, i.e. object names are equipped with concrete graphs. Such an association, if done systematically, will give rise to a number of automated reasoning problems such as model and subclass checking, deduction in MSOL, and abduction, here put into a new context. To motivate the following definition, note the following considerations. Two (different) ontology classes may be equipped with the same MSOL theories. This reflects an intensionality in the definition of the ontological classes which, although having different ontological definitions, denote the same structural class of molecules. Conversely, one and the same ontology class may be equipped with different MSOL theories. This corresponds to an intensionality in the realm of graph classes, where different descriptions of a graph class may be found in the literature (e.g. if the molecule has different structural variants). Therefore, we express soundness of the relation between ontology classes and monadic second-order theories as functionality modulo logical equivalence. We consider the ontology as the primary, and the graph-based formalisation as a secondary source of information. This is reflected in the following formal definition for ontologies expressed in ALC:2 Definition 1. Fix an ALC ontology O = hT, Ai, where T is a TBox, and A is an ABox. Let C be the set of ALC (sub)concept descriptions (atomic or complex) and I the set of object names appearing in O, T a set of finite MSOL theories3 , and G a set of MSOL finite undirected graphs. An ontology-graph association (oga for short) is a pair of relations ❀ = h❀T , ❀A i, where ❀T ⊆ C × T and ❀A ⊆ I × G ❀ is total if for any concept C ∈ C and object a ∈ I there exist T ∈ T, G ∈ G such that C ❀T T and a ❀A G. ❀ is sound if for all O |= C ⊑ D and a : C and b : D we have: C ❀T S, D ❀T T implies S MSOL T and a ❀A G implies G |=MSOL S and b ❀A H implies H |=MSOL T . ❀ is complete if for all C ❀T S, D ❀T T we have: S MSOL T implies O |= C ⊑ D and for all a ❀A G and b ❀A H with G |=MSOL S and H |=MSOL T we have a : C and b : D. ❀ is graph-extensional if C ❀T S, C ❀T T =⇒ MSOL S ↔ T . ❀ is class-extensional if C ❀T S, D ❀T S =⇒ O |= C ≡ D. Proposition 1. Completeness implies class-extensionality (not vice versa), and soundness implies graph-extensionality (not vice versa). The logical structure of ontology-graph associations is illustrated in Figure 3. Note that the new aspect here is that there is a shift of levels, namely graphs are models in MSOL, while they are individuals in OWL. The correspondences (including reasoning) are as follows: Rather than integrating or combining different logics and running the risk of losing any of the desirable properties of the special purpose formalisms, our approach realises an interlinked formalisation of different aspects of the domain of chemical molecules that relies on a mapping between the different layers. Whilst (lightweight) ontology languages are used to cope with the rather large chemical ontologies, MSOL is used to adequately capture some of the ontologically relevant spatial structure of molecule classes. Obviously, to make this approach worthwhile, we need to establish systematic ways of exchanging information between these two layers of different abstraction and expressiveness. 2 3 We focus here on OWL ontologies with at most ALC expressivity. However, all definitions carry over to FOL ontologies mutatis mutandis. We can therefore meaningfully use Boolean combinations of such theories. /M a  C /S   |=O complete |=MSOL sound ?D  b ~ /T `  /N Fig. 3. Ontology-graph association: OWL and MSOL. MSOL term chemical notion OWL term MSOL theory molecule class OWL class graph molecule OWL individual model checking instance checking OWL ABox checking logical entailment subclass relation OWL subclass (TBox) consistent theory nonempty class satisfiable OWL class Fig. 4. Grounded ontology: correspondences between MSOL, OWL, and chemical notions. Deduction. Proven entailment between MSOL theories may be used to assert subsumption between the corresponding classes in the chemical ontology (e.g. the ontology in Fig. 5 has been obtained in this way). This corresponds to ensuring completeness as defined above. Abduction. Abduction [6] can be used to hypothesise new correspondences. For example, given A ⊓ B ❀T T1 , A ❀T T2 and |=M SOL T1 ↔ T1 ∧ T2 , this may have the explanation B ❀T T3 . Fig. 5. Class hierarchy computed from MSOL implications Other related reasoning problems are induction, i.e. given a set of example molecules (e.g. MOL datafiles), learn a corresponding graph class specified in MSOL, and model checking, i.e. given a Möbius strip, check (using a tool such as [1]) that it is non-planar. Moreover, to show that it is additionally a loop is an interesting and non-trivial subsumption check. Although MSOL entailment is in general undecidable, logical entailment in second-order logic can be approximated with automated theorem provers like LEO-II (http://www.ags.uni-sb. de/˜leo/). An initial OWL ontology computed from the logical implications among the MSOL axiomatisations given in Sec. 4 is shown in Fig. 5 (available at ontohub.org). Here, the implications between the graph classes are mostly definitorial and therefore easy to check automatically, and trigger the creation of corresponding subsumptions in the ontology. 6 Conclusions Representation and reasoning with structured objects such as molecules is still an area of active research and development for ontologists and chemoinformaticians. Chemical ontologies such as ChEBI [4] provide one solution to this problem through careful manual classification. Formal ontology aims to supplement such manual efforts with explicit computable knowledge representation and accompanying automated reasoning. We here focus on a particularly interesting and challenging class of molecules for such formalisation, and examine an approach which uses the expressive power of monadic second-order logic (MSOL) to formalise properties that cannot be defined in OWL, proposing to systematically link the two layers. Compared to algorithmic approaches of molecule classification, we can offer a language for a declarative description of molecules and molecule classes, which offers a path to not only instance checking (as in the algorithmic case), but also to subclass checking, through MSOL theorem proving. We propose to combine this with OWL ontologies such as ChEBI, thus obtaining a “grounded ontology”, where OWL subclass relations can be verified or inferred by looking at the corresponding graph properties in MSOL. In the semi-automatic generation of MSOL theories chemical graphs datasets via inductive reasoning, a problem that has to be considered is that abstracting a graph class from a finite number of sample molecules can sometimes produce ambiguous results. Importantly, classes of molecules conforming to particular graph theories may have characteristic emergent properties in terms of chemical and biochemical reactivity and activity profile that none of the superclasses (with less restrictive accompanying graph theories) display. The activity and reactivity properties of molecules would need to be included in a separate ontological layer within the framework we describe. Note that the MSOL approach can only classify molecules based on properties of their graphs. However, from a graph-theoretic point of view, the molecular trefoil is equivalent to a simple loop. In order to distinguish it from the loop, one has to consider its embedding into Euclidean space, and use knot theory. Future work should consider invariants from knot theory (such as genus, polynomials and groups) in a similar role as that in which we presently propose to use MSOL. Also, the results of computational graph theory will be useful, e.g. for optimising parts of the model checking for graphs. References 1. S. Arnborg, ‘A general purpose MSOL model checker and optimizer based on Boolean function representation’, Technical report, KTH, Stockholm, Sweden, (1994). 2. E. W. S. Caetano, V. N. Freire, S. G. dos Santos, D. S. Galvao, and F. Sato, ‘Möbius and twisted graphene nanoribbons: stability, geometry and electronic properties’, The Journal of Chemical Physics, v128(164719), (2008). 3. B. Courcelle and J. Engelfriet, Graph structure and monadic second-order logic—A language theoretic approach, Cambridge University Press, 2011. 4. P de Matos, R Alcántara, A Dekker, M Ennis, J Hastings, K Haug, I Spiteri, S Turner, and C Steinbeck, ‘Chemical Entities of Biological Interest: an update’, Nucl. Acids Res., 38, D249–D254, (2010). 5. H.-D. Ebbinghaus and J. Flum, Finite Model Theory, Springer, 2005. 6. C. Elsenbroich, O. Kutz, and U. Sattler, ‘A Case for Abductive Reasoning over Ontologies’, in Proc. of OWLED-06, (2006). 7. J. Canceill et. al, ‘From classical chirality to topologically chiral catenands and knots’, in Supramolecular Chemistry I Directed Synthesis and Molecular Recognition, volume 165 of Topics in Current Chemistry, 131–162, Springer Berlin / Heidelberg, (1993). 8. M. J. Wester et al., ‘Scaffold Topologies. 2. Analysis of Chemical Databases’, Journal of Chemical Information and Modeling, 48(7), 1311–1324, (2008). 9. A. Habel and K.-H. Pennemann, ‘Correctness of high-level transformation systems relative to nested conditions’, Mathematical Structures in Computer Science, 19(2), 245–296, (2009). 10. A. Habel and H. Radke, ‘Expressiveness of graph conditions with variables’, ECEASST, 30, (2010). 11. D. Han, S. Pal, Y. Liu, and H. Yan, ‘Folding and cutting dna into reconfigurable topological nanostructures’, Nature Nanotechnology, 5, 712–717, (2010). 12. Janna Hastings, Despoina Magka, Colin Batchelor, Lian Duan, Robert Stevens, Marcus Ennis, and Christoph Steinbeck, ‘Structure-based classification and ontology in chemistry’, Journal of Cheminformatics, 4(1), 8, (2012). 13. M Konyk, A De Leon, and M Dumontier, ‘Chemical knowledge for the semantic web’, in Proceedings of Data Integration in the Life Sciences (DILS2008), volume LNBI 5109, pp. 169–176, Evry, France, (2008). Lecture Notes in Computer Science. 14. D. Magka, B. Motik, and I. Horrocks, ‘Modelling structured domains using description graphs and logic programming’, Technical report, Dept. of Computer Science, U. of Oxford, (2011). 15. B. Motik, B. Cuenca Grau, I. Horrocks, and U. Sattler, ‘Representing Ontologies Using Description Logics, Description Graphs, and Rules’, Artificial Intelligence, 173(14), 1275– 1309, (2009). 16. B. Motik, U. Sattler, and R. Studer, ‘Query Answering for OWL-DL with Rules’, Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 3(1), 41–60, (2005). 17. H. Rzepa. Molecular möbius strips and trefoil knots, 2003. http://www.ch.ic.ac. uk/motm/trefoil/, last accessed December 2011. 18. N. Trinajstic, Chemical graph theory, CRC Press, Florida, USA, 1992. 19. H.-R. Tseng, S. A. Vignon, and J. F. Stoddart, ‘Toward chemically controlled nanoscale molecular machinery’, Angewandte Chemie International Edition, 42, 1491–1495, (2003). 20. E. W Weisstein. Polyhedral graph, 2011. http://mathworld.wolfram.com/ PolyhedralGraph.html.