7. Data Formats¶
MØD utilises several data formats and encoding schemes.
7.1. GML¶
MØD uses the Graph Modelling Language (GML) for general specification of graphs and rules. The parser recognises most of the published specification, with regard to syntax. The specific grammar is as follows.
GML ::= (key
value
)* key ::=identifier
value ::= int float quoteEscapedStringlist
list ::= '[' (key
value
)* ']' identifier ::= a word matching the regex "[a-zA-Z][a-zA-Z0-9]*"
A quoteEscapedString
is zero or more characters surrounded by double
quotation marks. To include a \"
character it must be escaped. Tabs,
newlines, and backslashses can be written as \t
, \n
, and \\
.
GML code may have line comments, starting with #
.
They are ignored during parsing.
7.1.1. Graph¶
A graph can be specified as GML by giving a list of vertices and edges
with the key graph
.
The following grammar exemplifies the required key-value structure.
graphGML ::= 'graph [' (node
|edge
)* ']' node ::= 'node [ id' int 'label' quoteEscapedString ']' edge ::= 'edge [ source' int 'target' int 'label' quoteEscapedString ']'
Note though that list elements can appear in any order.
7.1.2. Rule¶
A rule \((L\leftarrow K\rightarrow R)\) in GML format is specified
as three graph fragments; left
, context
, and right
.
From those
\(L\) is constructed as left
\(\cup\) context
,
\(R\) as right
\(\cup\) context
, and
\(K\) as context
\(\cup\) (left
\(\cap\) right
).
Each graph fragment is specified as a list of vertices and edges, similar to a
graph in GML format.
The key-value structure is exemplified by the following grammar.
ruleGML ::= 'rule [' [ 'ruleID' quoteEscapedString ] [ 'labelType "'labelType
'"' ] [leftSide
] [context
] [rightSide
]matchConstraint
* ']' labelType ::= 'string' | 'term' leftSide ::= 'left [' (node
|edge
)* ']' context ::= 'context [' (node
|edge
)* ']' rightSide ::= 'right [' (node
|edge
)* ']' matchConstraint ::=adjacency
|labelAny
|labelNone
adjacency ::= 'constrainAdj [' 'id' int 'op "'op
'"' 'count' unsignedInt [ 'nodeLabels ['labelList
']' ] [ 'edgeLabels ['labelList
']' ] ']' labelAny ::= 'constrainLabelAny [' 'label' quoteEscapedString 'labels ['labelList
']' ']' labelNone ::= 'constrainLabelNone [' 'label' quoteEscapedString 'labels ['labelList
']' ']' labelList ::= ('label' quoteEscapedString)* op ::= '<' | '<=' | '=' | '>=' | '>'
Note though that list elements can appear in any order.
7.2. Tikz (Rule)¶
This format is used for visualising rules similarly to how the Tikz (Graph) format is used for graphs. A rule is depicted as its span \((L\leftarrow K\rightarrow R)\) with the vertex positions in the plane indicating the embedding of \(K\) in \(L\) and \(R\). Additionally, \(L\backslash K\) and \(R\backslash K\) are shown in different colour in \(L\) and \(R\) respectively.
7.3. DOT (Rule)¶
The DOT format (from Graphviz) is used for generating vertex coordinates for the Tikz format, when Open Babel can not be used.
7.4. Tikz (Graph)¶
Graphs are visualised using generated Tikz code.
The coordinates for the layout is either generated using Open Babel or Graphviz.
The visualisation style is controlled by passing instances of the classes
mod::graph::Printer
(C++) and mod.GraphPrinter
(Python)
to the printing functions.
The drawing style is inspired by ChemFig and Open Babel.
See also PostMØD (mod_post).
7.5. DOT (Graph)¶
The DOT format (from Graphviz) is used for generating vertex coordinates for the Tikz format, when Open Babel can not be used.
7.6. SMILES¶
The Simplified molecular-input line-entry system is a line notation for molecules. MØD can load most SMILES strings, and converts them internally to labelled graphs. For graphs that are sufficiently molecule-like, a SMILES string can be generated. The generated strings are canonical in the sense that the same version of MØD will print the same SMILES string for isomorphic molecules.
The reading of SMILES strings is based on the OpenSMILES specification, but with the following notes/changes.
Only single SMILES strings are accepted, i.e., not multiple strings separated by white-space.
The specical dot “bond” (
.
) is not allowed.Up and down bonds are regarded as implicit bonds, i.e., they might represent either a sngle bond or an aromatic bond. The stereo information is ignored.
Atom classes are (mostly) ignored. They can be used to specify unique IDs to atoms.
Wildcard atoms (specified with
*
) are converted to vertices with label*
. When inside brakcets, only the hydrogen count and atom class is then permitted.Abstract vertex labels can be specified inside brakcets. The bracket must in that case only contain the label and an optional class label. The label must be a non-empty string without
:
and with balanced square brackets.Charges of magnitude 2 and 3 may be specified with repeated
-
and+
.The bond type
$
is currently not allowed.Aromaticity can only be specified using the bond type
:
or using the special lower case atoms. I.e.,c1ccccc1
andC1:C:C:C:C:C:1
represent the same molecule, butC1=CC=CC=C1
is a different molecule.Ring-bonds and branches may appear in mixed order. The normal order is to have all ring-bonds first and all branches, e.g.,
C123(O)(N)
. The parser accepts them in mixed order, e.g.,C1(O)2(N)3
.The final graph will conform to the molecule encoding scheme described below.
Implicit hydrogens are added following a more complicated procedure.
A bracketed atom can have a radical by writing a dot (
.
) between the position of the charge and the position of the class.
The written SMILES strings are intended to be canonical and may not conform to any “prettyness” standards.
7.6.1. Implicit Hydrogen Atoms¶
When SMILES strings are written they will use implicit hydrogens whenever they
can be inferred when reading the string.
For the purposes of implicit hydrogens we use the following definition of
valence for an atom.
The valence of an atom is the weighted sum of its incident edges, where single
(-
) and aromatic (:
) bonds have weight 1, double bounds (=
) have
weight 2, and triple bonds (#
) have weight 3.
If an atom has an incident aromatic bond, its valence is increased by 1.
The atoms that can have implicit hydrogens are B, C, N, O, P, S, F, Cl, Br, and I.
Each have a set of so-called “normal” valences as shown in the following table.
The atoms N and S additionally have certain sets of incident edges that are
also considered “normal”, which are also listed in the table.
Atom |
Normal Valences and Neighbourhoods |
---|---|
B |
3 |
C |
4 |
N |
3, 5, \(\{-, :, :\}\), \(\{-, -, =\}\), \(\{:, :, :\}\) |
O |
2 |
P |
3, 5 |
S |
2, 4, 6, \(\{:, :\}\) |
F, Cl, Br, I |
1 |
If the set of incident edges is listed in the table, then no hydrogens are added. If the valence is higher than the highest normal valence, then no hydrogens are added. Otherwise, hydrogens are added until the valence is at the next higher normal valence.
When writing SMILES strings the inverse procedure is used.
7.7. GraphDFS¶
The GraphDFS format is intended to provide a convenient line notation for general undirected labelled graphs. Thus it is in many aspects similar to SMILES strings, but a string being both a valid SMILES string and GraphDFS string may not represent the same graph. The semantics of ring-closures/back-edges are in particular not the same.
7.7.1. Grammar¶
graphDFS ::=chain
chain ::=vertex
evPair
* vertex ::= (labelVertex
|ringClosure
)branch
* evPair ::=edge
vertex
labelVertex ::= '[' bracketEscapedString ']' [defRingId
]implicitHydrogenVertexLabels
[defRingId
] implicitHydrogenVertexLabels ::= 'B' | 'C' | 'N' | 'O' | 'P' | 'S' | 'F' | 'Cl' | 'Br' | 'I' defRingId ::= unsignedInt ringClosure ::= unsignedInt edge ::= '{' braceEscapedString '}'shorthandEdgeLabels
shorthandEdgeLabels ::= '-' | ':' | '=' | '#' | '' branch ::= '('evPair
+ ')'
A bracketEscapedString
and braceEscapedString
are zero or more
characters except respectively ]
and }
. To have these characters in
each of their strings they must be escaped, i.e., \]
and \}
respectively.
The parser additionally enforces that a defRingId
may not be
a number which has previously been used.
Similarly, a ringClosure
may only be a number which has
previously occured in a defRingId
.
A vertex specified via the implicitHydrogenVertexLabels
rule
will potentially have ekstra neighbours added after parsning. The rules are the
exact same as for implicit hydrogen atoms in SMILES.
7.7.2. Semantics¶
A GraphDFS string is, like the SMILES strings, an encoding of a depth-first traversal of the
graph it encodes.
Vertex labels are enclosed in square brackets and edge labels are enclosed in curly brackets.
However, a special set of labels can be specified without the enclosing brackets.
An edge label may additionally be completely omitted as a shorthand for a dash (-
).
A vertex can have a numeric identifier, defined by the
defRingId
non-terminal.
At a later stage this identifier can be used as a vertex specification to
specify a back-edge in the depth-first traversal.
Example: [v1]1-[v2]-[v3]-[v4]-1
, specifies a labelled \(C_3\)
(which equivalently can be specified shorter as [v1]1[v2][v3][v4]1
).
A vertex
being a ringClosure
can never be
the first vertex in a string, and is thus preceded with a
edge
. As in a depth-first traversal, such a back-edge is a
kind of degenerated branch. Example: [v1]1[v2][v3][v4]1[v5][v6]1
, this
specifies a graph which is two fused \(C_4\) with a common edge (and not
just a common vertex).
Warning
The semantics of back-edges/ring closures are not the same as in SMILES strings. In SMILES, a pair of matching numeric identifiers denote the individual back-edges.
A branch in the depth-first traversal is enclosed in parentheses.
7.7.3. Abstracted Molecules¶
The short-hand labels for vertices and edges makes it easier to specify partial molecules than using GML files.
As example, consider modelling Acetyl-CoA in which we wish to abstract most of the CoA part.
The GraphDFS string CC(=O)S[CoA]
can be used and we let the library add missing hydrogen
atoms to the vertices which encode atoms. A plain CoA molecule would in this modelling be
[CoA]S
, or a bit more verbosely as [CoA]S[H]
.
The format can also be used to create completely abstract structures (it can encode any undirected labelled graph), e.g., RNA strings. Note that in this case it may not be appropriate to add “missing” hydrogen atoms. This can be controlled by an optional parameter to the loading function.
7.8. Molecule Encoding¶
There is no strict requirement that graphs encode molecules, however several optimizations are in place when they do. The following describes how to encode molecules as undirected, simple, labelled graphs and thus when the library assumes a graph is a molecule.
7.8.1. Edges / Bonds¶
An edge encodes a chemical bond if and only if its label is listed in the table below.
Label |
Interpretation |
---|---|
|
Single bond |
|
“Aromatic” bond |
|
Double bond |
|
Triple bond |
7.8.2. Vertices / Atoms¶
A vertex encodes an atom with a charge if and only if its label conforms to the following grammar.
vertexLabel ::= [ isotope ]atomSymbol
[charge
] [radical
] isotope ::= unsignedInt charge ::= singleDigit ('-' | '+') radical ::= '.' atomSymbol ::= an atom symbol with the first letter capitalised
Currently there are no valence requirements for a graph being recognised as a molecule.
7.9. First-Order Terms¶
Vertex/edge labels on graphs/rules can be interpreted either as text strings
or as firt-order terms.
Additionally, for first-order terms there is a choice in which type of relation between terms
should be required in morphisms.
This can be controlled in each algorithm through label settings objects
(C++: LabelSettings
, Python: LabelSettings
).
A constant or function symbol is a word that can be matched
by the regex [A-Za-z0-9=#:.+-][A-Za-z0-9=#:.+-_]*
.
This means that all strings that are usually considered “molecular” can be reinterpreted
as constant symbols.
A variable symbol is a word that can be matched
by the regex _[A-Za-z0-9=#:.+-][A-Za-z0-9=#:.+-_]*
.
That is, variable is like a constant/function symbol, but with a _
prepended.
An unnamed variable can be specified by the special wildcard symbol *
.
Note
Variable names matched by the regex _[HT][0-9][0-9]*
may be generated
when printing out graphs/rules. Any original variable names are not saved.
Function terms start with a function symbol followed by a parenthesis with a comma-separated list of terms. They may contain white-space.
If parsing of terms fails a specific exception is thrown
(C++: TermParsingError
, Python: TermParsingError
).
7.10. Abstract Derivation Graphs¶
Sometimes it is really convenient to quickly write down a few equations to describe a “derivation graph”, without associating actual graphs and rules to it. That is, only specifying the underlying network. The network description is a string adhering to the following grammar:
description ::=derivation
{derivation
} derivation ::=side
("->" | "<=>")side
side ::=term
{ "+"term
} term ::= [ unsignedInt ]identifier
identifier ::= any character sequence without spaces
Note that the identifier
definition
in particular means that whitespace is important between coefficients and
identifiers. E.g., 2 A -> B
is different from 2A -> B
.
See also DGBuilder.addAbstract()
/dg::Builder::addAbstract()
.