Cheminformatics

DM840, Autumn 2014

Daniel Merkle

DM840 Required Assignment 2: Option 1 : Graph Isomorphism

In the first option for this assignment your task is to improve code for the canonicalization of a undirected graph. You are provided C++ source code which already implements this based on an DFS search tree exploration (based on individualization applied to equitable partitions), partition refinement, and testing. The code is heavily using concepts from Generic Programming and is therefore heavily templated. Pruning and different tree traversal methods are not implemented.

  • You should start with properly understanding the code that you are provided.
  • The code already includes checks for nodes that could be pruned, however, pruning is not implemented. Clever pruning, as you know from nauty and traces, heavily depends on the computation of automorphisms and generators. We recommend you use PermLib to do this. A mandatory task is to implement pruning based on automorphism checking. You can find the library in /home/jlandersen/shared/include. Furthermore you might consider to improve your pruning strategies by using node invariants, especially the quotient graph (see description of traces).
  • An orthogonal concept to pruning strategies is how to traverse the search tree. If pruning works, you can consider implementing different strategies for search tree traversal, including the method how traces is traversing the search tree.

In order to easily answer questions that you have during implementing your code, and in order to easily modify and extend the code we provided, we ask you to use a version control system. We set up a Mercurial repository. Plese use hg clone /home/jlandersen/shared/canonicalization to get access to the source code, which should compile smoothly in the IMADA terminal room.

DM840 Required Assignment 2: Option 2 : Atom-Atom Mapping

In the second option for this assignment, your task is to find possible chemical reactions between sets of molecules. You are given a set of educts and a set of products in SMILES notation and your task is to infer all possible graph grammar rules. 90% of all chemical reactions have a so-called mono-cyclic transition state. In the following Figure the reaction for the equilibrium system {ethyl acetate, water} {ethanol-acetic, acid}, with hydrogen chloride as a catalyst is illustrated.

The mapping from the atoms in the educts to the atoms in the products is chosen as above. In general however, there might be different mappings that lead to different Graph Grammar rules. A mapping needs to fulfill certain constraints which can easily be explained by a transition state graph as illustrated above. Based on a given mapping, some bonds remain unchanged, some from the set of educts are removed, and some bonds from the set of products are formed. In the transition state graph above these edges are either unlabeled, or labeled by "X" for a removal of a bond, or by "O" for forming an new bond. Note, that changing a double bond to a single bond is also a removal of a bond, and changing a single bond to a double bond also forms a new bond.

The mono-cyclic reaction above leads to a sequence of adjacent edges in the transition state graph that are alternating labeled with "X" and "O".

We say a mapping is chemically valid, if the labeled edges define one simple cycle graph with alternating labels "X" and "O". It is obvious that in the case of cycles the alternation only allows for cycles of even length. Mappings that lead to more than one cycle are not allowed.

Obviously a Graph Grammar rule can be defined based on such a chemically valid mapping. The straight-forward approach is to put all the nodes of the cycle in context of a graph grammar rule, and define the left- and right-hand side of the graph grammar rule based according to the change of the edge labels. This might create very general rules. Therefore we introduce a parameter k, that can force atoms and edges to be in the context of a rule as follows: If a node has a distance smaller or equal to k to one of the nodes of the found cycle, then that node is also put into the context of the graph grammar rule. All unchanged edges incident to two nodes in the context are then also put into the context. In the example above (bottom figure) the rule generated for k=0 is illustrated in black, the additional nodes and edges in the context for the rule generated for k=1 are depicted in green (note the invisible green carbon atom).

Of course there are, in general, several different chemically valid mappings. Note, that a pair of different mappings might either lead to an identical graph grammar rule, or to a different graph grammar rule (cmp. automorphism groups that we discussed in the context of (sub-)graph isomorphism algorithms that were discussed in the lecture).

Requirements

In a first step you have to define mathematically what a chemically valid mapping is. In the second step you have to enumerate all graph grammar rules based on chemically valid mappings for a set of educts and products and for different values k (context size) and c (cycle length).

Attacking the Problem

The overall goal is enumerate all different Graph Grammar rules based on chemically valid mappings (and create a GML file for each of them if it's not too many) based on specific values k, c, and molecules given in SMILES notation. You are most welcome to use any programming language of your choice. We however provide you a template program in C++ that should (without and specific C++ knowledge) be easily extensible. This program takes care of the GML file generation and also performs the check for identical rules. The part that you need to provide is an (most likely backtracking) algorithm that, more or less cleverly, enumerates all chemically valid graph grammar rules.

All files you need are provided in the directory /home/daniel/dm840/assignment2/. The files are more or less self-explanatory. The only file that you need to modify is the file main.cpp, within that file you basically only need to change the part between the lines // WORK AREA: START and // WORK AREA: END. Currently, the source code includes three different hard-coded mappings from the nodes of the educt molecule(s) to the nodes of the product molecule(s). Two of the mappings lead to the same rule. If you execute the compiled program ./test the program prints out the all nodes, all edges, and all labels of nodes and edges from the "educt graph" (which might consist of several components) and for the product graph. Furthermore it prints out all mappings, and indicates that an identical rule was found. Note, that you can execute the program with a file as an argument: ./test -f test.py. The files test.py is also self-explanatory. If you use the -f option, you don't have to recompile your file if you want to change the input compounds. The program will write all created rules in the out/ directory (make sure it exists). If the rules were successfully generated you can execute mod_genSummary in order to create a report with an illustration of all the rules based on the set of rules how you defined it.

In addition, note the following:

  • The context that is created for a rule is based on the mapping that you use for creating the rule: If you put all the atoms of the educt as well as the product graph in the mapping, then all atoms, and all unchanged edges will be in the context of the rule (in the above figure the mapping indicated by the black, green, and blue dotted lines). If you only map the vertices incident to changing edges onto each-other, then no edge will be in the context of the rule (in the above figure the mapping indicated by the green dotted lines only). The parameter k should therefore influence the size of a mapping vertexMap that you put via vertexMaps.push_back(vertexMap) in the set of all mappings.
  • First a rule is created based on a mapping via *r = createRule(vertexMap, ...). When this rule is put into the container for all rules (via uniqueStore.add(r)), it will be automatically checked if this rule was found before.
  • The "chemical" plausibility check for the mappings provided is very simple. It is, for example, checked if you try to change more than two edges incident to the same node, or if the number of edges removed and created is not in balance (see if(doChemistryCheck) ... in main.cpp). However, if you create rules that can create dangling edges or has other non-chemical properties, the program might just crash. It is your responsibility to check if a mapping induces a mono-cyclic transition state, and you should make sure that you do not add mappings that do not fulfill this property.
  • You should be very careful when changing parts of the code that is not meant for modification. Although some of the code is really not meant for modification, we still left it in the example program.

Additional Reading

For nice article that gives a bit more chemical background on the idea of mono-cyclic reaction, see:

James B. Hendrickson: Comprehensive System for Classification and Nomenclature of Organic Reactions. Journal of Chemical Information and Computer Sciences 37(5): 852-860 (1997)

(you can find this article in blackboard, voluntary reading)

Report / Submission

The report must be short and precise (maximum 10 pages, English language). It must include the following:

  • A small introduction.
  • A section on the formal definition of a chemically valid mapping and a partial chemically valid mapping that determines the nodes and edges in the context based on the parameter k and c.
  • A section that describes your implementation to find all (partial) the necessary valid mappings.
  • A result section; it should include a table of the following form. If you couldn't solve the instance, it should be indicated clearly. Also give information on how long it took to solve the instances.

    \begin{table}[h]
      \centering
      \begin{tabular}{ll|cccc}\\
        Educts & Products & $k$ & $c$ & $n$ & $t$\\\hline
        C=C, C=C & C1CCC1 & 0 & 4 & & \\
        C=C, C=C & C1CCC1 & 1 & 4 & & \\
        O, Cl, CC(=O)OCC & Cl, OCC, CC(=O)O & 0 & 6 \\
        O, Cl, CC(=O)OCC & Cl, OCC, CC(=O)O & 1 & 6 \\
        O, Cl, CC(=O)OCC & Cl, OCC, CC(=O)O & 2 & 6 \\
        C1C(O)CC(O)C(O)C1 & C=CO, C=CO, C=CO  & 0 & 6 & & \\
        C1C(O)CC(O)C(O)C1 & C=CO, C=CO, C=CO  & 1 & 6 & & \\
        CC=CC=CC, OC1C=CC=CC=1 & C=CC=CC=C, OC(=C)C=CC=C & 0 & 6 & & \\
        CC=CC=CC, OC1C=CC=CC=1 & C=CC=CC=C, OC(=C)C=CC=C & 1 & 6 & & \\
        CC=CC=CC, OC1C=CC=CC=1 & C=CC=CC=C, OC(=C)C=CC=C & 2 & 6 & & \\
        CC, OC1C=CC=CC=1 & C=C, OC(=C)C=CC=C & 0 & 6 & & \\
        CC, OC1C=CC=CC=1 & C=C, OC(=C)C=CC=C & 1 & 6 & & \\
        CC, OC1C=CC=CC=1 & C=C, OC(=C)C=CC=C & 2 & 6 & & \\
        OP(=O)(O)OP(=O)(O)O, O & O=P(O)(O)O, O=P(O)(O)O  & 0 & 4 & & \\
        OP(=O)(O)OP(=O)(O)O, O & O=P(O)(O)O, O=P(O)(O)O  & 1 & 4 & & \\
        OP(=O)(O)OP(=O)(O)O, O & O=P(O)(O)O, O=P(O)(O)O  & 0 & 6 & & \\
        OP(=O)(O)OP(=O)(O)O, O & O=P(O)(O)O, O=P(O)(O)O  & 1 & 6 & & \\
        C\#N, C\#N & N=CC\#N & 0 & 4 & & \\
        C\#N, C\#N & N=CC\#N & 1 & 4 & & \\
     \end{tabular}
    \end{table}
    
    Note that the backslash before the triple bond sign is only there because of LaTeX, the SMILES string does not contain it.

    In this table the parameters k and c are defined as described above, n is the number of different graph grammar rules you found, and t is the number of different mappings you created in order to find those t different graph grammar rules, i.e., the number of calls to vertexMaps.push_back(vertexMap).
  • An example that illustrates how an increasing k limits the possibility of applying rules: Given a fixed set of educts and products (you can choose them) infer rule sets Rk for different values of k. Expand the chemical space (initially filled with the educts only) by applying one of the rules of Rk once only (for at least two different values of k). Depict the two different hypergraph, in at least one of them the initially given product set should be a proper subset of the products created by the rule application.
  • A section where you discuss how you could speedup your implementation but maybe applying methods similar to the pruning methods in the Ullmann and/or McKay's algorithm.
  • In case you work as a group: a paragraph on who contributed with what.

Submission

For submitting the report and the sources proceed as follows: create the following directory structure

assignment2/
assignment2/code/
assignment2/report/
assignment2/results/

Change to the directory assignment2 and type aflever DM840 your@mail.com.

Additionally, you shall hand in a printout of your report (Department secretaries office).

The strict deadline for submission (electronically) is 08.01.2015, 11am. The electronically submitted and the printout version need to be identical.

Frequently Asked Questions (FAQ)

  • Are we allowed to work in groups?
  • In case you are a PhD student: No. In case you are not a PhD student: You are allowed to works in groups of max 2 people. Write the members of your group on the front page of your report. In the report you have to write who contributed with what. PhD students need to work on their own.

  • Can I install MØD on a non-IMADA terminal room machine
  • No.

Design by 1234.info | Modified by Daniel Merkle | CSS 2.0