DataWarrior User Manual

Molecule or Reaction Similarity and Descriptors


Molecule and reaction similarities play an important role in DataWarrior. They are calculated, when using a filter to focus on a similar molecule subset and they are used to customize views, e.g. to color, size or position markers based on compound similarities. Some data analysis algorithms, which typically use alphanumerical input information, may also consider compound similarities, e.g. self organizing maps. In addition many dedicated cheminformatics analysis methods need a similarity criterion to wor with, e.g. an activity cliff analysis, or compound clustering. DataWarrior supports various kinds of molecule similarities. These range from a simple chemical similarity based on common substructure fragments up to a binding behaviour similarity that considers 3D-geometry of conformers and the interaction potential to proteins.

When DataWarrior calculates a similarity value between two molecules, then the calculation is not performed on the molecular graphs directly. It rather involves a two-step process:

  • The molecular graphs of all molecules in a dataset are processed to extract certain molecule features. These are compiled into an abstract molecule description, called descriptor. In the simplest case the descriptor consists of a binary array of which every bit indicates, whether a certain feature is present in the molecule. These binary descriptors are also referred to as fingerprints. More complex descriptors may consist of a vector, a tree or a simplified graph of the original molecule.
  • In a second step the descriptors of two molecules are compared with some kind of logic to determine the actual similarity value, i.e. how much the two compounds have in common. In case of binary descriptor DataWarrior devides the number of common features by the number of features being available in any of the two molecules. This is referred to as Tanimoto similarity.
  • Naturally, the kind of features being collected and the algorithm used to compare these features, have a crucial influence on the kind of similarity being calculated. Therefore, the chosen descriptor for similarity calculations depends very much on the purpose. For molecules DataWarrior supports three different binary descriptors and three more advanced similarity methods. These and a reaction descriptor are all explained here in more detail.


    Which Descriptor Should Be Used For Which Purpose

    If the purpose is to filter a large compound collection by chemical structure similarity, the default descriptor FragFp is a good choice, because it is automatically available, it does not require much space and similarity calculations are practically instantanious.

    If more fine grained similaries need to be perceived, e.g. if stereo isomers need to be distinguished or to achieve best results from clustering or any kind of similarity analysis, then the SkelSpheres descriptor should be taken. Especially, when creating an evolutionary library in vast virtual compound space, then the SkelSpheres descriptor outperformes the binary fingerprints in quality, because it considers multiple fragment occurence and makes hash collisions unlikely.

    When chemical functionality from a synthetic chemist's point of view is more important than the carbon skeleton that carries this functionality, then you should try the OrgFunctions descriptor. Examples are searching a chemicals database for an alternative reactant to a reaction or arranging a building blocks collection in space based synthetically accessible functionality.

    If the similarity of biological binding behaviour is important rather than merely the similarity of the chemical graph, then use the Flexophore descriptor, which requires more space and significantly more time to calculate descriptors as well as similarity values.


    The FragFp Descriptor

    DataWarrior's default descriptor FragFp is a substructure fragment dictionary based binary fingerprint similar to the MDL keys. It relies on a dictionary of 512 predefined structure fragments. These were selected from the multitude of possible structure fragments by optimizing two criteria: All chosen fragments should occurr frequently withing typical organic molecule structures. Any two chosen fragments should show little overlap concerning their occurrence in diverse sets of organic compounds. The FragFp descriptor contains 1 bit for every fragment in the dictionary. A bit set to 1 if the corresponding fragment is present in the molecule at least on time. In about half of the fragments all hetero atoms have been replaced by wild cards. This way single atom replacements only cause a moderate drop of similarity, which reflects a chemists natural similarity perception.

    In addition to calculating molecule similarities DataWarrior uses the FragFp descriptor for a second purpose: The acceleration of the sub-structure filtering. Since a sub-structure search is effectively a graph matching algorithm and therefore computationally rather demanding, DataWarrior employs a pre-screening step that can quickly exclude most compounds from the graph-matching. In this step DataWarrior determines a list of all dictionary fragments, which are part of the sub-structure query. Molecules that don't contain all of the query's fragments cannot contain the query itself. Therefore, these are skipped in the graph-matching phase.


    The PathFp Descriptor

    The PathFp descriptor encodes any linear strand of up to 7 atoms into a hashed binary fingerprint of 512 bits. Therefore, every path of 7 or less atoms in the molecule is located. In a normalized way an identifying text string is constructed from every path that encodes atomic numbers and bond orders. From the text string a hash value is created, which is used to set the respective bit of the fingerprint. The PathFp descriptor is conceptually very similar to the 'folded fingerprints' that software of Daylight Inc. uses for calculating chemical similarities.


    The SphereFp Descriptor

    The SphereFp descriptor encodes circular spheres of atoms and bonds into a hashed binary fingerprint of 1024 bits. To construct the fingerprint DataWarrior does the following for every atom in the molecule: The atom itself is considered and taken as a first fragment. Then, for four times all direct neighbour atoms are added to the fragment, which every time grows the previous fragment. This way five substructure fragments with increasing atom layer count (n=1 to 5) are built.

    Spheres show-cased for two atoms, when building the SphereFp descriptor.

    These circular fragments are then converted into a canonical representation that retains the aromaticity information even if rings are broken. Then, the canonical representation is used to reproducibly generate a hash code, which is a number between 0 and 1023. Then the respective bit in the fingerprint is set to true.

    Fragments built from the two example atoms.

    Circular (i.e. spherical) fingerprints are probably more often used for the calculation of molecular similarities than any other kind of descriptor, because similarities obtained from circular fingerprints feel intuitive for chemists. Therefore, they are often used for structure-activity- and activity-cliff-analysis. They even serve for virtual screening and machine-learning, if structural similarity is an issue and scaffold hopping is not desired, which would rather need a pharmacophore or/and 3D-shape similarity or computationally more demanding methods to predict ligand-protein interactions. In the literature circular fingerprints are sometimes referred to as HOSE codes and are in use for spectroscopy prediction. Circular fingerprints cannot be used, however, in the pre-screening step of a substructure search.

    Other well known examples for spherical descriptors are Morgan-fingerprints and ECFP-descriptors.


    The SkelSpheres Descriptor

    The SkeletonSpheres descriptor is the big brother of the SphereFp. While it is based on the same principle, it differs in two ways, which makes it the preferred descriptor whenever structural similarity is needed. The first difference is a simple one: The fingerprint is not binary, it actually counts how many times a paricular hash value occurrs. The second difference is that every created circular fragment is used twice, once to increment the fingerprint's count value as in the SphereFp. Then, every non-carbon atom of the circular fragment is converted into a carbon atom creating carbon-only skeleton fragment. As the original fragment, the skeleton fragment is converted into a hash code through its canonical representation. The hash code is also used to increase the count of the respective descriptor bin.

    Both measures together cause a more robust and intuitive similarity value, because half of the fragments don't change, if just a single atom of the molecule is replaced, while other descriptors usually underestimate the similarity of two molecules that differ at one atom only.

    The SkeletonSpheres descriptor is the most accurate descriptor for the purpose of calculating similarities of chemical graphs. It considers aromaticity and stereo features. On the flipside it needs more memory and similarity calculations take slightly longer. With less than a million of compounds, this is rarely an issue. Technically, it is a byte array or vector with a resolution of 1024 bins.


    The OrgFunctions Descriptor

    The OrgFunctions descriptor perceives molecules with the focus on available funtional groups from a synthetic chemist's point of view. It also recognizes the steric or electronic features of the neighborhood of the functional groups. It perceives molecules as being very similar, if they carry the same functional groups in similar environments independent of the rest of the carbon skeletons.

    The OrgFunctions descriptor is neither a fingerprint nor an integer vector. It rather stores all synthetically accessible functions of the molecule in a finely grained way. DataWarrior distinguishes 1024 core functions, which typically overlap. Butenone for instance is recognized as vinyl-alkyl-ketone as well as a carbonyl-activated terminal alkene. All 1024 functional groups are organized in a tree structure that permits deriving similarities between related functions. These are taken into account, when the similarity between two molecules, i.e. OrgFunctions descriptors, is calculated.


    The Flexophore Descriptor

    The Flexophore descriptor allows predicting 3D-pharmacophore similarities. It provides an easy-to-use and yet powerful way to check, whether any two molecules may have a compatible protein binding behavior. A high Flexophore similarity indicates that a significant fraction of conformers of both molecules are similar concerning shape, size, flexibility and pharmacophore points. Different from common 3D-pharmacophore approaches, this descriptor matches entire conformer sets rather than comparing individual conformers, leading to higher predictability and taking molecular flexibility into account.

    The calculation of the Flexophore descriptor is computationally quite demanding. For a given molecule it starts with the creation of a representative set of up to 250 conformers using a self organization based algorithm to construct small rigid molecule fragments, which are then connected with likely torsion angles. This conformer generation approach balances high diversity and conformer likelyhood. Then, the atoms of the underlying molecule are detected and classified, which have the potential to interact with protein atoms in any way. De-facto an enhanced MM2 atom type is used to describe these atoms as interaction point. In some cases multiple atoms contribute to one summarized interaction point, e.g. in aromatic rings.

    Sample molecule with assigned interaction points

    A molecule's Flexophore descriptor now consists of a reduced, but complete graph of the original molecule with the interaction points being considered graph nodes. A graph edge between two nodes is encoded as a distance histogram between these nodes over all conformers. Since the Flexophore descriptor is a complete graph, every combination of any two nodes is encoded and stored as part of the descriptor. Thus, the descriptor creation as well as the similarity calculation from two descriptors depend heavily on the number of interaction points in each of them.

    Complete graph; distance histogram of highlighted edge

    The calculation of the similarity between two Flexophore descriptors involves a graph matching algorithm that not only tries to match the largest possible subgraphs, but also tries to maximize edge and node similarities. Edge similarities are derived from the distance histogram overlaps and node similarities are taken from a interaction point (extended MM2 atom type) similarity matrix, which was originally derived from a ligand-protein interaction analysis of the PDB database.


    The RxnFp Descriptor for Reactions

    As the descriptors above describe certain molecules features, the RxnFp descriptor comprises features that describe a chemical reaction. It is related to the SphereFp descripor, because it is a binary fingerprint, whose bits encode circular fragments of atoms and their environment. This descriptor uses 1024 bits of which the first half is dedicated to describe the reaction center and the second half encodes atoms, which don't take part in the reaction, i.e. which neither loose or gain a neighbor atom nor change a bond order to any of its atom neighbors. Thus, this descriptor relies on a proper mapping of product atoms to reactant atoms. It cannot be calculated for unmapped reactions.

    When comparing reactions, then reaction center atoms and their change of bonding are more important than structural features of reactant and products, which don't take part in the reaction. Therefore, when calculating reaction similarities from this descriptor, DataWarrior independently calculates reaction center similarity and reaction periphery similarity. Then it creates a weighted mean of both values with emphasis on the reaction center similarity. These weighted mean similarity values are used by DataWarrior whenever an analysis method requires an opaque row similarity value. Examples are a two-dimensional scaling to visualize reaction space or when reaction shall be clustered into groups of similars.

    In contrast, the reaction filter allows using both reaction similarity values independently. Therefore, when searching large reaction collections, one may first dig into reactions that contain the same transformation as a drawn query reaction. In a second step one may then increase the periphery similarity threshold to check, whether there are reactions among the already found ones that share structural elements not taking part in the reaction.