openmolecules.org

 
Home » DataWarrior » Cheminformatics » Metrics to use for UMAP analysis with SkelSpheres descriptor
Re: Metrics to use for UMAP analysis with SkelSpheres descriptor [message #1729 is a reply to message #1725] Thu, 08 September 2022 22:53 Go to previous messageGo to previous message
nbehrnd is currently offline  nbehrnd
Messages: 224
Registered: June 2019
Senior Member
Dear Christophe,

curious about the description «Technically, it [the SkelSpheres Descriptor] is a byte vector with a resolution of 1024 bins.» (help menu in DW, chapter Similarity & Descriptors, section Molecule or Reaction Similarity and Descriptors), I found an open access publication[1] going a bit more in detail. In section 4.2, the authors describe it with

«This descriptor was developed by Actelion. It is a vector of integers which represents the occurrence of different substructures in a molecule. Five circular layers with increasing bond distance are located for each atom in the molecule. Hydrogen atoms are not considered. This results in five fragments starting with the naked central atom, adding one layer at a time. Every fragment is encoded as a canonical string (id-code), similar to the generation of canonical SMILES. The canonical id-code includes the stereochemistry of the encoded fragment, which is a feature missing in other molecular descriptors. The string is then assigned to one of 1024 fields n in a vector. Therefore, the hash value of the id-code is calculated and the corresponding value in the vector is increased by one. The Hashlittle algorithm from Jenkins is used as a binning function which takes a text string as input and returns an integer value between 0 (inclusive) and 1024 (exclusive). [...] To consider the molecular scaffold without the influence of the hetero atoms, the whole calculation is repeated while replacing the hetero atoms with carbon. The resulting hash values are used to increment the corresponding fields in the vector. By adding this skeleton information to the descriptor vector the similarity calculation between two descriptor vectors becomes a bit insensitive to the exact position of the hetero atoms in two molecules. This directs the similarity value toward the perception of similarity by medicinal chemists. For them the exact position of a hetero atom is not as discriminating as it would be for the spheres descriptor without the skeleton coding part. The additional consideration of the scaffold information and the use of a histogram instead of a binary vector distinguishes the SkeletonSpheres descriptor from circular fingerprints.»

So far however, I don't understand the concept of "byte" in byte vector when entering the integers as elements of the vector either, which could be crucial.

Norwid

[1] The Screening Compound Collection: A Key Asset for Drug Discovery. C. Boss, J. Hazemann, T. Kimmerlin, M. von Korff, U. Lüthi, O. Peter, T. Sander, R. Siegrist, Chimia 2017, 71, 667-677, DOI: 10.2533/chimia.2017.667 (https://chimia.ch/chimia/article/view/2017_667 open access).
 
Read Message
Read Message
Read Message
Previous Topic: Dealing with multiple "0% inhibition" results in HTS tests
Next Topic: Removing "NaN points" from graphs or formula to avoid "NaN" results
Goto Forum:
  


Current Time: Fri Nov 22 06:11:08 CET 2024

Total time taken to generate the page: 0.03732 seconds