openmolecules.org

 
Home » DataWarrior » Functionality » Feature query - molecular fingerprints and library diversity
Re: Feature query - molecular fingerprints and library diversity [message #952 is a reply to message #951] Mon, 15 June 2020 21:26 Go to previous messageGo to previous message
thomas is currently offline  thomas
Messages: 655
Registered: June 2014
Senior Member
I am not sure, whether I correctly understand your questions. Is it the following?

1) How many distinct descriptors do we have within a given compound library?

The topological descriptors in DataWarrior are comprised of 512 or even 1024 fragments or hash codes derived from fragments. Therefore, different structures usually have different combinations of contained fragments and, therefore, different descriptors. Except for very rare cases the number of different compounds is equal to the number of different descriptors. By creating a list of duplicate structures or removing duplicate structures you can easily determine the number of distinct structures, which usually is the number of distinct descriptors. You may use a trick make the encoded descriptor visible and to directly remove rows with duplicate descriptors: open the dwar file in a text editor and remove the four column property lines of a descriptor column, which look like this:
<columnName="FragFp">
<columnProperty="parent Structure">
<columnProperty="specialType FragFp">
<columnProperty="version 1.2.1">

After removing these lines, DataWarrior does not know anymore that the column contains descriptors that are associated to the chemical structure. Instead it shows the text encoded descriptors in a visible column, which you can use to remove duplicates.
It depends on the descriptor, whether some very similar structures actually end up with the same descriptor. The simple descriptors do not use stereo-chemistry. Therefore, different stereo isomers indeed have the same descriptor. The SkeletonSpheres descriptor does not have these issues and you will have a hard time finding two different molecules that have an identical SkeletonSpheres.


Or do you ask this?

1) How many of the 227787 fragments listed in the paper (or any other fragment collection) are a substructure of at least one compound of a given compound collection?

2) Create a list of fragments found in a given molecule including the count?

Both cannot be done with DataWarrior. But with a little Java programming using the open-source framework OpenChemLib that wouldn't be a big undertaking.

Please le tme know, if I misinterpreted your question.

Thomas
 
Read Message
Read Message
Read Message
Previous Topic: Feature Request: Reference List Filter
Next Topic: Stereochemistry of double bonds
Goto Forum:
  


Current Time: Thu Apr 25 08:06:48 CEST 2024

Total time taken to generate the page: 0.04734 seconds