openmolecules.org

 
Home » DataWarrior » Functionality » Question about R-group occurences estimation
Question about R-group occurences estimation [message #757] Tue, 21 January 2020 16:02 Go to next message
jvolvr is currently offline  jvolvr
Messages: 1
Registered: January 2020
Location: Brazil
Junior Member
Is there a way to calculate the number of occurrences of a given group (e.g. hydroxyls, methyls) in my data set through the SMILES of my compounds?
Re: Question about R-group occurences estimation [message #759 is a reply to message #757] Tue, 21 January 2020 21:14 Go to previous messageGo to next message
thomas is currently offline  thomas
Messages: 273
Registered: June 2014
Senior Member
not yet, but it absolutely makes sense to introduce a custom substructure count feature. The substructure search itself is able to count occurrences, thus this is more matter of where to place the functionality in the user interface. I will think about it... Thomas
Re: Question about R-group occurences estimation [message #773 is a reply to message #759] Tue, 11 February 2020 14:53 Go to previous messageGo to next message
nbehrnd is currently offline  nbehrnd
Messages: 26
Registered: June 2019
Junior Member
There are two possible difficulties to retrieve methyl groups as-such in a list of SMILES. For one, the characteristic of them is the (single) carbon atom and searching just for this is less identifying than the string of c1ccccc1 about benzene, for example. Second, probably there would be a need to add explicit hydrogens on all SMILES before these would be easier to identify (which may be done, e.g., with babel).

If you have access to Python, then the additional module by rdkit (http://rdkit.org/) may be quite helpful to querry your SMILES with SMARTS. With the test file of smiles_list.smi attached below, the identification of methyl groups (in SMART's convention, expressed as [CH3]) works fine both locally -- per SMILES entry -- as well as in counting the globally:

from rdkit import Chem
smiles_source = "smiles_list.smi"
grand_total = 0

# example pattern to identify and count:
functional_group = Chem.MolFromSmarts('[CH3]')  #  methyl group

# alternative examples:
#functional_group = Chem.MolFromSmarts('c1ccccn1')  # for a pyridine
#functional_group = Chem.MolFromSmarts('C1CCCCC1')  # for cyclohexane

with open(smiles_source, mode="r") as source_file:
    for index, line in enumerate(source_file, start=1):
        molecule = Chem.MolFromSmiles(line.strip())
        match = molecule.GetSubstructMatches(functional_group)

        print("{:3} matches in entry {:2}: {}.".format(
            len(match), index, line.strip()))

        grand_total += len(match)

print("\nIn total {} instances were identified.".format(grand_total))


index.php?t=getfile&id=145&private=0
index.php?t=getfile&id=146&private=0

As DataWarrior relies on java, the implementation of the «ErtlFunctionalGroupsFinder» described by Fritsch et al. ( https://jcheminf.biomedcentral.com/articles/10.1186/s13321-0 19-0361-8, open access) equally may be of interest for Thomas.
  • Attachment: example.png
    (Size: 30.77KB, Downloaded 21 times)
  • Attachment: listing.png
    (Size: 51.00KB, Downloaded 18 times)
  • Attachment: smiles_list.smi
    (Size: 0.12KB, Downloaded 2 times)
  • Attachment: example.py
    (Size: 0.76KB, Downloaded 3 times)
Re: Question about R-group occurences estimation [message #775 is a reply to message #773] Wed, 12 February 2020 19:40 Go to previous message
thomas is currently offline  thomas
Messages: 273
Registered: June 2014
Senior Member
This is now built in: "Chemistry->From Chemical Structure->Add Substructure Count..."
This opens a dialog to
-select the structure column
-to draw a fragment with optional query features
-and to define whether counted fragments may overlap on some atoms

It is in the current beta and will be in the next official version that is due in a few days
Previous Topic: table view missing
Next Topic: Hyperlinks
Goto Forum:
  


Current Time: Wed Feb 26 09:08:07 CET 2020

Total time taken to generate the page: 0.00775 seconds