openmolecules.org

 
Home » DataWarrior » Functionality » Assign cluster name based on cluster size
Assign cluster name based on cluster size [message #1587] Thu, 07 April 2022 13:46 Go to next message
mcmc
Messages: 12
Registered: April 2018
Junior Member
it looks as if the cluster numbers that are generated by the "cluster compounds" algorithm are rather arbitrarily assigned (I guess in chronological order).
I feel it could be useful to sort clusters by size. That is, cluster 1 would be the largest one, then 2, etc.
Probably an easy fix, yet very helpful?
Re: Assign cluster name based on cluster size [message #1589 is a reply to message #1587] Sat, 09 April 2022 20:29 Go to previous messageGo to next message
nbehrnd is currently offline  nbehrnd
Messages: 131
Registered: June 2019
Senior Member
The following doesn't offer the automatic sort (and perhaps eventual cumulative distribution) you look for.

Yet because each cluster's molecule is labeled by the integer of the cluster, one may use it and request DataWarrior to plot a histogram (bin size 1, starting by 1 along the abscissa, number of molecules per bin along the ordinate) to identify the most populated bin/to check if the distribution of bins perhaps is bi/polymodal. For smaller numbers of bins, it might be helpful to add the number of molecules (view options for statistical graphs) to the drawing.

Norwid
  • Attachment: example.png
    (Size: 8.55KB, Downloaded 30 times)
Re: Assign cluster name based on cluster size [message #1590 is a reply to message #1589] Mon, 11 April 2022 17:27 Go to previous messageGo to next message
mcmc
Messages: 12
Registered: April 2018
Junior Member
Thanks Norwid. Meanwhile I realized that with the current cluster numbering, similar clusters tend to have adjacent numbers. That is, cluster 404 resembles 405. I guess that has some advantages too.

Also I observed that a SALI analysis provides "neighbor count" which seems to be the same as cluster size (minus 1). That in turn, gives a filter that can be used to zoom in on the most populated clusters.

[Updated on: Mon, 11 April 2022 17:28]

Report message to a moderator

Re: Assign cluster name based on cluster size [message #1598 is a reply to message #1590] Fri, 22 April 2022 22:46 Go to previous messageGo to next message
nbehrnd is currently offline  nbehrnd
Messages: 131
Registered: June 2019
Senior Member
Hello mcmc,

I just completed a small Python script to process DataWarrior's results about structure similarity (Chemistry -> Cluster Compounds) exported as text file (File -> Save Special -> Textfile). It identifies the clusters, sorts these based on the number of molecules in each clusters, updates the molecules' cluster labels (1, 2, 3,...) accordingly and writes a new .txt file one may read with DW by (Ctrl + O). There are two sorts possible: a) «the more molecules in the cluster, the lesser the integer used as label of the cluster», a pattern possibly matching best your intent. Though with the optional flag -r you equally may reverse the sort for b) «the more molecules in the cluster, the greater the label».

The .zip archive attached below includes the .py script and describes early results when processing a small set of test data. It assumes the first column labeled «Cluster No» contains the cluster labels assigned by DataWarrior (which is the program's default header).

Norwid

[Updated on: Tue, 26 April 2022 11:31]

Report message to a moderator

Re: Assign cluster name based on cluster size [message #1695 is a reply to message #1598] Fri, 05 August 2022 07:46 Go to previous message
DrCJM is currently offline  DrCJM
Messages: 4
Registered: September 2019
Location: Australia
Junior Member
Late comment!

It's easy to make a new column with the number of compounds in each cluster - all cluster members will obviously have the same number.

After clustering, which creates the "Cluster No" column, I calculate a new column (usually called "Cluster Count") with the function:

frequency(ClusterNo, "Cluster No")

You can then sort on Cluster Count or resize markers on graphs by Cluster Count etc. Very useful if you just want to show the Representative compounds on a graph, but size them by the number of similar compounds they represent.

So doesn't rename the cluster but allows you to do the sorts of things you might want to do after renaming them.

Craig.
Previous Topic: Updating columns
Goto Forum:
  


Current Time: Sun Aug 07 17:35:15 CEST 2022

Total time taken to generate the page: 0.01648 seconds