openmolecules.org Forum: Functionality » Assign cluster name based on cluster size

Home » DataWarrior » Functionality » Assign cluster name based on cluster size

Show: Today's Messages :: Polls :: Message Navigator

Assign cluster name based on cluster size [message #1587]

Thu, 07 April 2022 13:46

mcmc
Messages: 23
Registered: April 2018

Junior Member

it looks as if the cluster numbers that are generated by the "cluster compounds" algorithm are rather arbitrarily assigned (I guess in chronological order).
I feel it could be useful to sort clusters by size. That is, cluster 1 would be the largest one, then 2, etc.
Probably an easy fix, yet very helpful?

Report message to a moderator

Re: Assign cluster name based on cluster size [message #1589 is a reply to message #1587]

Sat, 09 April 2022 20:29

nbehrnd
Messages: 240
Registered: June 2019

Senior Member

The following doesn't offer the automatic sort (and perhaps eventual cumulative distribution) you look for.

Yet because each cluster's molecule is labeled by the integer of the cluster, one may use it and request DataWarrior to plot a histogram (bin size 1, starting by 1 along the abscissa, number of molecules per bin along the ordinate) to identify the most populated bin/to check if the distribution of bins perhaps is bi/polymodal. For smaller numbers of bins, it might be helpful to add the number of molecules (view options for statistical graphs) to the drawing.

Norwid

Attachment: example.png
(Size: 8.55KB, Downloaded 482 times)

Report message to a moderator

Re: Assign cluster name based on cluster size [message #1590 is a reply to message #1589]

Mon, 11 April 2022 17:27

mcmc
Messages: 23
Registered: April 2018

Junior Member

Thanks Norwid. Meanwhile I realized that with the current cluster numbering, similar clusters tend to have adjacent numbers. That is, cluster 404 resembles 405. I guess that has some advantages too.

Also I observed that a SALI analysis provides "neighbor count" which seems to be the same as cluster size (minus 1). That in turn, gives a filter that can be used to zoom in on the most populated clusters.

[Updated on: Mon, 11 April 2022 17:28]

Report message to a moderator

Re: Assign cluster name based on cluster size [message #1598 is a reply to message #1590]

Fri, 22 April 2022 22:46

nbehrnd
Messages: 240
Registered: June 2019

Senior Member

Hello mcmc,

I just completed a small Python script to process DataWarrior's results about structure similarity (Chemistry -> Cluster Compounds) exported as text file (File -> Save Special -> Textfile). It identifies the clusters, sorts these based on the number of molecules in each clusters, updates the molecules' cluster labels (1, 2, 3,...) accordingly and writes a new .txt file one may read with DW by (Ctrl + O). There are two sorts possible: a) «the more molecules in the cluster, the lesser the integer used as label of the cluster», a pattern possibly matching best your intent. Though with the optional flag -r you equally may reverse the sort for b) «the more molecules in the cluster, the greater the label».

The .zip archive attached below includes the .py script and describes early results when processing a small set of test data. It assumes the first column labeled «Cluster No» contains the cluster labels assigned by DataWarrior (which is the program's default header).

Norwid

Attachment: 2022-04-26_datawarrior_clustersort.zip
(Size: 36.29KB, Downloaded 518 times)

[Updated on: Tue, 26 April 2022 11:31]

Report message to a moderator

Re: Assign cluster name based on cluster size [message #1695 is a reply to message #1598]

Fri, 05 August 2022 07:46

DrCJM
Messages: 5
Registered: September 2019
Location: Australia

Junior Member

Late comment!

It's easy to make a new column with the number of compounds in each cluster - all cluster members will obviously have the same number.

After clustering, which creates the "Cluster No" column, I calculate a new column (usually called "Cluster Count") with the function:

frequency(ClusterNo, "Cluster No")

You can then sort on Cluster Count or resize markers on graphs by Cluster Count etc. Very useful if you just want to show the Representative compounds on a graph, but size them by the number of similar compounds they represent.

So doesn't rename the cluster but allows you to do the sorts of things you might want to do after renaming them.

Craig.

Report message to a moderator

Previous Topic:	Updating columns
Next Topic:	Find & Replace or alternative for stereoisomers

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Sun Feb 22 20:54:04 CET 2026

Total time taken to generate the page: 0.00746 seconds