openmolecules.org

 
Home » DataWarrior » Functionality » Why is clustering limited up to 10000 molecules?
Why is clustering limited up to 10000 molecules? [message #75] Wed, 22 July 2015 16:37 Go to next message
Nastasia is currently offline  Nastasia
Messages: 3
Registered: July 2015
Junior Member
I am trying to cluster my molecules, but the number of my molecules is far beyond 10000, actually it's about 50000. However, there are some duplicate molecules, which may be joined, but when I do that I loose my different descriptors important for each case, as they are just merged into some multiple group. Is it possible to increase the size for clusters?
Re: Why is clustering limited up to 10000 molecules? [message #78 is a reply to message #75] Thu, 23 July 2015 21:40 Go to previous messageGo to next message
thomas is currently offline  thomas
Messages: 155
Registered: June 2014
Senior Member
Clustering is a very old functionality and not used very often. The used algorithm is reproducible and is not based on random starting point as other faster algorithms. The flipside, however, is that is not a particularly fast one and its memory consumption and processor usage grow quadratic with the size of the data file. Without a limit people would experience error messages or extreme processing times causing lots of frustration. I am aware that DataWarrior needs an alternative clustering algorithm and it is on the feature list for the future, but till now I considered clustering low priority, because often there are better methods to reach a certain goal.

Kind regards, Thomas
Re: Why is clustering limited up to 10000 molecules? [message #154 is a reply to message #78] Sun, 28 February 2016 19:03 Go to previous messageGo to next message
avkitex is currently offline  avkitex
Messages: 2
Registered: February 2016
Junior Member
Dear autors. What if I would like to cluster ~100000 compounds?
I understand that it could consume huge amount of time and ram.

Best,
Nikita
Re: Why is clustering limited up to 10000 molecules? [message #157 is a reply to message #154] Thu, 03 March 2016 21:49 Go to previous messageGo to next message
thomas is currently offline  thomas
Messages: 155
Registered: June 2014
Senior Member
Dear Nikita, the current DataWarrior version is limited to 10.000 compounds. For 100.000 compounds the current algorithm would need 20GB of memory and many hours to finish. The only way to do it would be to patch the datawarrior.jar file (OSX or Linux only) or to change the source code and to recompile.

Regards, Thomas
Re: Why is clustering limited up to 10000 molecules? [message #158 is a reply to message #157] Thu, 03 March 2016 21:54 Go to previous messageGo to next message
avkitex is currently offline  avkitex
Messages: 2
Registered: February 2016
Junior Member
Dear, Thomas

I have servers with about 30Gb ram and a lot of time. And I woult like to try doing it.
Can you tell me how to patch and recompile jar file? And what is the best way to run it in console mode? (I have molecules in sdf or mol2 format. I would like to have a clusterisation tree in newick or compatable format.)

Looking forward to hear from you.

Best,
Nikita
Re: Why is clustering limited up to 10000 molecules? [message #159 is a reply to message #158] Fri, 04 March 2016 22:24 Go to previous messageGo to next message
thomas is currently offline  thomas
Messages: 155
Registered: June 2014
Senior Member
Dear Nikita,

if you use Linux or OSX, I will send you a customized datawarrior.jar file without limitation. But don't complain, if the clustering process seems to take forever...

If your server runs Linux, you can install the datawarrior directory on the server and login from remote via 'ssh -X' and then launch datawarrior the usual way. Then you all DataWarrior windows open on the client machine while it executes on the server and uses server RAM and server cores.

Best wishes, Thomas
Re: Why is clustering limited up to 10000 molecules? [message #252 is a reply to message #159] Wed, 22 March 2017 11:34 Go to previous messageGo to next message
dataviz is currently offline  dataviz
Messages: 4
Registered: November 2016
Junior Member
Hello
it is indeed true that to be able to run the cluster function on a dataset larger than 10,000 would be really great considering all the new data available today... This can be done externally but Warrior is so convenient that in a future release, it really would be nice...
Sincerely
Re: Why is clustering limited up to 10000 molecules? [message #255 is a reply to message #252] Thu, 30 March 2017 20:35 Go to previous message
thomas is currently offline  thomas
Messages: 155
Registered: June 2014
Senior Member
Hi Dataviz,

I have lifted the limit from 10.000 to 100.000 structures, but display a warning above 20.000 compounds. Nevertheless, clustering of significantly more structures than 20.000 will require a Linux or Macintosh computer, where it is easy to increase the memory maximum that DataWarrior is allowed to use. The change will be available with the next update.

Regards, Thomas
Previous Topic: R group count
Next Topic: Compound selection via check box
Goto Forum:
  


Current Time: Wed Oct 24 06:40:32 CEST 2018

Total time taken to generate the page: 0.00547 seconds