openmolecules.org Forum: Functionality » Similarity analysis using "find similar compounds..."

Home » DataWarrior » Functionality » Similarity analysis using "find similar compounds..." - slow analysis of libraries (Similarity analysis)

Show: Today's Messages :: Polls :: Message Navigator

Thu, 26 November 2020 18:16

thomas
Messages: 736
Registered: June 2014

Senior Member

I could confirm this with two files containing 16000 random structures each. Two reasons together cause the incredibly bad performance (many numbers and bad sorting):

A new column receives all individual similarity values to compounds of the second file, which are higher than the given threshold. With a 30% limit these are about 70% of the other file's molecules. Therefore, about 8000 to 12000 similarity values were put into every row of the first file. The individual similarity values are kept sorted by DataWarrior. Unfortunately, this was done in a very inefficient way by keeping the cell content as text, converting it to numbers to find the insert position for the new value. With just a few values this is not a problem, but with thousands of values, this was very expensive. I have updated the source code. The next development release early December will contain the fix. Now it takes about 2 minutes.

If your original idea was to get a distribution of all mutual similarities between any pair of molecules in a file, than there is a much faster way: Launch DataWarrior in development mode (with Java option '-Ddevelopment=true'). The you get a few undocumented additional items in the chemistry menu. 'Compare Descriptor Similarity Distribution' counts and shows a graph of the Gaussian like curve of all similarity values binned into 1-percent bins.

[Updated on: Thu, 26 November 2020 18:16]

Report message to a moderator

[Message index]

		Similarity analysis using "find similar compounds..." - slow analysis of libraries By: SM2020 on Tue, 24 November 2020 12:35
		Re: Similarity analysis using "find similar compounds..." - slow analysis of libraries By: SM2020 on Tue, 24 November 2020 21:33
		Re: Similarity analysis using "find similar compounds..." - slow analysis of libraries By: thomas on Thu, 26 November 2020 18:16
		Re: Similarity analysis using "find similar compounds..." - slow analysis of libraries By: SM2020 on Fri, 27 November 2020 01:35
		Re: Similarity analysis using "find similar compounds..." - slow analysis of libraries By: thomas on Sun, 29 November 2020 00:05
		Re: Similarity analysis using "find similar compounds..." - slow analysis of libraries By: SM2020 on Sun, 29 November 2020 15:07
		Re: Similarity analysis using "find similar compounds..." - slow analysis of libraries By: SM2020 on Thu, 04 February 2021 11:11
		Re: Similarity analysis using "find similar compounds..." - slow analysis of libraries By: thomas on Thu, 11 February 2021 10:55
		Re: Similarity analysis using "find similar compounds..." - slow analysis of libraries By: SM2020 on Mon, 01 March 2021 01:00
		Re: Similarity analysis using "find similar compounds..." - slow analysis of libraries By: SM2020 on Tue, 02 March 2021 23:40

Previous Topic:	Confidentiality of compound structures
Next Topic:	reaction enumeration reagent connection

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Fri Sep 19 01:26:04 CEST 2025

Total time taken to generate the page: 0.14162 seconds