openmolecules.org

 
Home » DataWarrior » Cheminformatics » Assessing A Machine Learning Method's Predictivity
Assessing A Machine Learning Method's Predictivity [message #1523] Tue, 01 March 2022 13:56 Go to next message
Christophe is currently offline  Christophe
Messages: 26
Registered: January 2022
Junior Member
Hello everyone,

In the User Manual at the “Assessing A Machine Learning Method's Predictivity” part, I can read:

“The dataset is divided into ten fractions along the time axis. A model is build with the first fraction and used to predict the property of the second fraction's molecules. Then a second model is built from the first two fractions, which is then applied to predict the third fraction. This continues until the nineth model is built from the first nine fractions and used to predict the tenth and last fraction.”

When I apply this to a case, by clicking “Machine Learning” and then “Assess Prediction Quality” the nine linear regressions I get, “predicted vs observed”, each only contain one Time Id set of data !!!
For example if I split my data set into ten fractions according a Time Id column, I get 9 regression models containing for example prediction fraction 3 vs 2 ; 3 vs 3 ; 3 vs 4 … 3 vs 10

From the user manual I would expect “predicted vs observed” as 2 vs (3) then 2 vs (3+4) then 2 vs (3+4+5 …). Of course the number of data from the (3), (3+4) and (3+4+5…) set should equals the number contained into 2.
But in that case the differences with "use random fractions instead of time based ones" would go thinner from (3) to (3+...+9)

Did I miss something? For me, dividing the data set into 10 fractions along the time axis serves precisely to take account of batches that are very different from one another

All the best
Re: Assessing A Machine Learning Method's Predictivity [message #1578 is a reply to message #1523] Tue, 05 April 2022 14:56 Go to previous messageGo to next message
thomas is currently offline  thomas
Messages: 648
Registered: June 2014
Senior Member
Hello Christophe,

I don't quite understand the question, but assume, there is a misinterpretation of the result. When you running the analysis, the data set is split into 10 fractions:

1: the oldest 10% of the data
2: the second oldest 10%
...
10: newest 10%

Then 9 models are generated using 10%, 20%, 30% ... 90% of the data (always the oldest)

Then every model is used to predict the Y value for the oldest fraction, which was not part of the model creation, e.g. for model 1 it is fraction 2, for model 5 it is fraction 6.

Then for all fractions with predicted data (2-10) a correlation graph is shown with predicted versus known Y-values. Graph 'fraction 8', for instance answers the question: If I had used built a model at the time, when I had 70% of the data and if I had used that model to predict Y-values for the next molecules to synthesize, how well would the prediction have been.

Does this explain it or did I misunderstand the question?

Thomas
Re: Assessing A Machine Learning Method's Predictivity [message #1596 is a reply to message #1578] Fri, 22 April 2022 14:20 Go to previous message
Christophe is currently offline  Christophe
Messages: 26
Registered: January 2022
Junior Member
Hello Thomas,

Thanks a lot for your reply. Your explanation is crystal clear.
I really didn't understand anything when I read the manual.

All the best.

Christophe
Previous Topic: retrosyntheis tools
Next Topic: ChemAxon calculated properties
Goto Forum:
  


Current Time: Fri Apr 19 09:41:46 CEST 2024

Total time taken to generate the page: 0.03886 seconds