Predicting characterization factors of chemical substances from a set of molecular descriptors based on machine learning algorithms
Machine learning models based on molecular descriptors to predict human and environmental toxicological factors in continental freshwater
Recommendation: posted 17 January 2022, validated 18 January 2022
Today, thousands of chemical substances are released into the environment because of human activities. It is thus crucial to identify all relevant chemicals that contribute to toxic effects on living organisms, also potentially disturbing the community functioning and the ecosystem services that flow from them. Once identified, chemical substances need to be associated with ecotoxicity factors. Nevertheless, getting such factors usually requires time-, resources- and animal-costly experiments that it should be possible to avoid. In this perspective, modelling approaches may be particularly helpful if they rely on easy-to-obtain information to be used as predictive variables.
Within this context, the paper of Servien et al. (2022) illustrates the use of machine learning algorithms to predict toxicity and ecotoxicity factors that were missing for a collection of compounds. Their modelling approach involve a collection of molecular descriptors as input variables. A total of 40 molecular descriptors were extracted from the TyPol database (Servien et al., 2014) as those describing the best how organic compounds behave within the environment. These molecular descriptors also have the advantage to be easily quantifiable for new chemical substances under evaluation. The performances of the proposed models were systematically checked and compared to the classical linear partial least square method, based on the calculation of the absolute error (namely, the difference between prediction and true value). This finally led to different best models (that is associated to the lowest median absolute error) according to the classification of the 526 compounds comprised in the TyPol database in five clusters. These five clusters of different sizes gather chemical substances with different but specific molecular characteristics, also corresponding to different estimates of the characterization factors both in their median and within-variability.
In a final step, predictions of characterization factors were performed for 102 missing values in the USEtox® database (Rosenbaum et al., 2008) but also referenced in TyPol. This paper highlights that the molecular descriptors that explain the most the toxicity of the chemical substances in each cluster strongly differ. Nevertheless, these predictions, whatever the cluster, appear precise enough to be considered as relevant despite everything.
As a conclusion, this paper is a promising proof-of-concept in using machine learning modelling to go beyond some constraints around the toxicity evaluation of chemical substances, especially handling non-linearities and data-demanding calculations, in in an ever-changing world that is gradually depleting its resources without sufficient concern for the short-term risks to the environment and human health.
Rosenbaum RK, Bachmann TM, Gold LS, Huijbregts MAJ, Jolliet O, Juraske R, Koehler A, Larsen HF, MacLeod M, Margni M, McKone TE, Payet J, Schuhmacher M, van de Meent D, Hauschild MZ (2008) USEtox—the UNEP-SETAC toxicity model: recommended characterisation factors for human toxicity and freshwater ecotoxicity in life cycle impact assessment. The International Journal of Life Cycle Assessment, 13, 532. https://doi.org/10.1007/s11367-008-0038-4
Servien R, Latrille E, Patureau D, Hélias A (2022) Machine learning models based on molecular descriptors to predict human and environmental toxicological factors in continental freshwater. bioRxiv, 2021.07.20.453034, ver. 6 peer-reviewed and recommended by Peer Community in Ecotoxicology and Environmental Chemistry. https://doi.org/10.1101/2021.07.20.453034
Servien R, Mamy L, Li Z, Rossard V, Latrille E, Bessac F, Patureau D, Benoit P (2014) TyPol – A new methodology for organic compounds clustering based on their molecular characteristics and environmental behavior. Chemosphere, 111, 613–622. https://doi.org/10.1016/j.chemosphere.2014.05.020
Sandrine CHARLES (2022) Predicting characterization factors of chemical substances from a set of molecular descriptors based on machine learning algorithms. Peer Community In Ecotoxicology and Environmental Chemistry, 100001. https://doi.org/10.24072/pci.ecotoxenvchem.100001
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.
Evaluation round #2
DOI or URL of the preprint: https://doi.org/10.1101/2021.07.20.453034
Version of the preprint: 4
Author's Reply, 07 Jan 2022
Decision by Sandrine CHARLES, posted 04 Jan 2022
Thank you very much for your revised version of this manuscruipt that accounts for all suggestions given by the different reviewers. This finally led to an improved version that I am almost ready to recommend given still a very minor revision based on suggestions that I added directly on your point-by-point reply, here attached. Considering this should not take a lot of time and can be discussed or ignored. These are only suggestions for your consideration.
Sandrine CharlesDownload recommender's annotations
Evaluation round #1
DOI or URL of the preprint: 10.1101/2021.07.20.453034
Version of the preprint: 3
Author's Reply, 17 Dec 2021
Decision by Sandrine CHARLES, posted 22 Nov 2021
First accept our apologize for the long duration of the review process regarding your paper. It took us a lot of time to find reviewers, the first two we got were not specialized enough into modelling to make us able to render a decision. We finally got three additional reviews that should help you in improving your manuscript in order to provide us with a revised version. Please provide this revision together with a point-by-point answer to reviewers' comments refering to the corresponding changes in your manuscript. Changes in your revised manuscript must be hightlighted to be clearly identifed compared to the previous version. If possible, please provide your revised manuscript and your answers on December the 21st 2021 at the latest.