All recommended articles

1 items found
18 Jan 2022
article picture

Machine learning models based on molecular descriptors to predict human and environmental toxicological factors in continental freshwater

Predicting characterization factors of chemical substances from a set of molecular descriptors based on machine learning algorithms

Recommended by based on reviews by Patrice Couture, Sylvain Bart, Dominique Lamonica and 2 anonymous reviewers

Today, thousands of chemical substances are released into the environment because of human activities. It is thus crucial to identify all relevant chemicals that contribute to toxic effects on living organisms, also potentially disturbing the community functioning and the ecosystem services that flow from them. Once identified, chemical substances need to be associated with ecotoxicity factors. Nevertheless, getting such factors usually requires time-, resources- and animal-costly experiments that it should be possible to avoid. In this perspective, modelling approaches may be particularly helpful if they rely on easy-to-obtain information to be used as predictive variables.

Within this context, the paper of Servien et al. (2022) illustrates the use of machine learning algorithms to predict toxicity and ecotoxicity factors that were missing for a collection of compounds. Their modelling approach involve a collection of molecular descriptors as input variables. A total of 40 molecular descriptors were extracted from the TyPol database (Servien et al., 2014) as those describing the best how organic compounds behave within the environment. These molecular descriptors also have the advantage to be easily quantifiable for new chemical substances under evaluation. The performances of the proposed models were systematically checked and compared to the classical linear partial least square method, based on the calculation of the absolute error (namely, the difference between prediction and true value). This finally led to different best models (that is associated to the lowest median absolute error) according to the classification of the 526 compounds comprised in the TyPol database in five clusters. These five clusters of different sizes gather chemical substances with different but specific molecular characteristics, also corresponding to different estimates of the characterization factors both in their median and within-variability.

In a final step, predictions of characterization factors were performed for 102 missing values in the USEtox® database (Rosenbaum et al., 2008) but also referenced in TyPol. This paper highlights that the molecular descriptors that explain the most the toxicity of the chemical substances in each cluster strongly differ. Nevertheless, these predictions, whatever the cluster, appear precise enough to be considered as relevant despite everything.

As a conclusion, this paper is a promising proof-of-concept in using machine learning modelling to go beyond some constraints around the toxicity evaluation of chemical substances, especially handling non-linearities and data-demanding calculations, in in an ever-changing world that is gradually depleting its resources without sufficient concern for the short-term risks to the environment and human health.


Rosenbaum RK, Bachmann TM, Gold LS, Huijbregts MAJ, Jolliet O, Juraske R, Koehler A, Larsen HF, MacLeod M, Margni M, McKone TE, Payet J, Schuhmacher M, van de Meent D, Hauschild MZ (2008) USEtox—the UNEP-SETAC toxicity model: recommended characterisation factors for human toxicity and freshwater ecotoxicity in life cycle impact assessment. The International Journal of Life Cycle Assessment, 13, 532.

Servien R, Latrille E, Patureau D, Hélias A (2022) Machine learning models based on molecular descriptors to predict human and environmental toxicological factors in continental freshwater. bioRxiv, 2021.07.20.453034, ver. 6 peer-reviewed and recommended by Peer Community in Ecotoxicology and Environmental Chemistry.

Servien R, Mamy L, Li Z, Rossard V, Latrille E, Bessac F, Patureau D, Benoit P (2014) TyPol – A new methodology for organic compounds clustering based on their molecular characteristics and environmental behavior. Chemosphere, 111, 613–622.