Close printable page

Recommendation

Predicting characterization factors of chemical substances from a set of molecular descriptors based on machine learning algorithms

Sandrine CHARLES based on reviews by Patrice Couture, Sylvain Bart, Dominique Lamonica and 2 anonymous reviewers

A recommendation of:

Machine learning models based on molecular descriptors to predict human and environmental toxicological factors in continental freshwater

Rémi Servien, Eric Latrille, Dominique Patureau, Arnaud Hélias (2022), bioRxiv, 2021.07.20.453034, ver. 6 peer-reviewed and recommended by Peer Community in Ecotoxicology and Environmental Chemistry https://doi.org/10.1101/2021.07.20.453034

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Codes used in this study

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Machine learning models based on molecular descriptors to predict human and environmental toxicological factors in continental freshwater

It is a real challenge for life cycle assessment practitioners to identify all relevant substances contributing to the ecotoxicity. Once this identification has been made, the lack of corresponding ecotoxicity factors can make the results partial and difficult to interpret. So, it is a real and important challenge to provide ecotoxicity factors for a wide range of compounds. Nevertheless, obtaining such factors using experiments is tedious, time-consuming, and made at a high cost. A modeling method that could predict these factors from easy-to-obtain information on each chemical would be of great value. Here, we present such a method, based on machine learning algorithms, that used molecular descriptors to predict two specific endpoints in continental freshwater for ecotoxicological and human impacts. The different tested machine learning algorithms show good performances on a learning database and the non-linear methods tend to outperform the linear ones. The cluster-then-predict approaches usually show the best performances which suggests that these predicted models must be derived for somewhat similar compounds. Finally, predictions were derived from the validated model for compounds with missing toxicity/ecotoxicity factors.

machine learning, Life Cycle Assessment, characterisation factors, toxicity, ecotoxicity, continental freshwater.

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

نماذج التعلم الآلي المبنية على الواصفات الجزيئية للتنبؤ بالعوامل السمية البشرية والبيئية في المياه العذبة القارية

يعد تحديد جميع المواد ذات الصلة التي تساهم في السمية البيئية تحديًا حقيقيًا لممارسي تقييم دورة الحياة. وبمجرد إجراء هذا التحديد، فإن الافتقار إلى عوامل السمية البيئية المقابلة يمكن أن يجعل النتائج جزئية ويصعب تفسيرها. لذلك، يعد توفير عوامل السمية البيئية لمجموعة واسعة من المركبات تحديًا حقيقيًا ومهمًا. ومع ذلك، فإن الحصول على مثل هذه العوامل باستخدام التجارب أمر شاق، ويستغرق وقتًا طويلاً، وبتكلفة عالية. إن طريقة النمذجة التي يمكنها التنبؤ بهذه العوامل من خلال معلومات يسهل الحصول عليها عن كل مادة كيميائية ستكون ذات قيمة كبيرة. هنا، نقدم مثل هذه الطريقة، بناءً على خوارزميات التعلم الآلي، التي تستخدم الواصفات الجزيئية للتنبؤ بنقطتي نهاية محددتين في المياه العذبة القارية للتأثيرات السمية البيئية والبشرية. تُظهر خوارزميات التعلم الآلي المختلفة التي تم اختبارها أداءً جيدًا في قاعدة بيانات التعلم وتميل الأساليب غير الخطية إلى التفوق على الأساليب الخطية. عادةً ما تُظهر مناهج الكتلة ثم التنبؤ أفضل الأداء مما يشير إلى أن هذه النماذج المتوقعة يجب أن تشتق من مركبات مماثلة إلى حد ما. وأخيرًا، تم استخلاص التنبؤات من النموذج المعتمد للمركبات التي تفتقر إلى عوامل السمية/السمية البيئية.

التعلم الآلي، تقييم دورة الحياة، عوامل التوصيف، السمية، السمية البيئية، المياه العذبة القارية.

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Modelos de aprendizaje automático basados en descriptores moleculares para predecir factores toxicológicos humanos y ambientales en agua dulce continental

Es un verdadero desafío para los profesionales de la evaluación del ciclo de vida identificar todas las sustancias relevantes que contribuyen a la ecotoxicidad. Una vez realizada esta identificación, la falta de los factores de ecotoxicidad correspondientes puede hacer que los resultados sean parciales y difíciles de interpretar. Por lo tanto, es un desafío real e importante proporcionar factores de ecotoxicidad para una amplia gama de compuestos. Sin embargo, obtener dichos factores mediante experimentos es tedioso, requiere mucho tiempo y tiene un costo elevado. Sería de gran valor un método de modelado que pudiera predecir estos factores a partir de información fácil de obtener sobre cada sustancia química. Aquí presentamos un método de este tipo, basado en algoritmos de aprendizaje automático, que utilizó descriptores moleculares para predecir dos criterios de valoración específicos en agua dulce continental para los impactos ecotoxicológicos y humanos. Los diferentes algoritmos de aprendizaje automático probados muestran buenos resultados en una base de datos de aprendizaje y los métodos no lineales tienden a superar a los lineales. Los enfoques de agrupación y luego predicción suelen mostrar los mejores resultados, lo que sugiere que estos modelos predichos deben derivarse para compuestos algo similares. Finalmente, las predicciones se derivaron del modelo validado para compuestos a los que les faltaban factores de toxicidad/ecotoxicidad.

aprendizaje automático, Evaluación del Ciclo de Vida, factores de caracterización, toxicidad, ecotoxicidad, agua dulce continental.

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Modèles d'apprentissage automatique basés sur des descripteurs moléculaires pour prédire les facteurs toxicologiques humains et environnementaux dans les eaux douces continentales

C'est un véritable défi pour les praticiens de l'analyse du cycle de vie d'identifier toutes les substances pertinentes contribuant à l'écotoxicité. Une fois cette identification réalisée, l'absence de facteurs d'écotoxicité correspondants peut rendre les résultats partiels et difficiles à interpréter. Il s’agit donc d’un défi réel et important de fournir des facteurs d’écotoxicité pour une large gamme de composés. Néanmoins, l’obtention de tels facteurs à l’aide d’expériences est fastidieuse, prend du temps et coûte cher. Une méthode de modélisation capable de prédire ces facteurs à partir d’informations faciles à obtenir sur chaque produit chimique serait d’une grande valeur. Nous présentons ici une telle méthode, basée sur des algorithmes d'apprentissage automatique, qui utilisait des descripteurs moléculaires pour prédire deux paramètres spécifiques dans les eaux douces continentales pour les impacts écotoxicologiques et humains. Les différents algorithmes de machine learning testés montrent de bonnes performances sur une base de données d’apprentissage et les méthodes non linéaires ont tendance à surpasser les méthodes linéaires. Les approches cluster-puis-prédiction montrent généralement les meilleures performances, ce qui suggère que ces modèles prédits doivent être dérivés pour des composés quelque peu similaires. Enfin, des prédictions ont été dérivées du modèle validé pour les composés pour lesquels des facteurs de toxicité/écotoxicité manquaient.

apprentissage automatique, analyse du cycle de vie, facteurs de caractérisation, toxicité, écotoxicité, eau douce continentale.

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

महाद्वीपीय मीठे पानी में मानव और पर्यावरणीय विषैले कारकों की भविष्यवाणी करने के लिए आणविक विवरणकों पर आधारित मशीन लर्निंग मॉडल

जीवन चक्र मूल्यांकन चिकित्सकों के लिए इकोटॉक्सिसिटी में योगदान देने वाले सभी प्रासंगिक पदार्थों की पहचान करना एक वास्तविक चुनौती है। एक बार यह पहचान हो जाने के बाद, संबंधित इकोटॉक्सिसिटी कारकों की कमी के कारण परिणाम आंशिक हो सकते हैं और उनकी व्याख्या करना कठिन हो सकता है। इसलिए, यौगिकों की एक विस्तृत श्रृंखला के लिए इकोटॉक्सिसिटी कारक प्रदान करना एक वास्तविक और महत्वपूर्ण चुनौती है। फिर भी, प्रयोगों का उपयोग करके ऐसे कारकों को प्राप्त करना कठिन, समय लेने वाला और उच्च लागत पर किया जाने वाला काम है। एक मॉडलिंग विधि जो प्रत्येक रसायन पर आसानी से प्राप्त होने वाली जानकारी से इन कारकों की भविष्यवाणी कर सकती है, बहुत मूल्यवान होगी। यहां, हम मशीन लर्निंग एल्गोरिदम पर आधारित एक ऐसी विधि प्रस्तुत करते हैं, जो इकोटॉक्सिकोलॉजिकल और मानव प्रभावों के लिए महाद्वीपीय मीठे पानी में दो विशिष्ट समापन बिंदुओं की भविष्यवाणी करने के लिए आणविक विवरणकों का उपयोग करती है। अलग-अलग परीक्षण किए गए मशीन लर्निंग एल्गोरिदम एक लर्निंग डेटाबेस पर अच्छा प्रदर्शन दिखाते हैं और गैर-रेखीय तरीके रैखिक तरीकों से बेहतर प्रदर्शन करते हैं। क्लस्टर-तब-भविष्यवाणी दृष्टिकोण आमतौर पर सर्वोत्तम प्रदर्शन दिखाते हैं जो बताता है कि इन अनुमानित मॉडलों को कुछ समान यौगिकों के लिए प्राप्त किया जाना चाहिए। अंत में, गायब विषाक्तता/इकोटॉक्सिसिटी कारकों वाले यौगिकों के लिए मान्य मॉडल से भविष्यवाणियां प्राप्त की गईं।

मशीन लर्निंग, जीवन चक्र मूल्यांकन, लक्षण वर्णन कारक, विषाक्तता, इकोटोक्सिसिटी, महाद्वीपीय ताज़ा पानी।

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

分子記述子に基づく機械学習モデルにより、大陸淡水における人体および環境の毒物学的要因を予測する

ライフサイクル評価の専門家にとって、生態毒性に寄与するすべての関連物質を特定することは大きな課題です。この特定が行われると、対応する生態毒性因子が欠如しているため、結果が部分的で解釈が難しくなる可能性があります。したがって、広範囲の化合物の生態毒性因子を提供することは現実的かつ重要な課題です。それにもかかわらず、実験を使用してそのような係数を取得するのは面倒で時間もかかり、コストもかかります。各化学物質に関する入手しやすい情報からこれらの要因を予測できるモデリング手法は非常に価値があります。ここでは、機械学習アルゴリズムに基づいて、分子記述子を使用して大陸淡水における生態毒性と人体への影響に関する 2 つの特定のエンドポイントを予測する手法を紹介します。テストされたさまざまな機械学習アルゴリズムは、学習データベースで優れたパフォーマンスを示し、非線形手法は線形手法よりも優れたパフォーマンスを示す傾向があります。通常、クラスターを作成してから予測するアプローチは最高のパフォーマンスを示します。これは、これらの予測モデルがある程度類似した化合物に対して導出される必要があることを示唆しています。最後に、毒性/生態毒性因子が欠落している化合物の検証済みモデルから予測が導かれました。

機械学習、ライフサイクル評価、特性評価要素、毒性、生態毒性、大陸淡水。

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Modelos de aprendizado de máquina baseados em descritores moleculares para prever fatores toxicológicos humanos e ambientais em água doce continental

É um verdadeiro desafio para os profissionais de avaliação do ciclo de vida identificar todas as substâncias relevantes que contribuem para a ecotoxicidade. Uma vez feita esta identificação, a falta de fatores de ecotoxicidade correspondentes pode tornar os resultados parciais e de difícil interpretação. Portanto, é um desafio real e importante fornecer fatores de ecotoxicidade para uma ampla gama de compostos. No entanto, a obtenção de tais fatores por meio de experimentos é tediosa, demorada e de alto custo. Um método de modelagem que pudesse prever esses fatores a partir de informações fáceis de obter sobre cada produto químico seria de grande valor. Aqui, apresentamos tal método, baseado em algoritmos de aprendizado de máquina, que utilizou descritores moleculares para prever dois pontos finais específicos em água doce continental para impactos ecotoxicológicos e humanos. Os diferentes algoritmos de aprendizado de máquina testados apresentam bom desempenho em um banco de dados de aprendizagem e os métodos não lineares tendem a superar os lineares. As abordagens de agrupamento e previsão geralmente mostram os melhores desempenhos, o que sugere que esses modelos previstos devem ser derivados para compostos um tanto semelhantes. Finalmente, as previsões foram derivadas do modelo validado para compostos com fatores de toxicidade/ecotoxicidade ausentes.

aprendizado de máquina, avaliação do ciclo de vida, fatores de caracterização, toxicidade, ecotoxicidade, água doce continental.

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Модели машинного обучения, основанные на молекулярных дескрипторах, для прогнозирования токсикологических факторов человека и окружающей среды в континентальных пресных водах

Для специалистов, занимающихся оценкой жизненного цикла, является настоящей проблемой выявить все соответствующие вещества, способствующие экотоксичности. После того, как эта идентификация будет сделана, отсутствие соответствующих факторов экотоксичности может сделать результаты неполными и трудными для интерпретации. Таким образом, это реальная и важная задача — найти факторы экотоксичности для широкого спектра соединений. Тем не менее получение таких факторов с помощью экспериментов является утомительным, трудоемким и дорогостоящим. Большую ценность имел бы метод моделирования, который мог бы предсказать эти факторы на основе легкодоступной информации о каждом химическом веществе. Здесь мы представляем такой метод, основанный на алгоритмах машинного обучения, который использует молекулярные дескрипторы для прогнозирования двух конкретных конечных точек в континентальной пресной воде для экотоксикологического и антропогенного воздействия. Различные протестированные алгоритмы машинного обучения показывают хорошие результаты в обучающей базе данных, а нелинейные методы имеют тенденцию превосходить линейные. Подходы «кластер-затем-прогнозирование» обычно показывают наилучшие результаты, что предполагает, что эти прогнозируемые модели должны быть получены для несколько схожих соединений. Наконец, прогнозы были получены на основе проверенной модели для соединений с отсутствующими факторами токсичности/экотоксичности.

машинное обучение, оценка жизненного цикла, факторы характеристики, токсичность, экотоксичность, континентальные пресные воды.

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

基于分子描述符的机器学习模型预测大陆淡水中的人类和环境毒理学因素

识别所有导致生态毒性的相关物质对生命周期评估从业者来说是一个真正的挑战。一旦做出这种识别，缺乏相应的生态毒性因素可能会使结果不完整且难以解释。因此，为多种化合物提供生态毒性因子是一个真正而重要的挑战。然而，通过实验获得这些因子是繁琐、耗时且成本较高的。一种可以通过易于获得的每种化学物质的信息来预测这些因素的建模方法将具有很大的价值。在这里，我们提出了一种基于机器学习算法的方法，该方法使用分子描述符来预测大陆淡水中对生态毒理学和人类影响的两个特定终点。不同的测试机器学习算法在学习数据库上显示出良好的性能，并且非线性方法往往优于线性方法。聚类然后预测方法通常表现出最佳性能，这表明这些预测模型必须针对有些相似的化合物得出。最后，根据缺少毒性/生态毒性因素的化合物的验证模型得出预测。

机器学习、生命周期评估、表征因素、毒性、生态毒性、大陆淡水。

Submission: posted 21 July 2021
Recommendation: posted 17 January 2022, validated 18 January 2022

Cite this recommendation as:
CHARLES, S. (2022) Predicting characterization factors of chemical substances from a set of molecular descriptors based on machine learning algorithms. Peer Community in Ecotoxicology and Environmental Chemistry, 100001. https://doi.org/10.24072/pci.ecotoxenvchem.100001

Recommendation

Today, thousands of chemical substances are released into the environment because of human activities. It is thus crucial to identify all relevant chemicals that contribute to toxic effects on living organisms, also potentially disturbing the community functioning and the ecosystem services that flow from them. Once identified, chemical substances need to be associated with ecotoxicity factors. Nevertheless, getting such factors usually requires time-, resources- and animal-costly experiments that it should be possible to avoid. In this perspective, modelling approaches may be particularly helpful if they rely on easy-to-obtain information to be used as predictive variables.

Within this context, the paper of Servien et al. (2022) illustrates the use of machine learning algorithms to predict toxicity and ecotoxicity factors that were missing for a collection of compounds. Their modelling approach involve a collection of molecular descriptors as input variables. A total of 40 molecular descriptors were extracted from the TyPol database (Servien et al., 2014) as those describing the best how organic compounds behave within the environment. These molecular descriptors also have the advantage to be easily quantifiable for new chemical substances under evaluation. The performances of the proposed models were systematically checked and compared to the classical linear partial least square method, based on the calculation of the absolute error (namely, the difference between prediction and true value). This finally led to different best models (that is associated to the lowest median absolute error) according to the classification of the 526 compounds comprised in the TyPol database in five clusters. These five clusters of different sizes gather chemical substances with different but specific molecular characteristics, also corresponding to different estimates of the characterization factors both in their median and within-variability.

In a final step, predictions of characterization factors were performed for 102 missing values in the USEtox® database (Rosenbaum et al., 2008) but also referenced in TyPol. This paper highlights that the molecular descriptors that explain the most the toxicity of the chemical substances in each cluster strongly differ. Nevertheless, these predictions, whatever the cluster, appear precise enough to be considered as relevant despite everything.

As a conclusion, this paper is a promising proof-of-concept in using machine learning modelling to go beyond some constraints around the toxicity evaluation of chemical substances, especially handling non-linearities and data-demanding calculations, in in an ever-changing world that is gradually depleting its resources without sufficient concern for the short-term risks to the environment and human health.

References

Rosenbaum RK, Bachmann TM, Gold LS, Huijbregts MAJ, Jolliet O, Juraske R, Koehler A, Larsen HF, MacLeod M, Margni M, McKone TE, Payet J, Schuhmacher M, van de Meent D, Hauschild MZ (2008) USEtox—the UNEP-SETAC toxicity model: recommended characterisation factors for human toxicity and freshwater ecotoxicity in life cycle impact assessment. The International Journal of Life Cycle Assessment, 13, 532. https://doi.org/10.1007/s11367-008-0038-4

Servien R, Latrille E, Patureau D, Hélias A (2022) Machine learning models based on molecular descriptors to predict human and environmental toxicological factors in continental freshwater. bioRxiv, 2021.07.20.453034, ver. 6 peer-reviewed and recommended by Peer Community in Ecotoxicology and Environmental Chemistry. https://doi.org/10.1101/2021.07.20.453034

Servien R, Mamy L, Li Z, Rossard V, Latrille E, Bessac F, Patureau D, Benoit P (2014) TyPol – A new methodology for organic compounds clustering based on their molecular characteristics and environmental behavior. Chemosphere, 111, 613–622. https://doi.org/10.1016/j.chemosphere.2014.05.020

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Reviews

Evaluation round #2

DOI or URL of the preprint: https://doi.org/10.1101/2021.07.20.453034

Version of the preprint: 4

Author's Reply, 07 Jan 2022

Download author's reply https://doi.org/10.24072/pci.ecotoxenvchem.100034.ar2

Decision by Sandrine CHARLES, posted 04 Jan 2022

Dear authors,

Thank you very much for your revised version of this manuscruipt that accounts for all suggestions given by the different reviewers. This finally led to an improved version that I am almost ready to recommend given still a very minor revision based on suggestions that I added directly on your point-by-point reply, here attached. Considering this should not take a lot of time and can be discussed or ignored. These are only suggestions for your consideration.

Best regards,

Sandrine Charles

Download recommender's annotations

https://doi.org/10.24072/pci.ecotoxenvchem.100034.d2

Evaluation round #1

DOI or URL of the preprint: 10.1101/2021.07.20.453034

Version of the preprint: 3

Author's Reply, 17 Dec 2021

Download author's reply Download tracked changes file

Dear recommender,

We thank the reviewers very much for their constructive comments. Major revisions were done as required, and a detailed response to the reviewer comments, that carefully addresses, point-by-point, the issues raised in the comments, is provided attached. We hope that you will find the changes satisfactory and that this revised manuscript will be now considered for recommendation in PCI Ecotoxicology & Environmental Chemistry. Please note that the new version of the paper (and of the supplemental material) has been uploaded at BiorXiv. We are at your disposal if you need any further information. Thank you very much in advance for your attention.

Best regards,

Rémi Servien on behalf of co-authors.

https://doi.org/10.24072/pci.ecotoxenvchem.100034.ar1

Decision by Sandrine CHARLES, posted 22 Nov 2021

Dear authors,

First accept our apologize for the long duration of the review process regarding your paper. It took us a lot of time to find reviewers, the first two we got were not specialized enough into modelling to make us able to render a decision. We finally got three additional reviews that should help you in improving your manuscript in order to provide us with a revised version. Please provide this revision together with a point-by-point answer to reviewers' comments refering to the corresponding changes in your manuscript. Changes in your revised manuscript must be hightlighted to be clearly identifed compared to the previous version. If possible, please provide your revised manuscript and your answers on December the 21st 2021 at the latest.

Best regards,

S. Charles

https://doi.org/10.24072/pci.ecotoxenvchem.100034.d1

Reviewed by Sylvain Bart, 23 Aug 2021

Servien et al presents a new method based on machine learning to predict ecotoxicological metrics for chemicals for which we don’t have these metrics. The approach is promising and complementary to the linear QSAR method which cannot deal with nonlinearity.

The graphical abstract is very informative and the introduction provides all the necessary information to understand the topic and the scientific gap addressed. All the methods and procedures are deeply described which is very appreciated for reader whom machine learning is not the primary expertise, like me.

In conclusion, the manuscript is well written, I don’t see any major issue in the manuscript, and I would recommended it for publication in a peer reviewed journal

minor comment:

-I would suggest to carefully check all figure captions to ensure all necessary informations are given for the figures to be read by themselves. E.g. : Figure 4, Provide full name somewhere for RF, PLS etc.. ?

All the best

https://doi.org/10.24072/pci.ecotoxenvchem.100034.rev11

Reviewed by Patrice Couture, 27 Aug 2021

I would not provide an in-depth review of this manuscript, due to my very limited expertise in the area of the paper (I am an ecotoxicologist). This paper needs to be properly reviewed by experts in modeling. I only identified a few points that would need to be addressed to improve the clarity and the relevance of ecotoxicological terms like LC50 (see file attached).

I consider that the topic addressed in this paper is interesting and the approach proposed is promising. Overall, this work has the potential to provide very useful tools for environmental and human risk assessment of new chemicals that will reduce costs, time and use of live organisms.

Download the review https://doi.org/10.24072/pci.ecotoxenvchem.100034.rev12

Reviewed by Dominique Lamonica, 11 Nov 2021

Download the review https://doi.org/10.24072/pci.ecotoxenvchem.100034.rev13

Reviewed by anonymous reviewer 2, 21 Oct 2021

The paper frames itself in a line of research initiated by other researchers and pursued also by the same authors in previous works, i.e. the use of machine learning to predict human and environmental toxicity of chemicals (using the USETox database, but not only).

The application described in this paper is just another confirmation of the potential of this kind of approach.

The paper is rather well written, although it appears too concise in the description of the full path of modelling that was followed. In this sense, to facilitate the understanding of the model chain, I suggest inserting a clear flowchart or a figure like Fig. 1 in Hou et al. 2020 (Estimate ecotoxicity characterization factors for chemicals in life cycle assessment using machine learning models. Environment International, 135, 105393) or Fig. 1 in Marvuglia et al. 2013 (Machine learning for toxicity characterization of organic chemical emissions using USEtox database: learning the structure of the input space. Environment International 83: 72-85).

Besides these two articles, other exist on similar applications in the literature, that have not been cited in this manuscript. They authors might want to take a look at them to improve their state of the art:

- Marvuglia et al. 2014. Variables selection for ecotoxicity and human toxicity characterization using Gamma Test. In: B. Murgante et al. (Eds.): ICCSA 2014, Part III, LNCS 8581, pp. 640–652, 2014. Proceedings of the 14th International Conference on Computational Science and Applications (ICCSA 2014), University of Minho, Guimaraes, Portugal.

- Marvuglia et al. 2015. Random Forest for toxicity of chemical emissions: features selection and uncertainty quantification. Journal of Environmental Accounting and Management 3(3): 229-241;

- Song et al. 2017. Rapid Life-Cycle Impact Screening Using Artificial Neural Networks. Environ. Sci. Technol. 2017, 51, 10777−10785.

- Wu and Wang 2018. Machine Learning Based Toxicity Prediction: From Chemical Structural Description to Transcriptome Analysis. Int. J. Mol. Sci. 2018, 19, 2358; doi:10.3390/ijms19082358.

- Lysenko et al 2018. An integrative machine learning approach for prediction of toxicity-related drug safety. https://doi.org/10.26508/lsa.201800098.

- Song et al. 2021. Accelerating the pace of ecotoxicological assessment using artificial intelligence. Ambio. https://doi.org/10.1007/s13280-021-01598-8

At page 11, when the clustering protocol is described, it is not clear to me how the clustering is chosen. The authors mention that the whole algorithm is repeated 200 times. However, this is not a deterministic procedure and at each iteration a (slightly or not) different partitioning can come up. Therefore, a criterion of cluster quality is needed. For example, in hierarchical clustering, not always the cut height that determines how many clusters to choose, is clear. If I understand correctly, the error criterion that the authors use, pertain only to the evaluation of the forecasting capacity of the models to determine the two factors CFET and CFHT, but nothing is said on how to chose the best clustering partition. There are many cluster validity measures (see e.g. Vazirgiannis M. (2009) Clustering Validity. In: LIU L., ÖZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-39940-9_616).

At page 11, line 18, the term NA appears but it is not explained in the paper. It is only explained in the caption of table S2 in supporting information. I think it should also be explained in the text of the paper.

At page 21, line 2-3 read as follows: "We could see in this Table that the important molecular descriptors strongly differ from one cluster to another, highlighting the usefulness of the cluster-then-predict approaches". This is true, but the important molecular descriptors (and the ranking of the descriptors overall) differ not only because we change from one cluster to another, but also because the best model changes from one cluster to the other. Therefore, how can we say that the important descriptors change only because of the cluster? To estimate how much of this change in ranking depends on the cluster and how much on the model used, the authors should provide the full ranking in each cluster for each model. Then one could calculate for example the change in raking position for each variable within the same cluster when passing from one model to the other.

In table 2, it is not clear how the descriptors are selected. Is it possible to add the % of variance of the output explained by each descriptor?

At page 24, the lines from 6 to 11 of the Conclusions are more fit for the introduction, rather than for the conclusions. I suggest moving this part there.

Suggested changes to the text:

- Page 3, line 11: begin the sentence with “therefore” rather than with “so”.

- Page 3, lines 23-24 from “To best” to “case-by-case basis”: this sounds like a repetition of something already mentioned above.

- Page 5, line 8: change “That’s why” with “That is why”.

- Page 6, line 28: change “that are” with “that is”.

- Page 9, line 26: add a comma after “performs well”.

- Page 10, line 18: correct “cluster-the-SVM” in “cluster-then-SVM”.

- Page 17, line 11: change “in each cluster” to “from one cluster to another”. The meaning changes, and I think my suggestion reflects better what you want to say.

- Page 17, line 13: begin the sentence with “therefore” rather than with “so”.

- Page 20, line 16: change “the more difficult” with “the most difficult”.

- Page 21, line 8: change “lonely” with “single”.

- Page 21, line 10: change “the more important” with “the most important”.

- Page 23, line 4: although also the cited paper (Lesnoff et al., 2020) uses the term “explicative”, I believe a more common term in statistics and machine learning is “explanatory”.

https://doi.org/10.24072/pci.ecotoxenvchem.100034.rev14

Reviewed by anonymous reviewer 1, 03 Nov 2021

The first impression reading the paper is that it contains some naïf considerations. The authors insist on the novelty of using non linear methods; those methods are in use since about 20 years, both in QSAR and many more modeling tasks. Using a non-linear method is the good practice today when simple linear methods fail.

So the novelty of the paper is not in choosing tools that are already accepted in QSAR; it can be in the idea of computing the characterization factors (CFs) using molecular descriptors instead of relying on the traditional LCA methods that depend on data (chemical, toxicological, etc.) not easily available for every chemical.

The authors compute 40 molecular descriptors (including some quantum chemical descriptors), selected since they appear relevant to describe the behavior of organic compounds in the environment. Then they apply both classifiers (using 3 modeling methods) and clustering, defining different local models for the 5 different clusters.

A point that should need more attention is the descriptor selection. In any modeling method (machine learning included) the features are important and a wider exploration of the features and their number is missing in the paper.

The combination of the classifiers with clustering is interesting in that the results can be more accepted by the users, which often like to consider also the compounds similar to the one under investigation.

As the authors report, USEtox® is commonly used; it provides in one single CF the chemical fate, the exposure, and the effect for each compound in a set of several thousands chemicals. Then the CF can be extended to other endpoints, both human and environmental (DALY and PDF). The observation that the computation of those final endpoints can be done in one model using directly the chemical information is the advantage of the proposed method over the traditional one.

In conclusion, even though the methods applied are quite common in QSAR, and the machine learning methods should be better applied, the paper proposes something new in the LCA domain.

https://doi.org/10.24072/pci.ecotoxenvchem.100034.rev15