HOME AI education

ECNU launches ChemGPT 1.0, a large language model in the field of Chemistry


On December 2, the Forum on Molecular Science and Health of the 2023 International Conference on the Cooperation and Integration of Industry, Education, Research and Application (Shanghai) was held at ECNU. During the forum, ECNU held a ChemGPT 1.0 press conference. Sun Zhenrong, Deputy Director of Shanghai Municipal Education Commission, Shi Guoyue, Vice President of ECNU, Prof. He Xiao, head of the R&D team from the School of Chemistry and Molecular Engineering, ECNU, and guests jointly launched ChemGPT 1.0, marking significant progress in the integration of AI and molecular science.

According to He Xiao, head of the R&D team, ChemGPT 1.0 is a chemical synthesis tool that integrates advanced AI technology. Thanks to the tremendous support of the Shanghai Municipal Education Commission, ECNU established the Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, whose R&D team was formed by the School of Chemistry and Molecular Engineering and the School of Computer Science and Technology. Committed to AI-driven chemical research, the Frontiers Science Center conducts AI4ChemicalScience exploration, integrating machine learning technology and chemical synthesis to enhance the efficiency and precision of synthesis. In the past two years, the research team has successfully built ChemGPT 1.0 through an in-depth study of chemical property databases, innovative introduction of physical descriptors, and development of a new density functional CF22D, creating a new tool for molecular intelligent manufacturing in the age of AI.

He Xiao stated that ChemGPT 1.0 has three major highlights: first, the construction of high-quality chemical dialogue data sets. By integrating more than 390,000 high-quality dialogue data, including 734 types of chemical property question and answer (Q&A) data, 11,679 types of science (including chemistry) Q&A data, 658 types of chemistry Q&A data, and more than 10,000 encyclopedia entries, ChemGPT 1.0 has been transformed into a data set of more than 2.07 million questions. Due to the extensive collection and in-depth understanding of professional knowledge in the chemical field, the constructed data set can provide strong support for the Q&A of chemical knowledge comprehensively and accurately. Second, the creation of a compound retrosynthetic database. In response to the problem of compound retrosynthesis, the team has built a new retrosynthesis database through data splicing, superposition, weighting, and synthesis screening. The database is large-scale, which improves the robustness and reactivity of the model, and has high-quality annotated data, which improves the accuracy and reliability of the model. Through better data balance, the model’s predictive ability for various types of responses is significantly improved. In the USPTO-50K test task, the large language model ChemGPT trained on the new data set has achieved the highest prediction accuracy of 74.4%. Third, the innovative improvements in language models and retrosynthetic models. The language model and the retrosynthesis model of ChemGPT 1.0 are based on the ChatGLM model and LLaMA model, respectively. Through full fine-tuning and two months of training on the A800 GPU cluster, the performance has been substantially improved. In addition, through the implementation of multi-model and multi-module integration technology, ChemGPT 1.0 can support knowledge Q&A in chemical expertise, chemical retrosynthesis, biomedicine, and general fields without installing any plug-ins. Moreover, ChemGPT 1.0 also supports automatic online data search, assisting the model to generate high-quality real-time answers and provide the service of painting.

On this basis, the R&D team has accomplished the overall construction and framework design of the automated chemical synthesis reaction technology system. Combined with the results of ultra-confinement synthesis, chemical synthesis based on microfluidic chips saves 80% of experimental time. The miniature synthesis factory driven by the AI chemist, “Xiaohua”, has realized the automated synthesis of compounds, bringing revolutionary changes to the field of chemical synthesis, and demonstrating the huge potential of AI in the field of biomedicine.

ChemGPT 1.0 signifies another important achievement of ECNU in the field of AI4Science. Next, the R&D team will further optimize and expand its functions, to help Shanghai’s key areas of AI and biomedicine accelerate their pace to the high end of global innovation chain, industrial chain, and value chain.

Source: School of Chemistry and Molecular Engineering

Editor: Philip Nash, Wicky Xu