Since the introduction of the "Materials Genome Initiative" in June 2011 by the United States, efficient materials computation, high-throughput experimental characterization, and materials data mining have been increasingly applied in materials research. Particularly, data-intensive science has become the "Fourth Paradigm" of scientific discovery, gaining growing attention following experimentation, theory, and computational simulation. The application of machine learning is a crucial indicator of data science. It has the capability to handle structurally complex high-dimensional data, simulate human learning behavior, automatically explore internal correlations within the data, and predict unknown sample features by establishing quantitative models[57]. This has elevated material research from traditional empirical methods to a scientific design based on the correlation between structure and material performance, providing new opportunities for material performance prediction and the discovery of new materials.
As shown in Figure 1.3: Machine learning is defined as, for a specific task T and performance measure P, if a computer program's performance on task T, measured by performance metric P, improves with experience E, then we say that this computer program learns from experience E[58].In materials science research, task T primarily focuses on the development and performance measurement of new materials. Currently, machine learning has been widely applied to predict the performance and composition of various types of materials, including high-entropy alloys, perovskite materials, non-ferrous metals, solid-state electrolytes, composite materials, and more[59]. The number of publications found on the Web of Science website using the keywords "machine learning" and "material" has surged from 764 articles in 2011 to 10351 articles in 2022.
Experience E typically refers to a dataset describing a set of material properties (features), which directly determines the accuracy and generalization performance of subsequent models. Obtaining a sufficient and reliable set of data is often a necessary condition for applying machine learning in materials science research.
Material databases often serve as convenient sources for researchers to obtain sufficient and reliable data. Databases established under unified standards greatly facilitate communication and collaboration among materials researchers, enhancing the efficiency of new material design. For example, the Inorganic Crystal Structure Database (ICSD) is a comprehensive non-organic material structure database that collects information on over 200,000 compounds, including metals, ceramics, and minerals. Compared to ICSD, the Cambridge Structural Databases (CSD) includes a vast amount of organic and metal-organic crystal structures [60]. Both databases compile valuable experimental data accumulated over the past decades, providing reliable data sources for establishing accurate structure-property relationships. High-throughput computing has also driven the significant development of material databases. For instance, the AFLOWlib database released by Curtarolo S[61] contains information on over 1,000,000 materials and their properties. This includes not only materials that have been experimentally confirmed but also numerous hypothetical materials calculated to be potentially stable under specific conditions.
In addition, text mining techniques have provided valuable assistance in collecting material data. Researchers, utilizing web crawling techniques, can analyze large volumes of literature text to extract material performance data, process parameters, or preparation methods. Yingli L and colleagues[62] employed recurrent neural networks to extract five types of data from material literature, including material names, elements, components, preparation methods, and performance. They manually annotated to construct the HASE dataset for hypoeutectic Al-Si alloys, comprising 8,845 material samples. Additionally, in material domains with limited annotated data, researchers proposed methods combining active learning. Leveraging characteristics of material samples, they achieved automatic labeling based on dictionaries and rules. In total, 16,677 material samples were collected. Rubayyat M and colleagues[63], using natural language processing, conducted bulk analysis of processing conditions for solid-state electrolytes in nearly ten thousand papers. They applied material synthesis parameters to the preparation of sulfide and oxide solid-state electrolyte materials, guiding the low-temperature synthesis of high-voltage oxide-based lithium garnet electrolytes.
Additionally, for failed and limited datasets, researchers have explored alternative approaches. The authors of a 2016 cover article in Nature achieved optimization of inorganic synthesis processes by mining a large set of failed experimental data, reaching an accuracy rate of up to 89%[64]. The SLAC National Accelerator Laboratory in the United States, based on limited experimental data on metallic glasses[65], used machine learning to predict all possible material properties, built a pool of candidate material data, and rapidly developed a new ternary metallic glass alloy through verification experiments. To accomplish a specific material task T, data often relies on a model for continuous improvement. Common modeling algorithms are divided into supervised learning algorithms and unsupervised learning algorithms. Supervised learning algorithms depend on a known set of labels (often the actual material properties required) to establish a mathematical model mapping features to labels, used for predicting unknown labels in subsequent tasks. Supervised learning can be further categorized into regression (generally for continuous data, seeking the best fit) and classification (generally for discrete data, seeking the best decision boundary). Supervised learning not only allows for the prediction of unknown samples but also provides deep-level model relationships. Representative algorithms such as linear regression and its derivatives, support vector machines, tree algorithms, neural networks, etc., are widely used in material property prediction and composition development. The details of these algorithms will be discussed in detail in section 2.2.1.
The use of unsupervised learning algorithms is relatively less common. These algorithms do not rely on pre-known material labels but focus on understanding internal correlations and differences in material features. They can cluster samples into certain groups (clustering) or undergo certain transformations (dimensionality reduction). Although the use of unsupervised learning is less common, it still performs well in clustering unknown samples.
Pagan D C and colleagues[66] used the K-means and PCA dimensionality reduction algorithms to cluster magnesium alloy grain tensile curves with different mechanical behaviors after dimensionality reduction. They successfully classified grains into four categories based on different levels of hardening. While individual machine learning algorithms may perform relatively well on certain problems, in practical applications, ensemble learning models are often used to further reduce errors and protect models from the destabilizing effects of random numbers and features. Ensemble learning algorithms are not a standalone machine learning algorithm but a technique that involves constructing and combining multiple machine learners to accomplish a learning task. It is also known as committee-based learning[67]. Ensemble learning algorithms often simultaneously train multiple homogeneous or heterogeneous base learners, measure the results of these base learners, and provide the final result through continuous optimization of the base learners or by considering various outputs together. Common ensemble learning algorithms include bagging, boosting, stacking, etc. Aayesha M and colleagues[68] used random forests and stacking methods to classify the composition phases of high-entropy alloys. The final accuracy reached 95%, similar to the results obtained using deep learning, but with lower model complexity and computational costs compared to deep learning.
In the realm of deep learning, the development of neural networks has brought significant technological breakthroughs to materials research. By simulating the structural form of biological neural networks, a vast and complex neural network is formed by many neurons composed of biases, weights, and activation functions. Neural networks, especially those utilizing convolutional neural networks (CNNs) and long short-term memory networks (LSTMs), exhibit exceptional performance in training and fitting large datasets. While recurrent neural networks (RNNs) are commonly used in text analysis, CNNs and LSTMs, with their remarkable advantages in image processing, play a crucial role in analyzing material structure images. Pin Z et al.[69] utilized images of aluminum alloy particles and a bidirectional long short-term memory network (Bi-LSTM) to extract information about the particle size distribution and morphology. They employed a neural network to train samples with corresponding particle information, effectively reproducing a model for mechanical behavior and texture evolution.
For the performance measure P of a model, datasets are typically divided into a training set and a test set. The evaluation of machine learning models is then done by assessing the residuals or accuracy on the test set. There are various methods for splitting datasets, such as the bootstrap method, holdout method, and the most commonly used, cross-validation. Cross-validation involves dividing the dataset into k equal-sized subsets and iteratively using k-1 subsets for training and the remaining one for testing. This method ensures sufficient sampling and data uniformity. Zheng X et al.[71] improved upon this by introducing the forward cross-validation method, making it more suitable for predicting the development of new materials. The forward cross-validation method sorts samples by size before conducting cross-validation, testing them on properties such as energy, bandgap width, and superconductor transition temperature. This approach provides more accurate predictions for outstanding performance of new materials compared to traditional cross-validation methods.
Bryce M et al.[72], using the development of superconducting materials as an example, compared the predictive performance of traditional k-fold cross-validation and leave-one-cluster-out cross-validation (LOCO CV) for the critical transition temperature of yttrium barium copper oxide. The results indicated that in predicting the performance of materials in different groups, LOCO CV showed more accurate extrapolation performance compared to traditional k-fold cross-validation. In practical applications, LOCO CV initially uses clustering algorithms to divide the original data into several clusters. During each validation, one cluster is taken as the test set, while the remaining data serves as the training set. Given the non-uniformity of data obtained in materials development and potential variations in cluster sizes, along with the requirement for strong extrapolation capabilities in materials development, LOCO CV provides a new approach for model selection and evaluation.