Skip to main content
Materials

Materials

Text mining could help in the rational design of new materials

04 Jul 2019 Isabelle Dumé
Text mining
Credit: Olga Kononova

Discovering new materials with a particular set of properties can be a slow and inefficient process that involves countless, often “trial-and-error”, experiments by highly trained experts. Materials scientists are now therefore looking to machine learning to help them in their task.

The scientific literature is a real treasure trove in this respect – scientists have been publishing papers for 100s of years and every week dozens more papers come out. Papers are obviously published as text, however, so this collective knowledge is difficult to analyse either by traditional statistical analyses or indeed most modern machine-learning applications, which are “supervised” – that is they need to be trained. Indeed, these programmes require the “hand labelled” input of training data (parameters that define a material’s composition, for example) and a particular output (a material’s electronic properties, for example). At least several hundred materials are needed to construct the training data.

Researchers at the Lawrence Berkeley National Laboratory and the University of California of Berkeley have now found that an unsupervised machine-learning algorithm, called Word2vec, designed to process text and natural language, can learn important materials science concepts by simply “reading” the abstracts of over three million journal articles. The algorithm can identify unreported properties of materials in scientific papers and this literature mining technique could even be used to design new materials in the future, they say.

Information-dense word embedding

The team, led by Anubhav Jain and Gerbrand Ceder, has found that information about material properties in the published literature can be efficiently encoded as information-dense word embedding (numerical representations, or mathematical vectors, of words) without any human labelling or indeed subsequent supervision.

Assigning such embeddings to words in the body of a text in a way that preserves their syntactic and semantic relationships is one of the main techniques in natural language processing (NLP), say the researchers. These word embeddings are usually constructed using machine learning algorithms like Word2vec that make use of information about the co-occurrence of works in a text. When trained on a relevant text, these techniques should produce a vector representing the word “iron”, for example, that is closer to the vector for “steel” than to the vector for “organic”.

The researchers collected 3.3 million abstracts from papers published in the fields of materials science, physics and chemistry between 1922 and 2018. They then processed these to remove papers that were unrelated to inorganic materials science (as determined by a separate machine learning classifier), which left 1.5 million extracts written using a vocabulary of about half a million words.

Positioning each word

They then analysed the texts using Word2vec, which takes a large text corpus and processes it using an artificial neural network to map each word in the vocabulary to a numerical vector, each of which has 200 “dimensions”. Dimensions in this context simply means that each word is represented by a sequence of 200 numbers.

“The key idea here is that words appearing in similar contexts have similar meaning,” explains Jain. These words form clusters within the multidimensional space and Word2vec can then accurately estimate the meaning of words or the functional relationships between them based on the patterns in which the words are employed in the original text.

The researchers found that the algorithm is able to obtain word embeddings that can capture the underlying structure of the periodic table and the crystal structure of metals without being told anything about materials science. It does this by simply analysing the positions of the words in the abstracts and their co-occurrence with other words.

“We found that moving in different directions in ‘word embedding space’ corresponds to adjusting various known atomic properties such as increasing atomic number or increasing electronegativity,” says Ceder. “We are also able to use simple vector addition and subtraction of word embeddings to predict the magnetic properties, crystal structures and symmetries of some materials.”

One example: many words in a text corpus represent chemical compositions of materials and the five materials most similar to LiCoO2 (a well-known lithium-ion cathode compound) can be determined through a dot product (a method to perform multiplication in high dimensions) of normalized word embeddings.

“According to our model, the compositions that are closest to LiCoO2 are LiMn2O4, LiNi0.5Mn1.5O4, LiNi0.8Co0.2O2, LiNi0.8Co0.15Al0.05O2 and LiNiO2, all of which are also lithium-ion cathode materials,” says study lead author Vahe Tshitoyan.

Word associations

The embeddings also produce word associations that correspond to concepts such as “chemical elements”, “oxides” and “crystal structures”, to name but three examples. For example, they can produce solutions such as: “NiFe” is to “ferromagnetic” as “IrMn” is to “?”, where the most fitting response to “?” is “antiferromagnetic”. This result backs up observations made in the first such experiments using Word2vec in 2013.

“Although the algorithm does not perform at 100% accuracy, the fact that it learns in an unsupervised manner is exciting,” Jain tells Physics World. “We are able to use the word embeddings of chemical elements to predict the formation energies of Elpasolite minerals, for instance, with high accuracy, which implies that the chemical knowledge of these materials is embedded in the word vectors.”

Discovering new materials by identifying “gaps”

The researchers did not stop there though: they also showed that their approach can be used to discover new materials by identifying “gaps” in the research literature for functional compounds. They did this by training their machine learning model to predict the likelihood that a material’s name will occur with the word “thermoelectric” in the text. They then searched the text to find materials that had not been reported to have thermoelectric properties but whose names had a high semantic relationship with the word thermoelectric and which therefore might be thermoelectric themselves.

To test their approach, they “went back in time” and retrained their model using only abstracts published before 2008 so that they could compare its predictions with the next 10 years of actual scientific discoveries.

“We found that our model would have predicted some of the best thermoelectric materials discovered in the last decade several years in advance of their actual first report by the materials research community,” says team member John Dagdelen.

“Our findings imply that NLP algorithms can be used not only to extract knowledge that is already in a text, but also to make successful projections about properties that are not yet known. We hope that this will motivate the scientific and NLP communities to collaborate more closely and find even more ways of exploiting all the knowledge stored in the research literature.” 

Towards an analysis of full texts and “out-of-vocabulary” materials

The team, reporting its work in Nature 10.1038/s41586-019-1335-8, now plans to train a model on full texts of scientific articles, rather than just abstracts. “We suspect that more complex NLP algorithms, such as those that are context-sensitive, will be required here,” says Tshitoyan.

Another interesting future direction is to find ways to make predictions about out-of-vocabulary materials – that is, materials not mentioned in texts at all. The approach described in this study could thus be used to unearth previously unrecognized properties of existing materials that could then be exploited in specific applications. Who knows, the next important superconductor or topological insulator may well be found using a machine-learning algorithm.

Copyright © 2024 by IOP Publishing Ltd and individual contributors