91TV

Fellows

Fellows

91TV is a self-governing Fellowship made up of many of the world’s most eminent scientists, engineers, and technologists.

Search the Fellows Directory

About elections
Events

Events

Discover events, scientific meetings and exhibitions held by the Royal Society, as well as access to videos of past events and information on our venue.

Public events

Scientific meetings

Accessibility

Summer Science Exhibition

Venue hire
Journals

Journals

Discover new research from across the sciences in our international, high impact journals. Find out more about our values as a not-for-profit society publisher, our support for open science and our commitment to research integrity.

See all journals

Authors

Reviewers

Librarians

Open access

About our journals

Journal policies

Metrics
Current topics

Current topics

Find out about our work in areas of current topical interest to the Royal Society.

AI and data

Climate change and biodiversity

Education and skills

Equality, diversity and inclusion

Genetic technologies

Health

History of science

Innovation

New technologies

Research culture and funding
Grants

Grants

91TV provides a range of grant schemes to support the UK scientific community and foster collaboration between UK based and overseas scientists.

Applying
See all grants Application and assessment process Application dates

Support
Training and development opportunities Global Talent Visa

About Grants
Policies and positions Contact the Grants team
Medals and prizes

Medals, awards and prizes

The Society’s medals, awards and prize lectures recognise excellence in science and technology. Our most prestigious award, the Copley Medal, was first awarded in 1731.

See all medals and prizes

Nomination guidance

Premier awards

Science Book Prize

Young People's Book Prize
91TV

Who we are

Mission

Charity

Strategy

Our staff

Diversity and inclusion

Our history

What we do

Collections

Advising policy makers

Public engagement

Schools engagement

Industry engagement

International activities

Recognising excellence

Supporting researchers

UK Young Academy

How we are governed

Committees and working groups

Council

Our funding

Our supporters

Trustees Report

Work for us

Our values

Vacancies
News and resources

News and resources

Explore the latest work from the Royal Society, from news stories and blog posts to policy statements and projects. You can also find resources for teachers and history of science researchers.

Blog

News

Projects

Reports and publications

Resources for schools

History of science resources

Explainers and introductions

Videos

Women in STEM

Data-driven materials discovery | 91TV

1 hour and 5 mins watch 01 November 2022

Well, thank you very much for that very kind introduction. I'm not sure
I'll live up to the expectations after that introduction. I'll do my best.
So today's talk is on data driven materials discovery and I'll begin by explaining the
challenge of the problem and then show you how we're trying to solve that problem, giving you
four case studies from the energy sector, and then I'll finish with where we're going next.
So here is a picture of Alexander Fleming very famously discovering penicillin by chance.
Now we've come a little bit further than that over the time, but still
actually we're discovering materials using trial and error in the main,
and this picture here in the middle is taken from a biomedical website,
who likened materials discovery to firing scalpels at a dartboard which has a moving target.
Now, wouldn't it be better if there was a systematic way to discover new materials?
Of course, I feel that there's an opportunity to do so using the recent advances in big data.
So if you're going to do data driven materials discovery, of course you need data. So let's
think, what would be the ideal source of data for data driven materials discovery?
Well, ideally, you would want the entire universe of all possible chemicals, molecules that could
ever exist and for each of those chemicals, know what their cognate material properties would be,
and that's because there's an inherent relationship between the structure,
the molecular structure, and the function of a material. People over the years have used
empirical means to work out what these structure function relationships are,
and yet if we could encode those relationships, those patterns about chemical and property space
in a way that a computer could understand, then we could encode them and link them up to a search
engine that we could then use to probe this chemical and property space to find patterns
in the data that then could lead to predictions and ultimately, experimental validation of new
materials for a given application to suit your needs. So that's the dream, but of course,
we don't have the whole universe of all chemicals and all of their properties. So what do we do?
Well, to a first order approximation, we do have that information, albeit in a highly fragmented
form and that form is the scientific literature. That could be academic papers, could be patents,
company reports and so on, but in all, there'll be somebody, well, lots of people in this audience,
who have written a paper on one material and its properties. There'll be another
person in the audience who's written a paper on another material and its cognate properties,
but if we could grab all of the information from all of the documents that ever existed
and put it all together, then the sum is greater than its parts. We could then have,
to a first order approximation, all of chemical and property space, so that we could predict new
materials from all of that historical data. So that's the essence of my talk and what
we've done to do that is written a whole load of software that mines text, the chemical and
property information and extracts it, and also image information, chemical schematic information
and chemical reaction information. Here are four software tools, our primary ones,
and having grabbed all of that information from the input documents that you give it,
it also automatically compiles it into databases for you. So now you have materials databases
that you can make for your own needs, i.e. bespoke databases for your given application
and once you've got that compiled data, we can then essentially drive that into a design to
device pipeline by then using the data to predict new materials, using the machine learning that
you see a lot now where they classify and optimise in data analytics, and then that
will find patterns in the data that leads to your prediction of a new material, and then
you can use that prediction and go forward with lead materials, with the experimental validation.
So that's the pipeline. I'm going to focus just for time on ChemDataExtractor, the text mining
tool, just so you know. So given that, let's just have a look a bit further in how can a data
extractor work. So that's the text mining tool. So we've got input of scientific literature. So
it can literally be thousands and thousands, or even millions of documents if you had them,
and ChemDataExtractor will ask of the documents, it will always find as far
as it can, the chemical molecule, wherever it writes the chemical name or a picture,
and it will then also grab the paired quantities of properties of which you've asked. So you ask
the properties you need for your particular device, whatever it is you want to make,
and it will then grab that paired information and put it into a chemical database for you.
How does it work under the hood? Well, this is best explained by example.
It's actually chemistry aware natural language processing for those in the field,
and what it does is it takes every sentence from every document of all those thousands of
documents and it does this. So figure two shows the UV-vis absorption spectra of 3A bracket red
and 3B bracket blue in acetonitrile. A fairly typical sentence in science,
and what it does is it takes the sentence and splits it up into its constituent parts. The
words, the numbers, the punctuation, and so on. It then assigns a grammar to each of those parts. So
figure is a noun. Two is a cardinal digit or decimal. Spectra is a noun, but it's plural,
and acetonitrile is what they call a chemical mention. So this is the chemistry aware logic
coming in. Acetonitrile is probably the solvent in this. You can probably read that as a human,
but of course the computer has to think harder and use logic to work that out.
Having assigned a grammar to that sentence, it then turns that sentence into a hierarchical tree.
So it's a figure and it's figure number two in that paper. It's a figure about a spectrum and the
type of spectrum is a UV-vis absorption spectrum, not for example, a UV-vis emission spectrum,
and it's a spectrum of something called 3A and something called 3B and acetonitrile.
Now, very crucially here you have to look at 3A and 3B because we don't know what 3A and
3B are yet. So if you put that all out into a database at that point, you would lose all of
the chemical name information and it would be a totally useless database. So what you have to do
before you leave the document is resolve what the labels mean, and usually in a scientific document,
if you track back in the paper which the computer then does, it will usually find
typically the first instance of 3A, and you'll see just before it a long chemical name like so,
bracket 3A or maybe 3A will be in bold or something like that. So that if you train
the computer to search for that, then you can resolve what 3A and 3B are and then put that in,
and then you can pull out the right chemical information and its cognate property information.
So that's why it's chemistry aware natural language processing.
So there's this technology here and then we're using let's say standard machine learning these
days for data analytics. The other actually pretty hard thing is once you get to the
experimental validation, and I'm not going to talk so much about the experimental work today,
but I do want to give you an overview of it because it often involves complex facilities
and so I thought I'd give you a glimpse of the sites and facilities that we do experiments at.
So what we're going to do now is we're going to go on a fly-by of the UK's neutron and muon facility.
Okay, so there's a
glimpse of the experimental
world that mostly I won't be able to talk about for time reasons,
but hopefully that gives you some idea. So we've talked now about the challenge. We've talked about
the technology. So let's see how we apply it. I'm going to show you now four case studies,
all taken from the energy sector to see if we can discover new materials. I'm going to
start with solar, the sun. Here's a picture of the sun in Cambodia just for your amusement.
I want to discuss the idea of how we can apply ChemDataExtractor to discover new
light absorbing materials for photovoltaics. So irrespective of the type of photovoltaic device,
all of them need some form of light absorption, right? So I want to think about the underpinning
problem at the molecular level to do that. So let's look at this graph. So the black-jaggedly
edge here. That's the solar emission spectrum. So there would be kind of your visible light.
So as a function of wavelength. So that's what that's what you see as a spectrum from the sun,
and what you want in terms of grabbing all of the photons under the area of that curve,
that's your goal is you need then light absorbing material that ideally would be something like that
green line that pretty much grabs all of the area under that curve. So all of the photons.
Now there's no one material that really does that. So that's a problem, but people in the device
technology often combine say two different types of molecules, say one that absorbs in the blue
and then one that absorbs in the red, and as a convolution of those two peaks, you will get more
or less the green, right? That's the concept. So people often do that in the device world.
So let's set ourselves the problem then, of making a database using ChemDataExtractor that finds the
underpinning molecular property information, say the wavelength maximum, so see where the
peaks are of these. We're also going to take, if it's present, the absorption information,
the intensity here which is going to be called the extinction coefficient, and we're also going to,
of course, take the material, the chemical name as well. So that's the database we're going to build,
material property one, property two. So we build that with ChemDataExtractor and that at the time
was just under 10,000 chemical molecules and their corresponding properties. Just all grabbed from
the scientific literature, and our goal is to get from 10,000 possible light absorbing materials all
the way down to, say, five lead candidates that we can take forward for experimental validation,
and the way you do that, if you apply this design to device pipeline like so,
we have to ask really carefully chosen questions that sequentially filter out from 10,000 all the
way down to just a few. So the first question we asked, for example, is remove all things
containing metals. We wanted organic materials for environmental regulations. That actually made our
case quite hard. Nonetheless, that's what we did. So that went down to just 3000. That's already a
big jump and you want quite big jumps early on so you don't end up asking far too many questions.
Then we ask a question that's relevant to a device, for this particular type of photovoltaic
technology, we knew that in the device we wanted molecules that contained a carboxylic
acid group and that's because we knew that the interface between the light absorber and the
semiconductor that makes the working electrode as a composite was working particularly well when we
had carboxylic acid groups in our light absorber. So we fixed that filter and then you see an order
of magnitude reduction from 3000 to 300 more or less, and then we go in and we say, we wrote
a mathematical algorithm that finds the optimum combination of those possible 300 you've got left
to find, if you like, the best combination of blue absorbing and red absorbing or at least extremes.
Then you go down to another order of magnitude to about 30 possible light absorbing materials
and at that point, 30, you can actually go in manually and do things. It's a manageable number,
and actually what we did then is we performed some electronic structure calculations, so computation,
to all of those 33 to check what we call the energetic alignment of our light absorbing
materials within the device. So this is the idea that we can predict all these new light absorbing
molecules but that's all in isolation. We've got to think about the device. You've got the
electrolytes. You've got the other electrode. We've got to check that the energetics line up
so that you get the right driving voltages. So we did that and that brings it down to essentially a
handful left, in our case, five lead candidates to go forward for experimental validation,
and here are the five. One good thing about the ChemDataExtractor type of
technology is because we've mined the data from the scientific literature,
the people who originally made these materials, we can contact, because we can track back. We always
keep the DOI as we go through so we can track back and find the author for correspondence,
and all of these molecules, by the way, were made for purely scientific curiosity, synthetic
curiosity. So none of them had any idea that it may be applied for photovoltaics. I wrote to them,
or I emailed them actually, and I said, hey, you know, we think your material might be useful for
photovoltaic application. Do you still have some or could you remake it and send it to us and we'll
put it into a device and test it for you, we'll do this as a collaboration so everybody wins.
They all said yes. They all sent materials and we put them into devices, and this is what we got.
This is a graph of voltage versus current density, for the experts in the room,
remember this is our lab work, but it's the relative differences that matter.
The black curve is an important reference. So the black curve actually corresponds to the industry
standard. Now, the industry standard actually is a metal organic material, right? So remember we
said we wanted only organic materials, but as an industry standard we'll reference it to the
metal organic one, and that's one of the best performing ones you can get. So if you can get
your curve and your testing to get anywhere close to that black line, you're doing well.
So I'm going to focus on XS6 and 15. They were the original labels in the original publications.
If you test them just on their own in a device, you'll get the red and the blue curve for those
two. So voltage is okay but the current density is not so great, because you're not close to this
black for example, but of course the logic was to put the two together to get blue absorbing light
red absorbing, and so when you put them together, you get the mauve line, which actually is pretty
close to the industrial standard but we've only got organics in our system, right? So we were
pretty happy about that, actually, it gets 92 per cent of power efficiency of the industry standard.
Other people were happy, it got onto the front cover of Advanced Energy Materials Journal,
and so there's an example of data driven materials discovery in action.
Before I leave this, the solar topic, I just want to touch on one other thing,
which isn't discovery per se, but I think it's relevant that you can also use ChemDataExtractor,
the text mining tool, to help with manufacturing, because what you can also do is mine existing
things that people know about. So why would you do that? For example, you can make nice histograms.
So here's an open circuit voltage, the current density, the power conversion efficiency of all
the known materials for two types of photovoltaics, perovskite solar cells and dye
sensitised solar cells. There's more on this one because it's much more mature technology. Nearly
200k of data on perovskite solar cells. So that's what's known, right? This isn't discovery at this
point, but it's not just the device properties you could predict and the molecular properties,
but you could also grab information about the actual manufacturing process. That might be
quite important to you to know what type of solar simulator was used, the testing device,
or the active area of the test sample that you used for the solar cell. That type of information
can be very valuable if you want to optimise the manufacturing process. So there are sort of side
uses of ChemDataExtractor as well. So that's the first case study. I now want to move to the
second case study and we're going to look at heat. This is a picture of a volcano erupting
in Guatemala. I took the photo from a save point here. You're laughing but wait till you
see what happens next. When it erupted, I was actually there on the ridge. This was a video
I took. You don't want to be ever closer than that, by the way. That's just really
to wake you up and show you some heat. Anyway, I'm not going to do volcanic applications,
but I am going to do thermoelectrics, this also gives me an excuse to play with Lego.
I made a little thermoelectric device which I'll pass around in a second, but just to explain then
how it works for those who are not in the domain area, what you have now here is a case where you
have a cold surface on top and a hot surface on the bottom, bottom, top, and therefore you have
a thermal gradient. I'm going to super simplify this - sorry, all physicists in the room.
So just like in the atmosphere, if you have a thermal gradient, you get the convection currents
that cause all the weather, and here you have electronics essentially creating a convection
current, in this case with thermocouple, the P and N semiconductors. So you can create electrical
currents and therefore voltage is driven between because you've got the thermal gradient that
stimulates it from the top. So then that creates, because the thermocouples are all in series, that
creates an electrical circuit. So thermoelectric. So I'll just take these around. I've got two
because I obviously got particularly happy with playing Lego.
Please send them back to the front, the people at the back, because otherwise the Rutherford
Labs will kill me for stealing all their Lego. I promised the visitor centre I'd send it back.
So that's how a thermoelectric material works, and in equations,
here is the figure of merit. So that's ZT. That's just to understand it as your proxy,
high ZT means a high performing thermoelectric material. S, Seebeck coefficient,
that's just a coefficient, and sigma is your electrical conductivity. You want that high,
right? It's thermoelectric. You want the thermal… Sorry. That's electrical conductivity.
The thermal electrical conductivity, kappa, you want low, and that's because you want to keep
that thermal gradient without interfering with the electronics. You want the temperature high.
So if we can then use ChemDataExtractor to mine all of these properties, plus something called
the power factor, then we get this distribution. We've literally just published this, I think,
eight days ago. So it's a new database, and just out of interest, you see all these spikes,
you're probably wondering what they are, right? These are real experimental data.
It's not an artefact. It's actually really real. What they are is actually rounding up of people's
results. People will quote things to say one decimal place. So then you get things spiking at
1.0, 1.1, 1.2, 1.3, 1.44, etc., and of course, we present the data faithfully. We could of course,
average it and put sort of Gaussians over each of those spikes, because probably they mean it was a
bit more and a bit less than 1.0, but this is this is real data, right? So that's why you see spikes,
and you can do things like look at the highest ZT material, right? People may know that anyway.
You may say, well, okay, but what else can you do with this? You can forecast with it. So you could
take the average, the year of publication, remember they come from academic papers,
and you can check their average ZT, the figure of merit, and then you can say, well, roughly,
if that was a linear trend, by 2052, the average ZT that would be reported in papers would be 1.5,
for example. If you're thinking, well, should I work on thermoelectrics? Well,
if this is the year of publication and the number of records that have been occurring
is going up like this, it's probably quite a good field to go into right now.
So it's good also to help you with forecasting is the point.
So that's heat. Now let's look at hydrogen. There's been a lot of media interest, of course,
that people have seen with hydrogen fuel cells for example, as a technology, and
therefore you need a way of producing hydrogen. People mostly produce it using water
and they do something called water splitting, which unsurprisingly means you split water into
its constituent parts, hydrogen and oxygen. It's good because you generate what you need,
the power, the supply of hydrogen, but you also generate a clean side product, oxygen, which is
also helpful. So no problem there, and there'll be a catalyst that may help. So this technique
is therefore called water splitting. This is a picture of me water splitting for no reason.
Here is now the question, how do we apply ChemDataExtractor to help?
Well, I mean, there's no point in mining for water, or something like that, but we could mine,
for example, for all catalysts that had ever been produced for water splitting,
and unsurprisingly, we are making a database or we've made one, haven't published it yet,
of all catalysts for water splitting. So that may help. Thinking about optimising things,
finding the right catalysts for the most efficient process of water splitting,
but there's another use that we could work with, which is thinking about the hydrogen.
Once you've made hydrogen, it's a gas, right? So we've got to find a way of storing it.
Now, what you would really want actually, is to condense it into a liquid form
so that you could store it in a nice, small, contained little vessel. Now, there's not really
a good way of storing hydrogen in the liquid form, partly because the liquefaction temperature for
hydrogen is about 21ish kelvin, that's -253 and a bit degree centigrade. So that's really cold,
right? So how do we find a refrigerant that will cool the hydrogen down, put it
in its liquid phase, and keep it cool, and we're going to say can we find a material like that?
What we're going to use for this purpose is we're going to try and discover a material
that's a magnetic refrigerant. So this is the idea that… So some materials will be heated
up when you apply a magnetic field to them and when you take the field away, it will cool down.
So that can be a way to potentially make a sort of, this is just my little avatar,
if you like, of magnetic refrigerants. It's not really like a fridge, but to
give you an idea. So what are the properties that make a magnetocaloric effect material?
There are these three. Don't worry, it's just equations. We have a relative cooling power,
RCP, as one of the parameters that govern the properties, the change in entropy and the
change in temperature. So just remember, we have three parameters, if you're not a scientist, and
what we're going to now do is then do our ChemDataExtractor thing. We're going to mine
the literature and find all materials that have reports of any or all of those three properties,
and we'll see what we get. So if we do that, there aren't actually that many, but there's just under
3000 materials and each of them have those number of the individual properties. TC, by the way,
is actually now the Curie temperature. That's because we wanted to find permanent magnetism,
right down at that temperature, and so that's what we got. Actually in that sense,
well, we got this database, that didn't give us a brand new material that was suitable for the
liquefaction temperature to store hydrogen, but that's okay, because what we can do is having got
the database, we can at least make a regression model because we've got data now, right,
with structure and property information, and so we can apply machine learning and make a regression
model to link, i.e. make structure function relationship for magnetocaloric effects. So that
means that if we've come in with a new material, then we can use this regression model to predict
the magnetocaloric effect properties. That's okay. So now we need new material, right? So
how do we get new materials? Well, we turn to our machine learning world. Super geeky moment coming
up. We apply something called a conditional deep field consistent
variational autoencoder with U-Net and crystal graph convolutional neural nets architecture.
You'll be glad to know I'm not going to explain what that is. So what you essentially do is, it's
a generative algorithm that essentially creates new 3D crystal structures, hypothetical ones,
that could then be fed into this prediction, and the way it works without really going into
any detail but if anyone wants to speak to me afterwards, I'll happily explain it to them.
Imagine you have - in fact, the imaging world kind of do this. You've probably seen on the internet,
you have a picture of a human face and then it morphs into some other human face,
right? It's a bit freaky sometimes. It's not quite like that, but it's sort of like that. In essence,
you can - because all you do with molecular structure is an image, or at least you can
treat it like an image, just these three dimensional coordinates, right? Don't worry
about bond lengths and bond angles and all those things. Just see it as an image, and if you treat
it like that where you can make essentially a data distribution that's representative of a molecular
structure or crystal structure in this case, and then you feed in something that's close
to your target material. So for example, with magnetocaloric effect materials, we knew that say,
Heusler type, so a type of structure like Heusler or perovskites or cubic materials tend to make
good magnetocaloric effect materials. So we could say, well, let's take one of those that
is known and then just sort of send it through this kind of what they call the latent space,
and then it goes through this sea of sort of representative data distribution, and then it will
sample it and subject it to a standard deviation, depending on if it's a big standard deviation,
it will allow the input structure to be perturbed a lot. If it's a small one, it won't be perturbed
so much, and then it will output a whole load of new hypothetically generated crystal structures
that most of which, by the way, might be totally unrealistic, because it doesn't know anything
about what's realistic as a bond geometry. But that's okay, because a lot of those will then
get screened out because you check the energetics make any sense, and if they don't, you just throw
it away. So we've now got this way of making hypothetically generated crystal structures
and once we've checked them for energetics, they could possibly be real and they're different from
the input you put in by enough that you're happy that that's different and new. So then with these
new materials, you do some further screening to check a few things. You check it's ferromagnetic,
for example. You could check its phase stability, that sort of thing, and again, this kind of
inverse pyramid pipeline that goes from, say 1000 down to 30, and then with those lead materials,
you can then apply that regression model that I talked about that we got from the experimental
data with ChemDataExtractor because that link structure and function. So we can take now our new
hypothetical molecules - sorry, crystal structures and predict the magnetocaloric effect properties,
and once you've done that you've got a prediction of new material, and then we have to seek out a
way to synthesise it and depending we can do that we can experimentally validate
that material as a magnetocaloric effect, i.e. magnetic refrigerant. So here are the results. The
blue are the known ones from the literature that we grabbed, and these are two of the properties,
the relative cooling power on the vertical axis and the Curie temperature on the horizontal.
So that's the things that were known in blue. The red ones are predictions.
Brand new materials. One of these is this one and if you know your periodic table, you'll
know that PM happens to be the one lanthanide material that's radioactive. That's annoying,
isn't it? So ChemDataExtractor doesn't know anything about safety, by the way.
We can't do that but we could substitute out the lanthanide, maybe, right? We were quite keen
because that was quite a nice temperature range for liquefaction of temperature of
hydrogen otherwise, but you could also go down here and pick something slightly different.
So we're now thinking about making or trying to make that, and here I
am at the neutron and muon facility, waiting with my sample state ready for it to go on,
but we haven't quite synthesised it yet. So that's really all about, I have to say, about hydrogen,
where we're up to on our discovery and from this project, but also just so you know, with other
predictions, we could think about room temperature type magnetic refrigerants, because they could be
useful for a different type of fridge. We've got predictions there which were also trying to make.
So that's number three out of four. Now we turn to batteries,
and this is a picture of a mountain where you've had an electrode in the
sky and electrode on the ground and you've had an electrical discharge. Obviously,
that's just lightning. This mountain has been actually struck by lightning and it's on fire.
It's a few years ago now. Does anybody know where it is? Is it obvious to people?
That's interesting. I'm having lots of shaking heads from scientists as well. That's it
bigger. Any further thoughts from the scientists, particularly the neutron and synchrotron people?
Yes, I think I heard it's Grenoble. That's the mountain here, that's on fire,
and this is the European neutron facility, and this is the European Synchrotron Radiation
Facility next door to it. Anyway, just to give you an excuse to show you the European facilities.
So nature's form of electrical discharge, but let's now think about batteries.
So we're going to do - you're getting quite bored of hearing this probably now, we're using our
ChemDataExtractor text mining tool to mine the literature, in case you haven't got the message
by now, and in this case we're going to mine materials and the devices information. So we're
going to mine the capacity, the conductivity, the voltage, the energy, the coulombic efficiency,
and we get nice distribution graphs like that. Again, you see this weird rounding effect with
the unitary things coming out, again because it's real experimental data. Now in this case,
we grabbed all the material and device property information, but what the computer couldn't do
or what ChemDataExtractor couldn't do, was distinguish between whether the material was
an anode, a cathode, or an electrolyte. It got lots of materials and all the device properties,
but it couldn't classify which ones they were because that was never really
programmed into ChemDataExtractor. So we then thought, well, how are we going to do that?
So what we did is having got that database of nearly 300-ish data records from
ChemDataExtractor, we went back and we sort of reverse engineered it. So we had a scientific
corpus that we fed in to that ChemDataExtractor process, and we took from it the papers that had a
successful hit from ChemDataExtractor, i.e. things that contain batteries, information to make our
database, and then we siphoned off that corpus. Now we've got a very battery rich corpus of data
and we then fed it into a different pipeline, and let me explain what the pipeline does. So this
is our battery rich data corpus and we then are going to make something called a data model. So
this is a different way of mining data, because what you do is this comes actually
from a technology that's come out of Google AI. In fact it's often the basis behind the
Chrome Search engine, actually, and what you do is you essentially, you take those sentences that
I talked about. That's how you remember how ChemDataExtractor works, where you have these
sentences and then you pull them apart and all these things, but instead of doing that,
what you do is you make vector representations of them, right? So imagine a phrase like the cat
sat on the mat. I'm going to write down the cat sat on the mat two times identically. Now you'll
know that the subject and object are the cat and the mat, right? So that has a high correlation
between the cat and the mat. So you can build a bipartite graph. So you can link those two with
a high correlation, and so you can start to get contextual information from that sentence
by essentially relating the information that you know is highly correlated together.
The computer can do that by making this vector representation and having made the
vector representation, it will then feed it into a neural net, a very deep learning neural network,
which then subject to all sorts of weightings, will train all of that with all the sentences that
you put in. So in some sense, it's different from ChemDataExtractor in all sorts of ways,
I suppose, but in one particular way, which is that instead of extracting the chemical and
property information, you keep all of the sentence information, all of the corpus, all of the words,
everything, and you put it into a model, and I use that word very carefully. So now what we've made,
we've trained a network to build a model of all of that corpus with all of the context of the
information from the sentences, and so it's now what they call a language model, right?
I've simplified a bit, but that's basically what you're doing. Now that's a really important
distinction between that and a database. A database is a very static thing which you look
at and you can do analytics on, but now we have an interactive model. That's why I'm calling it a
model and the way to think about it in my head, at least, is to think about it as your computer. So
you have a motherboard, and that's the core kind of operations of your computer,
and you have lots of peripherals, like your mouse, your printer, your keyboard and so on.
So these are peripherals and you can put things together, and the mouse has a different function.
So I can plug that in and make my motherboard do different things, and it's kind of like that
as a sort of concept and framework. So we've got this model. This data model, this language model,
and I'm now going to ask, I want to ask questions of that model because it's interactive,
I can do that. So I want to say, of course, is it an anode, is it a cathode or is it an electrolyte
from the material standpoint? So what I do is I make a whole load of questions up, just have
designed answers like, what is the anode? What is the cathode? What is the electrolyte? And so
on. I can make more complex sentences, of course, and so you make another database just of questions
and answers and you make it for the domain of interest, the materials domain of interest, the
battery domain, and then you can also mix that up with lots of generic question and answers, pairs,
just about normal English language, and then you put it all together and then you can ask the model
questions. So in this case, we want to know what the anode, cathode, and electrolytes are as a
classification. So we asked those questions about that and then we grab the answers, and
then we can then classify whether the materials that went in the original database are an anode,
cathode, or electrolyte. Hopefully that makes some sense. This is what you get. You get what
you'd expect from what is the anode, what is cathode and what is the electrolyte.
So that's the classification, and we can go one step further than that with this data model
concept and this is really a paradigm shift I think between that and databases. We can not
only have a question and answering module, but we could make a natural language processing model, a
text mining module. So what if we actually blended ChemDataExtractor, the chemistry aware natural
language processing model, or text mining module, with this new technology from Google AI and make
BatteryDataExtractor? You can see I haven't got much imagination with my names, by the way.
So this then, just published like eight days ago or five days ago or something, I get lost, sorry.
This is the first property specific text mining tool for autogenerating materials databases.
We've done it on batteries and as I say, it blends two different technologies to hopefully get the
best out of both worlds. It's also good, I think, because it's got a probabilistic nature about it,
because it's a model, you can actually get confidence scores. You can actually
get probabilities of each data record that you obtain, to know how likely is the question that I
ask going to be right. Now that's really important because, you know, how does somebody know whether
they should trust my database that I've got that was autogenerated with ChemDataExtractor,
right? I mean, we do experimental validation, obviously, but you know, to get that trust,
we need to think about real probabilistic confidence scores for all of the data records.
So I think that's a good thing. The fourth thing there is that we have now
a new way to interrelate material and property text during data extraction.
So imagine you've got, this is a really big problem by the way, imagine you've got a
scientific paper and you want to mine the chemical and property information, but let's say the
chemical name is on page one of the document and the property information is say on page five, and
in between on pages two, three and four, we've got a whole lot of other chemical names and a whole
lot of other property type things. How do you know that the chemical name on page one and the
property on page five are related, not something in between was related to either of them, right?
So what you can do because you have a model, not a database anymore, you can ask, if you're careful,
the right sequence of two questions that, by definition, interrelate them by essentially using
a sort of analogue version of Bayesian statistics. It's a conditional probability, so you'll see how
it works. Let me give you an example. So let's imagine I want to find a material with a voltage
of two volts, just for argument's sake. So I'm going to ask two questions in this sequence,
what is the value of the property name? Property name is voltage. What's the value of the voltage?
And you'll get an answer. You're just dealing with the property at the moment. So I said the
answer was two volts. So it can find that because that's easy. It's just looking for one area in
the document, but then we say which material - we want that link between material property - which
material has a property name, the voltage, of the answer to the previous question? So there's your
conditional. That's your Bayesian bit, and the answer to the previous question was two volts. So
the question becomes which material has a voltage of two volts? And that puts that into relationship
into analogue. So sort of a geeky, very geeky way of doing Bayesian kind of statistics.
So that was published recently. It got on the front cover of Chemical Science,
which is the Royal Society of Chemistry's flagship journal.
So leading on, this is really then thinking about where we're going with all of this and you can
see that you can build up all these different modules, having different peripheral functions,
just like the mouse and the printer and so on, that have different functions with respect to
your motherboard of your computer. In this case, the analogy is the data model, and you
can then make it do lots of different things. So that's I think where we really want to be going,
moving really away from actual databases and moving into this more interactive type of zone.
So that's the software side of things. I just want to close on thinking forward also about the
experimental world. This is a picture of the new Ray Dolby Centre in Cambridge University which is
nearly finished and this will, of course, allow us to continue doing the mod cons and this is where
the physics department and others will be housed. So we'll continue to have the mod cons of the
experimental validation side for example, and this is the Rutherford Appleton Laboratory to which
I'm 50 per cent seconded and it's already got these facilities and others. So there's Diamond,
the X-ray source, the ISIS, the neutron and muon source, the central laser facility here,
and in future it will also, I believe, have all these extra things. So there'll be a new
development in a laser facility. There's already nine new instruments being planned
for the existing neutron and muon source, and I believe there's a possible option of new neutron
emission source called ISIS-2 over here, maybe, or some other location possibly. So that, however,
reminds me to say that this site, like many big organisations, of course, the Rutherford Appleton
Laboratory is particularly large, but like many big organisations, they have a plan to go net zero
by a certain year. That's a really, really massive challenge, and I've talked today about a design to
device pipeline, but that's a linear pipeline, but if you really want to go to sustainability
then you've got to adopt the circular economy. Actually it's more of a butterfly shape. So
our design to device is kind of here, but we've really got to build an ecosystem.
Talking to people on the county council, dealing with the recycling right through to the geeky
scientists and engineers who are designing and making new materials. We've got to build
that ecosystem. That's a really different way of working that we've got to adopt.
So you can tell I'm getting to the end of my lecture. I'm going to get a bit more light
hearted now. We can call the user support with our Bat signal. That's us in the new facility,
the lasers are reaching out, and they will call on the user community, which
has an international reach and we can make our circular economy, and working even more closely
together across the facilities, we can continue to do Olympic science for a sustainable future.
Now for the grand finale, I want to thank people and I want to thank people properly, because
really, this is a celebration of everybody with whom I've worked. So I wanted to do it properly
rather than just put a slide. So I made you a short video to thank everybody. So here goes.
Thank you.
time for…
I mean,
I'm
so inspired. It's fantastic. Thank you so much. It's really special and different. Anyway,
questions for Jackie. There are microphones. Remember there's an audience out there as well. So
I'm going to stand in front here so I can be seen.
Over here on the left.
Thanks, Jackie. A number of your databases have 100,000 papers or so. How long does it
take a computer to read 100,000 papers and is it a laptop or is it a supercomputer?
So there are two different ways of answering that. The time taken to read all the documents is not so
long. What takes the time is actually the data cleaning process. Once you've mined the data,
you don't just get what you want, you actually kind of get lots of edge cases that things that
don't quite work, and so you have to spend a lot of time then actually cleaning the data.
So I don't want to pretend that ChemDataExtractor is something you can just press the button and out
pops the database. To give you a real example of it, in practice, a fresh PhD student from, say,
my group would take probably about two years to get a database that's in published work,
and obviously, they get quicker because they've learned at that point, but that would be coming
at it fresh. So we do use supercomputers for sure, and certainly it's a lot faster,
and for the really big runs we will use a supercomputer, but you could in theory,
as long as you've got time, you could do it on a regular desktop computer.
Thanks, Jackie. It was a great talk. You talked about extracting data from the scientific
literature. I guess within our facilities nationally and globally, there's huge amounts
of data, and I just wondered what your thoughts are on extracting that data in terms of challenges
and opportunities, because clearly a lot of that data doesn't end up in scientific publications.
So I guess it depends on what condition the data are, at least at what stage the data are in their
production. So you've got raw data, right? Which we couldn't realistically mine because
you've got to process it first. So that's where you have to reduce your data and analyse your
data so that it becomes something meaningful that you can relate to the rest of the world.
So that's the first thing. Once it becomes processed data, of course, then you would be
in a position where you could maybe publish it or you could not. If that data were available
to the likes of us, then we could actually take that data, but we would have to make it into some
sort of internal framework. I think that we would pipe that into the database format but
there's no reason you can't put a side arrow into that pipeline. Of course, you'll know
better ethically than me, how ethical is it to take somebody else's data that they didn't
really write up and then use it without maybe asking them, but after three years, I believe
it becomes open access anyway. So we get into sort of policy issues and even possible ethical issues.
So I think we have to think about that. There is something called DataCite, which is the sort of
framework that actually they've been very active at the Rutherford Appleton Laboratory in making,
and that actually allows you to source, if it's old enough, more than three years, you
can actually go in and find even the proposal that people wrote to do the experiment, the metadata
that was used in the log files, if they're electronic, if they've written, if the users
wrote them down, and then the actual raw data, and if you really wanted to, after three years,
you can go in through this DataCite database and you could actually process it yourself,
but again, we get back to that ethical question. Somebody else did the experiment. Maybe they were
doing a PhD. They finished. They left. Should you go in and process it yourself and publish
it? And who's the publisher and who's the author? I mean, it raises some interesting questions.
If I could actually ask one myself as a computing person, I could see quite a number of tools that
a computer person could construct from our end of things, which might help. So for example,
a programming language designed for this sort of domain, does this happen or
would this be unusual? I know it would be helpful because I
can think of quite a number of things which would make your life easier.
So it depends what you mean by that. If you mean programming
languages, should we be thinking of other things. We program everything in Python. That's partly
just because that's what a lot of people know, and because we want the consistency across the board,
but for example, we could think about probabilistic programming languages.
So the likes of Julia, for example, and that I think could help it make more things efficiently,
and certainly if we get to - actually relates to Russell's earlier point, you know, if we're
using supercomputers, maybe we can make it more efficient so that we could do it even better on
a desktop ultimately. Everything's, as you know, Moore's Law, exponentially it seems increasing.
I think there's interesting stuff there. Any more?
Thanks, Jackie. That was marvellous. I'm thinking about the underlying physical mechanisms that
make materials exceptional so you can identify or point towards an exceptional material. You
had some in your plots. Can your system give you any insight into what is different about
the mechanism in those materials that makes them exceptional, or is that still a job for humans?
So yes, I guess one thing I'd say is I don't think that humans are going to
become redundant. I think there's some of the really novel stuff, we will never get because
we're predicting based on trends. So we would never have predicted the new quantum technology,
for example. In fact, there's not enough data even now to make head roads in that, but I think
there are edge cases and you could look for outliers. You could use ChemDataExtractor to
find all the regular stuff and then say, what's this thing over here, and is it just a totally
duff piece of data, or is it actually really, really special? At the moment we might look at
the outliers, but we're sort of pre-programmed a little bit to think that they're outliers
and therefore they don't count but we could look at that differently for that purpose.
Over there. Thank you very much.
Hello. I was wondering how reliant do you think we are about changing trends in how data is actually
published into the scientific literature? Partly standards of English and language,
but also aspects like now a lot of papers have far more material in the SI than is actually in
the paper and I don't know whether you are mining the SI in the same way.
So by the way, there are quite a few of the publishers in the room. So it's a very relevant
question. I think there are definitely ways that regulation could help, if you take the historical
example of crystallographic data. It's been for decades a situation where it's been mandated that
you can only publish crystallographic information if you include a CIF, the crystallographic
information file, as part of your submission. That's not the case for almost anything else,
and so there's a lot of data that are hidden, that actually can be really useful. That's one
thing to say. So the journals regulate that with the crystallographic information. So you know,
of course, it's a burden on the publishers, but then potentially if it was regulated,
people might do it. The chance of anybody doing it voluntarily I think are actually quite slim
despite best intent, and I can speak from personal experience. When I publish a paper, the last thing
I want to do when I'm looking at the submission, I was like, oh, I've got all these files and now
I've got to produce a crystallographic information file. I kind of curse a little bit, but I still do
it because it's mandated, but I might not if it wasn't. So for all the best intent in the world,
I think we have to find a way to regulate that sort of process. With regards to
supporting information, you're right, a lot is increasingly going into supporting information
and the problem with that, from my standpoint as a data extractor, if you like, is that
it's all PDFs, and the problem with PDFs is they're really hard to read. We've just made
a PDF data structure, I don't know if you saw the clip as it was running through at the end, and we
made that because we can't access most of the supporting information. So we want to be able
to do so. So PDFDataExtractor is actually a code that will actually go into the front end
of ChemDataExtractor, so that that PDF extraction tool will be better. It's still not great because
ChemDataExtractor is of course optimised for mark-up language extraction because it's easier
to access by far. So we have to think about that, and I'm always very happy to talk to publishers
to see if we can work together to find a way through, to improve that process for everybody.

Join us for the Clifford Paterson Lecture 2020 given by Professor Jacqui Cole.

Professor Jacqueline Cole was awarded the Clifford Paterson Medal and Lecture 2020 for the development of photo-crystallography and the discovery of novel high-performance nonlinear optical materials and light-harvesting dyes using molecular design rules. After 2 years of delays due to the global pandemic, Professor Cole now has the opportunity to deliver the Prize Lecture.

Professor Cole will describe how one can combine the predictive power of artificial intelligence with data science and algorithms to discover new materials for the energy sector. A ‘design-to-device’ pipeline for materials discovery will be demonstrated. Thereby, large-scale data-mining workflows are fashioned to predict successfully new chemicals that possess a targeted functionality.

The success of such a data-driven materials discovery approach is nonetheless contingent upon having the right data source to mine. It also requires algorithms that suitably encode structure-function relationships into data-mining workflows that progressively short list data toward the prediction of a lead material for experimental validation. The talk shows how suitable data are sourced, algorithms are designed and fed into predictions, and how these predictions are borne out by experiments.

About the Royal Society
91TV is a Fellowship of many of the world's most eminent scientists and is the oldest scientific academy in continuous existence.
/

Subscribe to our YouTube channel for exciting science videos and live events.

Find us on:
Bluesky:
Facebook:
Instagram:
LinkedIn:
TikTok:

Transcript

Tags

royal society science scientists scientific policy scientific research science uk science research international international science science education science policy Professor Brian Cox Brian Cox

Email updates

We promote excellence in science so that, together, we can benefit humanity and tackle the biggest challenges of our time.

Email updates

Subscribe to our newsletters to be updated with the latest news on innovation, events, articles and reports.

First name *

Last name *

Email *

Name

What subscription are you interested in receiving? _{(Choose at least one subject)}

What subscription are you interested in receiving?

Public Newsletter - Summer Science, events, videos and news

Scientists newsletter - Grants, scientific meetings, and journals

Librarians newsletter - News and features for librarians

I am happy to receive the selected communications by email from the Royal Society, as set out in our privacy policy. I understand I can unsubscribe at any time. Review privacy policy *