Data-driven materials discovery | 91TV
Transcript
- Well, thank you very much for that very kind introduction. I'm not sure
- I'll live up to the expectations after that introduction. I'll do my best.
- So today's talk is on data driven materials discovery and I'll begin by explaining the
- challenge of the problem and then show you how we're trying to solve that problem, giving you
- four case studies from the energy sector, and then I'll finish with where we're going next.
- So here is a picture of Alexander Fleming very famously discovering penicillin by chance.
- Now we've come a little bit further than that over the time, but still
- actually we're discovering materials using trial and error in the main,
- and this picture here in the middle is taken from a biomedical website,
- who likened materials discovery to firing scalpels at a dartboard which has a moving target.
- Now, wouldn't it be better if there was a systematic way to discover new materials?
- Of course, I feel that there's an opportunity to do so using the recent advances in big data.
- So if you're going to do data driven materials discovery, of course you need data. So let's
- think, what would be the ideal source of data for data driven materials discovery?
- Well, ideally, you would want the entire universe of all possible chemicals, molecules that could
- ever exist and for each of those chemicals, know what their cognate material properties would be,
- and that's because there's an inherent relationship between the structure,
- the molecular structure, and the function of a material. People over the years have used
- empirical means to work out what these structure function relationships are,
- and yet if we could encode those relationships, those patterns about chemical and property space
- in a way that a computer could understand, then we could encode them and link them up to a search
- engine that we could then use to probe this chemical and property space to find patterns
- in the data that then could lead to predictions and ultimately, experimental validation of new
- materials for a given application to suit your needs. So that's the dream, but of course,
- we don't have the whole universe of all chemicals and all of their properties. So what do we do?
- Well, to a first order approximation, we do have that information, albeit in a highly fragmented
- form and that form is the scientific literature. That could be academic papers, could be patents,
- company reports and so on, but in all, there'll be somebody, well, lots of people in this audience,
- who have written a paper on one material and its properties. There'll be another
- person in the audience who's written a paper on another material and its cognate properties,
- but if we could grab all of the information from all of the documents that ever existed
- and put it all together, then the sum is greater than its parts. We could then have,
- to a first order approximation, all of chemical and property space, so that we could predict new
- materials from all of that historical data. So that's the essence of my talk and what
- we've done to do that is written a whole load of software that mines text, the chemical and
- property information and extracts it, and also image information, chemical schematic information
- and chemical reaction information. Here are four software tools, our primary ones,
- and having grabbed all of that information from the input documents that you give it,
- it also automatically compiles it into databases for you. So now you have materials databases
- that you can make for your own needs, i.e. bespoke databases for your given application
- and once you've got that compiled data, we can then essentially drive that into a design to
- device pipeline by then using the data to predict new materials, using the machine learning that
- you see a lot now where they classify and optimise in data analytics, and then that
- will find patterns in the data that leads to your prediction of a new material, and then
- you can use that prediction and go forward with lead materials, with the experimental validation.
- So that's the pipeline. I'm going to focus just for time on ChemDataExtractor, the text mining
- tool, just so you know. So given that, let's just have a look a bit further in how can a data
- extractor work. So that's the text mining tool. So we've got input of scientific literature. So
- it can literally be thousands and thousands, or even millions of documents if you had them,
- and ChemDataExtractor will ask of the documents, it will always find as far
- as it can, the chemical molecule, wherever it writes the chemical name or a picture,
- and it will then also grab the paired quantities of properties of which you've asked. So you ask
- the properties you need for your particular device, whatever it is you want to make,
- and it will then grab that paired information and put it into a chemical database for you.
- How does it work under the hood? Well, this is best explained by example.
- It's actually chemistry aware natural language processing for those in the field,
- and what it does is it takes every sentence from every document of all those thousands of
- documents and it does this. So figure two shows the UV-vis absorption spectra of 3A bracket red
- and 3B bracket blue in acetonitrile. A fairly typical sentence in science,
- and what it does is it takes the sentence and splits it up into its constituent parts. The
- words, the numbers, the punctuation, and so on. It then assigns a grammar to each of those parts. So
- figure is a noun. Two is a cardinal digit or decimal. Spectra is a noun, but it's plural,
- and acetonitrile is what they call a chemical mention. So this is the chemistry aware logic
- coming in. Acetonitrile is probably the solvent in this. You can probably read that as a human,
- but of course the computer has to think harder and use logic to work that out.
- Having assigned a grammar to that sentence, it then turns that sentence into a hierarchical tree.
- So it's a figure and it's figure number two in that paper. It's a figure about a spectrum and the
- type of spectrum is a UV-vis absorption spectrum, not for example, a UV-vis emission spectrum,
- and it's a spectrum of something called 3A and something called 3B and acetonitrile.
- Now, very crucially here you have to look at 3A and 3B because we don't know what 3A and
- 3B are yet. So if you put that all out into a database at that point, you would lose all of
- the chemical name information and it would be a totally useless database. So what you have to do
- before you leave the document is resolve what the labels mean, and usually in a scientific document,
- if you track back in the paper which the computer then does, it will usually find
- typically the first instance of 3A, and you'll see just before it a long chemical name like so,
- bracket 3A or maybe 3A will be in bold or something like that. So that if you train
- the computer to search for that, then you can resolve what 3A and 3B are and then put that in,
- and then you can pull out the right chemical information and its cognate property information.
- So that's why it's chemistry aware natural language processing.
- So there's this technology here and then we're using let's say standard machine learning these
- days for data analytics. The other actually pretty hard thing is once you get to the
- experimental validation, and I'm not going to talk so much about the experimental work today,
- but I do want to give you an overview of it because it often involves complex facilities
- and so I thought I'd give you a glimpse of the sites and facilities that we do experiments at.
- So what we're going to do now is we're going to go on a fly-by of the UK's neutron and muon facility.
- Okay, so there's a
- glimpse of the experimental
- world that mostly I won't be able to talk about for time reasons,
- but hopefully that gives you some idea. So we've talked now about the challenge. We've talked about
- the technology. So let's see how we apply it. I'm going to show you now four case studies,
- all taken from the energy sector to see if we can discover new materials. I'm going to
- start with solar, the sun. Here's a picture of the sun in Cambodia just for your amusement.
- I want to discuss the idea of how we can apply ChemDataExtractor to discover new
- light absorbing materials for photovoltaics. So irrespective of the type of photovoltaic device,
- all of them need some form of light absorption, right? So I want to think about the underpinning
- problem at the molecular level to do that. So let's look at this graph. So the black-jaggedly
- edge here. That's the solar emission spectrum. So there would be kind of your visible light.
- So as a function of wavelength. So that's what that's what you see as a spectrum from the sun,
- and what you want in terms of grabbing all of the photons under the area of that curve,
- that's your goal is you need then light absorbing material that ideally would be something like that
- green line that pretty much grabs all of the area under that curve. So all of the photons.
- Now there's no one material that really does that. So that's a problem, but people in the device
- technology often combine say two different types of molecules, say one that absorbs in the blue
- and then one that absorbs in the red, and as a convolution of those two peaks, you will get more
- or less the green, right? That's the concept. So people often do that in the device world.
- So let's set ourselves the problem then, of making a database using ChemDataExtractor that finds the
- underpinning molecular property information, say the wavelength maximum, so see where the
- peaks are of these. We're also going to take, if it's present, the absorption information,
- the intensity here which is going to be called the extinction coefficient, and we're also going to,
- of course, take the material, the chemical name as well. So that's the database we're going to build,
- material property one, property two. So we build that with ChemDataExtractor and that at the time
- was just under 10,000 chemical molecules and their corresponding properties. Just all grabbed from
- the scientific literature, and our goal is to get from 10,000 possible light absorbing materials all
- the way down to, say, five lead candidates that we can take forward for experimental validation,
- and the way you do that, if you apply this design to device pipeline like so,
- we have to ask really carefully chosen questions that sequentially filter out from 10,000 all the
- way down to just a few. So the first question we asked, for example, is remove all things
- containing metals. We wanted organic materials for environmental regulations. That actually made our
- case quite hard. Nonetheless, that's what we did. So that went down to just 3000. That's already a
- big jump and you want quite big jumps early on so you don't end up asking far too many questions.
- Then we ask a question that's relevant to a device, for this particular type of photovoltaic
- technology, we knew that in the device we wanted molecules that contained a carboxylic
- acid group and that's because we knew that the interface between the light absorber and the
- semiconductor that makes the working electrode as a composite was working particularly well when we
- had carboxylic acid groups in our light absorber. So we fixed that filter and then you see an order
- of magnitude reduction from 3000 to 300 more or less, and then we go in and we say, we wrote
- a mathematical algorithm that finds the optimum combination of those possible 300 you've got left
- to find, if you like, the best combination of blue absorbing and red absorbing or at least extremes.
- Then you go down to another order of magnitude to about 30 possible light absorbing materials
- and at that point, 30, you can actually go in manually and do things. It's a manageable number,
- and actually what we did then is we performed some electronic structure calculations, so computation,
- to all of those 33 to check what we call the energetic alignment of our light absorbing
- materials within the device. So this is the idea that we can predict all these new light absorbing
- molecules but that's all in isolation. We've got to think about the device. You've got the
- electrolytes. You've got the other electrode. We've got to check that the energetics line up
- so that you get the right driving voltages. So we did that and that brings it down to essentially a
- handful left, in our case, five lead candidates to go forward for experimental validation,
- and here are the five. One good thing about the ChemDataExtractor type of
- technology is because we've mined the data from the scientific literature,
- the people who originally made these materials, we can contact, because we can track back. We always
- keep the DOI as we go through so we can track back and find the author for correspondence,
- and all of these molecules, by the way, were made for purely scientific curiosity, synthetic
- curiosity. So none of them had any idea that it may be applied for photovoltaics. I wrote to them,
- or I emailed them actually, and I said, hey, you know, we think your material might be useful for
- photovoltaic application. Do you still have some or could you remake it and send it to us and we'll
- put it into a device and test it for you, we'll do this as a collaboration so everybody wins.
- They all said yes. They all sent materials and we put them into devices, and this is what we got.
- This is a graph of voltage versus current density, for the experts in the room,
- remember this is our lab work, but it's the relative differences that matter.
- The black curve is an important reference. So the black curve actually corresponds to the industry
- standard. Now, the industry standard actually is a metal organic material, right? So remember we
- said we wanted only organic materials, but as an industry standard we'll reference it to the
- metal organic one, and that's one of the best performing ones you can get. So if you can get
- your curve and your testing to get anywhere close to that black line, you're doing well.
- So I'm going to focus on XS6 and 15. They were the original labels in the original publications.
- If you test them just on their own in a device, you'll get the red and the blue curve for those
- two. So voltage is okay but the current density is not so great, because you're not close to this
- black for example, but of course the logic was to put the two together to get blue absorbing light
- red absorbing, and so when you put them together, you get the mauve line, which actually is pretty
- close to the industrial standard but we've only got organics in our system, right? So we were
- pretty happy about that, actually, it gets 92 per cent of power efficiency of the industry standard.
- Other people were happy, it got onto the front cover of Advanced Energy Materials Journal,
- and so there's an example of data driven materials discovery in action.
- Before I leave this, the solar topic, I just want to touch on one other thing,
- which isn't discovery per se, but I think it's relevant that you can also use ChemDataExtractor,
- the text mining tool, to help with manufacturing, because what you can also do is mine existing
- things that people know about. So why would you do that? For example, you can make nice histograms.
- So here's an open circuit voltage, the current density, the power conversion efficiency of all
- the known materials for two types of photovoltaics, perovskite solar cells and dye
- sensitised solar cells. There's more on this one because it's much more mature technology. Nearly
- 200k of data on perovskite solar cells. So that's what's known, right? This isn't discovery at this
- point, but it's not just the device properties you could predict and the molecular properties,
- but you could also grab information about the actual manufacturing process. That might be
- quite important to you to know what type of solar simulator was used, the testing device,
- or the active area of the test sample that you used for the solar cell. That type of information
- can be very valuable if you want to optimise the manufacturing process. So there are sort of side
- uses of ChemDataExtractor as well. So that's the first case study. I now want to move to the
- second case study and we're going to look at heat. This is a picture of a volcano erupting
- in Guatemala. I took the photo from a save point here. You're laughing but wait till you
- see what happens next. When it erupted, I was actually there on the ridge. This was a video
- I took. You don't want to be ever closer than that, by the way. That's just really
- to wake you up and show you some heat. Anyway, I'm not going to do volcanic applications,
- but I am going to do thermoelectrics, this also gives me an excuse to play with Lego.
- I made a little thermoelectric device which I'll pass around in a second, but just to explain then
- how it works for those who are not in the domain area, what you have now here is a case where you
- have a cold surface on top and a hot surface on the bottom, bottom, top, and therefore you have
- a thermal gradient. I'm going to super simplify this - sorry, all physicists in the room.
- So just like in the atmosphere, if you have a thermal gradient, you get the convection currents
- that cause all the weather, and here you have electronics essentially creating a convection
- current, in this case with thermocouple, the P and N semiconductors. So you can create electrical
- currents and therefore voltage is driven between because you've got the thermal gradient that
- stimulates it from the top. So then that creates, because the thermocouples are all in series, that
- creates an electrical circuit. So thermoelectric. So I'll just take these around. I've got two
- because I obviously got particularly happy with playing Lego.
- Please send them back to the front, the people at the back, because otherwise the Rutherford
- Labs will kill me for stealing all their Lego. I promised the visitor centre I'd send it back.
- So that's how a thermoelectric material works, and in equations,
- here is the figure of merit. So that's ZT. That's just to understand it as your proxy,
- high ZT means a high performing thermoelectric material. S, Seebeck coefficient,
- that's just a coefficient, and sigma is your electrical conductivity. You want that high,
- right? It's thermoelectric. You want the thermal… Sorry. That's electrical conductivity.
- The thermal electrical conductivity, kappa, you want low, and that's because you want to keep
- that thermal gradient without interfering with the electronics. You want the temperature high.
- So if we can then use ChemDataExtractor to mine all of these properties, plus something called
- the power factor, then we get this distribution. We've literally just published this, I think,
- eight days ago. So it's a new database, and just out of interest, you see all these spikes,
- you're probably wondering what they are, right? These are real experimental data.
- It's not an artefact. It's actually really real. What they are is actually rounding up of people's
- results. People will quote things to say one decimal place. So then you get things spiking at
- 1.0, 1.1, 1.2, 1.3, 1.44, etc., and of course, we present the data faithfully. We could of course,
- average it and put sort of Gaussians over each of those spikes, because probably they mean it was a
- bit more and a bit less than 1.0, but this is this is real data, right? So that's why you see spikes,
- and you can do things like look at the highest ZT material, right? People may know that anyway.
- You may say, well, okay, but what else can you do with this? You can forecast with it. So you could
- take the average, the year of publication, remember they come from academic papers,
- and you can check their average ZT, the figure of merit, and then you can say, well, roughly,
- if that was a linear trend, by 2052, the average ZT that would be reported in papers would be 1.5,
- for example. If you're thinking, well, should I work on thermoelectrics? Well,
- if this is the year of publication and the number of records that have been occurring
- is going up like this, it's probably quite a good field to go into right now.
- So it's good also to help you with forecasting is the point.
- So that's heat. Now let's look at hydrogen. There's been a lot of media interest, of course,
- that people have seen with hydrogen fuel cells for example, as a technology, and
- therefore you need a way of producing hydrogen. People mostly produce it using water
- and they do something called water splitting, which unsurprisingly means you split water into
- its constituent parts, hydrogen and oxygen. It's good because you generate what you need,
- the power, the supply of hydrogen, but you also generate a clean side product, oxygen, which is
- also helpful. So no problem there, and there'll be a catalyst that may help. So this technique
- is therefore called water splitting. This is a picture of me water splitting for no reason.
- Here is now the question, how do we apply ChemDataExtractor to help?
- Well, I mean, there's no point in mining for water, or something like that, but we could mine,
- for example, for all catalysts that had ever been produced for water splitting,
- and unsurprisingly, we are making a database or we've made one, haven't published it yet,
- of all catalysts for water splitting. So that may help. Thinking about optimising things,
- finding the right catalysts for the most efficient process of water splitting,
- but there's another use that we could work with, which is thinking about the hydrogen.
- Once you've made hydrogen, it's a gas, right? So we've got to find a way of storing it.
- Now, what you would really want actually, is to condense it into a liquid form
- so that you could store it in a nice, small, contained little vessel. Now, there's not really
- a good way of storing hydrogen in the liquid form, partly because the liquefaction temperature for
- hydrogen is about 21ish kelvin, that's -253 and a bit degree centigrade. So that's really cold,
- right? So how do we find a refrigerant that will cool the hydrogen down, put it
- in its liquid phase, and keep it cool, and we're going to say can we find a material like that?
- What we're going to use for this purpose is we're going to try and discover a material
- that's a magnetic refrigerant. So this is the idea that… So some materials will be heated
- up when you apply a magnetic field to them and when you take the field away, it will cool down.
- So that can be a way to potentially make a sort of, this is just my little avatar,
- if you like, of magnetic refrigerants. It's not really like a fridge, but to
- give you an idea. So what are the properties that make a magnetocaloric effect material?
- There are these three. Don't worry, it's just equations. We have a relative cooling power,
- RCP, as one of the parameters that govern the properties, the change in entropy and the
- change in temperature. So just remember, we have three parameters, if you're not a scientist, and
- what we're going to now do is then do our ChemDataExtractor thing. We're going to mine
- the literature and find all materials that have reports of any or all of those three properties,
- and we'll see what we get. So if we do that, there aren't actually that many, but there's just under
- 3000 materials and each of them have those number of the individual properties. TC, by the way,
- is actually now the Curie temperature. That's because we wanted to find permanent magnetism,
- right down at that temperature, and so that's what we got. Actually in that sense,
- well, we got this database, that didn't give us a brand new material that was suitable for the
- liquefaction temperature to store hydrogen, but that's okay, because what we can do is having got
- the database, we can at least make a regression model because we've got data now, right,
- with structure and property information, and so we can apply machine learning and make a regression
- model to link, i.e. make structure function relationship for magnetocaloric effects. So that
- means that if we've come in with a new material, then we can use this regression model to predict
- the magnetocaloric effect properties. That's okay. So now we need new material, right? So
- how do we get new materials? Well, we turn to our machine learning world. Super geeky moment coming
- up. We apply something called a conditional deep field consistent
- variational autoencoder with U-Net and crystal graph convolutional neural nets architecture.
- You'll be glad to know I'm not going to explain what that is. So what you essentially do is, it's
- a generative algorithm that essentially creates new 3D crystal structures, hypothetical ones,
- that could then be fed into this prediction, and the way it works without really going into
- any detail but if anyone wants to speak to me afterwards, I'll happily explain it to them.
- Imagine you have - in fact, the imaging world kind of do this. You've probably seen on the internet,
- you have a picture of a human face and then it morphs into some other human face,
- right? It's a bit freaky sometimes. It's not quite like that, but it's sort of like that. In essence,
- you can - because all you do with molecular structure is an image, or at least you can
- treat it like an image, just these three dimensional coordinates, right? Don't worry
- about bond lengths and bond angles and all those things. Just see it as an image, and if you treat
- it like that where you can make essentially a data distribution that's representative of a molecular
- structure or crystal structure in this case, and then you feed in something that's close
- to your target material. So for example, with magnetocaloric effect materials, we knew that say,
- Heusler type, so a type of structure like Heusler or perovskites or cubic materials tend to make
- good magnetocaloric effect materials. So we could say, well, let's take one of those that
- is known and then just sort of send it through this kind of what they call the latent space,
- and then it goes through this sea of sort of representative data distribution, and then it will
- sample it and subject it to a standard deviation, depending on if it's a big standard deviation,
- it will allow the input structure to be perturbed a lot. If it's a small one, it won't be perturbed
- so much, and then it will output a whole load of new hypothetically generated crystal structures
- that most of which, by the way, might be totally unrealistic, because it doesn't know anything
- about what's realistic as a bond geometry. But that's okay, because a lot of those will then
- get screened out because you check the energetics make any sense, and if they don't, you just throw
- it away. So we've now got this way of making hypothetically generated crystal structures
- and once we've checked them for energetics, they could possibly be real and they're different from
- the input you put in by enough that you're happy that that's different and new. So then with these
- new materials, you do some further screening to check a few things. You check it's ferromagnetic,
- for example. You could check its phase stability, that sort of thing, and again, this kind of
- inverse pyramid pipeline that goes from, say 1000 down to 30, and then with those lead materials,
- you can then apply that regression model that I talked about that we got from the experimental
- data with ChemDataExtractor because that link structure and function. So we can take now our new
- hypothetical molecules - sorry, crystal structures and predict the magnetocaloric effect properties,
- and once you've done that you've got a prediction of new material, and then we have to seek out a
- way to synthesise it and depending we can do that we can experimentally validate
- that material as a magnetocaloric effect, i.e. magnetic refrigerant. So here are the results. The
- blue are the known ones from the literature that we grabbed, and these are two of the properties,
- the relative cooling power on the vertical axis and the Curie temperature on the horizontal.
- So that's the things that were known in blue. The red ones are predictions.
- Brand new materials. One of these is this one and if you know your periodic table, you'll
- know that PM happens to be the one lanthanide material that's radioactive. That's annoying,
- isn't it? So ChemDataExtractor doesn't know anything about safety, by the way.
- We can't do that but we could substitute out the lanthanide, maybe, right? We were quite keen
- because that was quite a nice temperature range for liquefaction of temperature of
- hydrogen otherwise, but you could also go down here and pick something slightly different.
- So we're now thinking about making or trying to make that, and here I
- am at the neutron and muon facility, waiting with my sample state ready for it to go on,
- but we haven't quite synthesised it yet. So that's really all about, I have to say, about hydrogen,
- where we're up to on our discovery and from this project, but also just so you know, with other
- predictions, we could think about room temperature type magnetic refrigerants, because they could be
- useful for a different type of fridge. We've got predictions there which were also trying to make.
- So that's number three out of four. Now we turn to batteries,
- and this is a picture of a mountain where you've had an electrode in the
- sky and electrode on the ground and you've had an electrical discharge. Obviously,
- that's just lightning. This mountain has been actually struck by lightning and it's on fire.
- It's a few years ago now. Does anybody know where it is? Is it obvious to people?
- That's interesting. I'm having lots of shaking heads from scientists as well. That's it
- bigger. Any further thoughts from the scientists, particularly the neutron and synchrotron people?
- Yes, I think I heard it's Grenoble. That's the mountain here, that's on fire,
- and this is the European neutron facility, and this is the European Synchrotron Radiation
- Facility next door to it. Anyway, just to give you an excuse to show you the European facilities.
- So nature's form of electrical discharge, but let's now think about batteries.
- So we're going to do - you're getting quite bored of hearing this probably now, we're using our
- ChemDataExtractor text mining tool to mine the literature, in case you haven't got the message
- by now, and in this case we're going to mine materials and the devices information. So we're
- going to mine the capacity, the conductivity, the voltage, the energy, the coulombic efficiency,
- and we get nice distribution graphs like that. Again, you see this weird rounding effect with
- the unitary things coming out, again because it's real experimental data. Now in this case,
- we grabbed all the material and device property information, but what the computer couldn't do
- or what ChemDataExtractor couldn't do, was distinguish between whether the material was
- an anode, a cathode, or an electrolyte. It got lots of materials and all the device properties,
- but it couldn't classify which ones they were because that was never really
- programmed into ChemDataExtractor. So we then thought, well, how are we going to do that?
- So what we did is having got that database of nearly 300-ish data records from
- ChemDataExtractor, we went back and we sort of reverse engineered it. So we had a scientific
- corpus that we fed in to that ChemDataExtractor process, and we took from it the papers that had a
- successful hit from ChemDataExtractor, i.e. things that contain batteries, information to make our
- database, and then we siphoned off that corpus. Now we've got a very battery rich corpus of data
- and we then fed it into a different pipeline, and let me explain what the pipeline does. So this
- is our battery rich data corpus and we then are going to make something called a data model. So
- this is a different way of mining data, because what you do is this comes actually
- from a technology that's come out of Google AI. In fact it's often the basis behind the
- Chrome Search engine, actually, and what you do is you essentially, you take those sentences that
- I talked about. That's how you remember how ChemDataExtractor works, where you have these
- sentences and then you pull them apart and all these things, but instead of doing that,
- what you do is you make vector representations of them, right? So imagine a phrase like the cat
- sat on the mat. I'm going to write down the cat sat on the mat two times identically. Now you'll
- know that the subject and object are the cat and the mat, right? So that has a high correlation
- between the cat and the mat. So you can build a bipartite graph. So you can link those two with
- a high correlation, and so you can start to get contextual information from that sentence
- by essentially relating the information that you know is highly correlated together.
- The computer can do that by making this vector representation and having made the
- vector representation, it will then feed it into a neural net, a very deep learning neural network,
- which then subject to all sorts of weightings, will train all of that with all the sentences that
- you put in. So in some sense, it's different from ChemDataExtractor in all sorts of ways,
- I suppose, but in one particular way, which is that instead of extracting the chemical and
- property information, you keep all of the sentence information, all of the corpus, all of the words,
- everything, and you put it into a model, and I use that word very carefully. So now what we've made,
- we've trained a network to build a model of all of that corpus with all of the context of the
- information from the sentences, and so it's now what they call a language model, right?
- I've simplified a bit, but that's basically what you're doing. Now that's a really important
- distinction between that and a database. A database is a very static thing which you look
- at and you can do analytics on, but now we have an interactive model. That's why I'm calling it a
- model and the way to think about it in my head, at least, is to think about it as your computer. So
- you have a motherboard, and that's the core kind of operations of your computer,
- and you have lots of peripherals, like your mouse, your printer, your keyboard and so on.
- So these are peripherals and you can put things together, and the mouse has a different function.
- So I can plug that in and make my motherboard do different things, and it's kind of like that
- as a sort of concept and framework. So we've got this model. This data model, this language model,
- and I'm now going to ask, I want to ask questions of that model because it's interactive,
- I can do that. So I want to say, of course, is it an anode, is it a cathode or is it an electrolyte
- from the material standpoint? So what I do is I make a whole load of questions up, just have
- designed answers like, what is the anode? What is the cathode? What is the electrolyte? And so
- on. I can make more complex sentences, of course, and so you make another database just of questions
- and answers and you make it for the domain of interest, the materials domain of interest, the
- battery domain, and then you can also mix that up with lots of generic question and answers, pairs,
- just about normal English language, and then you put it all together and then you can ask the model
- questions. So in this case, we want to know what the anode, cathode, and electrolytes are as a
- classification. So we asked those questions about that and then we grab the answers, and
- then we can then classify whether the materials that went in the original database are an anode,
- cathode, or electrolyte. Hopefully that makes some sense. This is what you get. You get what
- you'd expect from what is the anode, what is cathode and what is the electrolyte.
- So that's the classification, and we can go one step further than that with this data model
- concept and this is really a paradigm shift I think between that and databases. We can not
- only have a question and answering module, but we could make a natural language processing model, a
- text mining module. So what if we actually blended ChemDataExtractor, the chemistry aware natural
- language processing model, or text mining module, with this new technology from Google AI and make
- BatteryDataExtractor? You can see I haven't got much imagination with my names, by the way.
- So this then, just published like eight days ago or five days ago or something, I get lost, sorry.
- This is the first property specific text mining tool for autogenerating materials databases.
- We've done it on batteries and as I say, it blends two different technologies to hopefully get the
- best out of both worlds. It's also good, I think, because it's got a probabilistic nature about it,
- because it's a model, you can actually get confidence scores. You can actually
- get probabilities of each data record that you obtain, to know how likely is the question that I
- ask going to be right. Now that's really important because, you know, how does somebody know whether
- they should trust my database that I've got that was autogenerated with ChemDataExtractor,
- right? I mean, we do experimental validation, obviously, but you know, to get that trust,
- we need to think about real probabilistic confidence scores for all of the data records.
- So I think that's a good thing. The fourth thing there is that we have now
- a new way to interrelate material and property text during data extraction.
- So imagine you've got, this is a really big problem by the way, imagine you've got a
- scientific paper and you want to mine the chemical and property information, but let's say the
- chemical name is on page one of the document and the property information is say on page five, and
- in between on pages two, three and four, we've got a whole lot of other chemical names and a whole
- lot of other property type things. How do you know that the chemical name on page one and the
- property on page five are related, not something in between was related to either of them, right?
- So what you can do because you have a model, not a database anymore, you can ask, if you're careful,
- the right sequence of two questions that, by definition, interrelate them by essentially using
- a sort of analogue version of Bayesian statistics. It's a conditional probability, so you'll see how
- it works. Let me give you an example. So let's imagine I want to find a material with a voltage
- of two volts, just for argument's sake. So I'm going to ask two questions in this sequence,
- what is the value of the property name? Property name is voltage. What's the value of the voltage?
- And you'll get an answer. You're just dealing with the property at the moment. So I said the
- answer was two volts. So it can find that because that's easy. It's just looking for one area in
- the document, but then we say which material - we want that link between material property - which
- material has a property name, the voltage, of the answer to the previous question? So there's your
- conditional. That's your Bayesian bit, and the answer to the previous question was two volts. So
- the question becomes which material has a voltage of two volts? And that puts that into relationship
- into analogue. So sort of a geeky, very geeky way of doing Bayesian kind of statistics.
- So that was published recently. It got on the front cover of Chemical Science,
- which is the Royal Society of Chemistry's flagship journal.
- So leading on, this is really then thinking about where we're going with all of this and you can
- see that you can build up all these different modules, having different peripheral functions,
- just like the mouse and the printer and so on, that have different functions with respect to
- your motherboard of your computer. In this case, the analogy is the data model, and you
- can then make it do lots of different things. So that's I think where we really want to be going,
- moving really away from actual databases and moving into this more interactive type of zone.
- So that's the software side of things. I just want to close on thinking forward also about the
- experimental world. This is a picture of the new Ray Dolby Centre in Cambridge University which is
- nearly finished and this will, of course, allow us to continue doing the mod cons and this is where
- the physics department and others will be housed. So we'll continue to have the mod cons of the
- experimental validation side for example, and this is the Rutherford Appleton Laboratory to which
- I'm 50 per cent seconded and it's already got these facilities and others. So there's Diamond,
- the X-ray source, the ISIS, the neutron and muon source, the central laser facility here,
- and in future it will also, I believe, have all these extra things. So there'll be a new
- development in a laser facility. There's already nine new instruments being planned
- for the existing neutron and muon source, and I believe there's a possible option of new neutron
- emission source called ISIS-2 over here, maybe, or some other location possibly. So that, however,
- reminds me to say that this site, like many big organisations, of course, the Rutherford Appleton
- Laboratory is particularly large, but like many big organisations, they have a plan to go net zero
- by a certain year. That's a really, really massive challenge, and I've talked today about a design to
- device pipeline, but that's a linear pipeline, but if you really want to go to sustainability
- then you've got to adopt the circular economy. Actually it's more of a butterfly shape. So
- our design to device is kind of here, but we've really got to build an ecosystem.
- Talking to people on the county council, dealing with the recycling right through to the geeky
- scientists and engineers who are designing and making new materials. We've got to build
- that ecosystem. That's a really different way of working that we've got to adopt.
- So you can tell I'm getting to the end of my lecture. I'm going to get a bit more light
- hearted now. We can call the user support with our Bat signal. That's us in the new facility,
- the lasers are reaching out, and they will call on the user community, which
- has an international reach and we can make our circular economy, and working even more closely
- together across the facilities, we can continue to do Olympic science for a sustainable future.
- Now for the grand finale, I want to thank people and I want to thank people properly, because
- really, this is a celebration of everybody with whom I've worked. So I wanted to do it properly
- rather than just put a slide. So I made you a short video to thank everybody. So here goes.
- Thank you.
- time for…
- I mean,
- I'm
- so inspired. It's fantastic. Thank you so much. It's really special and different. Anyway,
- questions for Jackie. There are microphones. Remember there's an audience out there as well. So
- I'm going to stand in front here so I can be seen.
- Over here on the left.
- Thanks, Jackie. A number of your databases have 100,000 papers or so. How long does it
- take a computer to read 100,000 papers and is it a laptop or is it a supercomputer?
- So there are two different ways of answering that. The time taken to read all the documents is not so
- long. What takes the time is actually the data cleaning process. Once you've mined the data,
- you don't just get what you want, you actually kind of get lots of edge cases that things that
- don't quite work, and so you have to spend a lot of time then actually cleaning the data.
- So I don't want to pretend that ChemDataExtractor is something you can just press the button and out
- pops the database. To give you a real example of it, in practice, a fresh PhD student from, say,
- my group would take probably about two years to get a database that's in published work,
- and obviously, they get quicker because they've learned at that point, but that would be coming
- at it fresh. So we do use supercomputers for sure, and certainly it's a lot faster,
- and for the really big runs we will use a supercomputer, but you could in theory,
- as long as you've got time, you could do it on a regular desktop computer.
- Thanks, Jackie. It was a great talk. You talked about extracting data from the scientific
- literature. I guess within our facilities nationally and globally, there's huge amounts
- of data, and I just wondered what your thoughts are on extracting that data in terms of challenges
- and opportunities, because clearly a lot of that data doesn't end up in scientific publications.
- So I guess it depends on what condition the data are, at least at what stage the data are in their
- production. So you've got raw data, right? Which we couldn't realistically mine because
- you've got to process it first. So that's where you have to reduce your data and analyse your
- data so that it becomes something meaningful that you can relate to the rest of the world.
- So that's the first thing. Once it becomes processed data, of course, then you would be
- in a position where you could maybe publish it or you could not. If that data were available
- to the likes of us, then we could actually take that data, but we would have to make it into some
- sort of internal framework. I think that we would pipe that into the database format but
- there's no reason you can't put a side arrow into that pipeline. Of course, you'll know
- better ethically than me, how ethical is it to take somebody else's data that they didn't
- really write up and then use it without maybe asking them, but after three years, I believe
- it becomes open access anyway. So we get into sort of policy issues and even possible ethical issues.
- So I think we have to think about that. There is something called DataCite, which is the sort of
- framework that actually they've been very active at the Rutherford Appleton Laboratory in making,
- and that actually allows you to source, if it's old enough, more than three years, you
- can actually go in and find even the proposal that people wrote to do the experiment, the metadata
- that was used in the log files, if they're electronic, if they've written, if the users
- wrote them down, and then the actual raw data, and if you really wanted to, after three years,
- you can go in through this DataCite database and you could actually process it yourself,
- but again, we get back to that ethical question. Somebody else did the experiment. Maybe they were
- doing a PhD. They finished. They left. Should you go in and process it yourself and publish
- it? And who's the publisher and who's the author? I mean, it raises some interesting questions.
- If I could actually ask one myself as a computing person, I could see quite a number of tools that
- a computer person could construct from our end of things, which might help. So for example,
- a programming language designed for this sort of domain, does this happen or
- would this be unusual? I know it would be helpful because I
- can think of quite a number of things which would make your life easier.
- So it depends what you mean by that. If you mean programming
- languages, should we be thinking of other things. We program everything in Python. That's partly
- just because that's what a lot of people know, and because we want the consistency across the board,
- but for example, we could think about probabilistic programming languages.
- So the likes of Julia, for example, and that I think could help it make more things efficiently,
- and certainly if we get to - actually relates to Russell's earlier point, you know, if we're
- using supercomputers, maybe we can make it more efficient so that we could do it even better on
- a desktop ultimately. Everything's, as you know, Moore's Law, exponentially it seems increasing.
- I think there's interesting stuff there. Any more?
- Thanks, Jackie. That was marvellous. I'm thinking about the underlying physical mechanisms that
- make materials exceptional so you can identify or point towards an exceptional material. You
- had some in your plots. Can your system give you any insight into what is different about
- the mechanism in those materials that makes them exceptional, or is that still a job for humans?
- So yes, I guess one thing I'd say is I don't think that humans are going to
- become redundant. I think there's some of the really novel stuff, we will never get because
- we're predicting based on trends. So we would never have predicted the new quantum technology,
- for example. In fact, there's not enough data even now to make head roads in that, but I think
- there are edge cases and you could look for outliers. You could use ChemDataExtractor to
- find all the regular stuff and then say, what's this thing over here, and is it just a totally
- duff piece of data, or is it actually really, really special? At the moment we might look at
- the outliers, but we're sort of pre-programmed a little bit to think that they're outliers
- and therefore they don't count but we could look at that differently for that purpose.
- Over there. Thank you very much.
- Hello. I was wondering how reliant do you think we are about changing trends in how data is actually
- published into the scientific literature? Partly standards of English and language,
- but also aspects like now a lot of papers have far more material in the SI than is actually in
- the paper and I don't know whether you are mining the SI in the same way.
- So by the way, there are quite a few of the publishers in the room. So it's a very relevant
- question. I think there are definitely ways that regulation could help, if you take the historical
- example of crystallographic data. It's been for decades a situation where it's been mandated that
- you can only publish crystallographic information if you include a CIF, the crystallographic
- information file, as part of your submission. That's not the case for almost anything else,
- and so there's a lot of data that are hidden, that actually can be really useful. That's one
- thing to say. So the journals regulate that with the crystallographic information. So you know,
- of course, it's a burden on the publishers, but then potentially if it was regulated,
- people might do it. The chance of anybody doing it voluntarily I think are actually quite slim
- despite best intent, and I can speak from personal experience. When I publish a paper, the last thing
- I want to do when I'm looking at the submission, I was like, oh, I've got all these files and now
- I've got to produce a crystallographic information file. I kind of curse a little bit, but I still do
- it because it's mandated, but I might not if it wasn't. So for all the best intent in the world,
- I think we have to find a way to regulate that sort of process. With regards to
- supporting information, you're right, a lot is increasingly going into supporting information
- and the problem with that, from my standpoint as a data extractor, if you like, is that
- it's all PDFs, and the problem with PDFs is they're really hard to read. We've just made
- a PDF data structure, I don't know if you saw the clip as it was running through at the end, and we
- made that because we can't access most of the supporting information. So we want to be able
- to do so. So PDFDataExtractor is actually a code that will actually go into the front end
- of ChemDataExtractor, so that that PDF extraction tool will be better. It's still not great because
- ChemDataExtractor is of course optimised for mark-up language extraction because it's easier
- to access by far. So we have to think about that, and I'm always very happy to talk to publishers
- to see if we can work together to find a way through, to improve that process for everybody.
Join us for the Clifford Paterson Lecture 2020 given by Professor Jacqui Cole.
Professor Jacqueline Cole was awarded the Clifford Paterson Medal and Lecture 2020 for the development of photo-crystallography and the discovery of novel high-performance nonlinear optical materials and light-harvesting dyes using molecular design rules. After 2 years of delays due to the global pandemic, Professor Cole now has the opportunity to deliver the Prize Lecture.
Professor Cole will describe how one can combine the predictive power of artificial intelligence with data science and algorithms to discover new materials for the energy sector. A ‘design-to-device’ pipeline for materials discovery will be demonstrated. Thereby, large-scale data-mining workflows are fashioned to predict successfully new chemicals that possess a targeted functionality.
The success of such a data-driven materials discovery approach is nonetheless contingent upon having the right data source to mine. It also requires algorithms that suitably encode structure-function relationships into data-mining workflows that progressively short list data toward the prediction of a lead material for experimental validation. The talk shows how suitable data are sourced, algorithms are designed and fed into predictions, and how these predictions are borne out by experiments.
About the Royal Society
91TV is a Fellowship of many of the world's most eminent scientists and is the oldest scientific academy in continuous existence.
/
Subscribe to our YouTube channel for exciting science videos and live events.
Find us on:
Bluesky:
Facebook:
Instagram:
LinkedIn:
TikTok: