91TV

From deep network mysteries to physics | 91TV

57 mins watch 28 November 2023

Transcript

  • Professor Stephane Mallat: So let me first thank you, Marta and the Royal Society for this very
  • impressive award. Andrew Blake, Guillermo Sapiro, who are also at the origin of it. To all of you,
  • my family and friends, for being here to hear this lecture. So what I'm going to speak about
  • is indeed AI, which has been, as you've all seen revolutionised by these deep neural networks which
  • are invading our lives. The latest avatar being ChatGPT. Strangely enough, although the algorithms
  • are very well mastered, we don't really understand why they work so well. That will be the topic of
  • the lecture, to try to approach this mystery and to show that there are very deep connections with
  • physics, and that will be a central theme of the lecture. So what kind of problems this kind
  • of machines are attacking? What is AI? There is essentially two types of problems. On one hand,
  • if you have data, so what is data, for example, it can be an audio recording,
  • it can be a text with millions of letters, it can be an image with again millions of pixels, video,
  • physics, molecules. What you would like to do is to try to understand the property of this data,
  • and defining the property of this data amounts to define a probability distribution, which describes
  • the relation between the different components of this data. So learning such model is one of the
  • key elements of this domain, the applications are data generation. In fact, the images that you see
  • here is not photography, it has been generated by neural networks. If you look at these videos
  • over there, same thing, they are not natural videos, they've been generated. In fact, they
  • were generated from few sentences, for example, a teddy bear running in New York led to this video.
  • There are plenty of applications, for example, to recovering better quality medical imaging,
  • suppressing noise to data, and so on. Now the second kind of problems that you have in this
  • domain are classification or regression, which means from a data you'd like to get an answer,
  • that I'll call Y, to a question. For example image classification, you have images here,
  • each image corresponds to an incidence of X and the value Y, here would be for example,
  • this is a car or a grid, this is a mushroom. The other image corresponds to cherries,
  • or a dog, or a Madagascar cat. This is a problem of classification. A problem of regression would
  • rather associate to data X a real value Y. For example, compute the molecule quantum energy,
  • sorry, the quantum energy of this molecule. That kind of task can be achieved nowadays with these
  • neural networks. Now the big surprise is that previously these tasks are incredibly difficult
  • to solve. So why, if you look at this problem, which consists to associate value Y to data X,
  • so to compute this function of X, in order to compute it in learning or data science,
  • you have to begin with examples. So you have an example Xi, and the value of your function for
  • this Xi which is Yi. And now the problem is the following, you are given a new image, so a new X,
  • and you would like to know the class, for example the value Y. So the first idea that comes to mind
  • is to take the X which is here to look at all the neighbours. Since for all the training neighbours,
  • you know, the value is to compute the value of the function by averaging the values which are in the
  • neighbourhood. That works in general quite well, but not in this situation, because X has many
  • variables. In other words, lives in a very high dimensional space and you have no chance to have
  • a close neighbour from the training data close to X. To understand this difficulty, which is raised
  • by the big dimension of the space, consider the interval of 01 in Dimension D. So for example,
  • in Dimension 2, that gives the square over there on the right. Suppose that you want to ensure that
  • you will always have an example which is at the distance one over ten. How many examples do you
  • need for your training? The number of examples will be ten to the power of D, in Dimension 2
  • it will be 100. Now if D is 80, that's more ten to the power 80 than the number of atoms in the
  • universe. That means it's impossible. Now this curse of dimensionality means that you have an
  • explosion of possibilities. In order to learn, you need somewhere to reduce the dimensionality of the
  • problem. In other words, you need to realise that within X there are not so much information, which
  • is really crucial to find the classification and learn the task. That's what is very difficult to
  • discover. Now what is a neuron? A neuron is a very simple computational unit. You take the inputs,
  • the different value of your data, and you are going to weight it with different weights W1 Wk,
  • and you are like that going to make a vote of the data with the different weights, which is this
  • linear combination. Now if this value is above a threshold B, then your neuron is going to send the
  • value outside. If it's below zero that's called a rectifier. Then the neuron is going to output
  • zero. So that's this very simple unit. Now this simple unit you put it within a network. That
  • means that the input data X is going to be fed into a full layer, which is over there of neuron,
  • which themselves are going to be fed in a next layer of neuron. The number of layers like that
  • can grow up to several hundred, and the output you are going to get an estimation of what you think
  • is the right answer to the question. Now the field began to explode the day people began to have more
  • structured neural networks, and in particular, this architecture that was introduced by Yann
  • LeCun are called convolutional networks. Is the basic architecture which is used on data such
  • as images. In fact it has applications to almost all fields. I'm going to show here what it does
  • for an image, in the inputs that you have here on the left, over there, your image is the data. The
  • weights of each of the neuron are going to look at a small part of the image which are shown as
  • the small square over there. Now because you don't know where the object is, the weights are going to
  • be identical all over the image. This means that the weights, if you put them within a big matrix,
  • is called a convolution operator. This is going to be transmitted and then all the coefficients which
  • are below zero, are set to zero by the rectifier which is over here. Then again you reapply a set
  • of neurons. That means that you again do a transformation with a convolution operator
  • W2 and again a rectification. So you cascade this transformation up to the output. Now comes in the
  • learning phase. So how do you learn? What you want is that in the output the value obtained
  • are equal to the true value that you would like to have, that you are provided in the training data.
  • So to do that you are going to measure the error. The error is going to be defined by what is called
  • a loss, which depends upon the parameters of the network. What are the parameters? The parameters
  • are the weights of the different neurons. Now the number of weights can go to millions, billions,
  • even trillions in the latest neural network. It's absolutely huge. What you are going to do
  • is to optimise these weights so that the network gets you the right answer. That is the learning
  • phase. It tries to minimise the error. So how can you do it? Imagine that the error, the
  • loss which is here, is a simple, like that convex function, so you begin with a value which is here,
  • over there. The algorithm is called a gradient descent, it's going to move the parameters
  • progressively in the direction of the derivative which is given by this equation. Now if you do so,
  • it's like a ball which is going to roll, it's going to roll to the minimum loss, and you are
  • going to find the best weight which achieves the minimum error. The problem is that in a network,
  • that's more how the loss landscape looks like, you have plenty of what is called local minima. So a
  • priori, your ball is going to be trapped somewhere here and not go at the right position. If you
  • begin from here, it's going to be trapped here. So the question is how come the neural networks
  • learn so well despite the non-convexity? The more impressive question is the fact that this
  • network gets you impressive results for a very wide range of fields. Image recognition, audio,
  • speech recognition, scientific computations. These networks can predict the evolution of differential
  • equations. Medical diagnostics, fault detection, even generation of images, music, physical data,
  • generation of text, programming, doing mathematical proofs more recently with ChatGPT.
  • The first question is how come this is possible? You have this curse of dimensionality. That
  • means that these networks are able to find a way around this curse of dimensionality. The second
  • surprise is that the architecture of all these networks, which solve all these kind of tasks,
  • are quite similar. That means that all these problems share some very strong properties,
  • so that they can be solved with the same kind of algorithm. These will be two key questions I'll
  • try to address. Now you can begin to look at your network and look at the weights. What is observed
  • if you look at a network for image classification is that the weights at the beginning they look
  • like the weights that I'm showing at the bottom. This was measured from a natural neural network
  • and they look like small oscillating waves that's just on the first layer. In other words, the W1.
  • The other layers are much more complicated to look like because the weights looks random, I'll come
  • back to them. So why these kind of wavelets are observed? Now the interesting thing is that if
  • you look back in neurophysiological models, these dates back to 1960. If you look back at the visual
  • system within the brain, in the back of the brain, you have this first visual domain which called V,
  • it's called V1, and Hubel and Wiesel discovered that within this domain you observe filters which
  • are called simple cells, which have very similar responses, which are shown here. This comes from
  • neurophysiological publications, very similar to what you observe on the first layer of a
  • neural network. Then when you move towards V2 V4, IT, things get much more nonlinear,
  • much more complicated. People have studied that, and that's in particular, so that's a much more
  • recent result from the team of DiCarlo in 2018. What they observe is that if you compare a neural
  • network and you compare what is done by the visual brain, there are some very strong correspondence.
  • In both of them, as I mentioned in the first layer, you observe these wavelets which are shown
  • here, and that corresponds to this zone. What they observed also is that the next layers have a type
  • of response which can predict the population response of the neuron in the next layer,
  • V4 IT. So there seems to be some very strong similarities, and the next question is of
  • course why? How come you have a neural network, which deeply has not much to do with biology,
  • which gives you suddenly the same answer that what seems to appear in the brain, and why are these
  • wavelets coming in? So wavelets, these wavelets have been studied since the 1980s. In particular
  • we understand mathematically quite well their properties, why they're useful. Wavelets are used
  • to separate phenomena across scale. If you have an image like the X, which is here on the left,
  • this is the image, you can split it into an image at a larger scale. So you remove details,
  • and a set of details that you see here, which corresponds to the wavelet coefficient.
  • Essentially, they give you information about the edges, where the information varies very quickly
  • in the image. Then if you take this image and you split it again, you again see an image at a
  • lower resolution, and the details that have been erased when you go from this small image to the
  • bigger one. So you can, like that, decompose an information by separating the different
  • phenomena that appears in the different scale, up to the last one. Now, as that was mentioned,
  • this has been used for image compression, because if you look at the number of non-zero coefficients
  • of the wavelets which are here, there are very few. These essentially corresponds to the edge.
  • Here we are not interested in data compression. Why would that kind of wavelet be of any use,
  • if what you are really interested is to analyse the information which is within data. To get a
  • key to try to understand that, I'm going to move towards physics and look at the kind of
  • thing that is being done in physics. So in the talk I'm going to move from the problem, which
  • is fine models of data. I'm going to specialise that in the sides of physics, because that will
  • allow us to try to understand a little bit more what are the mathematical properties behind,
  • and we're going to move towards classifications of images. As I said there's two problems we'll
  • need to solve. 1) we'll need to understand why we can reduce the dimensionality of the problem,
  • the complexity of the problem. 2) we will need to understand why we can make the problem more
  • or less convex so that the optimisation converge. There will be two key ideas. The first one you
  • need to separate information at different scales which appears in these architectures. The most
  • difficult thing we'll see is to compute or define the interactions across scales,
  • and we'll see why this is fundamental in physics and why this transport in many other fields. Okay,
  • so let me begin from physics point of view. So the field of statistical physics is about
  • taking the fundamental properties at microscopic scale, interaction between particles, and try to
  • understand how from these you can infer properties of the world at the microscopic scale, properties
  • of materials and so on. Now X now corresponds to a physical field, you can think of it as an image.
  • The energy of the field is in fact defining the probability to observe the field within
  • this state. So in other words, the probability distribution P is an exponential of the energy,
  • which is sometimes also called a Gibbs energy. So in physics, what you want is given this energy,
  • understand the property of the field at all scales. Now what is known pretty well in
  • physics is the energy of systems which are not too complicated, like gas, like ferromagnetism that
  • you see here, which essentially corresponds to the problem of measuring the magnetism of a material
  • from the properties of the spin, that kind of thing. They are good physical models. Where there
  • are not so good physical models, and in fact no physical models is whenever you have a physical
  • phenomena where there are geometrical structures, for example, turbulence. In 1942, Kolmogorov
  • raised this problem. Can we define a statistical model of turbulence? Up to now this has not been
  • derived from the fundamental equation. Cosmic web, this is the aggregation of mass in the cosmos. How
  • can we describe the energy of such a system? So one way to do it is try to do it from data. The
  • second question that I'll ask is by doing that, can I better understand the principle which are
  • governing these machines, in particular the neural network. I'll be showing that indeed, on the way,
  • you get very good hints about what's happening in the problem. Now, why is physics very difficult?
  • Physics is very difficult because it's also very high dimensional problem. You have phenomena which
  • occurs at the scale of ten to the -20, which is the scale of elementary particle up to phenomena,
  • which happens at the cosmological scale, at the scale of ten to the power 20 metres. Now how does
  • physics deal with this problem? When you deal with a cosmological problem, you don't try to look at
  • the property of each atom that are involved in the cosmos. What you do is you do scaled separation.
  • In other words, you have one domain of physics which only deals with particles, with atoms, and
  • their nucleus. You have another domain of physics which specialises at the next scale, at properties
  • of materials, which is material physics. Or you have biology that will study properties of DNA.
  • So each time you try to separate phenomena at different scales. Now, of course, scales
  • interact because the property of atoms are going to influence the properties of materials which are
  • much bigger. So the idea is to try to… Which is done always in physics, is to study each phenomena
  • at each scale and try to understand the kind of interactions that happens across scale. Now why is
  • that a good strategy? Suppose that you have a set of particles which are these points over there on
  • the plane. Think of these particles as it can be pixels in an image, it can be agents in a social
  • network, everybody interacts. Now what are the strongest interactions, the strongest interaction
  • between yourself, let's say, which are in the middle in red, and the next particle will be
  • maybe with your family, the closest people. Then the one which are a bit more far away, particles,
  • instead of trying to look at the interaction of each of them, you can regroup them and look
  • at the equivalent field influence on the central one. The ones which are even much more far away,
  • you may think, for example, that the life of someone in Russia has little influence in your
  • life if you pick him randomly, probably. If now you think of all the Russians, and if there is a
  • tension between Russia and Ukraine, which are two groups, that's going to influence your life. What
  • that means, that means that, yes, you can regroup the phenomena in different scale, but then you
  • need to understand all the interaction between the scale. So why have you simplified the problem? You
  • went from the particle to log d groups. So you are killing the curse of dimensionality. Now you still
  • have a complicated problem, which is understand the interaction between all this scale. How to
  • model the interaction across the scale? That's where we'll see neural network coming in. So what
  • is the strategy in physics to do that. That's an old problem that has been studied in particular
  • by the beautiful work of Wilson and Kadanoff. Wilson got the Nobel Prize for that, which is
  • called the renormalisation group. The idea is to try to look at the evolution of phenomena,
  • when you move from one scale to the next. Don't worry, at one point we'll come back to neural
  • network. So how do you do that? Here is an image, cosmos at a fine scale and you approximate it at
  • the next scale over there. What you would like is to understand how to come back, how to come back,
  • we've seen how to do it. If you can compute the wavelet coefficients which are shown here, you can
  • go back. These are the high frequency variation to the original one. Now what do we want to do?
  • What we want is to relate the energies, we want to relate the probabilities, the probability at the
  • fine scale, from the probability at the coarser scale, which is here. You have the Bayes' formula,
  • which tells you that this is going to be given by the probability of the wavelet coefficients,
  • given the low frequency. The discovery of Wilson and many physicists around, was that
  • this conditional probability is much simpler than the probability of the field, and you can see it,
  • if you look at the wavelet coefficient, they look like noise, much less structured,
  • very much correlated. This is much easier to understand than to understand the structure of
  • these filaments. If you look at the same equation but you take the logarithm, you get an equation
  • on the energies. What that says is that yes, it's very complicated to compute an energy,
  • but if you compute the increment, the interaction terms, this is going to be much simpler. That is
  • the approach that will see will naturally lead us to a neural network architecture. So that's
  • the work that was done with Tanguy Marchand and Giulio Biroli, the Ecole Normale Superieure,
  • you can take now a very complicated probability distribution with a very complicated energy over
  • there. The idea is to say, I'm first going to look at it at a very coarse scale, very few
  • coefficients, and progressively I'm going to add details. When you add these details you compute
  • these conditional probability distribution, you compute these interaction energies. We'll see that
  • these appears within the different layers on these kind of problems of a neural network. Now why is
  • that also important when you do the optimisation, because the original energy which is here is going
  • to be very complicated. There is going to be a lot of local minima. If you try to get it directly,
  • you are going to immediately be trapped and you'll never converge to the best solution. If you look
  • at the simpler interaction energies, one of the very important problem, the properties that in
  • most problem, large class of problems they are convex. So if you try to learn each of these,
  • you have a much easier problem than learning the total. So that's what the strategy that is going
  • to be followed. So now we need to learn these and to learn it, that means you need to build
  • an approximation with parameters. That's where the neural network are going to come in. So what I'm
  • doing here is instead of showing a neural network, I'm showing how progressively it's going to appear
  • from these first principles coming from physics. So here is an image on the left, these are the
  • wavelet coefficients that are obtained with the different wavelets at the next scale and the next
  • scale. What we want to understand to build a model of this is to understand the relation between the
  • different scale, the relation between the wavelet coefficient over there and the coarser image,
  • which has represented itself by all the wavelet coefficients at a different scale. Now you can see
  • that they are very related. You can see the shape of the boat appears at all scales. So obviously
  • these are very related, one relatively to the other. The big difficulty, and that has been
  • blocking a lot of research in math and physics, is that if you try a naive approach, which is just to
  • try to correlate these coefficient at different scales, you are going to get zero because these
  • coefficients oscillate with different sign. When you make a correlation, it disappears. Now if you
  • follow a neural network strategy, then you put an activation function, a nonlinearity which is
  • going to kill the sign. That's what we're going to do here. I'm going here to use an absolute value
  • because mathematically it's easier to analyse. Then of course, the correlation between the
  • different scale is going to be non-zero. Okay, now your correlation may have very long range
  • interactions. You want to build a model with as few parameters as possible. What we are going to
  • do is reapply the same strategy, again apply a wavelet transform, so that means that from here
  • I'm reapplying a transformation and that, with wavelets, that begins to look much more like
  • a neural network. Now if I want to understand physics, I want to understand the interaction
  • across all the scales, all the orientations that appears at a given depth, which is here. How to
  • do that? Have to realise that if you can capture this interaction, you can capture the physics,
  • whether it's gravitation, electromagnetism and so on. Everything is within the nature of this
  • interaction. Now there are techniques that have been developed, and if you look, as I
  • mentioned at the beginning, in neural networks, the coefficients have a tendency to look random.
  • So one strange strategy, and that's the work that has been done by PhD students at Ecole Normale
  • Superieure Edward Lempereur, Gaspar Rochette and Florentin Guth. The idea is you want to measure
  • interaction between all the coefficients that correspond to the different scale. You do random
  • combinations. In other words, you take these coefficients and now you introduce neurons,
  • or with random weights. So this is a random matrix. you multiply it. So that means you are
  • going to do a random linear combination of the coefficient, and then you apply your rectifier.
  • These are kind of models that appeared in the field that shows that neural networks seems to
  • have random coefficient. Then you compute your interaction energy with the parameter vector.
  • So the idea is once you have all the coefficients that have been computed at different scale, the
  • interactions that happens across the coefficients are going to be carried by these random weights.
  • Then you compute the energy by learning the parameters of the network. How do you learn them,
  • with your gradient descent? The optimisation that is done in a neural network and that is done with
  • a standard maximum likelihood approach. Now if you analyse the mathematics behind, you realise that
  • doing this random projection and non-linearity is like doing a Fourier transform in high dimension.
  • That means that essentially what you have is interaction between different scale, and you
  • just represent the low frequency properties of these interaction, the regular part of these
  • interaction. So here is an example. These are a type of physical phenomena that are not described
  • by any standard energy model. On the left you have the gravitational alignment obtained in the cosmic
  • webs at very large scale. In the middle you have a turbulence coming from an electromagnetic field,
  • and on the right you have a fluid turbulence. So for each of these system viewed as equilibrium
  • system, you would like to compute the energy. To do so we just apply what I just described.
  • So we built this kind of neural network with separate information across scale, and then
  • we computed the parameters of the energy. Then you can sample the models, and what you see at
  • the bottom are generated by the computer, by the energy calculated by the neural network. You can
  • see that indeed you are reproducing fields which are very close to the original one. You can check
  • that by checking the statistics of the field, indicating that indeed you can learn physical
  • models of very complex physical energy, including turbulences. Okay, so what all this says. It says
  • that essentially when you have a very complicated phenomena which lives at many scale. One of the
  • important things is to separate the scale and then try to measure this interaction. Now, is this
  • really what is done by a standard neural network that is being trained for doing classification?
  • That's what I'm going to finish on. I'm going to show that you find the same kind of principle
  • within these networks. So if you want to do image classification, so in input you have an image,
  • in the output you would like to get the class of the image. As I said these networks are learning
  • by taking the weights and updating the weights so that at the end you have an error which is
  • as small as possible. How do you initialise the weight when you don't know anything? You just put
  • random weights. That corresponds to what is called Gaussian white noise, totally independent weights.
  • Then you take your algorithm, which is doing a gradient descent, and you progressively update
  • the weight in order to minimise the error that is obtained by the classification. The question is
  • what has been learned, and what kind of function can you approximate? You know from the curse
  • of dimensionality that you cannot approximate anything. You can only approximate problems which
  • have the appropriate structure. Okay, so from the previous analysis, what is very tempting to
  • do is essentially do exactly the same thing than what we did on the physical side. That means you
  • take your image, you apply the wavelet filters. So you're going to get the first scale of the wavelet
  • transform. And then you apply an operator that is going to compute the interaction across the
  • different scale. Then you repeat and you compute the second interaction operator and so on. Now
  • the question is how do you compute the interaction operator? About five years ago or seven years ago,
  • I did a bet with Yann LeCun that someone will, within the next five years, be able to compute
  • these interaction operator without learning. So we tried for five years, and we failed, we failed.
  • So how do you try to compute these operators? You try to have some prior information about physics.
  • Physics is invariant by rotations and so on. You have all kinds of properties that you try to put,
  • and so that you can encode the information across scale. The best results that we got, this is the
  • state of the art on the right here. What is shown below the prior is the percentage of error that we
  • obtain essentially four to five times bigger. So obviously you need to learn the problem is
  • too complicated, and this was to classify images. The top is CIFAR-10 is smaller images. ImageNet,
  • much, much bigger images, more complicated. Okay, So the next stage is for us to say,
  • I lost I had to pay three star Michelin restaurant to [?Le Camp 0:35:31.1], which is totally unfair
  • given that he probably doesn't need it. At least we needed to have a research question out of the
  • lost. So the next stage is to say, okay, we lost, so let's try to learn these operators. And that
  • was done by Florentin Guth and John Zarka. So they maintain the architecture, you separate scales,
  • but now you learn the interaction which are these operators. Once you learn the interactions,
  • you go back to the performance of a standard neural network. So that means that separating
  • scale and just learning the interaction is enough. Then obviously the question is what
  • has been learned behind these operators? Again, if you think in physical terms,
  • this interaction is the interaction between the different scales. So they describe the physics,
  • in this case these interactions, they describe the different properties of the different objects that
  • you want to separate. So to try to look at what was learned, what Florentin Guth, Brice Menard,
  • and in collaboration with Gaspar Rochette, did is to look at the evolution of the weights. You
  • begin from the weights in your neural networks, which are totally random, Gaussian, random. Now
  • you let them progressively evolve and learn them and what do you observe when you have a matrix of
  • Gaussian white noise, the spectrum, which is the eigenvalue of the covariance matrix are flat. As
  • you see the learning moves, you see that the spectrum of the coefficient. In other words,
  • the weights are not anymore independent, they get to be very correlated, and you begin to
  • see the evolution of the spectrum. However, the surprise was that the weights remain somewhat
  • random. In other words, they still have, to a first approximation, a Gaussian distribution.
  • So what does that mean? That means that in such a model you begin with your data X, you separate the
  • different scale with your wavelet transform, and then you compute the interaction between scale.
  • In this model, the interaction between scale is computed by introducing these coefficients,
  • which are correlated, and if you look at the correlation matrix, you realise that, in fact,
  • what it does, it reduces the dimension. It goes from data that lives in high dimension,
  • and it reduces this dimension, and by reducing this dimension, essentially it eliminates all
  • the information which is not useful, and it selects the information which is useful for the
  • classification. Then you go to the next scale, you again separate scale, and then you go back
  • with your… These are the random coefficient, and as I said the random coefficients are essentially
  • equivalent to a Fourier transform which is over there. Then you again separate the scale and you
  • iterate so that we call that a rainbow network, because you have your noise which has different
  • colour at each layer, that are illustrated here a bit like a rainbow. So within this mathematical
  • world, then you can do the analysis. You can look at the class of output functions. You can
  • see that they belong to a certain space and they are essentially influenced by the covariate. The
  • question is, does it really work? We've done some hypothesis, is that going to work on real data? So
  • that's the test that was implemented by Florentin Guth and Gaspar Rochette with Brice Menard at
  • Ecole Normale Superieure. So you take now your neural network, you separate the scale and you
  • implement all the scale interaction operators with just Gaussian weights with a covariance. So that
  • means the following thing. You are first going to take a task, learn a network, compute at each
  • layer the covariance. Now you want to see whether the model works. How do you test it? You create a
  • new network with totally random weights but have exactly the same covariance, and you see whether
  • they have the same performance. The performance that you need to reach is 7.8 per cent error on
  • CIFAR-10, which is of the order of eight. You see whether when you apply the model and you
  • create plenty of new networks, they have the same performance. The performance in this case is a bit
  • higher, 11 percent, for these kind of images. The problem is that when you go to phenomena which are
  • more complex, the simple model that I gave, which is to have Gaussian weights, begins to be worse,
  • worse, and worse. That's why I said all these are mainly mysteries. Each of us have models, but they
  • are limited. So they are conjectures that the reason why you want to go from ten layers here,
  • or 18 layers, to 300 layers, is that it simplifies much more, the learning and gaussianity comes
  • back, but these remains mostly open questions. So I would like to finish on trying to take a
  • bit of distance with all this. As I said, there is this basic mystery. These problems are incredibly
  • complicated. These neural networks, not only they are able to avoid the curse of dimensionality, but
  • they are able to solve an incredible variety of problem with the same kind of architecture. So one
  • answer to this problem, as I said, if you look at physics, is that the key to avoid this complexity
  • is to separate scale. However, once you've separated the scale, the big, big difficulty
  • is to encode the interaction between scale, and again encoding interaction between scale, that
  • amounts to discover quantum physics, that amounts to discover gravitation. So this is not easy, but
  • by doing that, taking that strategy, you reduce this problem to a problem which is infeasible to
  • a feasible problem. Now, what we've been showing is that in the case of turbulence, for example,
  • these kind of models, given the priori that we have in physics, is sufficient to learn new
  • kind of physical models as the one that I've been describing. As we see the evolution of AI, we see
  • that the impact on physics is getting bigger and bigger. For example, you now have neural networks
  • which are able to predict meteorology in the next three days, which are better than calculating the
  • solution by running a partial differential equation, namely the Navier-Stokes equation,
  • because there are a lot of uncertainty on the data that the network seems to be able to better
  • take into account on short range, just three days. Beyond that, it doesn't work so well. Now what is
  • interesting in this field is that, as you've seen, the examples that I've been showing are quite
  • simple compared to the example that are nowadays shown with creation of incredibly complex images
  • with, as I mentioned, ChatGPT. What's happening is that the field right now is moving incredibly fast
  • because it's in an, let's say, empirical phase. We have algorithms which are being developed by
  • extremely smart scientists and engineers, and the performance of these algorithms is growing very
  • quickly with more data, more computational power. The results are incredibly impressive. As I said,
  • speech, protein folding, large language models, and mathematics moves very, very, very slowly
  • compared. We are still trying to understand properties of turbulence, trying to understand
  • property of filaments in the cosmic web, whereas these networks are able to solve proofs and so on.
  • So at one point there is a question that you can raise is why trying to understand? It's worth to
  • ask this question, because after all, you can develop a system, check it statistically. If
  • you think, for example, automatic cars, automatic cars are now running in Phoenix, in San Francisco
  • because statistically the number of accidents are very small and it works. Although we are very far
  • from understanding any of the mathematics which leads to robust results, or not always so robust
  • because there was recently an accident in in San Francisco. So why try to understand? From a
  • practical point of view, I think there is two main reasons. One is robustness. It's always somewhat
  • dangerous to have an engineering system that you build where you don't understand why it is stable,
  • and why it is robust. If you think of building a bridge, you can begin by doing that. Romans built
  • bridge without knowing the basis of mechanics, but if you want to build much bigger bridge,
  • very stable, then you begin to need to understand the math and the mechanics. Efficiency,
  • these systems are incredibly energy costly, data costly, and from that point of view, understanding
  • is important. I think there is another reason, and for myself at least as important, it's a
  • very beautiful problem. It's an amazing problem. We are now having machines which are able to solve
  • problems that range from physics, to language, to music generation, speech, protein folding, all
  • kind of problems. That means that there is some kind of structure in the world, which is common
  • because these machines are all able to solve it. Discovering this structure, it's about structure
  • of information, it's about structure of knowledge is a very, very beautiful problem. So we are not
  • close to solving it. I try to give some elements to it, at least to approach it, but I think it's
  • a very beautiful problem for anybody interested in doing math in this domain. Thanks very much.
  • F1: Thank you. Thank you for the excellent lecture and especially mentioning robustness at the end,
  • my personal favourite! So I would now like to open the floor to questions. Do we have any questions?
  • There is Andy here, and Shafia I think there was a question there at the back, okay. Andy?
  • Andy: Thank you for a fantastic lecture, you've covered many topics. I would love to hear your
  • thoughts, a bit about interpretation of parameters in these models. That came to
  • mind because I was thinking of the images, the way you were how the models interpreted
  • those 2D images of a ship and the parameters at various points. It strikes me that the
  • really substantial models you're just talking about at the end there are just much bigger
  • with trillions of parameters. So it seems like that's a problem of scale as well,
  • on small models we can visualise what the parameters mean, on the gigantic ones it's much,
  • much harder. So I wonder if you've got any thoughts on how we might approach that problem.
  • Professor Stephane Mallat: Yes, this is an extremely difficult problem. When you have a
  • physical, or a problem with three parameters, then you can associate to each parameter an
  • intuitive meaning. When you begin to have a model with 100 parameters, a thousand, a million
  • parameters by itself is very complicated. Now if you go back in the physics case,
  • at the end all the parameters are used to learn the energy, and then you can relate the property
  • of the energy to the physics. So, for example, that's what we are doing on active matters. Active
  • matters are particles which have their own free will and can move by themselves and by looking at
  • the energy. Then you can begin to interpret some of the physical phenomena. So that means that you
  • go back to an abstraction level, which is still high because we are thinking about you are looking
  • at the physical potential, you are not looking at patterns. I don't think that interpretability
  • on such complex problems can be boiled down to say, oh, there is this type, or this type,
  • or this type of pattern. The interpretability will probably be at a mathematical level, on constructs
  • such as in the case of physics, the case of potentials. It's a very good question. How to
  • do it for problems such as classification? This is in some sense why myself, I moved back to physics
  • because in the case of physics, you understand the problem. You have a machine you don't understand,
  • you can try to relate the two. If you want to classify cats and dogs, deep down we don't know
  • what is an image of cat, an image of dog. Then you have a machine that you don't understand,
  • so you have a machine don't understand to solve a problem you don't understand,
  • it's more difficult. So hopefully we'll move to that problem. It's a beautiful problem of
  • interpretability, but I don't think it's going to be solved by easy things such as patterns,
  • or this idea we had in computer vision in the 1990s or 2000.
  • F1: Thank you. Right. Next question.
  • M1: Hi, thanks for the talk. In terms of neural networks, so you've talked about it from a
  • reductive point of view. So you're saying that the networks learning in terms of wavelets, and you're
  • building that up at different hierarchies. When you look at generative neural networks,
  • you're getting some very, very complex behaviour out of them. So you showed us before the idea
  • of a bear running through the streets. There's a lot of high level concepts involved in that. So,
  • for instance, if you're putting into a generative neural network the concept of
  • the Pope in a puffer jacket, can you go backwards through that and split it up and show which part
  • of the network is associated with Pope, and which part is related to the puffer jacket?
  • Professor Stephane Mallat: Okay, I didn't quite understand the end, but so from what I understand…
  • M1: Can you do the opposite and probe the network
  • backwards? So the concepts within it are highlighted?
  • Professor Stephane Mallat: In the case of a generative network, to generate images and so on.
  • In the case so, the one that I know well, in the case of images, if you look at the architecture,
  • usually people use what is called U-Net. And what are U-Net doing is precisely separating scales,
  • there you go. At one scale, the next scale, the next scale, the next scale.
  • Then you have these interaction elements that comes in the horizontal and they reconstruct,
  • so it's very present. In fact the first networks that were doing generation of images,
  • without doing these multiscale properties, were only working for small images. So you
  • do see that appear. Now whether these are wavelets inside or not, as I said,
  • you can only see it on the first layer, inside the other layer, it's very complicated if you don't do
  • the factorisation. For the synthesis part there are many techniques. So that we would need to
  • discuss about that, because depending upon the way the synthesis is done, if it's core diffusions,
  • then you use these network at each layer, and so at each time, so at each time you
  • do see [unclear word 0:52:58:8] structures. Yes, that was for the case of images that we studied.
  • F1: Thank you.
  • M2: Yes, thank you for a really stimulating talk. I think it's really interesting to do a
  • comparison between modern systems and human brain. I think one of the important comparisons is the
  • energy footprint of learning. The human brain is extraordinarily efficient, 8t runs on 10 watts to
  • 20 watts of energy. So I guess my question is, current systems having a much higher energy
  • footprint for learning, do you think we've missed the mark with the algorithms or with the hardware?
  • Professor Stephane Mallat: Both! Certainly both. The hardware is obvious. Silicon compares
  • to neurons, energy wise it's extremely bad. At the algorithmic level there is something quite
  • obvious, is that in the neurophysiological system, there is DNA, and DNA already encodes
  • part of the solution. [?Probably we are not born 0:54:13:6] but the architecture is already there,
  • but also responses. You can see it, there are a lot of experiments that have been done on babies,
  • they acquire vision incredibly quickly. You can see, for example,
  • these simple cells that are already there, they're already encoded. Now that could be
  • addressed by pre-training, but that means the algorithm needs to evolve. So both,
  • probably the worse is the hardware, if you compare it to biology, there we are really fine.
  • F1: Okay, thank you. So just one last question.
  • M3: Thank you for a fascinating talk, Professor Mallat. I would like to comment
  • on your diagram showing the range of scales considered by physics, from the Planck level,
  • to the whole cosmos. I'd point out that one of the problems, which is the reason for physics
  • being different from chemistry, and different from biology, and from psychology and so on,
  • is it's not just a matter of scale, it's a matter of the nature of the systems that
  • are involved and the fact that the system has evolved. It has properties which aren't
  • contained by any of its constituents. There are emergent properties as the scale gets different,
  • for different scales. The approach that you've taken seems to me to be concentrating too much
  • on the scale and the interactions between the different scales, without taking account of
  • the different emergent properties that occur at these different situations. Do you agree?
  • Professor Stephane Mallat: So that's a very interesting question. You're absolutely right,
  • the nature of the phenomena that appears in, in fact, a different scale, which may correspond
  • to chemistry and biology, or to fundamental physics are different. You see the same kind
  • of thing appearing in neural networks. In fact, people speak of emergent properties in ChatGPT,
  • which suddenly can work with a prompt which was not expected at the beginning. One point is that,
  • I think you would agree, that all this system, whether there are biological or chemical, they are
  • built from the same particles. So they are built from the same material in the same way that the
  • emergent properties of a system with trillions of neural neurons are based on the weights. So
  • it is emergent properties. But these emergent properties result of going to different scales,,
  • that's the way I view it. You're absolutely right. There are properties which are fundamentally
  • different. This is why statistical physics is both beautiful and incredibly difficult,
  • is to understand how these emergent properties come out from the previous layer. That's what we
  • don't either understand in neural networks. So I'm not sure the two point of views are so different.

91TV Milner Award Lecture 2023 given by Professor Stéphane Mallat.

The remarkable performances of deep neural network remains a mathematical mystery. How come similar network architectures can capture properties of data as different as languages, natural images or physical fields? Physics provides a rich framework to understand this mystery. This presentation bridges neural network and physics, by showing in what sense they rely on similar mathematical principles. It introduces models of deep neural networks and complex physical fields, by separating phenomena appearing at different scales. Interactions of structures across scales are learned with random weights and wavelets. These models are applied to image classification, as well as generation of fluid turbulence and cosmological fields.


About the Royal Society
91TV is a Fellowship of many of the world's most eminent scientists and is the oldest scientific academy in continuous existence.
/

Subscribe to our YouTube channel for exciting science videos and live events.

Find us on:
Bluesky:
Facebook:
Instagram:
LinkedIn:
TikTok:

Transcript

Tags