91TV

Probability theory and AI | 91TV

53 mins watch 21 February 2022

Transcript

  • Hi, I'm delighted and honoured to be here presenting the Royal Society Milner Lecture today.
  • The topic of my lecture is probabilistic machine learning and artificial intelligence.
  • AI has gone through a number of areas of research. In the early days of AI, the method was
  • based on logic, search and symbolic AI. Then, in the 1980s and early 1990s,
  • machine learning, which used to be a small subfield of AI,
  • grew in prominence, and the emphasis turned into systems that learn from data. One of those
  • systems that was very exciting in the '80s was neural networks, under the name connectionism,
  • and this generated a huge amount of excitement because people thought of it as maybe a new model
  • for understanding human and animal cognition and relating symbolic and subsymbolic processing and
  • neural processing. The promise of sort of a new paradigm in psychology and neuroscience
  • from this computational perspective, as well as a new paradigm in artificial intelligence, was
  • something that was very palpable in those days. It was also a very significant time for me,
  • personally, because that's when I got into the field,
  • and I was swept away by that excitement around learning systems. In the mid-1990s and early 2000,
  • people started to become disillusioned with neural networks because of the complexity of
  • training them, and the sort of lack of significant performance from them in an engineering sense,
  • and they turned to more mathematically-elegant methods, such as kernel methods, like
  • support vector machines, probabilistic models, graphical models, Bayesian inference, and so on.
  • In the early 2010s neural networks came back in force with the deep learning revolution,
  • and this is a time that saw many, many breakthroughs happening in the field of
  • AI and machine learning. Again, the sort of link to neuroscience became very prominent,
  • and we're still living in the deep learning era. One of the things I'll touch upon is what's next,
  • what's what comes after the deep learning era. First, let me step back and ask the question,
  • what is artificial intelligence anyway? I'm not going to try to define it, but I'm going
  • to comment a bit on some terminology. The term artificial intelligence I find a little bit
  • of a misnomer, in my view, at least. because intelligence in a machine should not be
  • considered artificial. I think the distinction between artificial and natural seems itself kind
  • of artificial. We don't call what aeroplanes do artificial flying - they fly - and so we should
  • apply the same sort of principles to how we define what biological and machine systems do
  • as well. A second distinction that's worth making is around autonomy versus intelligence,
  • and one of the concerns about AI systems that's often stated, really doesn't stem from the
  • intelligence of the system, but rather from the autonomy, combined with a lack of intelligence. So
  • when we think of a system that's autonomous, we are endowing it with the ability to make
  • decisions on its own, and often that's the situation that becomes concerning. The fact
  • that it's intelligent should be generally a positive thing.
  • There's a term, general intelligence, that people are focusing on a lot these days,
  • and I want to remind people that specialist systems are incredibly useful
  • and that we should celebrate those specialist systems. Also, general intelligence is often
  • related to human abilities, and I would argue that human intelligence is flexible,
  • but not all that general. Which brings me to this next term, which is human-level AI.
  • To me, it's not clear that humans should be a model for intelligent systems. Humans are primates
  • with a particular set of cognitive skills that have evolved for our survival. Human reasoning is
  • notoriously flawed. Human memory, calculation and communication abilities are very limited,
  • and really, we want to strive for machines that are complementary to human abilities.
  • Another thing that I think is worth mentioning is that,
  • often when people celebrate human achievements, what they're really looking at is the product of
  • society and organisations over time, and we must not conflate this with the abilities of a single
  • brain. Again, if we try to mimic the abilities of a single brain, we're not going to get to the sort
  • of advances that human society, has achieved. So we should be really looking at systems that
  • are intelligent in a way that have a potential to complement human intelligence.
  • So it's a really exciting time for AI and machine learning. There have been incredible breakthroughs
  • in AI and games that you're all aware of. There have also been a tremendous number of
  • really fantastic advances in practical applications of AI and machine learning,
  • speech and language technologies, computer vision, scientific data analysis, recommender systems,
  • and online commerce, self-driving cars, applications of AI to finance, climate, weather,
  • economic systems, etc. Many of these applications have been driven by deep learning technology,
  • so let me touch on deep learning and see where that takes us.
  • So deep learning are neural networks of the kind that were very popular in the
  • mid-1980s. Neural networks are tuneable nonlinear functions with many parameters.
  • The parameters of a neural network are these weights, and these models are called
  • neural networks because they're composed of simple computational elements organised in a network,
  • inspired by a very, very simplified model of how the brain works. Deep learning models,
  • in particular, look at modelling these nonlinear functions through layers of computation,
  • and these layers of computation, where one computation gets passed on to the next layer,
  • can be thought of as a composition of functions. So the overall function from inputs X to outputs Y
  • is a composition of functions, and if this has many layers we call it deep. Now, the way these
  • models are trained is usually with some version of maximum likelihood or penalised likelihood,
  • using some variant of stochastic gradient descent optimisation. So in fact, a deep neural network is
  • just some nonlinear function with a little bit of basic statistics as a principle for training it,
  • and some basic optimisation used to minimise the error or maximise the likelihood.
  • Now, deep learning is, essentially, the same concept as has been around for many decades,
  • except for a few innovations that have happened in the last decade or so. So the community has
  • figured out better architectural and algorithmic ways of making the models deeper. We have vastly
  • larger data sets, thanks to the web, and we also have vastly larger compute resources to be able to
  • train these models. We have much better software frameworks that make the training of these models
  • quite automatic, and requires much less expertise to be able to train these models,
  • and we have, along with all of this, vastly increased industry investment as well as
  • media interest. So there's been a flywheel of excitement around deep learning, and it's produced
  • a tremendous amount of impact in a huge number of areas of science and technology and commerce.
  • Now, there are a few key ideas that have made for the success of deep learning. So what's the
  • difference between what we were doing in the 1980s and what we're doing now? Well, first of all, we
  • figured out that very large models can work really well, and this kind of reminds us of the sort of
  • principles behind nonparametric statistics. Very flexible, large models are quite powerful.
  • But to train those flexible large models, we need huge data sets, and if we don't have
  • huge, real data sets, we can often generate simulated data to train these models.
  • Training these models is made easy through the use of automatic differentiation,
  • so software to do this has really accelerated the field. To make the models deep, we've learned as
  • a community to keep each layer of computation somewhat close to the identity function. That
  • way you can stack many layers together, and that idea has been reinvented in many different ways.
  • We've discovered that stochastic optimisation works surprisingly well,
  • as long as you have good initialisation methods. We've rediscovered in many ways the concept of
  • parameter tying or finding ways of encoding symmetries in the data - convolution, recurrent
  • nets, graph neural nets are all instances of that concept - and we found that the internal
  • representations of these deep neural networks are very valuable and reusable. So we can use
  • a deep neural network that's been pre-trained on one task to be able to do amazing things
  • on many other tasks, and to be able to learn new tasks with very few shots of additional learning.
  • Of course, there are also limitations to deep learning, and although they've been incredibly
  • powerful at giving great performance on many, many benchmarks, they're very data hungry,
  • often requiring millions of examples. They're very compute-intensive to train and deploy.
  • They're easily fooled by adversarial examples, so they're not all that robust.
  • They're finicky to optimise because the optimisation function is non-convex,
  • and there are many, many choices involved in that optimisation. They're uninterpretable black boxes
  • lacking in transparency and difficult to trust. It's non-trivial to incorporate prior knowledge
  • and symbolic representations, and they're very poor at representing uncertainty. All of these
  • areas of limitation are areas that the research community is actively working on overcoming.
  • Now, I want to talk about a personal view of going beyond deep learning.
  • In my research in the last decade or so, I've been really focusing on a probabilistic
  • modelling approach to machine learning. So this approach is based on the concept of a model.
  • A model is something that is present in many different fields of science and engineering.
  • A model, in my view, describes data that one could observe from a system - that's how you can falsify
  • a model and tell whether it's a good model or not - and any time you have a model, you have
  • to think about the uncertainty that you may have about the parameters and structure of that model.
  • So the probabilistic modelling approach is basically going to use the mathematics of
  • probability theory to express all the forms of uncertainty and noise associated with our model,
  • and then we're going to continue to use probability theory in the form of
  • inverse probability, or Bayes' rule, that allows us to infer unknown quantities, adapt our models,
  • make predictions, and learn from data. So Bayes' rule is expressed here at the top of this page,
  • with Reverend Thomas Bayes, looking at it at the bottom of the page. Bayes' rule is a way of
  • learning from data, or converting our knowledge from a state of knowledge prior to observing
  • data to a state of knowledge after observing the data. The way we represent our knowledge about
  • all the things that we're uncertain about is through a probability distribution that
  • captures the range of possibilities that we might have over various different hypotheses.
  • That's expressed here. That's called the prior. Then, when we observe the data for any hypothesis,
  • we should be able to compute the probability of that observed data given the hypothesis.
  • That's the likelihood. We multiply the two. We renormalise over the space of hypotheses we are
  • considering, and what we get is a posterior distribution over hypotheses given data. I should
  • emphasise that in this framework there are only two kinds of quantities out there. There's data,
  • which is measured or observed quantities, and there's everything else. Everything else is
  • fair game to be uncertain about, and that's what I'm calling hypotheses. So any uncertain quantity
  • model structure, model parameters, any assumptions that we may have, those are the hypotheses,
  • and we should place a probability distribution to represent our uncertainty about those hypotheses.
  • So Bayes' rule can be applied to machine learning in the following way. In fact,
  • it's worth noting that Bayes' rule itself is just a corollary of two even simpler rules
  • from probability theory, the sum rule and the product rule, written at the top of this page.
  • So if we apply Bayes' rule to learning the parameters of a machine learning model given data,
  • we get this expression here, where I've just substituted in theta for the parameters of
  • our model, perhaps the weights of a neural network D for the data set that we've observed,
  • and M for the model class that we're considering, the overall model structure that we have. If we
  • want to make any prediction about any uncertain quantity X given the data, then the sum rule and
  • the product rule tell us that the way you should make predictions is by averaging the predictions
  • for every possible parameter value, weighted by this posterior distribution over parameters
  • that we've computed in the equation above. So prediction is naturally an averaging or
  • integration process, or summation process if it's a discrete space, and if we want to compare
  • between different models, then we apply Bayes' rule at the level of different model structures M.
  • So it's actually really, really straightforward, and it's actually very powerful as well.
  • We can use Bayesian machine learning and Bayesian methods in general to encode an
  • automatic preference for simplicity. This is called Bayesian Occam's razor,
  • and in this example here, what we have is a data set of eight data points, and we're trying to fit
  • a function y= f(x) to these eight data points. Of course if we fit polynomials we know that if we
  • have a seventh order polynomial, we can exactly fit these eight data points. This is something
  • that's called overfitting. On the other hand, we might not be fitting enough structure in the data,
  • and that's what's called underfitting, and if we apply Bayesian inference and look at
  • a distribution of possible polynomials given the data - these are sort of these curves shown
  • in green - what we can also get is an automatic penalisation of models that are overly complex.
  • So in this particular case, for example, we get that the system says, well, a quadratic or maybe
  • linear or a cubic seem like reasonable hypotheses for this data. Perhaps it's a constant function,
  • but higher order polynomials, there isn't enough evidence in the data to support those.
  • The workhorse of Bayesian Occam's razor can be used for all sorts of model selection
  • questions, so we can use it to automatically find the number of clusters in data, the intrinsic
  • dimensionality of data, to find whether some inputs are relevant to predicting some outputs,
  • or to learn the order of a dynamical system, the number of states in a hidden Markov model,
  • the structure of a neural network, or the structure of a probabilistic graphical model,
  • and it's been applied in all of these cases quite successfully.
  • Now, the thing that drives Bayesian Occam's razor is this concept known as the marginal likelihood,
  • which is the probability is the likelihood integrated with respect to the prior,
  • or is the probability of the data given the model, averaging over parameters. Not optimising
  • parameters, but averaging over parameters. This has a really beautiful information theoretic
  • interpretation as well, which is that log2 of one over the marginal likelihood,
  • is the number of bits of surprise at observing data D under model M.
  • And so Bayesian Occam's razor, basically, says that a model is a good model if the data is
  • not very surprising under that model, and that penalises models that are overly complex because
  • in very complex models, a simple data set would actually be surprising to observe.
  • Now, the field of AI has many people who don't often think about probabilistic inference,
  • and I want to talk about why probabilities matter for AI. There are a few reasons. First of all,
  • I think we really need to have calibrated models and calibrated prediction uncertainties. We need
  • to have systems that know when they don't know, especially in applications where the decisions
  • of the system are critical, like for example, in a medical application or in a self-driving car.
  • We also want to make sure that we have a nice framework for automatically doing complexity,
  • control and structure learning, and Bayesian Occam's razor gives you that.
  • Uncertainty is essential for decision-making problems, like active learning,
  • black-box optimisation, reinforcement learning and other exploration exploitation trade-offs.
  • Generally, we want to build intelligent systems that can make rational decisions,
  • and probability theory is one of the ways of thinking about rational decision-making.
  • We need ways of building in prior knowledge into our learning systems, and making sure that that
  • knowledge is updated in a coherent and robust way as you get more data, and we want to make sure
  • that our learning algorithms don't just work on huge data sets, but also work on small data sets.
  • Now I'm going to turn to some recent areas of research, and actually they're active areas,
  • but they've been around for a very long time. So we can bring the ideas of probabilistic
  • modelling and Bayesian inference together with deep learning in a very elegant way.
  • So the way we do this is we think of a neural network as simply being a probabilistic model
  • with some parameters theta - those are the weights of the neural network - and we apply
  • Bayesian inference to those parameters of the neural network given the data.
  • Now, in the early 1990s, this idea was explored by people like Radford Neal and David MacKay,
  • and Radford Neal actually showed a very beautiful result, where he showed that a neural network with
  • infinitely many hidden units in one hidden layer,
  • treated in a Bayesian manner, converged to a stochastic process known as the Gaussian process,
  • that's been around for over a century. So at that time, many of us in the field said, 'Ah,
  • neural networks are such a pain. Gaussian processes are so elegant and beautiful.
  • Let's just study Gaussian processes and see how far we can take that.'
  • It turns out that this idea has been revisited in the era of deep learning, so even if you have a
  • deep model, as long as the layers are wide, you still converge to a Gaussian process.
  • In fact, you can see that even empirically, in that the function and the error bars
  • in this one-dimensional function approximation problem, are virtually indistinguishable
  • between the Gaussian process and the Bayesian deep and wide network here.
  • Now, I've related deep neural networks with Gaussian processes, and Gaussian processes can be
  • thought of as kernel machines, just the Bayesian version of kernel machines,
  • like support vector machines, and both of these models can be related to
  • the mother of all statistical models, which is linear regression.
  • Here is a cube diagram that shows the relationship between these models through various operations
  • that you can do from linear regression to other models. So in fact, a lot of the ideas that have
  • been explored in the last three decades of machine learning are closely related to each other.
  • In fact, the field of Bayesian deep learning has been around for three decades, and there
  • have been many, many methods that have been used to approximate the integrals or averages
  • that are needed to do Bayesian inference in a neural network, from the Laplace approximation,
  • variational methods, Markov chain, Monte Carlo, dropout, and many other methods. It's
  • still an incredibly exciting area, and it's a very promising for getting more calibrated
  • predictions out of models and getting out of distribution performance that's good and robust.
  • So what we really care about is not just fitting a function, but away from the data we want to have a
  • good level of uncertainty that captures that the system might not know what the right answer is.
  • Another way of bringing probabilistic models and deep learning together is recent work on
  • deep sum product networks. A sum product network is a compact, deep representation of a potentially
  • exponentially large mixture model. A mixture model is, again, a classical statistical model that
  • captures a complicated probability distribution as a superposition of simpler distributions.
  • A mixture model is a very flexible model, and a deep sum product network
  • is a way of learning these flexible models in an efficient way. We've recently been exploring
  • deep sum product networks, and there are a number of key properties that are very exciting.
  • So these models are trainable using stochastic gradient descent with automatic differentiation,
  • in much the same way as a deep architecture can be trained using GPU support and methods
  • like dropout, so to the user it becomes very similar to a deep neural network.
  • They can be used both as generative and discriminative models, which is quite
  • useful. Their predictive results are comparable to deep neural networks, at least on the small
  • examples that we've tried out. They have better calibrated uncertainties. We can compute the
  • likelihoods exactly. We can do efficient marginalisation and conditioning exactly, and
  • we can deal with missing data and detect outliers. So I find this this very promising and exciting
  • and it's worth further study. Here are a couple of recent papers on this topic.
  • Another area that is incredibly powerful and exciting, and that you can combine with deep
  • learning, is probabilistic programming, and the concept of probabilistic programming comes back
  • to the concept of modelling that we had before. We want to be able to express models in many
  • fields of science and engineering, and rather than expressing a model as a set of equations,
  • we can think of expressing the model as a computer program that would generate possible data sets.
  • So probabilistic programming language is a language for expressing such models,
  • computer programs that generate data or simulators. A universal probabilistic
  • programming language is one that can express any computable probability distribution, and that's
  • a very, very powerful concept. Then, a universal inference engine is what's behind the scenes doing
  • automatic inference over the hidden variables and parameters of your model given the data.
  • So we're used to the concept of running a simulator in the forward direction to
  • generate data, but you can take a simulator and you can say, 'What would happen if this is
  • actually the data that I observe? What should the parameters of my simulator have been?'
  • and universal inference does Bayes' rule on computer programs. It can run a computer program,
  • essentially, in a backward direction and infer the hidden parameters of the computer program.
  • This is actually, in some ways, very similar to what automatic differentiation does, which is to
  • basically compute derivatives through computer programs in the backward direction. The idea
  • of probabilistic programming has been around for a couple of decades as well. There are many very
  • successful probabilistic programming languages, and there are many inference algorithms that have
  • been implemented on the back end to be able to capture inference in these probabilistic programs.
  • Here is a probabilistic program for expressing a Bayesian hidden Markov model in the programming
  • language Turing. It's a very simple to understand expression of this probabilistic model. Here is a
  • graphical model representing that model. Then, all you have to do is present the data and all
  • the inference happens automatically, and there are some great resources in probabilistic programming.
  • Now, I actually think the probabilistic programming could really revolutionise the way we
  • think about scientific modelling, and it's at the heart of
  • future advances in machine learning and AI, especially when we combine it with deep learning.
  • Although there are many probabilistic programming efforts that are very exciting, two of them that
  • I've been personally involved in are pyro, which is a deep, universal probabilistic programming
  • language developed when I was at Uber and which is now open source, and Turing, which was developed
  • at the University of Cambridge, but is also now an open-source project.
  • So I want to finish off by thinking about how to build an AI system.
  • So the question is, do we have the mathematical principles required to build machine intelligence?
  • We would need principles for perception, learning, reasoning, decision-making,
  • and I would argue that maybe we already have all the mathematical principles that we need.
  • For perception, we know that deep learning is quite good. We have tools for optimisation,
  • Bayes' rule, and we know that more data helps. For learning, again, Bayesian inference is a
  • nice framework. Optimisation is a workhorse of learning, the likelihood principle is often used.
  • For reasoning, we have decades of work on logic and probability theory, and for decision-making
  • we have decision theory, control theory, Bellman's equation, reinforcement learning, and game theory,
  • and so on. So maybe we have all the tools we need to build AI systems, but maybe what we're missing
  • are better ways of approximating the intractable computations involved,
  • better optimisation algorithms, better probabilistic inference methods, Monte
  • Carlo methods, variational methods, etc., and of course, we also need more data and compute.
  • Now, there are many exciting things about AI. So there are many opportunities that come from AI,
  • but there are also many challenges in the field of AI, especially as
  • it relates to society. So I'll start with one of the opportunities, which is
  • that AI offers the opportunity to accelerate other areas of science and technology,
  • and this is hugely, hugely exciting. AI systems are often very data driven,
  • and one must balance the value they provide to users with privacy concerns when using data.
  • Interpretability of AI systems is a real challenge. AI systems can be hard to interpret and
  • explain, but on the other hand, they can also be more reliable and transparent than humans, because
  • you can examine the code. Control of AI systems is something that is on many people's minds, so
  • endowing any system with autonomy carries risks. The power dynamics of AI is something
  • that many people have talked about. AI can result both in the concentration of power,
  • but it can also empower individuals through devices that give special skills to people.
  • The fairness of AI systems is something that we need to
  • really think about and invest in when we deploy them. AI systems must be designed
  • to be fair to diverse users, and one of the best ways of doing that is to ensure that
  • the people designing the AI systems also come from very diverse backgrounds. So to summarise,
  • I think AI is a tool for society, and we want to make sure that it can help with human flourishing.
  • So I'll just wrap up now. I've talked about a probabilistic modelling framework for building
  • AI systems that reason under uncertainty and learn from data,
  • and I've briefly reviewed some of the frontiers of our research in probabilistic AI,
  • including Bayesian deep learning and probabilistic programming. If you're interested in this topic,
  • here is a review paper which is from a few years ago that talks about probabilistic
  • machine learning and AI. Thank you very much for your time and I look forward to some questions.
  • Thank you Zoubin. Thank you. This was a very inspiring lecture
  • and I'm very glad that you put probability at the centre. I have lots of questions myself,
  • but I think before we do that, I would like to remind anyone who is watching that you can enter
  • your own questions into Slido. You should have received instructions together with the joining
  • instructions email. Okay, so let's go to the first question. Is that okay Zoubin, are you ready?
  • Yes.
  • So the first question is from Michael Hopkins, and it is, as a user of Bayesian inference in
  • conjunction with Gaussian process models since the mid-1990s, I am aware of the great power
  • of these methods. What Professor Ghahramani like to comment on the sensitivity of the
  • evidence value to prior knowledge, both in GPs and more generally? Over to you, Zoubin.
  • Yes, great. Thanks. That's an excellent question, Michael. So, much has been written about
  • the sensitivity of the evidence to the prior. Just as a reminder, the evidence,
  • or marginal likelihood, is what we tend to use to compare different models,
  • and it integrates over the prior, and so it makes sense that it depends on the prior.
  • I think that it's natural for the evidence to depend on the prior. If we take a model such as
  • a Gaussian process, or any other parametric model, even, and we vary the width of the prior over its
  • parameters, we're actually expressing a different distribution over functions, and so the evidence
  • should depend on the prior, but people are uncomfortable about that, because a lot of people
  • want to seek objective measures for comparing models. I'll say two things. So one of them is,
  • in my view, it's okay to embrace the dependence on the prior and to make it focus our attention more
  • on having sensible priors that capture the distribution over functions or whatever we're
  • trying to learn. So it just puts more pressure on people to think about their prior choices
  • rather than pick priors out of a black box. The other thing that I would like to suggest
  • is, while the evidence itself is interesting from the point of view of comparing between
  • models, in machine learning, and actually in a lot of applications, we're often
  • more interested in the predictive performance of a model, how it will behave when you have new data.
  • That sort of predictive performance tends to be quite robust to the choice of prior,
  • as long as the prior captures the sort of flexible space of models. So I would be less worried about
  • choice of prior if, instead of computing the value of the evidence, you're actually trying to
  • use a probabilistic model to make statements about potential new data.
  • So those are my thoughts on the evidence and the prior. A great question.
  • Okay. Thank you. So we have some more questions. Thank you. Yes,
  • please do ask if you have any more questions. So the next question is from Kerensa Jennings,
  • who is very grateful for the talk and would love to understand
  • more about your thinking about the ethics of AI and probabilistic ML. She noted on slide 33,
  • 34 that you spoke about fairness and diversity, but she would like to - he or she - would like
  • to understand more on the dimensions and considerations that matter for fairness.
  • Yes, again, a great question. Thank you very much.
  • As AI and machine learning systems are being used more and more in ways that interact with people
  • through products or technologies that people use every day, it's absolutely essential for
  • the people designing these systems to consider the impact that the systems have on their users,
  • and that includes the fairness, accountability, interpretability, and many other ethical
  • considerations that come with building any kind of automated system. As I said earlier on in my talk,
  • often the thing that really matters is the autonomy of the system, not the intelligence
  • side. So when we delegate decision-making to an autonomous system, we need to uphold
  • the values that we wish to consider
  • that system to have. Now, this is where it gets quite tricky because the values
  • can be culturally dependent. They depend on the values of the designer of the system, they depend
  • on the values of the people who labelled the data, for example, that the system was trained on.
  • Often, that's the source of a tremendous amount of bias that gets entered into our models.
  • In particular, for example, on language models, if you train them on
  • natural human language, you can get a tremendous number of biases creeping into our models.
  • So there are many, many mechanisms for trying to both technically and societally manage
  • those ethical and bias concerns in models. Let me talk about both the technical and societal way.
  • Technically, we need better tools for being able to probe our machine learning models, which get
  • less and less interpretable as we've moved into deep learning technology. So
  • better interpretability tools are often very, very
  • helpful - things like model cards and data cards are examples of
  • tools that increase transparency - but the societal side of it is also very important.
  • The question around who is designing the models, who is participating in the field of AI,
  • do we have a diverse and representative community of AI researchers and software engineers, and
  • associated people that build these systems? Where are they being built? By
  • people from what cultures? That socio-technical side of the work is incredibly important.
  • I would say for any software system really, but in particular for AI and machine learning systems,
  • because they interact with us in somewhat unpredictable ways, and because they are
  • trained on data that are collected, often, from
  • human interactions with those systems. So a huge and important area, one that
  • is both in the academic research mind, but also very, very prominent in all the big
  • tech companies that work in the field of AI. So, great, thanks for that question.
  • Okay. Thank you. So I will now move to the question from Chris Bishop,
  • who is saying that your lecture was superb and has a quick question. The Bayesian framework is
  • very elegant but computationally intensive. Since we are constrained by compute capacity,
  • are we better to use that capacity to do non-Bayesian training of a large network,
  • or a more Bayesian treatment of a smaller one? Please answer.
  • Great. It's always a pleasure to get asked a question by someone who's a world expert in
  • the field of machine learning, so thanks, Chris. Excellent question.
  • I think the answer depends on two things. So first of all, one would like to have methods
  • at a whole Pareto frontier of computation, so as we provide more computation, better
  • estimates of uncertainty, which is maybe one way of thinking about calibration of Bayesian methods.
  • Now, add a particular amount of computation, we have the choices that you described,
  • train a larger model in a non-Bayesian way or a smaller model in a Bayesian way,
  • and I think if we have a tremendous amount of data, it's often good to train the larger models,
  • maybe fewer larger models. If we have smaller amounts of data, then for that
  • choice of computation then it does often pay off to treat uncertainty in a more calibrated way.
  • One thing I will say, though, is that there's been a tremendous
  • amount of research done on exactly this question, and there is a great
  • NeurIPS tutorial from last year from three of my colleagues at Google, looking at uncertainty
  • calibration, and looking at this whole Pareto frontier. I'm a fan of methods that
  • make me have my cake and eat it too. So in this particular case, methods for training
  • very large models, but improving their uncertainty calibration in a cheap way, so as to approximate
  • the solution for a more expensive Bayesian method, that's often quite promising, and there are
  • good methods for doing that by, for example, retraining small subparts of the large model
  • in a way that capture the uncertainty in the model. A lot more in that NeurIPS tutorial
  • by Balaji, Jasper and Dustin Tran. Great, thanks.
  • Okay. Thank you. So we still have lots of questions. The next question is from Steve Young.
  • How important will causal reasoning be in future intelligent systems,
  • providing, for example, the ability to reason about counterfactuals?
  • Great. Another brilliant question. Thanks, Steve. I think causal inference is incredibly important.
  • Let's go back to first principles. Intelligent systems are only interesting if they can act in
  • some way, if they can interact in the world. When you act in the world, you're trying to
  • get a system to produce some effect. So the thing that you really need to decide what the
  • right actions are is, at some level, a causal model of the world that you're interacting
  • with to see whether you're producing the right effects. That's true for medicine, as well as for
  • economics, and also artificial intelligence. So causal inference is hugely important for
  • the sciences and for statistics, and a great area of research, and actually the subject of
  • Bernard Schölkopf's Milner Award Lecture a few years ago, which I greatly enjoyed. Now, I think,
  • do we need counterfactuals for causal inference, is actually a question that has been debated.
  • I'm a proponent of causal inference without counterfactuals, which is a position taken by
  • Phil David. Phillip David has produced an excellent paper with that title,
  • Causality without counterfactuals. You can actually approach
  • causal inference from the point of view of better probabilistic modelling or better
  • Bayesian inference or probabilistic modelling. So essentially, in my view,
  • many answers in causal inference can be given if we produce a model that predicts better under more
  • circumstances, like under interventions of various different kinds, are just more circumstances.
  • If our model can predict better under many, many different circumstances,
  • then it starts approaching a causal model of the world, and so you can think of that just
  • in terms of decision theory and probabilistic inference, and that's the position that
  • Phil David has taken. I know there are other people prominent in the field of causality that
  • focus on counterfactuals, but I, as a Bayesian, I'm deeply uncomfortable with counterfactuals,
  • because it seems like we can't reason about things that are impossible to
  • measure and that don't match the reality that we've observed.
  • We should always base our inferences on the data that we have, and the possible data
  • that we could observe in the future. In that world, there is no room for counterfactuals.
  • It's a hard position. I'm not an expert in causality, but that is my view at least.
  • Okay, thank you. Okay, so the next question is from [?Sazeep Bhuiyan 0:46:38.5]. Apologies
  • for mispronouncing, if I have. I hope that that was understandable. Now,
  • the question is, how can we make machine intelligence obey human laws
  • if machine intelligence has autonomy? What happens if machine intelligence disobeys human laws?
  • Who is criminally liable, the machine intelligence or the creator of the machine intelligence?
  • This is a fantastic question. Thanks, Sazeep.
  • So first of all, we should think of machine intelligence as a tool. Even when we endow
  • a machine with some level of autonomy, it's not - that can be constrained autonomy. So for example,
  • you could say a thermostat, to use a very classic example, a thermostat has
  • machine intelligence and it has autonomy over setting the temperature in your room based on
  • what it senses. It's a very primitive form of machine intelligence. We have built that tool,
  • and it's obeying human laws, because we can kind of control the parameters of a thermostat.
  • A self-driving car is a more fancy version of a thermostat, that has, obviously, a lot of societal
  • impact and obviously a lot of risks as well. If we want a self-driving car to have some autonomy,
  • which it obviously needs to drive itself, we also will program it to obey human traffic laws,
  • and then we will have the traffic regulatory environment, like that every human lives in, that
  • will also govern self-driving cars, and may be even more stringent for self-driving cars.
  • Now, the question of criminal liability is a really excellent question. I'm not a lawyer,
  • and this is something that I'm sure a lot of people in the legal profession are thinking about.
  • The self-driving car, again, is a great example to think about this.
  • To some extent, we haven't necessarily created the right legal frameworks for this,
  • but there are examples of autonomous systems, and depending on the
  • legal environment and the jurisdiction, there's different notions of who is responsible for what
  • in in case something goes wrong, but I think that the law and regulation will have to catch-up with
  • the emergence of more machine intelligence in many more application areas.
  • Okay, thank you. So I'm aware of the time I think I will just allow one
  • last question from Andrew Blake, before we finish up. The question is, it seems that humans
  • often learn concepts with much less data than deep networks need. Do you see machine learning
  • catching up closer to humans in this respect any time soon, and if so, how will they achieve that?
  • Thanks, Andrew. Again, it's wonderful to get questions from so many experts in the field.
  • I do think that this is one of the weak points that people in machine learning don't talk about
  • enough, about deep learning. I am much less impressed by a deep architecture that can
  • learn from tens of millions of examples than by a simple system that learns from ten examples,
  • whether it's a human or a machine. I'm going to give you an example of why I've been so stubbornly
  • excited about Bayesian methods. That example comes from my own experience when I was working on
  • Bayesian model selection, and you could set up very small data sets.
  • Data sets of, for example, ten sequences coming from some artificial language,
  • say A/B, A/B, A/B is a particular sequence, or A/A, B/B, A/A, B/B, or whatever.
  • If you have a prior on Bayesian hidden Markov models and you do
  • automatic model selection on that, you can get it to learn the right
  • generative model for those sequences with as few as 10 or 20 examples. In fact, in some cases,
  • the models trained using Bayesian model selection would find more compact and better
  • representations of the data than the ones that a human would come up with, just because they're
  • sort of automatically searching over a space of models in a more systematic way than humans.
  • So I do think that Bayesian model selection, in all its forms, can be an incredibly powerful
  • tool for structure and model learning from very small data sets, and I've seen that in practice,
  • and demoed that to my students and so on, but the field of machine learning has moved on and
  • focussed a lot on very large data sets, and that's where the advances in deep learning have come.
  • But we do need systems that can learn from small data sets, and I do think that
  • Bayesian inference is a good normative framework for learning models from data using Occam's razor.
  • Okay. Thank you. So we have come to the end of this event, and I think it just
  • remains to say thank you, once again, for the great lecture and the
  • excellent answers to your questions, and of course, to congratulate you
  • for your achievement and winning the Milner Award. I hope that you will have a chance to visit
  • Carlton House Terrace to receive the actual medal sometime soon. Okay, so thank you.
  • Thank you, Marta. I hope so too, and thank you, Marta, for hosting this wonderful evening.

Join Professor Zoubin Ghahramani to explore the foundations of probabilistic AI and how it relates to deep learning.

Modern artificial intelligence (AI) is heavily based on systems that learn from data. Such machine learning systems have led to breakthroughs in the sciences and underlie many modern technologies such as automatic translation, autonomous vehicles, and recommender systems. Professor Ghahramani discusses some topics at the frontier of probabilistic machine learning and some of the societal challenges and opportunities for AI

This is the Royal Society Milner Prize Lecture 2021.


About the Royal Society
91TV is a Fellowship of many of the world's most eminent scientists and is the oldest scientific academy in continuous existence.
/

Subscribe to our YouTube channel for exciting science videos and live events.

Find us on:
Bluesky:
Facebook:
Instagram:
LinkedIn:
TikTok:

Transcript

Tags