91TV

Fellows

Fellows

91TV is a self-governing Fellowship made up of many of the world’s most eminent scientists, engineers, and technologists.

Search the Fellows Directory

About elections
Events

Events

Discover events, scientific meetings and exhibitions held by the Royal Society, as well as access to videos of past events and information on our venue.

Public events

Scientific meetings

Accessibility

Summer Science Exhibition

Venue hire
Journals

Journals

Discover new research from across the sciences in our international, high impact journals. Find out more about our values as a not-for-profit society publisher, our support for open science and our commitment to research integrity.

See all journals

Authors

Reviewers

Librarians

Open access

About our journals

Journal policies

Metrics
Current topics

Current topics

Find out about our work in areas of current topical interest to the Royal Society.

AI and data

Climate change and biodiversity

Education and skills

Equality, diversity and inclusion

Genetic technologies

Health

History of science

Innovation

New technologies

Research culture and funding
Grants

Grants

91TV provides a range of grant schemes to support the UK scientific community and foster collaboration between UK based and overseas scientists.

Applying
See all grants Application and assessment process Application dates

Support
Training and development opportunities Global Talent Visa

About Grants
Policies and positions Contact the Grants team
Medals and prizes

Medals, awards and prizes

The Society’s medals, awards and prize lectures recognise excellence in science and technology. Our most prestigious award, the Copley Medal, was first awarded in 1731.

See all medals and prizes

Nomination guidance

Premier awards

Science Book Prize

Young People's Book Prize
91TV

Who we are

Mission

Charity

Strategy

Our staff

Diversity and inclusion

Our history

What we do

Collections

Advising policy makers

Public engagement

Schools engagement

Industry engagement

International activities

Recognising excellence

Supporting researchers

UK Young Academy

How we are governed

Committees and working groups

Council

Our funding

Our supporters

Trustees Report

Work for us

Our values

Vacancies
News and resources

News and resources

Explore the latest work from the Royal Society, from news stories and blog posts to policy statements and projects. You can also find resources for teachers and history of science researchers.

Blog

News

Projects

Reports and publications

Resources for schools

History of science resources

Explainers and introductions

Videos

Women in STEM

Probability theory and AI | 91TV

53 mins watch 21 February 2022

Hi, I'm delighted and honoured to be here presenting the Royal Society Milner Lecture today.
The topic of my lecture is probabilistic machine learning and artificial intelligence.
AI has gone through a number of areas of research. In the early days of AI, the method was
based on logic, search and symbolic AI. Then, in the 1980s and early 1990s,
machine learning, which used to be a small subfield of AI,
grew in prominence, and the emphasis turned into systems that learn from data. One of those
systems that was very exciting in the '80s was neural networks, under the name connectionism,
and this generated a huge amount of excitement because people thought of it as maybe a new model
for understanding human and animal cognition and relating symbolic and subsymbolic processing and
neural processing. The promise of sort of a new paradigm in psychology and neuroscience
from this computational perspective, as well as a new paradigm in artificial intelligence, was
something that was very palpable in those days. It was also a very significant time for me,
personally, because that's when I got into the field,
and I was swept away by that excitement around learning systems. In the mid-1990s and early 2000,
people started to become disillusioned with neural networks because of the complexity of
training them, and the sort of lack of significant performance from them in an engineering sense,
and they turned to more mathematically-elegant methods, such as kernel methods, like
support vector machines, probabilistic models, graphical models, Bayesian inference, and so on.
In the early 2010s neural networks came back in force with the deep learning revolution,
and this is a time that saw many, many breakthroughs happening in the field of
AI and machine learning. Again, the sort of link to neuroscience became very prominent,
and we're still living in the deep learning era. One of the things I'll touch upon is what's next,
what's what comes after the deep learning era. First, let me step back and ask the question,
what is artificial intelligence anyway? I'm not going to try to define it, but I'm going
to comment a bit on some terminology. The term artificial intelligence I find a little bit
of a misnomer, in my view, at least. because intelligence in a machine should not be
considered artificial. I think the distinction between artificial and natural seems itself kind
of artificial. We don't call what aeroplanes do artificial flying - they fly - and so we should
apply the same sort of principles to how we define what biological and machine systems do
as well. A second distinction that's worth making is around autonomy versus intelligence,
and one of the concerns about AI systems that's often stated, really doesn't stem from the
intelligence of the system, but rather from the autonomy, combined with a lack of intelligence. So
when we think of a system that's autonomous, we are endowing it with the ability to make
decisions on its own, and often that's the situation that becomes concerning. The fact
that it's intelligent should be generally a positive thing.
There's a term, general intelligence, that people are focusing on a lot these days,
and I want to remind people that specialist systems are incredibly useful
and that we should celebrate those specialist systems. Also, general intelligence is often
related to human abilities, and I would argue that human intelligence is flexible,
but not all that general. Which brings me to this next term, which is human-level AI.
To me, it's not clear that humans should be a model for intelligent systems. Humans are primates
with a particular set of cognitive skills that have evolved for our survival. Human reasoning is
notoriously flawed. Human memory, calculation and communication abilities are very limited,
and really, we want to strive for machines that are complementary to human abilities.
Another thing that I think is worth mentioning is that,
often when people celebrate human achievements, what they're really looking at is the product of
society and organisations over time, and we must not conflate this with the abilities of a single
brain. Again, if we try to mimic the abilities of a single brain, we're not going to get to the sort
of advances that human society, has achieved. So we should be really looking at systems that
are intelligent in a way that have a potential to complement human intelligence.
So it's a really exciting time for AI and machine learning. There have been incredible breakthroughs
in AI and games that you're all aware of. There have also been a tremendous number of
really fantastic advances in practical applications of AI and machine learning,
speech and language technologies, computer vision, scientific data analysis, recommender systems,
and online commerce, self-driving cars, applications of AI to finance, climate, weather,
economic systems, etc. Many of these applications have been driven by deep learning technology,
so let me touch on deep learning and see where that takes us.
So deep learning are neural networks of the kind that were very popular in the
mid-1980s. Neural networks are tuneable nonlinear functions with many parameters.
The parameters of a neural network are these weights, and these models are called
neural networks because they're composed of simple computational elements organised in a network,
inspired by a very, very simplified model of how the brain works. Deep learning models,
in particular, look at modelling these nonlinear functions through layers of computation,
and these layers of computation, where one computation gets passed on to the next layer,
can be thought of as a composition of functions. So the overall function from inputs X to outputs Y
is a composition of functions, and if this has many layers we call it deep. Now, the way these
models are trained is usually with some version of maximum likelihood or penalised likelihood,
using some variant of stochastic gradient descent optimisation. So in fact, a deep neural network is
just some nonlinear function with a little bit of basic statistics as a principle for training it,
and some basic optimisation used to minimise the error or maximise the likelihood.
Now, deep learning is, essentially, the same concept as has been around for many decades,
except for a few innovations that have happened in the last decade or so. So the community has
figured out better architectural and algorithmic ways of making the models deeper. We have vastly
larger data sets, thanks to the web, and we also have vastly larger compute resources to be able to
train these models. We have much better software frameworks that make the training of these models
quite automatic, and requires much less expertise to be able to train these models,
and we have, along with all of this, vastly increased industry investment as well as
media interest. So there's been a flywheel of excitement around deep learning, and it's produced
a tremendous amount of impact in a huge number of areas of science and technology and commerce.
Now, there are a few key ideas that have made for the success of deep learning. So what's the
difference between what we were doing in the 1980s and what we're doing now? Well, first of all, we
figured out that very large models can work really well, and this kind of reminds us of the sort of
principles behind nonparametric statistics. Very flexible, large models are quite powerful.
But to train those flexible large models, we need huge data sets, and if we don't have
huge, real data sets, we can often generate simulated data to train these models.
Training these models is made easy through the use of automatic differentiation,
so software to do this has really accelerated the field. To make the models deep, we've learned as
a community to keep each layer of computation somewhat close to the identity function. That
way you can stack many layers together, and that idea has been reinvented in many different ways.
We've discovered that stochastic optimisation works surprisingly well,
as long as you have good initialisation methods. We've rediscovered in many ways the concept of
parameter tying or finding ways of encoding symmetries in the data - convolution, recurrent
nets, graph neural nets are all instances of that concept - and we found that the internal
representations of these deep neural networks are very valuable and reusable. So we can use
a deep neural network that's been pre-trained on one task to be able to do amazing things
on many other tasks, and to be able to learn new tasks with very few shots of additional learning.
Of course, there are also limitations to deep learning, and although they've been incredibly
powerful at giving great performance on many, many benchmarks, they're very data hungry,
often requiring millions of examples. They're very compute-intensive to train and deploy.
They're easily fooled by adversarial examples, so they're not all that robust.
They're finicky to optimise because the optimisation function is non-convex,
and there are many, many choices involved in that optimisation. They're uninterpretable black boxes
lacking in transparency and difficult to trust. It's non-trivial to incorporate prior knowledge
and symbolic representations, and they're very poor at representing uncertainty. All of these
areas of limitation are areas that the research community is actively working on overcoming.
Now, I want to talk about a personal view of going beyond deep learning.
In my research in the last decade or so, I've been really focusing on a probabilistic
modelling approach to machine learning. So this approach is based on the concept of a model.
A model is something that is present in many different fields of science and engineering.
A model, in my view, describes data that one could observe from a system - that's how you can falsify
a model and tell whether it's a good model or not - and any time you have a model, you have
to think about the uncertainty that you may have about the parameters and structure of that model.
So the probabilistic modelling approach is basically going to use the mathematics of
probability theory to express all the forms of uncertainty and noise associated with our model,
and then we're going to continue to use probability theory in the form of
inverse probability, or Bayes' rule, that allows us to infer unknown quantities, adapt our models,
make predictions, and learn from data. So Bayes' rule is expressed here at the top of this page,
with Reverend Thomas Bayes, looking at it at the bottom of the page. Bayes' rule is a way of
learning from data, or converting our knowledge from a state of knowledge prior to observing
data to a state of knowledge after observing the data. The way we represent our knowledge about
all the things that we're uncertain about is through a probability distribution that
captures the range of possibilities that we might have over various different hypotheses.
That's expressed here. That's called the prior. Then, when we observe the data for any hypothesis,
we should be able to compute the probability of that observed data given the hypothesis.
That's the likelihood. We multiply the two. We renormalise over the space of hypotheses we are
considering, and what we get is a posterior distribution over hypotheses given data. I should
emphasise that in this framework there are only two kinds of quantities out there. There's data,
which is measured or observed quantities, and there's everything else. Everything else is
fair game to be uncertain about, and that's what I'm calling hypotheses. So any uncertain quantity
model structure, model parameters, any assumptions that we may have, those are the hypotheses,
and we should place a probability distribution to represent our uncertainty about those hypotheses.
So Bayes' rule can be applied to machine learning in the following way. In fact,
it's worth noting that Bayes' rule itself is just a corollary of two even simpler rules
from probability theory, the sum rule and the product rule, written at the top of this page.
So if we apply Bayes' rule to learning the parameters of a machine learning model given data,
we get this expression here, where I've just substituted in theta for the parameters of
our model, perhaps the weights of a neural network D for the data set that we've observed,
and M for the model class that we're considering, the overall model structure that we have. If we
want to make any prediction about any uncertain quantity X given the data, then the sum rule and
the product rule tell us that the way you should make predictions is by averaging the predictions
for every possible parameter value, weighted by this posterior distribution over parameters
that we've computed in the equation above. So prediction is naturally an averaging or
integration process, or summation process if it's a discrete space, and if we want to compare
between different models, then we apply Bayes' rule at the level of different model structures M.
So it's actually really, really straightforward, and it's actually very powerful as well.
We can use Bayesian machine learning and Bayesian methods in general to encode an
automatic preference for simplicity. This is called Bayesian Occam's razor,
and in this example here, what we have is a data set of eight data points, and we're trying to fit
a function y= f(x) to these eight data points. Of course if we fit polynomials we know that if we
have a seventh order polynomial, we can exactly fit these eight data points. This is something
that's called overfitting. On the other hand, we might not be fitting enough structure in the data,
and that's what's called underfitting, and if we apply Bayesian inference and look at
a distribution of possible polynomials given the data - these are sort of these curves shown
in green - what we can also get is an automatic penalisation of models that are overly complex.
So in this particular case, for example, we get that the system says, well, a quadratic or maybe
linear or a cubic seem like reasonable hypotheses for this data. Perhaps it's a constant function,
but higher order polynomials, there isn't enough evidence in the data to support those.
The workhorse of Bayesian Occam's razor can be used for all sorts of model selection
questions, so we can use it to automatically find the number of clusters in data, the intrinsic
dimensionality of data, to find whether some inputs are relevant to predicting some outputs,
or to learn the order of a dynamical system, the number of states in a hidden Markov model,
the structure of a neural network, or the structure of a probabilistic graphical model,
and it's been applied in all of these cases quite successfully.
Now, the thing that drives Bayesian Occam's razor is this concept known as the marginal likelihood,
which is the probability is the likelihood integrated with respect to the prior,
or is the probability of the data given the model, averaging over parameters. Not optimising
parameters, but averaging over parameters. This has a really beautiful information theoretic
interpretation as well, which is that log2 of one over the marginal likelihood,
is the number of bits of surprise at observing data D under model M.
And so Bayesian Occam's razor, basically, says that a model is a good model if the data is
not very surprising under that model, and that penalises models that are overly complex because
in very complex models, a simple data set would actually be surprising to observe.
Now, the field of AI has many people who don't often think about probabilistic inference,
and I want to talk about why probabilities matter for AI. There are a few reasons. First of all,
I think we really need to have calibrated models and calibrated prediction uncertainties. We need
to have systems that know when they don't know, especially in applications where the decisions
of the system are critical, like for example, in a medical application or in a self-driving car.
We also want to make sure that we have a nice framework for automatically doing complexity,
control and structure learning, and Bayesian Occam's razor gives you that.
Uncertainty is essential for decision-making problems, like active learning,
black-box optimisation, reinforcement learning and other exploration exploitation trade-offs.
Generally, we want to build intelligent systems that can make rational decisions,
and probability theory is one of the ways of thinking about rational decision-making.
We need ways of building in prior knowledge into our learning systems, and making sure that that
knowledge is updated in a coherent and robust way as you get more data, and we want to make sure
that our learning algorithms don't just work on huge data sets, but also work on small data sets.
Now I'm going to turn to some recent areas of research, and actually they're active areas,
but they've been around for a very long time. So we can bring the ideas of probabilistic
modelling and Bayesian inference together with deep learning in a very elegant way.
So the way we do this is we think of a neural network as simply being a probabilistic model
with some parameters theta - those are the weights of the neural network - and we apply
Bayesian inference to those parameters of the neural network given the data.
Now, in the early 1990s, this idea was explored by people like Radford Neal and David MacKay,
and Radford Neal actually showed a very beautiful result, where he showed that a neural network with
infinitely many hidden units in one hidden layer,
treated in a Bayesian manner, converged to a stochastic process known as the Gaussian process,
that's been around for over a century. So at that time, many of us in the field said, 'Ah,
neural networks are such a pain. Gaussian processes are so elegant and beautiful.
Let's just study Gaussian processes and see how far we can take that.'
It turns out that this idea has been revisited in the era of deep learning, so even if you have a
deep model, as long as the layers are wide, you still converge to a Gaussian process.
In fact, you can see that even empirically, in that the function and the error bars
in this one-dimensional function approximation problem, are virtually indistinguishable
between the Gaussian process and the Bayesian deep and wide network here.
Now, I've related deep neural networks with Gaussian processes, and Gaussian processes can be
thought of as kernel machines, just the Bayesian version of kernel machines,
like support vector machines, and both of these models can be related to
the mother of all statistical models, which is linear regression.
Here is a cube diagram that shows the relationship between these models through various operations
that you can do from linear regression to other models. So in fact, a lot of the ideas that have
been explored in the last three decades of machine learning are closely related to each other.
In fact, the field of Bayesian deep learning has been around for three decades, and there
have been many, many methods that have been used to approximate the integrals or averages
that are needed to do Bayesian inference in a neural network, from the Laplace approximation,
variational methods, Markov chain, Monte Carlo, dropout, and many other methods. It's
still an incredibly exciting area, and it's a very promising for getting more calibrated
predictions out of models and getting out of distribution performance that's good and robust.
So what we really care about is not just fitting a function, but away from the data we want to have a
good level of uncertainty that captures that the system might not know what the right answer is.
Another way of bringing probabilistic models and deep learning together is recent work on
deep sum product networks. A sum product network is a compact, deep representation of a potentially
exponentially large mixture model. A mixture model is, again, a classical statistical model that
captures a complicated probability distribution as a superposition of simpler distributions.
A mixture model is a very flexible model, and a deep sum product network
is a way of learning these flexible models in an efficient way. We've recently been exploring
deep sum product networks, and there are a number of key properties that are very exciting.
So these models are trainable using stochastic gradient descent with automatic differentiation,
in much the same way as a deep architecture can be trained using GPU support and methods
like dropout, so to the user it becomes very similar to a deep neural network.
They can be used both as generative and discriminative models, which is quite
useful. Their predictive results are comparable to deep neural networks, at least on the small
examples that we've tried out. They have better calibrated uncertainties. We can compute the
likelihoods exactly. We can do efficient marginalisation and conditioning exactly, and
we can deal with missing data and detect outliers. So I find this this very promising and exciting
and it's worth further study. Here are a couple of recent papers on this topic.
Another area that is incredibly powerful and exciting, and that you can combine with deep
learning, is probabilistic programming, and the concept of probabilistic programming comes back
to the concept of modelling that we had before. We want to be able to express models in many
fields of science and engineering, and rather than expressing a model as a set of equations,
we can think of expressing the model as a computer program that would generate possible data sets.
So probabilistic programming language is a language for expressing such models,
computer programs that generate data or simulators. A universal probabilistic
programming language is one that can express any computable probability distribution, and that's
a very, very powerful concept. Then, a universal inference engine is what's behind the scenes doing
automatic inference over the hidden variables and parameters of your model given the data.
So we're used to the concept of running a simulator in the forward direction to
generate data, but you can take a simulator and you can say, 'What would happen if this is
actually the data that I observe? What should the parameters of my simulator have been?'
and universal inference does Bayes' rule on computer programs. It can run a computer program,
essentially, in a backward direction and infer the hidden parameters of the computer program.
This is actually, in some ways, very similar to what automatic differentiation does, which is to
basically compute derivatives through computer programs in the backward direction. The idea
of probabilistic programming has been around for a couple of decades as well. There are many very
successful probabilistic programming languages, and there are many inference algorithms that have
been implemented on the back end to be able to capture inference in these probabilistic programs.
Here is a probabilistic program for expressing a Bayesian hidden Markov model in the programming
language Turing. It's a very simple to understand expression of this probabilistic model. Here is a
graphical model representing that model. Then, all you have to do is present the data and all
the inference happens automatically, and there are some great resources in probabilistic programming.
Now, I actually think the probabilistic programming could really revolutionise the way we
think about scientific modelling, and it's at the heart of
future advances in machine learning and AI, especially when we combine it with deep learning.
Although there are many probabilistic programming efforts that are very exciting, two of them that
I've been personally involved in are pyro, which is a deep, universal probabilistic programming
language developed when I was at Uber and which is now open source, and Turing, which was developed
at the University of Cambridge, but is also now an open-source project.
So I want to finish off by thinking about how to build an AI system.
So the question is, do we have the mathematical principles required to build machine intelligence?
We would need principles for perception, learning, reasoning, decision-making,
and I would argue that maybe we already have all the mathematical principles that we need.
For perception, we know that deep learning is quite good. We have tools for optimisation,
Bayes' rule, and we know that more data helps. For learning, again, Bayesian inference is a
nice framework. Optimisation is a workhorse of learning, the likelihood principle is often used.
For reasoning, we have decades of work on logic and probability theory, and for decision-making
we have decision theory, control theory, Bellman's equation, reinforcement learning, and game theory,
and so on. So maybe we have all the tools we need to build AI systems, but maybe what we're missing
are better ways of approximating the intractable computations involved,
better optimisation algorithms, better probabilistic inference methods, Monte
Carlo methods, variational methods, etc., and of course, we also need more data and compute.
Now, there are many exciting things about AI. So there are many opportunities that come from AI,
but there are also many challenges in the field of AI, especially as
it relates to society. So I'll start with one of the opportunities, which is
that AI offers the opportunity to accelerate other areas of science and technology,
and this is hugely, hugely exciting. AI systems are often very data driven,
and one must balance the value they provide to users with privacy concerns when using data.
Interpretability of AI systems is a real challenge. AI systems can be hard to interpret and
explain, but on the other hand, they can also be more reliable and transparent than humans, because
you can examine the code. Control of AI systems is something that is on many people's minds, so
endowing any system with autonomy carries risks. The power dynamics of AI is something
that many people have talked about. AI can result both in the concentration of power,
but it can also empower individuals through devices that give special skills to people.
The fairness of AI systems is something that we need to
really think about and invest in when we deploy them. AI systems must be designed
to be fair to diverse users, and one of the best ways of doing that is to ensure that
the people designing the AI systems also come from very diverse backgrounds. So to summarise,
I think AI is a tool for society, and we want to make sure that it can help with human flourishing.
So I'll just wrap up now. I've talked about a probabilistic modelling framework for building
AI systems that reason under uncertainty and learn from data,
and I've briefly reviewed some of the frontiers of our research in probabilistic AI,
including Bayesian deep learning and probabilistic programming. If you're interested in this topic,
here is a review paper which is from a few years ago that talks about probabilistic
machine learning and AI. Thank you very much for your time and I look forward to some questions.
Thank you Zoubin. Thank you. This was a very inspiring lecture
and I'm very glad that you put probability at the centre. I have lots of questions myself,
but I think before we do that, I would like to remind anyone who is watching that you can enter
your own questions into Slido. You should have received instructions together with the joining
instructions email. Okay, so let's go to the first question. Is that okay Zoubin, are you ready?
Yes.
So the first question is from Michael Hopkins, and it is, as a user of Bayesian inference in
conjunction with Gaussian process models since the mid-1990s, I am aware of the great power
of these methods. What Professor Ghahramani like to comment on the sensitivity of the
evidence value to prior knowledge, both in GPs and more generally? Over to you, Zoubin.
Yes, great. Thanks. That's an excellent question, Michael. So, much has been written about
the sensitivity of the evidence to the prior. Just as a reminder, the evidence,
or marginal likelihood, is what we tend to use to compare different models,
and it integrates over the prior, and so it makes sense that it depends on the prior.
I think that it's natural for the evidence to depend on the prior. If we take a model such as
a Gaussian process, or any other parametric model, even, and we vary the width of the prior over its
parameters, we're actually expressing a different distribution over functions, and so the evidence
should depend on the prior, but people are uncomfortable about that, because a lot of people
want to seek objective measures for comparing models. I'll say two things. So one of them is,
in my view, it's okay to embrace the dependence on the prior and to make it focus our attention more
on having sensible priors that capture the distribution over functions or whatever we're
trying to learn. So it just puts more pressure on people to think about their prior choices
rather than pick priors out of a black box. The other thing that I would like to suggest
is, while the evidence itself is interesting from the point of view of comparing between
models, in machine learning, and actually in a lot of applications, we're often
more interested in the predictive performance of a model, how it will behave when you have new data.
That sort of predictive performance tends to be quite robust to the choice of prior,
as long as the prior captures the sort of flexible space of models. So I would be less worried about
choice of prior if, instead of computing the value of the evidence, you're actually trying to
use a probabilistic model to make statements about potential new data.
So those are my thoughts on the evidence and the prior. A great question.
Okay. Thank you. So we have some more questions. Thank you. Yes,
please do ask if you have any more questions. So the next question is from Kerensa Jennings,
who is very grateful for the talk and would love to understand
more about your thinking about the ethics of AI and probabilistic ML. She noted on slide 33,
34 that you spoke about fairness and diversity, but she would like to - he or she - would like
to understand more on the dimensions and considerations that matter for fairness.
Yes, again, a great question. Thank you very much.
As AI and machine learning systems are being used more and more in ways that interact with people
through products or technologies that people use every day, it's absolutely essential for
the people designing these systems to consider the impact that the systems have on their users,
and that includes the fairness, accountability, interpretability, and many other ethical
considerations that come with building any kind of automated system. As I said earlier on in my talk,
often the thing that really matters is the autonomy of the system, not the intelligence
side. So when we delegate decision-making to an autonomous system, we need to uphold
the values that we wish to consider
that system to have. Now, this is where it gets quite tricky because the values
can be culturally dependent. They depend on the values of the designer of the system, they depend
on the values of the people who labelled the data, for example, that the system was trained on.
Often, that's the source of a tremendous amount of bias that gets entered into our models.
In particular, for example, on language models, if you train them on
natural human language, you can get a tremendous number of biases creeping into our models.
So there are many, many mechanisms for trying to both technically and societally manage
those ethical and bias concerns in models. Let me talk about both the technical and societal way.
Technically, we need better tools for being able to probe our machine learning models, which get
less and less interpretable as we've moved into deep learning technology. So
better interpretability tools are often very, very
helpful - things like model cards and data cards are examples of
tools that increase transparency - but the societal side of it is also very important.
The question around who is designing the models, who is participating in the field of AI,
do we have a diverse and representative community of AI researchers and software engineers, and
associated people that build these systems? Where are they being built? By
people from what cultures? That socio-technical side of the work is incredibly important.
I would say for any software system really, but in particular for AI and machine learning systems,
because they interact with us in somewhat unpredictable ways, and because they are
trained on data that are collected, often, from
human interactions with those systems. So a huge and important area, one that
is both in the academic research mind, but also very, very prominent in all the big
tech companies that work in the field of AI. So, great, thanks for that question.
Okay. Thank you. So I will now move to the question from Chris Bishop,
who is saying that your lecture was superb and has a quick question. The Bayesian framework is
very elegant but computationally intensive. Since we are constrained by compute capacity,
are we better to use that capacity to do non-Bayesian training of a large network,
or a more Bayesian treatment of a smaller one? Please answer.
Great. It's always a pleasure to get asked a question by someone who's a world expert in
the field of machine learning, so thanks, Chris. Excellent question.
I think the answer depends on two things. So first of all, one would like to have methods
at a whole Pareto frontier of computation, so as we provide more computation, better
estimates of uncertainty, which is maybe one way of thinking about calibration of Bayesian methods.
Now, add a particular amount of computation, we have the choices that you described,
train a larger model in a non-Bayesian way or a smaller model in a Bayesian way,
and I think if we have a tremendous amount of data, it's often good to train the larger models,
maybe fewer larger models. If we have smaller amounts of data, then for that
choice of computation then it does often pay off to treat uncertainty in a more calibrated way.
One thing I will say, though, is that there's been a tremendous
amount of research done on exactly this question, and there is a great
NeurIPS tutorial from last year from three of my colleagues at Google, looking at uncertainty
calibration, and looking at this whole Pareto frontier. I'm a fan of methods that
make me have my cake and eat it too. So in this particular case, methods for training
very large models, but improving their uncertainty calibration in a cheap way, so as to approximate
the solution for a more expensive Bayesian method, that's often quite promising, and there are
good methods for doing that by, for example, retraining small subparts of the large model
in a way that capture the uncertainty in the model. A lot more in that NeurIPS tutorial
by Balaji, Jasper and Dustin Tran. Great, thanks.
Okay. Thank you. So we still have lots of questions. The next question is from Steve Young.
How important will causal reasoning be in future intelligent systems,
providing, for example, the ability to reason about counterfactuals?
Great. Another brilliant question. Thanks, Steve. I think causal inference is incredibly important.
Let's go back to first principles. Intelligent systems are only interesting if they can act in
some way, if they can interact in the world. When you act in the world, you're trying to
get a system to produce some effect. So the thing that you really need to decide what the
right actions are is, at some level, a causal model of the world that you're interacting
with to see whether you're producing the right effects. That's true for medicine, as well as for
economics, and also artificial intelligence. So causal inference is hugely important for
the sciences and for statistics, and a great area of research, and actually the subject of
Bernard Schölkopf's Milner Award Lecture a few years ago, which I greatly enjoyed. Now, I think,
do we need counterfactuals for causal inference, is actually a question that has been debated.
I'm a proponent of causal inference without counterfactuals, which is a position taken by
Phil David. Phillip David has produced an excellent paper with that title,
Causality without counterfactuals. You can actually approach
causal inference from the point of view of better probabilistic modelling or better
Bayesian inference or probabilistic modelling. So essentially, in my view,
many answers in causal inference can be given if we produce a model that predicts better under more
circumstances, like under interventions of various different kinds, are just more circumstances.
If our model can predict better under many, many different circumstances,
then it starts approaching a causal model of the world, and so you can think of that just
in terms of decision theory and probabilistic inference, and that's the position that
Phil David has taken. I know there are other people prominent in the field of causality that
focus on counterfactuals, but I, as a Bayesian, I'm deeply uncomfortable with counterfactuals,
because it seems like we can't reason about things that are impossible to
measure and that don't match the reality that we've observed.
We should always base our inferences on the data that we have, and the possible data
that we could observe in the future. In that world, there is no room for counterfactuals.
It's a hard position. I'm not an expert in causality, but that is my view at least.
Okay, thank you. Okay, so the next question is from [?Sazeep Bhuiyan 0:46:38.5]. Apologies
for mispronouncing, if I have. I hope that that was understandable. Now,
the question is, how can we make machine intelligence obey human laws
if machine intelligence has autonomy? What happens if machine intelligence disobeys human laws?
Who is criminally liable, the machine intelligence or the creator of the machine intelligence?
This is a fantastic question. Thanks, Sazeep.
So first of all, we should think of machine intelligence as a tool. Even when we endow
a machine with some level of autonomy, it's not - that can be constrained autonomy. So for example,
you could say a thermostat, to use a very classic example, a thermostat has
machine intelligence and it has autonomy over setting the temperature in your room based on
what it senses. It's a very primitive form of machine intelligence. We have built that tool,
and it's obeying human laws, because we can kind of control the parameters of a thermostat.
A self-driving car is a more fancy version of a thermostat, that has, obviously, a lot of societal
impact and obviously a lot of risks as well. If we want a self-driving car to have some autonomy,
which it obviously needs to drive itself, we also will program it to obey human traffic laws,
and then we will have the traffic regulatory environment, like that every human lives in, that
will also govern self-driving cars, and may be even more stringent for self-driving cars.
Now, the question of criminal liability is a really excellent question. I'm not a lawyer,
and this is something that I'm sure a lot of people in the legal profession are thinking about.
The self-driving car, again, is a great example to think about this.
To some extent, we haven't necessarily created the right legal frameworks for this,
but there are examples of autonomous systems, and depending on the
legal environment and the jurisdiction, there's different notions of who is responsible for what
in in case something goes wrong, but I think that the law and regulation will have to catch-up with
the emergence of more machine intelligence in many more application areas.
Okay, thank you. So I'm aware of the time I think I will just allow one
last question from Andrew Blake, before we finish up. The question is, it seems that humans
often learn concepts with much less data than deep networks need. Do you see machine learning
catching up closer to humans in this respect any time soon, and if so, how will they achieve that?
Thanks, Andrew. Again, it's wonderful to get questions from so many experts in the field.
I do think that this is one of the weak points that people in machine learning don't talk about
enough, about deep learning. I am much less impressed by a deep architecture that can
learn from tens of millions of examples than by a simple system that learns from ten examples,
whether it's a human or a machine. I'm going to give you an example of why I've been so stubbornly
excited about Bayesian methods. That example comes from my own experience when I was working on
Bayesian model selection, and you could set up very small data sets.
Data sets of, for example, ten sequences coming from some artificial language,
say A/B, A/B, A/B is a particular sequence, or A/A, B/B, A/A, B/B, or whatever.
If you have a prior on Bayesian hidden Markov models and you do
automatic model selection on that, you can get it to learn the right
generative model for those sequences with as few as 10 or 20 examples. In fact, in some cases,
the models trained using Bayesian model selection would find more compact and better
representations of the data than the ones that a human would come up with, just because they're
sort of automatically searching over a space of models in a more systematic way than humans.
So I do think that Bayesian model selection, in all its forms, can be an incredibly powerful
tool for structure and model learning from very small data sets, and I've seen that in practice,
and demoed that to my students and so on, but the field of machine learning has moved on and
focussed a lot on very large data sets, and that's where the advances in deep learning have come.
But we do need systems that can learn from small data sets, and I do think that
Bayesian inference is a good normative framework for learning models from data using Occam's razor.
Okay. Thank you. So we have come to the end of this event, and I think it just
remains to say thank you, once again, for the great lecture and the
excellent answers to your questions, and of course, to congratulate you
for your achievement and winning the Milner Award. I hope that you will have a chance to visit
Carlton House Terrace to receive the actual medal sometime soon. Okay, so thank you.
Thank you, Marta. I hope so too, and thank you, Marta, for hosting this wonderful evening.

Join Professor Zoubin Ghahramani to explore the foundations of probabilistic AI and how it relates to deep learning.

Modern artificial intelligence (AI) is heavily based on systems that learn from data. Such machine learning systems have led to breakthroughs in the sciences and underlie many modern technologies such as automatic translation, autonomous vehicles, and recommender systems. Professor Ghahramani discusses some topics at the frontier of probabilistic machine learning and some of the societal challenges and opportunities for AI

This is the Royal Society Milner Prize Lecture 2021.

About the Royal Society
91TV is a Fellowship of many of the world's most eminent scientists and is the oldest scientific academy in continuous existence.
/

Subscribe to our YouTube channel for exciting science videos and live events.

Find us on:
Bluesky:
Facebook:
Instagram:
LinkedIn:
TikTok:

Transcript

Tags

royal society science scientists scientific policy scientific research science uk science research international international science science education science policy

Email updates

We promote excellence in science so that, together, we can benefit humanity and tackle the biggest challenges of our time.

Email updates

Subscribe to our newsletters to be updated with the latest news on innovation, events, articles and reports.

First name *

Last name *

Email *

Name

What subscription are you interested in receiving? _{(Choose at least one subject)}

What subscription are you interested in receiving?

Public Newsletter - Summer Science, events, videos and news

Scientists newsletter - Grants, scientific meetings, and journals

Librarians newsletter - News and features for librarians

I am happy to receive the selected communications by email from the Royal Society, as set out in our privacy policy. I understand I can unsubscribe at any time. Review privacy policy *