91TV

Fellows

Fellows

91TV is a self-governing Fellowship made up of many of the world’s most eminent scientists, engineers, and technologists.

Search the Fellows Directory

About elections
Events

Events

Discover events, scientific meetings and exhibitions held by the Royal Society, as well as access to videos of past events and information on our venue.

Public events

Scientific meetings

Accessibility

Summer Science Exhibition

Venue hire
Journals

Journals

Discover new research from across the sciences in our international, high impact journals. Find out more about our values as a not-for-profit society publisher, our support for open science and our commitment to research integrity.

See all journals

Authors

Reviewers

Librarians

Open access

About our journals

Journal policies

Metrics
Current topics

Current topics

Find out about our work in areas of current topical interest to the Royal Society.

AI and data

Climate change and biodiversity

Education and skills

Equality, diversity and inclusion

Genetic technologies

Health

History of science

Innovation

New technologies

Research culture and funding
Grants

Grants

91TV provides a range of grant schemes to support the UK scientific community and foster collaboration between UK based and overseas scientists.

Applying
See all grants Application and assessment process Application dates

Support
Training and development opportunities Global Talent Visa

About Grants
Policies and positions Contact the Grants team
Medals and prizes

Medals, awards and prizes

The Society’s medals, awards and prize lectures recognise excellence in science and technology. Our most prestigious award, the Copley Medal, was first awarded in 1731.

See all medals and prizes

Nomination guidance

Premier awards

Science Book Prize

Young People's Book Prize
91TV

Who we are

Mission

Charity

Strategy

Our staff

Diversity and inclusion

Our history

What we do

Collections

Advising policy makers

Public engagement

Schools engagement

Industry engagement

International activities

Recognising excellence

Supporting researchers

UK Young Academy

How we are governed

Committees and working groups

Council

Our funding

Our supporters

Trustees Report

Work for us

Our values

Vacancies
News and resources

News and resources

Explore the latest work from the Royal Society, from news stories and blog posts to policy statements and projects. You can also find resources for teachers and history of science researchers.

Blog

News

Projects

Reports and publications

Resources for schools

History of science resources

Explainers and introductions

Videos

Women in STEM

Computer vision: learning to see the world | 91TV

59 mins watch 04 May 2023

Okay. I think we should make a start. Good evening, ladies and gentlemen. My name is
Peter Bruce. I'm the physical secretary of the Royal Society and one of the vice presidents and
it's a great pleasure for me to welcome you all to Carlton House Terrace, the home of the Royal
Society, and to this Bakerian award lecture. Now, let me just say a little bit about the
Bakerian medal and lecture. It's one of our premier lectures. It is the premier lecture
and medal in the physical sciences. It was established through a bequest of Henry Baker,
who was a fellow of the Royal Society, who made a bequest of £100, and I shall read you the remit,
if you like, for the Bakerian as written at that time, for a narration or discourse on such part
of natural history or experimental philosophy at such time, and in such manner as the President
and the Council of the Society, for the time being, shall please to order and appoint. Now
you'll gather that it's not a relatively recently introduced lecture and medal, the origin 1775,
so it's been around for quite some time, and as well as the medal, it's accompanied by a gift of
£10,000. Now, it's a real pleasure for me to introduce the medal winner and our lecturer this
evening, Andrew Zisserman. Andrew is one of the principal architects of modern computer vision.
His work in the 1980s on surface reconstruction and discontinuities is widely cited. He is
best known for his leading work in the 1990s, establishing the computational theory of multiple
view reconstruction and the development of practical algorithms that are widely used today.
This culminated in the publication of a book in 2000, turn of the century in other words,
with Richard Hartley already regarded now as a standard text.
His laboratory in Oxford is internationally renowned, and its work is currently shedding
new light on the problems of object detection and recognition. So without any further ado,
you want to hear from him I'm sure, not me, let me welcome Andrew to the stage to present his
lecture as you see the title there, Computer Vision, Learning to See the World. Andrew.
Okay, thank you for the introduction, Peter, and thank you for the prize. It's a great honour
for me. It's a great honour for the computer vision field as well. So we're very grateful.
So I'm going to talk on computer vision, learning to see the world, and I'm going
to start off by saying what computer vision is. The aim of computer vision is to extract
visual information from images and videos so that the computer can understand images and videos,
much like a human would understand them. That's what we aim to do. So in this image here,
we'd like to be able to answer questions like what is in the image? So the objects,
in this case a person, that's in the image. Where are the things in the image? Meaning
the spatial layout of the scene. The pose of the person. And then what is happening?
Nothing much happening at the moment but if I play the video, it's more interesting. This is
a clip from Singing in the Rain. Okay, so as I say, the objective is to be able to carry out
tasks like this, answer these questions, and the field has made quite some progress and we
can do quite a few things now, especially since the advent of deep learning about a decade ago.
So I'll show you some examples, and the first one is I'm going to show you a video sequence
in the top left, and then various views of the 3D layout of the people and what they're doing,
as they sort of collide with each other.
So that's sort of the 'where' task. The next one I'm going to show you is the 'what' task. So here,
you're seeing object detection and tracking. So these boxes, they're detecting the objects,
they're moving through the video that's tracking it, and there's also recognition going on,
maybe you can read the labels, but it's recognising various animals and objects.
So there's a tiger. It recognises the tiger, deer. Also some vehicles you'll see in a moment.
There's a motorbike. So all of these being recognised directly in the image and then tracked,
and that was sort of what is in the image, what is in the video. And the next
one I want to show you is action. So we can recognise human actions now and we can also
recognise actions of humans and other animals, and I'm going to show you an animal example.
So here, what you're going to see is chimpanzees and this is how computer vision can support other
fields. So zoologists have hundreds of hours of videos that they want to analyse and annotate, and
with computer vision, you can annotate the actions of the behaviour of chimpanzees automatically
or through the video. So here, you're going to see the behaviours of nut cracking and eating.
Okay, now each of these tasks I've shown you, sort of recognition tasks
of actions or what's in the image, has been done by a deep learning model,
and a deep learning model for a visual task is trained in three steps.
The first step is you construct a very, very large data set of images or videos that you
label for the task you want, like recognising an image, then you choose or you design a deep
model. This is where the deep part comes, a neural network. And third step is to train the model's
parameters on this data set, and you train it by predicting the labels that are on the data set.
I'll show you an example. So for the recognition I was showing you earlier, recognising the animals
and the vehicles, we were using a network which had been trained to classify 1000 different object
categories, and this is a picture of the network. So it takes in an image, in this case an elephant,
and then the output is a choice of one of a 1000 categories it's been trained to recognise,
and this network has 30 million parameters. That's a lot of parameters, and so to train it,
you need a very large data set, and it was trained on a data set of a million images, 1000 images for
each of the 1000 categories. What that means is that somebody had to find 1000 images of monkeys
like this, 1000 images of dogs, 1000 images of elephants, a colossal amount of work and
that's what the network was trained on. Now I'm coming come to the core of this lecture. This is
not how a baby learns to see. So a baby is not shown a thousand images and say, this is a dog,
a thousand images, this is a cat. That's not how a baby learns to see, and what I'm going
to do in this lecture is explore how computers can learn to see more in a way that an infant
learns to see and that is to learn directly from the data. That's what I'm going to show.
So what this means is, we have these three steps for learning a visual task, and we're going to
throw away the first step. We're not going to have a large data set that somebody has to construct.
Instead, we're going to obtain the supervision to train the network directly from the data. Now
this is not a new idea. Turing in 1950 wrote a paper and he said, instead of
trying to produce a programme to simulate the adult mind, why not rather try to produce one
which simulates the child? If this were then subject to an appropriate course of education,
one will attain the adult brain. Okay, so just following what Turing suggested in 1950.
It also ties in with what's been found by psychologists who study
cognitive development in infants, and what they found is the importance of
data in order to develop intelligence, data from the physical world is needed.
And this paper, which is very good to read, gives six lessons in order to develop intelligence in
children and for us. And lesson number one is, be multimodal. And so I'm going to be multimodal.
So for the next part, what I'm going to do next is I'm going to show how machines can learn directly
from the data without having to have a labelled data set, and I'm going to particularly pick out
the theme of correspondence between modalities, and that's what we're going to learn from.
I'm going to show three different examples. This method of learning from the data. It's
called self-supervision, and the first one I'm going to do,
the modalities will be audio and visual, and I'm going to learn from the correspondence of those.
Example number one. So here's an example, first of all, of what a synchronised signal is.
So this is, of course, what we have in the world. We have these synchronised signals coming in, the
audio and visual synchronised. We can, of course, make them unsynchronised by shifting the audio,
and then it sounds like this. And just this
difference, we get this synchronised signal for three, we can also make it unsynchronised.
That's going to be the supervision we're going to use to train a network,
and I'm going to illustrate this with talking heads. The reason I'm doing talking heads, for
two reasons. One, if we think of the Turing baby, what is the baby going to see first, it's going to
see its parents speaking. So in terms of, that's one reason. The second is we as humans are very,
very sensitive to lack of synchronisation in talking heads. I'll show you an example.
I was in the camps yesterday talking to people. There are 1.3
million earthquake survivors still living in those crowded camps.
So I hope you can see, it's out of sync. So now, how are we going to train the network? We're
going to train it to tell if the lip motion in a video sequence is synchronised with the audio or
not. So that will be its task. We're going to give it a video sequence, a video clip,
and the audio and say, are these synchronised? And that's going to be the training signal.
The network itself is going to have a part which takes in a video clip, and this network is going
to be called a visual encoder, takes in the video clip, and produces a vector.
I'm also going to have an audio encoder that takes an audio clip, produces a vector. These vectors,
these are lists of numbers, maybe 512 numbers, but you can think of it just as a point in 3D.
And so the visual encoder predicts a point. The audio encoder predicts a point.
And then the training is going to be that if the audio and the visual is synchronised,
we want these points to be close together, and if the audio and visual are not synchronised,
we want them to be far apart. So very, very simple.
Where do we get the data from? We get it from just natural signals coming in. So we have some frames
and the corresponding video - sorry, corresponding audio - that will be synchronised. We'll call that
a positive sample, and we can get any number of positive samples from videos of talking heads.
Now if we take the frames and displace, take a temporal displacement, the audio and visual will
no longer be synchronised. So that will be what we call a negative sample. So we can generate
these positive, negative samples effortlessly from any video coming in of a talking head,
and we can get millions of these. So that will be our training data. So now we take this network,
which has a visual encoder producing a vector, audio encoder producing a vector,
and train it on all these samples, which we know are synchronised or not synchronised
because we're creating them, and we train it so that it can tell if it's synchronised,
the points it produces are close, and if they're not synchronised, they're not close.
Now imagine we've done that and trained this on millions of examples. What we'll end up
with is points which are close together when it's synchronised. So what I'm showing here
is called an embedding space. This is where these vectors we've produced live. I'm showing it in 2D.
So when we have a synchronised signal, then the points that are produced, the vectors that are
produced by the video encoder and the autoencoder are close like this. If they're not synchronised,
then they won't be close. That's what we've learned. We've told it to do this. We've trained
it. It's done that. So now I'm going to show you, once we've done this, what can we use it for?
The first thing we can use it for is to synchronise audio and visual signals when they're
not synchronised. So the way this works is we know that when they're not synchronised, the audio and
visual embeddings are going to be distant. We can then shift the audio and if we shift it and they
become close, then they'll be synchronised. So we start with something that's unsynchronised,
shift the audio until these vectors become close, then it will be synchronised. What that means
we can do is we had this annoying example of out of sync, we can now synchronise that to this.
Heavy rain and probably four or five hours of heavy rain ahead. I was in the
camps yesterday talking to people, there are 1.3 million earthquake survivors.
Okay, but more importantly, we can also use this network that we've trained to find out where in
the image or the video the person speaking is, and the way this is going to be shown is we're
going to have a video and an audio track, and we can produce what's called a heat map,
where it's hottest, where the person speaking. We're going to use this for localisation.
Now this is a bit more technical, but the way this works is inside the network,
we have to go into the network a bit, there's a spatial grid of vectors and these spatial grid
of vectors corresponds to the spatial grid of the pixels, and we can take the audio encoding vector
and we can pick out the vector in the spatial grid which is closest and the one which is closest will
be the one which is most synchronised, and that will be where the speaker is.
Let's go inside the network a bit. So I'll show you some examples. On the left,
you'll see the heat map, and on the right, you'll see a box around where the speaker is.
It's the perfect place to come if you want to see old roses looking their absolute best
and the very latest…
and here's another clue, if you come…
one example where we could track somebody using their voice. Now I'm going to show you that we've
got multiple people speaking on and off, and maybe some people are moving their lips, but
they're not speaking. They're yawning or laughing. We can pick out the active speaker by this signal
because we can find the pixels, if you like, which are synchronised with the voice we're hearing.
By private concerns…
FOIA, freedom of information request…
Finally, we're not tied to humans. We've trained a network. We have ways of training networks
which can pick out synchronised signals. So this can equally work too for cartoons
where the mouth moves with the voice. It will work with that as well. So again,
what you're going to see, I didn't say before, but the blue will be the active speaker and
the red will be the inactive speaker.
catch with me tonight.
but give the monitor a kiss.
you've seen is we can take something which arises from the world, physics of the world, which is
synchronisation, and then manipulate it slightly, train this network, which has tens or hundreds of
millions of parameters, and then use this network to track the person who's speaking, for example.
So I'm going to go on to the second example, which is audio visual correspondence beyond just talking
heads. So now we're going to consider more general objects, more general scenarios where we have
various objects that make sounds or actions that make sounds, and in terms of this Turing infant,
we imagine that it's been watching its parents talking. Now it's sitting up, it's looking around
at the world and looking and listening to objects around it. This is development. So the idea here
is if you see an image like this, this is an image of drums, you know what it's going to sound like.
And if you hear a sound like this… If you hear a sound like this…
We
know the answer. So obviously this is a guitar. So you have this semantic correspondence between
what's in the image and the sound, and this arises just again from the physical world
that in the physical world, you look at the scene, if something's sounding,
you can see it and you can hear it. So this again, just arises from the physics of the world, and
we're going to use this correspondence, semantic correspondence between the vision and the audio
to learn from training the network. Just to note, this is a weaker requirement than synchronisation.
We can do it from a single image. We don't actually need temporal information for this.
We need the audio signal and an image. So the way to do it, I'm going to formulate it
like as a picking game. We're going to task the network to pick which of these images
this sound corresponds to. So imagine the sound is actually a guitar and it has to pick out which of
these it corresponds to, and it should pick this one. The way we're going to do this is again,
distances. We're going to find which of these embeddings has the smallest distance and pick
that one. So we'll have a similar network we're going to train as before, we take in a video clip,
go through a visual encoder that produces a vector, a list of numbers, we take an audio clip,
it goes through an audio encoder that produces a vector, and what we want is if the audio and
visual correspond, then the distance between these vectors is small, these points is small. If they
don't correspond then the distance between the points should be large, and that's it. So where
do we get the data from? Well, we get the data from any videos we have. So here's two videos.
We don't need to know what's in them, but they differ in this case. What we do know is that
there is a correspondence between the sound and the frames. So now we can take samples from this,
for training. So we take positive samples where we take a frame and the audio around it. We can
take any number of these. Now how do we get negative samples? We simply take the audio
from one video and a frame from another video, and in general, they won't correspond. And that's it.
And we can do this, we have videos which arise from the world, have this correspondence property
naturally, and we can sample millions of positive samples like this and train this network. So
that's it. So again, imagine we've done this and we'll look at this embedding space where these
vectors live. Then what we'll have learned is when the audio and visual correspond,
the embeddings will be close together. And now if we have maybe another instrument like a drum,
then the sound will be distant from the embedding from the guitar, but it will be close to the
embedding of the image of the drum. So we have an embedding space that we've learned like this. Now
what can we do with this? One thing we can do is
what's called cross-modal retrieval. We can start with a sound, and now we can find images
which correspond to this sound, and the way to do this is to populate the joint embedding space with
frames from videos. So that's what I'm showing here. All of these points are frames from videos.
And now we can look at the neighbourhood of where the sound's been embedded and pick
frames which are close by and they must be corresponding. So I'll show an example. Here's
a sound I'm going to play you. Now what is producing that sound? We can dive into this
embedding space, look at nearby frames and find videos, and here are the videos that
could have made that sound. So it's cross-modal. We start with audio and we find images.
Now I'm going to show another use of this embedding space. So we've
trained it so that when there's a correspondence between the audio and
visual, that the embedding vectors are close. So imagine here we have the sound of a guitar.
Then any other image of a guitar should be embedded close by to this
because that's what we've trained it to do. So by transitivity, what's happened with this network is
it's learned to embed all of the objects at the same class together. That's what it has to learn
to do, and this is, in fact how it solves the problem. How else could it solve the problem of
determining the correspondence between the audio and visual, unless it was doing this?
So we've actually learned a network that,
the visual network which learns to embed objects of the same class close together,
and now we can use this for visual retrieval. So we can start off with a frame of a video,
put it into the embedding space, populate this with other videos, and now look at the neighbours
of this and these will be other videos. So we can start with a video, search for similar videos,
or start with a frame and search for similar frames. So here's an example. We start with a
frame of a guitar, search in a few hundred thousand images, and these are the ones
that are nearby. As you see, we start from acoustic guitar, it found acoustic guitars.
Another query. We start with a drum. Search inside this embedding space, we can find images of drums.
And that's all been… The point is, all this has been learned simply from taking samples
from videos where the audio and visual correspond. So that's all we had to do.
We've trained this visual network and now we can use this visual network for recognition.
We can also use it for localisation. So as we saw in the synchronisation case,
we go inside the network. The network has this spatial grid of vectors
to find out where the object is that's making the sound. We take the audio embedding. We look at
the spatial grid of vectors, find the closest vector, and that will be where the object is
that's making the sound. So I'm going to show you an example now. You're going to see a video and
frame by frame, you're going to see the heat map in the centre overlaid on the frame, and then on
the right, you'll see the heat map itself. As I said, this will all be done frame by frame.
All these different instruments, you see how it's localising them.
So all of that learned. Now the third example, we're going to change the modality now. So far,
we have audio and visual. Now we're going to change to language or text and visual.
So in terms of our infant, by about ten or 12 months, infants can start to understand words
and speak words, and eventually they'll learn to read like this. And that's
what we're going to do now. I've done it in this order. I've started with vision and audio,
visual and then moved on to language because an infant learns to speak after it's learned to see.
Okay, let's just go back to our cognitive psychologists. They gave six lessons,
and lesson number six is learn a language. So we're following their six lessons still,
how to do it. So we've seen that we can train networks like this. I showed you in the audio
case where we have a visual encoder and an audio encoder, and then we have what's called a
contrastive loss. So we minimise the distance when there's a correspondence between the outputs. And
very simply, if we're changing the audio modality, we can also just change this audio encoder
to a text encoder. So now we have text which corresponds to the video. So the text here is a
man is playing an electric guitar. We have a text encoder and we can use exactly the same idea, this
correspondence idea that if this text description, the sentence corresponds to this image, describes
this image, then the output, these vectors should be close together, and if it doesn't describe this
image, it describes some other image, then the output, these vectors should be far apart. And
that's it. So we've got our network. Now, how do we train it? So to train this, we need to
have pair data between images and text. So text which describes images. Where do we get that from?
Fortunately, on the internet there's something called alt text, which is,
if you hover your mouse over an image, you often see a sentence comes up and it's provided,
so that you don't have to download the image or for the visually impaired, it can be read out. So
this alt text is available at massive quantities. There are millions or billions of examples of alt
text available. Now I've put some examples on this slide. On the left hand side, the far left,
the alt text is trees in a winter snowstorm. It's describing the image,
and the one on the far right is façade of an old shop. So this is available easily,
and we can train this network by getting millions of examples of these paired visual text, and
as before, just taking positive ones where they correspond, the distance should be small and
when they don't correspond. So we pick a random image and a random text, they don't correspond,
the distance should be large and that's it. We train the network. So again, imagine we've
trained the network and we look at the embedding space where these vectors live, then what we'll
have is if the text corresponds to the image as it does here, the embeddings will be close.
And if we have another text, a man playing a guitar sitting down that doesn't describe this
image on the left, so it will be far away, but it will be close to the actual image where a man is
playing a guitar sitting down. So we've got this embedding space again. Now how do we use this? So
we're going to use this for search and retrieval of images and videos using language. This is
really useful. So once again, we have a joint embedding space. We can populate this with images.
So these dots now represent images that have been encoded, the vectors from those, and say we want
to find a particular image and we want to find it using language. So we describe what we want.
So here's a sentence, car in a river. We embed that in this space and then we again look for
neighbours of this. So here's a neighbour. This will then correspond to an image of what we were
looking for because of the way we've trained it. So now I'm going to show you a demonstration of
this. The really remarkable thing about language in terms of communicating with the machine is
you can keep on adding words in language. You can make queries even more complex
so you can keep on adding requirements. I'll show you how that works in the demo.
So this will be a demo searching 35 million images from Wikimedia Commons,
and you'll see the text being typed in and the retrieval will come immediately.
We'll start off with something quite simple. So the first one is a red car, and there we are.
Now we make it slightly more nuanced. So it's now a sports car. There it is.
And now more interesting, several requirements. Person riding a bike. There we are. Change bike
to horse. Fine. Now make it even more demanding. Riding a horse but jumping. And there it is. And
so on. We can also search for animals, and we can search for animals doing particular things. Here's
penguins raising their wings. So it really is, what you're seeing here is by this embedding,
it feels like you're communicating with a machine actually, because you get this instant response
and you can keep on making it more and more precise, the search query you're looking for.
Okay, so that's the three examples I wanted to show you of learning from data. So that's sort of
the tutorial part of the talk. Now I'm going to finish by giving you two snapshots of research,
more recent research that build again on this type of self-supervised learning of correspondence
between modalities. So we've done all this work. Now let's use it for some applications. So I'm
going to two applications. One is going to be recognising British Sign Language,
and then I'm going to do audio description of videos. Number one, British Sign Language. So this
is a visual language that the deaf community uses in Britain, and here's an example.
I don't know how many of you can read British Sign Language. So I'll tell you what she was doing.
She was interpreting the sentence, 'Every spring, our planet is transformed.' Okay,
I'm going to play it again, and you can look for the sign for planet,
which actually, she does seven signs in that short sequence and it's very challenging to spot them
all. We would like, of course, to have a machine that could understand British Sign Language for
many reasons. One is so that then deaf people can communicate with machines. At the moment,
we can speak using Alexa, we can speak to machines and get them to do what we want, but
if a deaf person wants to do that, they have to type it. It's much better if they could use their
own language to communicate, and of course, it would be very good if they could communicate with
non-signers and the machine could help do that, could translate. So that's why we want to do this.
How do we do it? Where do we get our data from? Where do we get our paired
data from? And the answer is we get it by watching television, because on television
you'll have seen signs overlaid in television programmes like this, and you have subtitles which
correspond to what's being said, and the signer is also interpreting what's being said. So you have a
pair data, a correspondence between the subtitle and the sign sequence,
and this correspondence is what we can learn from as we've been doing all the way through this talk.
Now, the BBC have very generously made available 2000 sign language, signed programmes together
with subtitles to support academic research on recognising British Sign Language. I'm
going to show you some work we've been doing on this large data set that they released,
and what I'm actually going to show you is how we can recognise signs using mouthings. So what I
mean by mouthings, when signers are signing, sometimes they mouth the words that they're
signing. Not always, but sometimes they do. That's what we can pick up on. So I'll show you
some examples. On the left, you're going to see the sign for office, but also he mouths office.
On the right, he's going to do the sign for tree and he's going to mouth tree.
So why is this useful? It's useful because we can pick up words that are being mouthed on the lips.
We know how to do that. So how this is going to work is imagine we have a subtitle like this
clip here. Are you happy with this application? Now we can look for each word in this subtitle,
happy, application, and see if it's mouthed. Now in this example, she does mouth happy.
So we look at the lips and we find where happy is being mouthed,
and once we've done that, then we know the temporal segment where she
mouthed happy. Then we know the sign because we know exactly where she mouthed and we know
the sign. So we have a way of automatically annotating the data and getting the signs,
and the way we do the spotting on the lips, it uses this synchronisation network I showed you
in the first example. That's actually how we do it. So that was one example. Now imagine
we do this at industrial scale. We scale up. We do it on the BBC 2000 programmes.
So we take the words that occur in the subtitles, we look at all the subtitles as they occur and we
see whether the person is mouthing that word, and then we take that segment and that will be
the sign corresponding to the word. I'll show you some examples. So first of all, for family,
important, you see the word important. You see we're getting all these different examples,
before. So actually, if you look at this one, before, it's signed in two different ways in
these examples. You can pick out some signs doing it one way, some signs doing it another way.
Perfect. Now we're getting these signs from mouthings, but of course once we have the sign,
we can learn to recognise it just from the hand movements and hand gestures. So then we'll be
able to recognise it, whether they mouthed or not, and for each word, we can generate
thousands of examples, really, hundreds of thousands of examples here. You're seeing these
perfect examples. We're getting more perfect. So it really is quite a perfect method, in fact,
because we can generate signs for thousands of different words and hundreds of examples, say,
for each one, and now we have a way where we can learn all these signs and recognise them
by computer. This problem is not solved but this is a way of generating the data.
Now, the second application, I want to show you a snapshot is audio description of video.
Audio description is a soundtrack that's provided for the visually impaired and it describes the
visual elements of the television programme or the movie so that they can understand what's going on.
I'm going to show you a short example of audio description, the type of thing that's available.
This is for the film Out of Sight.
She turns her head and finds Jack standing beside her.
Can I buy you a drink?
Sit down.
then places his lighter on the table. She opens her mouth as if to speak, but no words come.
So you can see how the audio description is complementary to the soundtrack.
So the things that you couldn't tell from what's being said or the music,
that's what it's providing, and then someone who's blind can understand what's going on in the film.
So we'd like to be able to generate these automatically. So we'd like to have a machine
that takes in the video and then produces the audio description, probably as text,
and then we have a text to speech that will read it out. So the visually impaired can follow it.
So how would we do that? So we obviously need to supply the video to a model we're going to
train and learn, but we have to do more than that because audio descriptions have the names. You
heard the names of the characters. So we also have to provide the names of the characters. We have to
provide a character bank of people who are in the film. So we need this auxiliary information. Now,
given those two inputs, then we want to train a model to produce the audio description.
Now where are we going to get the training data? And we need paired data between films
and audio description. Fortunately, volunteers have provided audio descriptions for thousands
and thousands of films, so this paired data is readily available, and as you've seen,
we can learn from these corresponding data. So we have films. We have the audio descriptions. We can
learn a model which generates audio descriptions. I'm going to show you two examples of audio
descriptions that we've generated. Again, this is still a work in progress. It's not finished.
They're both from Harry Potter, and the first one is a painful example.
Concentrate, Potter. Focus.
Okay, so that's the clip, and this was the audio description that we predicted
for that. So Snape correctly points at Harry. So it got the character's right. Harry closes
his eyes in horror. There's more pain I'd say, but this is what's produced.
Second example, a more pleasant example.
So how are we going to get to London?
There we go. The audio description that was produced was Hermione, Ron, and Luna's eyes are
fixed on Harry, who is standing in the doorway. That's correct. That's what happened. Then Harry
rides on a horse's back as a horse rears up in the air. So this model thinks this is a horse,
but of course, it clearly is not if you read Harry Potter, but there's more to do here.
Okay, so that's the end of my snapshot. So I'm finishing now. This is
what you've seen. You've seen that it's possible to learn visual encoders directly from data
in various ways. There's no need for manual supervision,
which is a traditional way of doing this, and I've gone through a learning curriculum for
a virtual infant. Audio visual synchronisation, audio visual correspondence and language visual
correspondence. That's what you've seen. I just mentioned that the field, computer vision field
works on this problem a lot, and even though I've shown cross-modal learning, you can also learn
visual encoders purely from the visual stream. So deaf people can see as well, of course.
I'd like to end by thanking people. So of course, this work is not done by me by any
means. It's done by my students, my postdocs, and I'm always inspired by talking to faculty,
Oxford in the UK, DeepMind internationally. It wouldn't be possible without all these
people and a lot of them are here in the audience. So that's great. So thank you.
Thank you very much, Andrew, for a super lecture. Very stimulating. We have time for some questions.
So who would like to start us off? Now there are a couple of people with microphones,
so if you put your hand up, someone will come to you with a microphone. So let's
start over here on my right, your left, with the first question. We also have people because
of course this is being live streamed. So we have people who can ask questions on Slido,
and I think one of my colleagues is hovering around with an iPad, and will wave at me if we
have some questions on Slido. Please go ahead.
how far are you away from having real time live access to this system, if you like, really?
So there are lots of systems I showed here. The real time demo I was showing, that's real time. So
you can type it and it will immediately retrieve images or videos from a large data set.
Right. I think what I was trying to say actually is more the video. So at the
moment it's trained on videos online, I take it, and what I was trying to say is that if you
were to have a device that could see someone talking in real time, like for instance, the
sign language, could that be then translated?
clear, we can't do continuous sign language yet. At the moment most of these methods run on
big machines, GPUs, etc., there's lots of work on taking these big models and distilling them
down to smaller models in various creative ways. So even though at the moment they run on
GPUs, etc., some of these models already can run in your browser. You
can do real time pose detection of humans in your browser. You can see the ASR,
automatic speech recognition, can be done in your browser. So these models start large but then
once they're ready, they can be made smaller and more portable. Does that answer your question?
Yes, that's basically it, really, yes. Great, thank you.
I wouldn't say all the models though. I mean, some of the models are too large at the moment to do
that, of course, but that's the way it goes.
Okay, the next question just a few rows back, I think. There we are.
Congratulations on the prize. Thanks for the lecture. So it feels like for different tasks,
maybe you have to do very specific data processing, right? Which you explained now. Do
you see a way of doing self-supervision with like a fairly general type of data processing where you
can apply later to very different tasks, maybe like what ChatGPT does for texts? Thank you.
Thank you for the question. Yes, I've concentrated on visual tasks here,
but we already know the answer to that. The answer is yes. I've shown some different types of
self-supervision here, and there are many others, as I said, and once the networks have been trained
in some way by these self-supervised methods, they then can be used for multiple tasks by
applying what are called different heads. So they can be used for recognition, object detection,
tracking. Once you have good features, a good network, it can do multiple tasks,
and this is really the way that large scale networks are trained nowadays,
but there's still an issue of you have to say what the tasks are. It's still a research question to
train a network and then it immediately do all the tasks you want it to do, predict depth,
predict other things about images or videos. There's still work to be done here, but we
certainly have lots of evidence that a good visual backbone enables lots of tasks after that.
Okay. So I think we have a question there. If you go one row back first, I think, if you pass
it back and then I've got the order correct.
many dimensions the embedding space should be?
It's usually a power of two. It's a good question. There's not a good answer. It's
also storage. Vectors are very large. It requires more storage. But no, I don't have a good answer.
It's an empirical question, really. You try different sizes and you determine what works best
empirically. Thank you for asking. I don't know if anybody else here has got a better answer.
Any volunteers? No, probably not. Okay. Go ahead please.
Thank you very much for the talk, Andrew. Knowing what you know now about this research project,
which parts were the hardest? Was it data collection, data prep? Was it designing
the encoder, neural net architecture? What would you for somebody else that's working
on a similar research project, what advice knowing what you know, would you give?
Each stage of these projects, you stumble across something. It's always the way with
projects. You have an idea or somebody has an idea and you start doing it and then
unexpected things happen. So it's difficult to answer, really. Sometimes the networks are
hard to train. Sometimes the data that you think is good is not good and when you're
putting together so many things like this, unexpected things happen. One of my rules is
when you have data sets, they're always noisy. I mean, it always happens.
There's always something wrong and you always have to look at your data and see what's going
on. So I can't give you a definite answer. It's just that every stage always has problems.
Okay. So the question at the front here.
Thank you.
great lecture. So with the first message you showed with temporal alignment and so on,
and you argued that that could be how babies might learn to see as a cue, but with the later work,
with things like text to object, that still seems to require a huge amount of data. Is there any
argument that might be similar to ways that babies learn to see and correspond with language?
It's good to ask. I think the huge amount of data in the text case is an Achilles heel.
That's a problem at the moment that you need so much data to learn from. I think
it's a research question how you can avoid needing so much data. By the way,
I wasn't saying this is how babies necessarily learn. In principle,
they can learn this way because we can see that just from the data, you can learn these tasks,
these skills. Not saying necessarily that they do this at all. It's just that the information
is there and the order is that after they've learnt to see and hear cognitive development, we
know that it's later that they acquire language. That's the only point I was making there, but yes,
how to avoid having to use such vast quantities of data in the text case, I don't have a good
answer to that. In the audio and visual case, it's readily available. There's no cost to that.
Okay. Question here.
Just keep your hand up, please, yes, that's great, so we can see. Thank you.
Hi. Thank you. I am not in the field, so my question might be stupid, but I'm just wondering
for the correspondence of synchronisation, what if there is some false correspondence? Let's say I'm
waving my hand, but this actually corresponds with a sound which is not in the image at all.
Would your model be able to pick out that?
So it's the large scale data that helps avoid problems like that. You always have,
we can call it noise, things which don't correspond to what's making the sound,
but when you see enough examples, you can pick out the ones that matter and the ones that
don't. So that's the answer to that question, really. If you think of the talking head,
there are lots of things going on. The person who is talking, their eyes might flutter,
their hair might blow in the wind, but in order for it to solve the task of learning
synchronisation, it has to see that what really matters is the lips, because it's the lips that
are synchronised with the voice, with the speech. So it gets to ignore all these nuisance factors,
this sort of noise, and pick out what really matters, otherwise it can't solve the task.
Andrew, how many is enough?
In different circumstances. It's not the same in all circumstances.
It's a question we always fight with. We keep on going empirically until things work well,
and then you see if you can train more efficiently, meaning less data,
but as I'm waving my hands and saying a million samples, because that's typically what we use,
because it's so easy as well to get this data.
apply generically to these things?
the scaling laws, what are called scaling laws, where you can say for the number of parameters,
how much data do you need given a certain training budget, and this is published work for training
models. Again, it starts off empirically. It's not like physics or geometry where you can give
counterarguments, this field doesn't have things like that. It's much more empirical
and then generalising from that.
two over this side. This one first, I think. Where's our person with the microphone?
Hi, I have two questions. Quick ones. What do you think computer vision will look like
in say 20 years, and the second is, is superhuman vision possible?
Do you mind if I don't answer the first question? Because I think these sort of predictions,
I know people always want us to make them, but they're always wrong. So whatever I say,
it's going to be wrong really, because it could move so quickly.
Superhuman, yes. What would be superhuman, though? I think for sure, we can do things that
are superhuman. Being able to search through 35 million images in a fraction of a second, surely
that's superhuman, and we're going to be able to, for sure, search enormous satellite images
spanning the whole world. We'll be able to find something instantly. We already can do superhuman
things with computer vision, I think, and it will go on in terms of temporal resolution,
spotting things which exist over long time scales that a human wouldn't notice, or a short time
scale that a human wouldn't be able to see. All of these things will happen, yes. Once we can do a
skill on a computer, we can make it superhuman.
Thank you very much for the talk. You showed us the multimodalities of examples,
and then in the end, you said that it also can be applied to the single modality, but with a single
modality obviously then you have to figure out the augmentations because it's less obvious, how
to compare positive and negative examples. Have you found that multimodal embedding and comparing
the embeddings from multiple modalities learns better, because you don't have to engineer those
augmentations, or do they perform on par or the single modality can perform actually better?
You're right that with images you have to do augmentations. From a single image, the sort
of things you might do is crop the image and you want the embeddings to match, even though you've
cropped, the image has got a horse in it, and you crop out parts of it, it will still be a horse.
So you want the embeddings computed from these various crops to all match, and that will then
train the network to understand that the contents of the image shouldn't be affected by these crops.
To answer your question empirically, learning from multiple modalities generally works better than
certainly in video, doing all these augmentations. It's naturally doing augmentations. So it works
better. But the other methods work very well as well. You just have to work harder to make
them work and some of the people in this room have done multimodal and unimodal learning, but
they all work well in the end. It's just you have to do more work to make
the unimodal ones work better.
Thank you for the lecture. Very interesting. Just because I'm curious, I suppose,
on the topic of solving for finding a speaker in a crowd.
What if everyone is turned around? How do you then detect speaker?
Yes, then it can't answer, of course, it has to see the lips. That's the answer, but there'll be
other cues. If a is person speaking, I'm not going to turn around, but when I'm speaking,
I move my hands. There's body language. There are other ways you can do this as well.
I haven't shown that here but you can imagine if everybody was always speaking from behind,
then the network had to solve the problem, it would do something
like that.
Okay. I think this gentleman here will have the honour of the last question.
Thank you. Yes, I was just wondering in terms of, there were several examples where, for example,
the visual network would be working together with maybe an audio network, or it could be a
descriptive network. I'm wondering whether the embedded space that you end up for the
visual network can apply across kind of multiple different problems. I mean, you sort of alluded to
that earlier, saying sometimes actually, even if you trained it on one problem, then it can
actually be useful on other problems, and I'm wondering whether these embedded spaces end up
sort of encapsulating an overall description of the image, which
can be used in multiple different tasks, and whether those embeddings are a good gestalt
of the whole thing that's being presented.
but in order to solve that task, they have to do something more than you've
trained them for. I give the example, all the drums being embedded close together, all
the guitars being close together, and once it's done that, then anything which is like a guitar,
like a drum, it will embed to a certain point. So it's sort of learnt the characteristics and that
applies to thousands of different categories and then you can use it for tracking guitars
or other properties you might want for guitars. The answer is yes, basically.
Okay. Well, that's super. Now before you applaud to thank Andrew again,
I'm going to combine that task with presenting him with his scroll and medal. So Andrew, if you
want to come out here because it'll make it easier for you. I will hand you this, if you want to hold
that, and if you can do a trick of holding
that at the same time and shaking my hand, and smile at the camera. Thank
you all very much for coming. It was an excellent lecture. Thank you.

The Bakerian Prize Lecture 2023 is given by Professor Andrew Zisserman

Computer vision is a field where the goal is to enable machines to understand and use the visual content of images and videos in a similar manner to humans. In this talk Professor Zisserman will describe how machines are able to learn to recognise objects and actions from a temporal sequence of video frames, together with the audio and speech that accompanies them - an approach that is inspired by how infants may 'learn to see'. He will show applications of computer vision to image search, to recognising sign language (BSL), and to generating video descriptions for the visually impaired.

About the Royal Society
91TV is a Fellowship of many of the world's most eminent scientists and is the oldest scientific academy in continuous existence.
/

Subscribe to our YouTube channel for exciting science videos and live events.

Find us on:
Bluesky:
Facebook:
Instagram:
LinkedIn:
TikTok:

Transcript

Tags

royal society science scientists scientific policy scientific research science uk science research international international science science education science policy Professor Brian Cox Brian Cox

Email updates

We promote excellence in science so that, together, we can benefit humanity and tackle the biggest challenges of our time.

Email updates

Subscribe to our newsletters to be updated with the latest news on innovation, events, articles and reports.

First name *

Last name *

Email *

Name

What subscription are you interested in receiving? _{(Choose at least one subject)}

What subscription are you interested in receiving?

Public Newsletter - Summer Science, events, videos and news

Scientists newsletter - Grants, scientific meetings, and journals

Librarians newsletter - News and features for librarians

I am happy to receive the selected communications by email from the Royal Society, as set out in our privacy policy. I understand I can unsubscribe at any time. Review privacy policy *