Computer vision: learning to see the world | 91TV
Transcript
- Okay. I think we should make a start. Good evening, ladies and gentlemen. My name is
- Peter Bruce. I'm the physical secretary of the Royal Society and one of the vice presidents and
- it's a great pleasure for me to welcome you all to Carlton House Terrace, the home of the Royal
- Society, and to this Bakerian award lecture. Now, let me just say a little bit about the
- Bakerian medal and lecture. It's one of our premier lectures. It is the premier lecture
- and medal in the physical sciences. It was established through a bequest of Henry Baker,
- who was a fellow of the Royal Society, who made a bequest of £100, and I shall read you the remit,
- if you like, for the Bakerian as written at that time, for a narration or discourse on such part
- of natural history or experimental philosophy at such time, and in such manner as the President
- and the Council of the Society, for the time being, shall please to order and appoint. Now
- you'll gather that it's not a relatively recently introduced lecture and medal, the origin 1775,
- so it's been around for quite some time, and as well as the medal, it's accompanied by a gift of
- £10,000. Now, it's a real pleasure for me to introduce the medal winner and our lecturer this
- evening, Andrew Zisserman. Andrew is one of the principal architects of modern computer vision.
- His work in the 1980s on surface reconstruction and discontinuities is widely cited. He is
- best known for his leading work in the 1990s, establishing the computational theory of multiple
- view reconstruction and the development of practical algorithms that are widely used today.
- This culminated in the publication of a book in 2000, turn of the century in other words,
- with Richard Hartley already regarded now as a standard text.
- His laboratory in Oxford is internationally renowned, and its work is currently shedding
- new light on the problems of object detection and recognition. So without any further ado,
- you want to hear from him I'm sure, not me, let me welcome Andrew to the stage to present his
- lecture as you see the title there, Computer Vision, Learning to See the World. Andrew.
- Okay, thank you for the introduction, Peter, and thank you for the prize. It's a great honour
- for me. It's a great honour for the computer vision field as well. So we're very grateful.
- So I'm going to talk on computer vision, learning to see the world, and I'm going
- to start off by saying what computer vision is. The aim of computer vision is to extract
- visual information from images and videos so that the computer can understand images and videos,
- much like a human would understand them. That's what we aim to do. So in this image here,
- we'd like to be able to answer questions like what is in the image? So the objects,
- in this case a person, that's in the image. Where are the things in the image? Meaning
- the spatial layout of the scene. The pose of the person. And then what is happening?
- Nothing much happening at the moment but if I play the video, it's more interesting. This is
- a clip from Singing in the Rain. Okay, so as I say, the objective is to be able to carry out
- tasks like this, answer these questions, and the field has made quite some progress and we
- can do quite a few things now, especially since the advent of deep learning about a decade ago.
- So I'll show you some examples, and the first one is I'm going to show you a video sequence
- in the top left, and then various views of the 3D layout of the people and what they're doing,
- as they sort of collide with each other.
- So that's sort of the 'where' task. The next one I'm going to show you is the 'what' task. So here,
- you're seeing object detection and tracking. So these boxes, they're detecting the objects,
- they're moving through the video that's tracking it, and there's also recognition going on,
- maybe you can read the labels, but it's recognising various animals and objects.
- So there's a tiger. It recognises the tiger, deer. Also some vehicles you'll see in a moment.
- There's a motorbike. So all of these being recognised directly in the image and then tracked,
- and that was sort of what is in the image, what is in the video. And the next
- one I want to show you is action. So we can recognise human actions now and we can also
- recognise actions of humans and other animals, and I'm going to show you an animal example.
- So here, what you're going to see is chimpanzees and this is how computer vision can support other
- fields. So zoologists have hundreds of hours of videos that they want to analyse and annotate, and
- with computer vision, you can annotate the actions of the behaviour of chimpanzees automatically
- or through the video. So here, you're going to see the behaviours of nut cracking and eating.
- Okay, now each of these tasks I've shown you, sort of recognition tasks
- of actions or what's in the image, has been done by a deep learning model,
- and a deep learning model for a visual task is trained in three steps.
- The first step is you construct a very, very large data set of images or videos that you
- label for the task you want, like recognising an image, then you choose or you design a deep
- model. This is where the deep part comes, a neural network. And third step is to train the model's
- parameters on this data set, and you train it by predicting the labels that are on the data set.
- I'll show you an example. So for the recognition I was showing you earlier, recognising the animals
- and the vehicles, we were using a network which had been trained to classify 1000 different object
- categories, and this is a picture of the network. So it takes in an image, in this case an elephant,
- and then the output is a choice of one of a 1000 categories it's been trained to recognise,
- and this network has 30 million parameters. That's a lot of parameters, and so to train it,
- you need a very large data set, and it was trained on a data set of a million images, 1000 images for
- each of the 1000 categories. What that means is that somebody had to find 1000 images of monkeys
- like this, 1000 images of dogs, 1000 images of elephants, a colossal amount of work and
- that's what the network was trained on. Now I'm coming come to the core of this lecture. This is
- not how a baby learns to see. So a baby is not shown a thousand images and say, this is a dog,
- a thousand images, this is a cat. That's not how a baby learns to see, and what I'm going
- to do in this lecture is explore how computers can learn to see more in a way that an infant
- learns to see and that is to learn directly from the data. That's what I'm going to show.
- So what this means is, we have these three steps for learning a visual task, and we're going to
- throw away the first step. We're not going to have a large data set that somebody has to construct.
- Instead, we're going to obtain the supervision to train the network directly from the data. Now
- this is not a new idea. Turing in 1950 wrote a paper and he said, instead of
- trying to produce a programme to simulate the adult mind, why not rather try to produce one
- which simulates the child? If this were then subject to an appropriate course of education,
- one will attain the adult brain. Okay, so just following what Turing suggested in 1950.
- It also ties in with what's been found by psychologists who study
- cognitive development in infants, and what they found is the importance of
- data in order to develop intelligence, data from the physical world is needed.
- And this paper, which is very good to read, gives six lessons in order to develop intelligence in
- children and for us. And lesson number one is, be multimodal. And so I'm going to be multimodal.
- So for the next part, what I'm going to do next is I'm going to show how machines can learn directly
- from the data without having to have a labelled data set, and I'm going to particularly pick out
- the theme of correspondence between modalities, and that's what we're going to learn from.
- I'm going to show three different examples. This method of learning from the data. It's
- called self-supervision, and the first one I'm going to do,
- the modalities will be audio and visual, and I'm going to learn from the correspondence of those.
- Example number one. So here's an example, first of all, of what a synchronised signal is.
- So this is, of course, what we have in the world. We have these synchronised signals coming in, the
- audio and visual synchronised. We can, of course, make them unsynchronised by shifting the audio,
- and then it sounds like this. And just this
- difference, we get this synchronised signal for three, we can also make it unsynchronised.
- That's going to be the supervision we're going to use to train a network,
- and I'm going to illustrate this with talking heads. The reason I'm doing talking heads, for
- two reasons. One, if we think of the Turing baby, what is the baby going to see first, it's going to
- see its parents speaking. So in terms of, that's one reason. The second is we as humans are very,
- very sensitive to lack of synchronisation in talking heads. I'll show you an example.
- I was in the camps yesterday talking to people. There are 1.3
- million earthquake survivors still living in those crowded camps.
- So I hope you can see, it's out of sync. So now, how are we going to train the network? We're
- going to train it to tell if the lip motion in a video sequence is synchronised with the audio or
- not. So that will be its task. We're going to give it a video sequence, a video clip,
- and the audio and say, are these synchronised? And that's going to be the training signal.
- The network itself is going to have a part which takes in a video clip, and this network is going
- to be called a visual encoder, takes in the video clip, and produces a vector.
- I'm also going to have an audio encoder that takes an audio clip, produces a vector. These vectors,
- these are lists of numbers, maybe 512 numbers, but you can think of it just as a point in 3D.
- And so the visual encoder predicts a point. The audio encoder predicts a point.
- And then the training is going to be that if the audio and the visual is synchronised,
- we want these points to be close together, and if the audio and visual are not synchronised,
- we want them to be far apart. So very, very simple.
- Where do we get the data from? We get it from just natural signals coming in. So we have some frames
- and the corresponding video - sorry, corresponding audio - that will be synchronised. We'll call that
- a positive sample, and we can get any number of positive samples from videos of talking heads.
- Now if we take the frames and displace, take a temporal displacement, the audio and visual will
- no longer be synchronised. So that will be what we call a negative sample. So we can generate
- these positive, negative samples effortlessly from any video coming in of a talking head,
- and we can get millions of these. So that will be our training data. So now we take this network,
- which has a visual encoder producing a vector, audio encoder producing a vector,
- and train it on all these samples, which we know are synchronised or not synchronised
- because we're creating them, and we train it so that it can tell if it's synchronised,
- the points it produces are close, and if they're not synchronised, they're not close.
- Now imagine we've done that and trained this on millions of examples. What we'll end up
- with is points which are close together when it's synchronised. So what I'm showing here
- is called an embedding space. This is where these vectors we've produced live. I'm showing it in 2D.
- So when we have a synchronised signal, then the points that are produced, the vectors that are
- produced by the video encoder and the autoencoder are close like this. If they're not synchronised,
- then they won't be close. That's what we've learned. We've told it to do this. We've trained
- it. It's done that. So now I'm going to show you, once we've done this, what can we use it for?
- The first thing we can use it for is to synchronise audio and visual signals when they're
- not synchronised. So the way this works is we know that when they're not synchronised, the audio and
- visual embeddings are going to be distant. We can then shift the audio and if we shift it and they
- become close, then they'll be synchronised. So we start with something that's unsynchronised,
- shift the audio until these vectors become close, then it will be synchronised. What that means
- we can do is we had this annoying example of out of sync, we can now synchronise that to this.
- Heavy rain and probably four or five hours of heavy rain ahead. I was in the
- camps yesterday talking to people, there are 1.3 million earthquake survivors.
- Okay, but more importantly, we can also use this network that we've trained to find out where in
- the image or the video the person speaking is, and the way this is going to be shown is we're
- going to have a video and an audio track, and we can produce what's called a heat map,
- where it's hottest, where the person speaking. We're going to use this for localisation.
- Now this is a bit more technical, but the way this works is inside the network,
- we have to go into the network a bit, there's a spatial grid of vectors and these spatial grid
- of vectors corresponds to the spatial grid of the pixels, and we can take the audio encoding vector
- and we can pick out the vector in the spatial grid which is closest and the one which is closest will
- be the one which is most synchronised, and that will be where the speaker is.
- Let's go inside the network a bit. So I'll show you some examples. On the left,
- you'll see the heat map, and on the right, you'll see a box around where the speaker is.
- It's the perfect place to come if you want to see old roses looking their absolute best
- and the very latest…
- and here's another clue, if you come…
- one example where we could track somebody using their voice. Now I'm going to show you that we've
- got multiple people speaking on and off, and maybe some people are moving their lips, but
- they're not speaking. They're yawning or laughing. We can pick out the active speaker by this signal
- because we can find the pixels, if you like, which are synchronised with the voice we're hearing.
- By private concerns…
- FOIA, freedom of information request…
- Finally, we're not tied to humans. We've trained a network. We have ways of training networks
- which can pick out synchronised signals. So this can equally work too for cartoons
- where the mouth moves with the voice. It will work with that as well. So again,
- what you're going to see, I didn't say before, but the blue will be the active speaker and
- the red will be the inactive speaker.
- catch with me tonight.
- but give the monitor a kiss.
- you've seen is we can take something which arises from the world, physics of the world, which is
- synchronisation, and then manipulate it slightly, train this network, which has tens or hundreds of
- millions of parameters, and then use this network to track the person who's speaking, for example.
- So I'm going to go on to the second example, which is audio visual correspondence beyond just talking
- heads. So now we're going to consider more general objects, more general scenarios where we have
- various objects that make sounds or actions that make sounds, and in terms of this Turing infant,
- we imagine that it's been watching its parents talking. Now it's sitting up, it's looking around
- at the world and looking and listening to objects around it. This is development. So the idea here
- is if you see an image like this, this is an image of drums, you know what it's going to sound like.
- And if you hear a sound like this… If you hear a sound like this…
- We
- know the answer. So obviously this is a guitar. So you have this semantic correspondence between
- what's in the image and the sound, and this arises just again from the physical world
- that in the physical world, you look at the scene, if something's sounding,
- you can see it and you can hear it. So this again, just arises from the physics of the world, and
- we're going to use this correspondence, semantic correspondence between the vision and the audio
- to learn from training the network. Just to note, this is a weaker requirement than synchronisation.
- We can do it from a single image. We don't actually need temporal information for this.
- We need the audio signal and an image. So the way to do it, I'm going to formulate it
- like as a picking game. We're going to task the network to pick which of these images
- this sound corresponds to. So imagine the sound is actually a guitar and it has to pick out which of
- these it corresponds to, and it should pick this one. The way we're going to do this is again,
- distances. We're going to find which of these embeddings has the smallest distance and pick
- that one. So we'll have a similar network we're going to train as before, we take in a video clip,
- go through a visual encoder that produces a vector, a list of numbers, we take an audio clip,
- it goes through an audio encoder that produces a vector, and what we want is if the audio and
- visual correspond, then the distance between these vectors is small, these points is small. If they
- don't correspond then the distance between the points should be large, and that's it. So where
- do we get the data from? Well, we get the data from any videos we have. So here's two videos.
- We don't need to know what's in them, but they differ in this case. What we do know is that
- there is a correspondence between the sound and the frames. So now we can take samples from this,
- for training. So we take positive samples where we take a frame and the audio around it. We can
- take any number of these. Now how do we get negative samples? We simply take the audio
- from one video and a frame from another video, and in general, they won't correspond. And that's it.
- And we can do this, we have videos which arise from the world, have this correspondence property
- naturally, and we can sample millions of positive samples like this and train this network. So
- that's it. So again, imagine we've done this and we'll look at this embedding space where these
- vectors live. Then what we'll have learned is when the audio and visual correspond,
- the embeddings will be close together. And now if we have maybe another instrument like a drum,
- then the sound will be distant from the embedding from the guitar, but it will be close to the
- embedding of the image of the drum. So we have an embedding space that we've learned like this. Now
- what can we do with this? One thing we can do is
- what's called cross-modal retrieval. We can start with a sound, and now we can find images
- which correspond to this sound, and the way to do this is to populate the joint embedding space with
- frames from videos. So that's what I'm showing here. All of these points are frames from videos.
- And now we can look at the neighbourhood of where the sound's been embedded and pick
- frames which are close by and they must be corresponding. So I'll show an example. Here's
- a sound I'm going to play you. Now what is producing that sound? We can dive into this
- embedding space, look at nearby frames and find videos, and here are the videos that
- could have made that sound. So it's cross-modal. We start with audio and we find images.
- Now I'm going to show another use of this embedding space. So we've
- trained it so that when there's a correspondence between the audio and
- visual, that the embedding vectors are close. So imagine here we have the sound of a guitar.
- Then any other image of a guitar should be embedded close by to this
- because that's what we've trained it to do. So by transitivity, what's happened with this network is
- it's learned to embed all of the objects at the same class together. That's what it has to learn
- to do, and this is, in fact how it solves the problem. How else could it solve the problem of
- determining the correspondence between the audio and visual, unless it was doing this?
- So we've actually learned a network that,
- the visual network which learns to embed objects of the same class close together,
- and now we can use this for visual retrieval. So we can start off with a frame of a video,
- put it into the embedding space, populate this with other videos, and now look at the neighbours
- of this and these will be other videos. So we can start with a video, search for similar videos,
- or start with a frame and search for similar frames. So here's an example. We start with a
- frame of a guitar, search in a few hundred thousand images, and these are the ones
- that are nearby. As you see, we start from acoustic guitar, it found acoustic guitars.
- Another query. We start with a drum. Search inside this embedding space, we can find images of drums.
- And that's all been… The point is, all this has been learned simply from taking samples
- from videos where the audio and visual correspond. So that's all we had to do.
- We've trained this visual network and now we can use this visual network for recognition.
- We can also use it for localisation. So as we saw in the synchronisation case,
- we go inside the network. The network has this spatial grid of vectors
- to find out where the object is that's making the sound. We take the audio embedding. We look at
- the spatial grid of vectors, find the closest vector, and that will be where the object is
- that's making the sound. So I'm going to show you an example now. You're going to see a video and
- frame by frame, you're going to see the heat map in the centre overlaid on the frame, and then on
- the right, you'll see the heat map itself. As I said, this will all be done frame by frame.
- All these different instruments, you see how it's localising them.
- So all of that learned. Now the third example, we're going to change the modality now. So far,
- we have audio and visual. Now we're going to change to language or text and visual.
- So in terms of our infant, by about ten or 12 months, infants can start to understand words
- and speak words, and eventually they'll learn to read like this. And that's
- what we're going to do now. I've done it in this order. I've started with vision and audio,
- visual and then moved on to language because an infant learns to speak after it's learned to see.
- Okay, let's just go back to our cognitive psychologists. They gave six lessons,
- and lesson number six is learn a language. So we're following their six lessons still,
- how to do it. So we've seen that we can train networks like this. I showed you in the audio
- case where we have a visual encoder and an audio encoder, and then we have what's called a
- contrastive loss. So we minimise the distance when there's a correspondence between the outputs. And
- very simply, if we're changing the audio modality, we can also just change this audio encoder
- to a text encoder. So now we have text which corresponds to the video. So the text here is a
- man is playing an electric guitar. We have a text encoder and we can use exactly the same idea, this
- correspondence idea that if this text description, the sentence corresponds to this image, describes
- this image, then the output, these vectors should be close together, and if it doesn't describe this
- image, it describes some other image, then the output, these vectors should be far apart. And
- that's it. So we've got our network. Now, how do we train it? So to train this, we need to
- have pair data between images and text. So text which describes images. Where do we get that from?
- Fortunately, on the internet there's something called alt text, which is,
- if you hover your mouse over an image, you often see a sentence comes up and it's provided,
- so that you don't have to download the image or for the visually impaired, it can be read out. So
- this alt text is available at massive quantities. There are millions or billions of examples of alt
- text available. Now I've put some examples on this slide. On the left hand side, the far left,
- the alt text is trees in a winter snowstorm. It's describing the image,
- and the one on the far right is façade of an old shop. So this is available easily,
- and we can train this network by getting millions of examples of these paired visual text, and
- as before, just taking positive ones where they correspond, the distance should be small and
- when they don't correspond. So we pick a random image and a random text, they don't correspond,
- the distance should be large and that's it. We train the network. So again, imagine we've
- trained the network and we look at the embedding space where these vectors live, then what we'll
- have is if the text corresponds to the image as it does here, the embeddings will be close.
- And if we have another text, a man playing a guitar sitting down that doesn't describe this
- image on the left, so it will be far away, but it will be close to the actual image where a man is
- playing a guitar sitting down. So we've got this embedding space again. Now how do we use this? So
- we're going to use this for search and retrieval of images and videos using language. This is
- really useful. So once again, we have a joint embedding space. We can populate this with images.
- So these dots now represent images that have been encoded, the vectors from those, and say we want
- to find a particular image and we want to find it using language. So we describe what we want.
- So here's a sentence, car in a river. We embed that in this space and then we again look for
- neighbours of this. So here's a neighbour. This will then correspond to an image of what we were
- looking for because of the way we've trained it. So now I'm going to show you a demonstration of
- this. The really remarkable thing about language in terms of communicating with the machine is
- you can keep on adding words in language. You can make queries even more complex
- so you can keep on adding requirements. I'll show you how that works in the demo.
- So this will be a demo searching 35 million images from Wikimedia Commons,
- and you'll see the text being typed in and the retrieval will come immediately.
- We'll start off with something quite simple. So the first one is a red car, and there we are.
- Now we make it slightly more nuanced. So it's now a sports car. There it is.
- And now more interesting, several requirements. Person riding a bike. There we are. Change bike
- to horse. Fine. Now make it even more demanding. Riding a horse but jumping. And there it is. And
- so on. We can also search for animals, and we can search for animals doing particular things. Here's
- penguins raising their wings. So it really is, what you're seeing here is by this embedding,
- it feels like you're communicating with a machine actually, because you get this instant response
- and you can keep on making it more and more precise, the search query you're looking for.
- Okay, so that's the three examples I wanted to show you of learning from data. So that's sort of
- the tutorial part of the talk. Now I'm going to finish by giving you two snapshots of research,
- more recent research that build again on this type of self-supervised learning of correspondence
- between modalities. So we've done all this work. Now let's use it for some applications. So I'm
- going to two applications. One is going to be recognising British Sign Language,
- and then I'm going to do audio description of videos. Number one, British Sign Language. So this
- is a visual language that the deaf community uses in Britain, and here's an example.
- I don't know how many of you can read British Sign Language. So I'll tell you what she was doing.
- She was interpreting the sentence, 'Every spring, our planet is transformed.' Okay,
- I'm going to play it again, and you can look for the sign for planet,
- which actually, she does seven signs in that short sequence and it's very challenging to spot them
- all. We would like, of course, to have a machine that could understand British Sign Language for
- many reasons. One is so that then deaf people can communicate with machines. At the moment,
- we can speak using Alexa, we can speak to machines and get them to do what we want, but
- if a deaf person wants to do that, they have to type it. It's much better if they could use their
- own language to communicate, and of course, it would be very good if they could communicate with
- non-signers and the machine could help do that, could translate. So that's why we want to do this.
- How do we do it? Where do we get our data from? Where do we get our paired
- data from? And the answer is we get it by watching television, because on television
- you'll have seen signs overlaid in television programmes like this, and you have subtitles which
- correspond to what's being said, and the signer is also interpreting what's being said. So you have a
- pair data, a correspondence between the subtitle and the sign sequence,
- and this correspondence is what we can learn from as we've been doing all the way through this talk.
- Now, the BBC have very generously made available 2000 sign language, signed programmes together
- with subtitles to support academic research on recognising British Sign Language. I'm
- going to show you some work we've been doing on this large data set that they released,
- and what I'm actually going to show you is how we can recognise signs using mouthings. So what I
- mean by mouthings, when signers are signing, sometimes they mouth the words that they're
- signing. Not always, but sometimes they do. That's what we can pick up on. So I'll show you
- some examples. On the left, you're going to see the sign for office, but also he mouths office.
- On the right, he's going to do the sign for tree and he's going to mouth tree.
- So why is this useful? It's useful because we can pick up words that are being mouthed on the lips.
- We know how to do that. So how this is going to work is imagine we have a subtitle like this
- clip here. Are you happy with this application? Now we can look for each word in this subtitle,
- happy, application, and see if it's mouthed. Now in this example, she does mouth happy.
- So we look at the lips and we find where happy is being mouthed,
- and once we've done that, then we know the temporal segment where she
- mouthed happy. Then we know the sign because we know exactly where she mouthed and we know
- the sign. So we have a way of automatically annotating the data and getting the signs,
- and the way we do the spotting on the lips, it uses this synchronisation network I showed you
- in the first example. That's actually how we do it. So that was one example. Now imagine
- we do this at industrial scale. We scale up. We do it on the BBC 2000 programmes.
- So we take the words that occur in the subtitles, we look at all the subtitles as they occur and we
- see whether the person is mouthing that word, and then we take that segment and that will be
- the sign corresponding to the word. I'll show you some examples. So first of all, for family,
- important, you see the word important. You see we're getting all these different examples,
- before. So actually, if you look at this one, before, it's signed in two different ways in
- these examples. You can pick out some signs doing it one way, some signs doing it another way.
- Perfect. Now we're getting these signs from mouthings, but of course once we have the sign,
- we can learn to recognise it just from the hand movements and hand gestures. So then we'll be
- able to recognise it, whether they mouthed or not, and for each word, we can generate
- thousands of examples, really, hundreds of thousands of examples here. You're seeing these
- perfect examples. We're getting more perfect. So it really is quite a perfect method, in fact,
- because we can generate signs for thousands of different words and hundreds of examples, say,
- for each one, and now we have a way where we can learn all these signs and recognise them
- by computer. This problem is not solved but this is a way of generating the data.
- Now, the second application, I want to show you a snapshot is audio description of video.
- Audio description is a soundtrack that's provided for the visually impaired and it describes the
- visual elements of the television programme or the movie so that they can understand what's going on.
- I'm going to show you a short example of audio description, the type of thing that's available.
- This is for the film Out of Sight.
- She turns her head and finds Jack standing beside her.
- Can I buy you a drink?
- Sit down.
- then places his lighter on the table. She opens her mouth as if to speak, but no words come.
- So you can see how the audio description is complementary to the soundtrack.
- So the things that you couldn't tell from what's being said or the music,
- that's what it's providing, and then someone who's blind can understand what's going on in the film.
- So we'd like to be able to generate these automatically. So we'd like to have a machine
- that takes in the video and then produces the audio description, probably as text,
- and then we have a text to speech that will read it out. So the visually impaired can follow it.
- So how would we do that? So we obviously need to supply the video to a model we're going to
- train and learn, but we have to do more than that because audio descriptions have the names. You
- heard the names of the characters. So we also have to provide the names of the characters. We have to
- provide a character bank of people who are in the film. So we need this auxiliary information. Now,
- given those two inputs, then we want to train a model to produce the audio description.
- Now where are we going to get the training data? And we need paired data between films
- and audio description. Fortunately, volunteers have provided audio descriptions for thousands
- and thousands of films, so this paired data is readily available, and as you've seen,
- we can learn from these corresponding data. So we have films. We have the audio descriptions. We can
- learn a model which generates audio descriptions. I'm going to show you two examples of audio
- descriptions that we've generated. Again, this is still a work in progress. It's not finished.
- They're both from Harry Potter, and the first one is a painful example.
- Concentrate, Potter. Focus.
- Okay, so that's the clip, and this was the audio description that we predicted
- for that. So Snape correctly points at Harry. So it got the character's right. Harry closes
- his eyes in horror. There's more pain I'd say, but this is what's produced.
- Second example, a more pleasant example.
- So how are we going to get to London?
- There we go. The audio description that was produced was Hermione, Ron, and Luna's eyes are
- fixed on Harry, who is standing in the doorway. That's correct. That's what happened. Then Harry
- rides on a horse's back as a horse rears up in the air. So this model thinks this is a horse,
- but of course, it clearly is not if you read Harry Potter, but there's more to do here.
- Okay, so that's the end of my snapshot. So I'm finishing now. This is
- what you've seen. You've seen that it's possible to learn visual encoders directly from data
- in various ways. There's no need for manual supervision,
- which is a traditional way of doing this, and I've gone through a learning curriculum for
- a virtual infant. Audio visual synchronisation, audio visual correspondence and language visual
- correspondence. That's what you've seen. I just mentioned that the field, computer vision field
- works on this problem a lot, and even though I've shown cross-modal learning, you can also learn
- visual encoders purely from the visual stream. So deaf people can see as well, of course.
- I'd like to end by thanking people. So of course, this work is not done by me by any
- means. It's done by my students, my postdocs, and I'm always inspired by talking to faculty,
- Oxford in the UK, DeepMind internationally. It wouldn't be possible without all these
- people and a lot of them are here in the audience. So that's great. So thank you.
- Thank you very much, Andrew, for a super lecture. Very stimulating. We have time for some questions.
- So who would like to start us off? Now there are a couple of people with microphones,
- so if you put your hand up, someone will come to you with a microphone. So let's
- start over here on my right, your left, with the first question. We also have people because
- of course this is being live streamed. So we have people who can ask questions on Slido,
- and I think one of my colleagues is hovering around with an iPad, and will wave at me if we
- have some questions on Slido. Please go ahead.
- how far are you away from having real time live access to this system, if you like, really?
- So there are lots of systems I showed here. The real time demo I was showing, that's real time. So
- you can type it and it will immediately retrieve images or videos from a large data set.
- Right. I think what I was trying to say actually is more the video. So at the
- moment it's trained on videos online, I take it, and what I was trying to say is that if you
- were to have a device that could see someone talking in real time, like for instance, the
- sign language, could that be then translated?
- clear, we can't do continuous sign language yet. At the moment most of these methods run on
- big machines, GPUs, etc., there's lots of work on taking these big models and distilling them
- down to smaller models in various creative ways. So even though at the moment they run on
- GPUs, etc., some of these models already can run in your browser. You
- can do real time pose detection of humans in your browser. You can see the ASR,
- automatic speech recognition, can be done in your browser. So these models start large but then
- once they're ready, they can be made smaller and more portable. Does that answer your question?
- Yes, that's basically it, really, yes. Great, thank you.
- I wouldn't say all the models though. I mean, some of the models are too large at the moment to do
- that, of course, but that's the way it goes.
- Okay, the next question just a few rows back, I think. There we are.
- Congratulations on the prize. Thanks for the lecture. So it feels like for different tasks,
- maybe you have to do very specific data processing, right? Which you explained now. Do
- you see a way of doing self-supervision with like a fairly general type of data processing where you
- can apply later to very different tasks, maybe like what ChatGPT does for texts? Thank you.
- Thank you for the question. Yes, I've concentrated on visual tasks here,
- but we already know the answer to that. The answer is yes. I've shown some different types of
- self-supervision here, and there are many others, as I said, and once the networks have been trained
- in some way by these self-supervised methods, they then can be used for multiple tasks by
- applying what are called different heads. So they can be used for recognition, object detection,
- tracking. Once you have good features, a good network, it can do multiple tasks,
- and this is really the way that large scale networks are trained nowadays,
- but there's still an issue of you have to say what the tasks are. It's still a research question to
- train a network and then it immediately do all the tasks you want it to do, predict depth,
- predict other things about images or videos. There's still work to be done here, but we
- certainly have lots of evidence that a good visual backbone enables lots of tasks after that.
- Okay. So I think we have a question there. If you go one row back first, I think, if you pass
- it back and then I've got the order correct.
- many dimensions the embedding space should be?
- It's usually a power of two. It's a good question. There's not a good answer. It's
- also storage. Vectors are very large. It requires more storage. But no, I don't have a good answer.
- It's an empirical question, really. You try different sizes and you determine what works best
- empirically. Thank you for asking. I don't know if anybody else here has got a better answer.
- Any volunteers? No, probably not. Okay. Go ahead please.
- Thank you very much for the talk, Andrew. Knowing what you know now about this research project,
- which parts were the hardest? Was it data collection, data prep? Was it designing
- the encoder, neural net architecture? What would you for somebody else that's working
- on a similar research project, what advice knowing what you know, would you give?
- Each stage of these projects, you stumble across something. It's always the way with
- projects. You have an idea or somebody has an idea and you start doing it and then
- unexpected things happen. So it's difficult to answer, really. Sometimes the networks are
- hard to train. Sometimes the data that you think is good is not good and when you're
- putting together so many things like this, unexpected things happen. One of my rules is
- when you have data sets, they're always noisy. I mean, it always happens.
- There's always something wrong and you always have to look at your data and see what's going
- on. So I can't give you a definite answer. It's just that every stage always has problems.
- Okay. So the question at the front here.
- Thank you.
- great lecture. So with the first message you showed with temporal alignment and so on,
- and you argued that that could be how babies might learn to see as a cue, but with the later work,
- with things like text to object, that still seems to require a huge amount of data. Is there any
- argument that might be similar to ways that babies learn to see and correspond with language?
- It's good to ask. I think the huge amount of data in the text case is an Achilles heel.
- That's a problem at the moment that you need so much data to learn from. I think
- it's a research question how you can avoid needing so much data. By the way,
- I wasn't saying this is how babies necessarily learn. In principle,
- they can learn this way because we can see that just from the data, you can learn these tasks,
- these skills. Not saying necessarily that they do this at all. It's just that the information
- is there and the order is that after they've learnt to see and hear cognitive development, we
- know that it's later that they acquire language. That's the only point I was making there, but yes,
- how to avoid having to use such vast quantities of data in the text case, I don't have a good
- answer to that. In the audio and visual case, it's readily available. There's no cost to that.
- Okay. Question here.
- Just keep your hand up, please, yes, that's great, so we can see. Thank you.
- Hi. Thank you. I am not in the field, so my question might be stupid, but I'm just wondering
- for the correspondence of synchronisation, what if there is some false correspondence? Let's say I'm
- waving my hand, but this actually corresponds with a sound which is not in the image at all.
- Would your model be able to pick out that?
- So it's the large scale data that helps avoid problems like that. You always have,
- we can call it noise, things which don't correspond to what's making the sound,
- but when you see enough examples, you can pick out the ones that matter and the ones that
- don't. So that's the answer to that question, really. If you think of the talking head,
- there are lots of things going on. The person who is talking, their eyes might flutter,
- their hair might blow in the wind, but in order for it to solve the task of learning
- synchronisation, it has to see that what really matters is the lips, because it's the lips that
- are synchronised with the voice, with the speech. So it gets to ignore all these nuisance factors,
- this sort of noise, and pick out what really matters, otherwise it can't solve the task.
- Andrew, how many is enough?
- In different circumstances. It's not the same in all circumstances.
- It's a question we always fight with. We keep on going empirically until things work well,
- and then you see if you can train more efficiently, meaning less data,
- but as I'm waving my hands and saying a million samples, because that's typically what we use,
- because it's so easy as well to get this data.
- apply generically to these things?
- the scaling laws, what are called scaling laws, where you can say for the number of parameters,
- how much data do you need given a certain training budget, and this is published work for training
- models. Again, it starts off empirically. It's not like physics or geometry where you can give
- counterarguments, this field doesn't have things like that. It's much more empirical
- and then generalising from that.
- two over this side. This one first, I think. Where's our person with the microphone?
- Hi, I have two questions. Quick ones. What do you think computer vision will look like
- in say 20 years, and the second is, is superhuman vision possible?
- Do you mind if I don't answer the first question? Because I think these sort of predictions,
- I know people always want us to make them, but they're always wrong. So whatever I say,
- it's going to be wrong really, because it could move so quickly.
- Superhuman, yes. What would be superhuman, though? I think for sure, we can do things that
- are superhuman. Being able to search through 35 million images in a fraction of a second, surely
- that's superhuman, and we're going to be able to, for sure, search enormous satellite images
- spanning the whole world. We'll be able to find something instantly. We already can do superhuman
- things with computer vision, I think, and it will go on in terms of temporal resolution,
- spotting things which exist over long time scales that a human wouldn't notice, or a short time
- scale that a human wouldn't be able to see. All of these things will happen, yes. Once we can do a
- skill on a computer, we can make it superhuman.
- Thank you very much for the talk. You showed us the multimodalities of examples,
- and then in the end, you said that it also can be applied to the single modality, but with a single
- modality obviously then you have to figure out the augmentations because it's less obvious, how
- to compare positive and negative examples. Have you found that multimodal embedding and comparing
- the embeddings from multiple modalities learns better, because you don't have to engineer those
- augmentations, or do they perform on par or the single modality can perform actually better?
- You're right that with images you have to do augmentations. From a single image, the sort
- of things you might do is crop the image and you want the embeddings to match, even though you've
- cropped, the image has got a horse in it, and you crop out parts of it, it will still be a horse.
- So you want the embeddings computed from these various crops to all match, and that will then
- train the network to understand that the contents of the image shouldn't be affected by these crops.
- To answer your question empirically, learning from multiple modalities generally works better than
- certainly in video, doing all these augmentations. It's naturally doing augmentations. So it works
- better. But the other methods work very well as well. You just have to work harder to make
- them work and some of the people in this room have done multimodal and unimodal learning, but
- they all work well in the end. It's just you have to do more work to make
- the unimodal ones work better.
- Thank you for the lecture. Very interesting. Just because I'm curious, I suppose,
- on the topic of solving for finding a speaker in a crowd.
- What if everyone is turned around? How do you then detect speaker?
- Yes, then it can't answer, of course, it has to see the lips. That's the answer, but there'll be
- other cues. If a is person speaking, I'm not going to turn around, but when I'm speaking,
- I move my hands. There's body language. There are other ways you can do this as well.
- I haven't shown that here but you can imagine if everybody was always speaking from behind,
- then the network had to solve the problem, it would do something
- like that.
- Okay. I think this gentleman here will have the honour of the last question.
- Thank you. Yes, I was just wondering in terms of, there were several examples where, for example,
- the visual network would be working together with maybe an audio network, or it could be a
- descriptive network. I'm wondering whether the embedded space that you end up for the
- visual network can apply across kind of multiple different problems. I mean, you sort of alluded to
- that earlier, saying sometimes actually, even if you trained it on one problem, then it can
- actually be useful on other problems, and I'm wondering whether these embedded spaces end up
- sort of encapsulating an overall description of the image, which
- can be used in multiple different tasks, and whether those embeddings are a good gestalt
- of the whole thing that's being presented.
- but in order to solve that task, they have to do something more than you've
- trained them for. I give the example, all the drums being embedded close together, all
- the guitars being close together, and once it's done that, then anything which is like a guitar,
- like a drum, it will embed to a certain point. So it's sort of learnt the characteristics and that
- applies to thousands of different categories and then you can use it for tracking guitars
- or other properties you might want for guitars. The answer is yes, basically.
- Okay. Well, that's super. Now before you applaud to thank Andrew again,
- I'm going to combine that task with presenting him with his scroll and medal. So Andrew, if you
- want to come out here because it'll make it easier for you. I will hand you this, if you want to hold
- that, and if you can do a trick of holding
- that at the same time and shaking my hand, and smile at the camera. Thank
- you all very much for coming. It was an excellent lecture. Thank you.
The Bakerian Prize Lecture 2023 is given by Professor Andrew Zisserman
Computer vision is a field where the goal is to enable machines to understand and use the visual content of images and videos in a similar manner to humans. In this talk Professor Zisserman will describe how machines are able to learn to recognise objects and actions from a temporal sequence of video frames, together with the audio and speech that accompanies them - an approach that is inspired by how infants may 'learn to see'. He will show applications of computer vision to image search, to recognising sign language (BSL), and to generating video descriptions for the visually impaired.
About the Royal Society
91TV is a Fellowship of many of the world's most eminent scientists and is the oldest scientific academy in continuous existence.
/
Subscribe to our YouTube channel for exciting science videos and live events.
Find us on:
Bluesky:
Facebook:
Instagram:
LinkedIn:
TikTok: