Mark Hamilton, an MIT Ph.D. student in electrical engineering and computer science and affiliate of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), wants to use machines to understand how animals communicate. .
Mark Hamilton, an MIT Ph.D. student in electrical engineering and computer science and affiliate of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), wants to use machines to understand how animals communicate. To do that, he set out first to create a system that can learn human language « from scratch. »
« Funny enough, the key moment of inspiration came from the movie ‘March of the Penguins.’ There’s a scene where a penguin falls while crossing the ice, and lets out a little belabored groan while getting up. When you watch it, it’s almost obvious that this groan is standing in for a four letter word. This was the moment where we thought, maybe we need to use audio and video to learn language. » says Hamilton. « Is there a way we could let an algorithm watch TV all day and from this figure out what we’re talking about? »
« Our model, DenseAV, aims to learn language by predicting what it’s seeing from what it’s hearing, and vice-versa. For example, if you hear the sound of someone saying ‘bake the cake at 350’ chances are you might be seeing a cake or an oven. To succeed at this audio-video matching game across millions of videos, the model has to learn what people are talking about, » says Hamilton.
A paper describing the work appears on the arXiv preprint server.
Once they trained DenseAV on this matching game, Hamilton and his colleagues looked at which pixels the model looked for when it heard a sound. For example, when someone says « dog, » the algorithm immediately starts looking for dogs in the video stream. By seeing which pixels are selected by the algorithm, one can discover what the algorithm thinks a word means.
Interestingly, a similar search process happens when DenseAV listens to a dog barking: It searches for a dog in the video stream.
« This piqued our interest. We wanted to see if the algorithm knew the difference between the word ‘dog’ and a dog’s bark, » says Hamilton. The team explored this by giving the DenseAV a « two-sided brain. » Interestingly, they found one side of DenseAV’s brain naturally focused on language, like the word « dog, » and the other side focused on sounds like barking. This showed that DenseAV not only learned the meaning of words and the locations of sounds, but also learned to distinguish between these types of cross-modal connections, all without human intervention or any knowledge of written language.