Computers possess two remarkable capabilities with respect to images: They can both identify them and generate them anew. Historically, these functions have stood separate, akin to the disparate acts of a chef who is good .
Computers possess two remarkable capabilities with respect to images: They can both identify them and generate them anew. Historically, these functions have stood separate, akin to the disparate acts of a chef who is good at creating dishes (generation), and a connoisseur who is good at tasting dishes (recognition).
Yet, one can’t help but wonder: What would it take to orchestrate a harmonious union between these two distinctive capacities? Both chef and connoisseur share a common understanding in the taste of the food. Similarly, a unified vision system requires a deep understanding of the visual world.
Now, researchers in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have trained a system to infer the missing parts of an image, a task that requires deep comprehension of the image’s content. In successfully filling in the blanks, the system, known as the Masked Generative Encoder (MAGE), achieves two goals at the same time: accurately identifying images and creating new ones with striking resemblance to reality.
This dual-purpose system enables myriad potential applications, like object identification and classification within images, swift learning from minimal examples, the creation of images under specific conditions like text or class, and enhancing existing images.
Unlike other techniques, MAGE doesn’t work with raw pixels. Instead, it converts images into what’s called «semantic tokens,» which are compact, yet abstracted, versions of an image section. Think of these tokens as mini jigsaw puzzle pieces, each representing a 16×16 patch of the original image.