Researchers at the Allen Institute for Artificial Intelligence (AI2) have created a machine learning algorithm that can produce images using only text captions as its guide. The results are somewhat terrifying… but if you can look past the nightmare fuel, this creation represents an important step forward in the study of AI and imaging.
Unlike some of the genuinely mind-blowing machine learning algorithms we’ve shared in the past—see here, here, and here—this creation is more of a proof-of-concept experiment. The idea was to take a well-established computer vision model that can caption photos based on what it “sees” in the image, and reverse it: producing an AI that can generate images from captions, instead of the other way around.
This is a fascinating area of study and, as MIT Technology Review points out, it shows in real terms how limited these computer vision algorithms really are. While even a small child can do both of these things readily—describe an image in words, or conjure a mental picture of an image based on those words—when the Allen Institute researchers tried to generate a photo from a text caption using a model called LXMERT, it generated nonsense in return.
So they set out to modify LXMERT and created X-LXMERT. And while the results that X-LXMERT generates given a text caption aren’t exactly “coherent,” they’re not “nonsense” either—the general idea is usually there. Here are some example images created by the researchers using various captions:
And here are a few examples we generated by plugging various captions into a live demo they created using their model:
The above are all based on captions provided by the researchers, and most of them seem to at least contain the major concepts in each description. However, when we tried to create totally new captions based on more esoteric concepts like “photographer,” “photography studio,” or even the word “camera,” the results fell apart:
While the results from and limitations of X-LXMERT probably don’t inspire either awe or the fear of the impending AI revolution, the groundbreaking masking technique that the researchers developed is an important first step in teaching an AI to “fill in the blanks” that any text description inherently leaves out.
This will eventually lead to better image recognition and computer vision, which can only help improve tasks that actually matter to the readers of this site. In other words: the better a computer is at understanding what you mean when you describe an image or image editing task, the more complex the tasks it will be able to perform on that image.