Microsoft unveiled Kosmos-1, a multimodal model capable of analysing images for content, solving visual puzzles, performing visual text recognition, passing visual IQ tests, and understanding natural language instructions.
The researchers believe that multimodal AI—which integrates different modes of input such as text, audio, images, and video—is a critical step towards developing AGI that can perform general tasks at the level of a human. “Language Is Not All You Need: Aligning Perception with Language Models,” the researchers write in their academic paper, “is a necessity to achieve artificial general intelligence, in terms of knowledge acquisition and grounding to the real world.”