The recent emergence of multimodal AI has meant that AI systems are now becoming increasingly multipurpose in nature, as they simultaneously process and generate a variety of data modalities — including text, images, audio and video — in an integrated fashion.
One of the more versatile subsets of multimodal AI is the vision language model (VLM), which combines natural language processing (NLP) and computer vision (CV) capabilities to tackle advanced vision-language tasks — such as image captioning, visual question answering, and…