Meta Releases Llama 3.2 Vision Models That Can Process Text and Images
Sarah Reyes
••3 min read
Meta launched its first-ever AI vision model, Llama 3.2, to process images and text. What’s new? It allows developers to create more sophisticated AI applications, including augmented reality apps that can understand video in real-time, summarize long documents, and sort images based on content.
Meta's vice president of generative AI, Ahmad Al-Dahle, says developers will find it easy to get this model up and running. They will only have to do little except add the “multimodality,” show Llama images, and communicate with it.
The multimodal model is catching up with other multimodal models launched by Google and OpenAI. Vision support will play a significant role in Meta in building more AI capabilities on hardware, such as Ray-Ban Meta glasses.
Llama 3.2 has two vision models: one with 90 billion parameters and one with 11 billion parameters. There are also two text-only models, one with 1 billion parameters and one with 3 billion parameters, which can work on MediaTek, Qualcomm, and other Arm hardware.
The company integrated a “pre-trained” image encoder with special adapters into its language model. This will enable the vision models to work and understand text and images.
The special adapters will link the image data with the model’s text processing elements - allowing it to handle both input types.
This training process started with the first Llama language model, which was trained on large image sets paired with text descriptions, teaching it how to connect the two. The team then refined it using more specific and cleaner data, improving its capability to reason over visuals.
Finally, the team used fine-tuning and synthetic data generation to ensure the model behaves safely and gives helpful answers.
The models can understand text and images; thus, they can answer user questions based on the visual content. For instance, they can identify an object in a scene or summarize the image’s content.
Llama 3.2 models can also extract and summarize the information presented in documents as graphs, images, and charts. Thus, businesses could use them to interpret sales data quickly.
They can also generate image captions. Hence, industries like digital media, where understanding image content is essential, can benefit from these models.
Are the Llama Vision Models Open and Customizable?
Yes, they are. Developers can enhance or fine-tune both the pre-trained and aligned model versions using the “Torchtune” framework.
These models also reduce reliance on cloud infrastructure because they can be deployed locally via Torchchat, offering developers a solution to deploy AI systems on-prem and in environments constrained in resources. Developers can also test these vision models through Meta AI.