Dragonfly: Enhanced Vision-Language Model with Multi-Resolution Zoom Launched by Together.ai

Together.ai has announced the launch of Dragonfly, an innovative vision-language model designed to enhance fine-grained visual understanding and reasoning about image regions. The architecture leverages multi-resolution zoom-and-select capabilities to optimize multi-modal reasoning while maintaining context efficiency, according to Together AI.

Dragonfly Model Architecture

Dragonfly employs two primary strategies: multi-resolution visual encoding and zoom-in patch selection. These techniques enable the model to focus on fine-grained details of image regions, enhancing its commonsense reasoning capabilities. The architecture processes images at multiple resolutions—low, medium, and high—dividing each image into sub-images that are encoded into visual tokens. These tokens are then projected into a language space, forming a concatenated sequence that feeds into the language model.

Zoom-in Patch Selection: Dragonfly employs a selective approach for high-resolution images, identifying and retaining only the sub-images that provide the most significant visual information. This targeted selection reduces redundancy and improves the overall model efficiency.

Performance and Evaluation

Dragonfly demonstrates promising performance on several vision-language benchmarks, including commonsense visual question answering and image captioning. The model achieved competitive results on benchmarks such as AI2D, ScienceQA, MMMU, MMVet, and POPE, showcasing its effectiveness in fine-grained understanding of image regions.

Benchmark Performance:

Model	AI2D	ScienceQA	MMMU	MMVet	POPE
VILA	-	68.2		34.9	85.5
LLaVA-v1.5 (Vicuna-7B)	54.8	70.4	35.3	30.5	85.9
LLaVA-v1.6 (Mistral-7B)	60.8	72.8	33.4	44.8	86.7
QWEN-VL-chat	52.3	68.2	35.9	-	-
Dragonfly (LLaMA-8B)	63.6	80.5	37.8	35.9	91.2

Dragonfly-Med

In collaboration with Stanford Medicine, Together.ai has also introduced Dragonfly-Med, a version fine-tuned on 1.4 million biomedical image-instruction data. This model excels in high-resolution medical data tasks, outperforming previous models like Med-Gemini on multiple medical imaging benchmarks.

Evaluation on Medical Benchmarks

Dragonfly-Med was evaluated on visual question-answering and clinical report generation tasks, achieving state-of-the-art results on several benchmarks:

Dataset	Metric	Med-Gemini	Dragonfly-Med (LLaMA-8B)
VQA-RAD	Acc (closed)	69.7	77.4
SLAKE	Acc (closed)	84.8	90.4
Path-VQA	Acc (closed)	83.3	92.3

Conclusion and Future Work

Dragonfly's architecture offers a new research direction by focusing on zooming in on image regions to capture more fine-grained visual information. Together.ai plans to continue improving the model's capabilities and exploring new architectures and visual encoding strategies to benefit broader scientific fields.

The collaboration with Stanford Medicine and the utilization of resources like Meta LLaMA3 and CLIP from OpenAI have been crucial in developing Dragonfly. The model's codebase also builds upon the foundations of Otter and LLaVA-UHD.