TITLE:
Object Detection Meets LLMs: Model Fusion for Safety and Security
AUTHORS:
Zeba Mohsin Wase, Vijay K. Madisetti, Arshdeep Bahga
KEYWORDS:
Computer Vision, Large Language Models, Self Driving Vehicles
JOURNAL NAME:
Journal of Software Engineering and Applications,
Vol.16 No.12,
December
27,
2023
ABSTRACT: This paper proposes a novel model fusion approach to enhance predictive capabilities of vision and language models by strategically integrating object detection and large language models. We have named this multimodal integration approach as VOLTRON (Vision Object Linguistic Translation for Responsive Observation and Narration). VOLTRON is aimed at improving responses for self-driving vehicles in detecting small objects crossing roads and identifying merged or narrower lanes. The models are fused using a single layer to provide LLaMA2 (Large Language Model Meta AI) with object detection probabilities from YoloV8-n (You Only Look Once) translated into sentences. Experiments using specialized datasets showed accuracy improvements up to 88.16%. We provide a comprehensive exploration of the theoretical aspects that inform our model fusion approach, detailing the fundamental principles upon which it is built. Moreover, we elucidate the intricacies of the methodologies employed for merging these two disparate models, shedding light on the techniques and strategies used.