1. Introduction
The advent and development of autonomous driving technologies have marked a significant transformation in the automotive industry, reshaping how vehicles interact with their environment and with each other. The concept of autonomous driving has evolved over several decades, beginning with basic cruise control systems in the 1950s [
1,
2] and progressing to the sophisticated connected and automated vehicles (CAVs) of today [
3,
4]. The integration of advanced sensors, machine learning algorithms, and communication technologies has enabled vehicles to perform complex tasks such as navigation, obstacle avoidance, and decision-making with minimal human intervention [
5].
As the capabilities of autonomous vehicles have expanded, so too have the challenges associated with ensuring their safety and reliability. Traffic safety has always been a critical concern in the development of autonomous vehicles, as these systems must be able to respond to a wide range of dynamic and unpredictable situations on the road. This is particularly important in the context of safety-critical events, such as sudden changes in traffic patterns, unexpected obstacles, and potential collisions. Additionally, traffic safety is influenced not only by the type of vehicle—whether conventional or autonomous—but also by the volume of traffic, as higher traffic volumes increase the probability of unsafe conditions [
6]. The ability of autonomous vehicles to detect, analyze, and respond to these events in real time is crucial for preventing accidents and ensuring the safety of all road users [
7].
The recent advancement breakthrough in large language models (LLMs) has revealed the potential usage in the complex challenging environment of analyzing driving videos. Many researchers have investigated the potential of utilizing LLMs in analyzing driving videos through textual representations [
8,
9,
10,
11]. With the advancement of MLLMs, a new merger has been reached with the power reasoning of LLMs in the different modalities of text, image, and audio [
12,
13].
Critical-safety event analysis is considered one of the complex and critical environments that could benefit from the new MLLM breakthrough. While full critical-safety event detection might still be a far reach, MLLM could advance the understanding of the dynamic variety of road transportation through providing textual analysis of the visual representation of the environment and the different agents in it, then using this analysis to provide a direct, concise, and actionable early warning to the ego-driver in the case of any potential hazards.
This capability of MLLMs to synthesize information across multiple modalities—such as visual cues from driving videos, environmental sounds, and contextual data—opens new avenues for enhancing driver assistance systems. By integrating textual analysis with real-time video and audio inputs, MLLMs can facilitate more accurate and context-aware interpretations of driving scenarios, which are crucial for preventing accidents and improving road safety. The ability to generate natural language descriptions or warnings based on complex visual and auditory inputs allows for a more intuitive interface between the technology and the driver, which has the ability to reduce cognitive load and improve reaction times in critical situations.
Moreover, the adaptability of MLLMs to learn from diverse data sources, including different driving environments and conditions, enhances their robustness and reliability. As MLLMs continue to evolve, their role in the domain of autonomous driving and driver assistance is expected to expand, offering more sophisticated solutions for anticipating and mitigating safety risks. The integration of these models into real-world applications could mark a significant step forward in achieving safer and more efficient transportation systems. In this context, the development of MLLMs presents an exciting opportunity to revolutionize the field of critical-safety event detection in driving. As the technology matures, the potential for creating systems that can not only detect but also predict, prevent, and recommend about critical-safety events becomes increasingly feasible, moving us closer to a future where road transportation is not only smarter but also significantly safer.
2. Related Works
Before the era of MLLM, researchers in safety event analysis relied on developing a complex machine learning model from the ground up, utilizing thousands of annotated datasets to achieve high accuracy and reliability. For instance, the authors in [
14] proposed a supervised encoder–decoder model where a pre-trained ResNet-101 was used as encoder to extract the visual and flow features of 17k distinct ego-car dash cam scenarios, and a neural image caption generation structure [
15] as a decoder to predict the caption of the street frames while attaining the extracted features from the encoder part.
The study by Zhenjie et. al. LLM4Drive [
16] reviews the integration of large language models (LLMs) in autonomous driving systems, highlighting their potential to enhance decision-making, perception, and interaction through advanced reasoning and contextual understanding. The survey categorizes current research into planning, perception, question answering, and generation, addressing the challenges of transparency, scalability, and real-world application. It underscores the need for robust datasets and interpretable models to build trust and improve system reliability in autonomous driving.
Cui et al. explores the integration of LLMs and vision foundation models (VFMs) in enhancing autonomous driving systems [
17]. The work covers the historical evolution from early sensor-based approaches to advanced deep learning techniques that improve perception, planning, and decision-making. The paper also reviews existing multimodal tools and datasets like KITTI [
18] and nuScenes [
19]. A study by Chen et al. [
20] proposed a pretraining method that aligns numeric vector modalities with LLM (GPT3.5) representations, improving the system’s ability to interpret driving scenarios, answer questions, and make decisions. Furthermore, the study titled “DriveMLM” [
21] introduces an LLM-based autonomous driving (AD) framework that aligns multi-modal LLMs with behavioral planning states, enabling closed-loop autonomous driving in realistic simulators. It bridges the gap between language decisions and vehicle control commands by standardizing decision states according to the off-the-shelf motion planning module. On another hand, the “Drive As you Speak” paper [
22] presents an approach to enabling human-like interaction with large language models in autonomous vehicles. It leverages LLMs to understand and respond to human commands, demonstrating the potential of LLMs in creating more intuitive and user-friendly autonomous driving experiences. Moreover, AccidentGPT [
23] introduces a multi-modal model for traffic accident analysis, which was capable of reconstructing crash processes and providing comprehensive reports.
Recent advancements also explored the integration of sensor data and real-time processing using LLMs to enhance autonomous driving capabilities. A study by Zhang et al. [
24] examined the integration of LLMs with LiDAR and radar data to improve object detection and tracking. Similarly, a study by Singh et al. [
25] highlighted the use of LLMs in predicting pedestrian behavior by analyzing both visual signals and contextual information, which increased the reliability of autonomous systems in urban settings. Moreover, another study by Lopez et al. [
26] focused on utilizing LLMs to interpret driver motions and voice commands, which might facilitate a more natural interaction between the driver and the vehicle. Furthermore, a study by Kim et al. [
27] explored the analysis of live video feeds from dashboard cameras, which enabled the early detection of potential hazards such as sudden lane changes or road obstacles. This approach allowed for timely warnings and interventions, which has the potential to enhance the safety of autonomous driving systems.
A study by Hussien et al. [
28] also highlighted the potential of integrating LLMs with knowledge graphs and retrieval-augmented generation (RAG)-based explainable frameworks to provide explainable predictions of road user behaviors, which is crucial for developing safe automated driving systems. This study underscored the importance of explainability in the potential deployment of LLMs in critical environments like autonomous driving.
In addition to LLMs, contributions have been made in the domain of cooperative control of CAVs. The study by Liang et al. [
29] explores a multi-agent system (MAS) architecture designed to facilitate the cooperative control of CAVs. This hierarchical architecture enables vehicles to collaborate effectively in complex traffic environments, sharing information and making collective decisions that enhance safety and efficiency. The study emphasizes the importance of cooperation among autonomous vehicles, particularly in scenarios where rapid decision-making and coordination are critical to preventing accidents and ensuring smooth traffic flow.
Despite the promising developments in using MLLMs for autonomous driving and intelligent transportation systems, a significant gap remains in the application of these models for safety-critical event analysis. Existing studies focused on enhancing autonomous driving capabilities through improved perception and decision-making processes without specifically addressing the unique challenges posed by safety-critical situations. This gap shows the need for a specialized approach that leverages the multimodal capabilities of LLMs to directly address the complexity of safety-critical events in driving scenarios and provide more explainable information and recommendations, which is very important for taking the right safety countermeasures. Current methods mostly rely on complex machine and deep learning models and extensive annotated datasets, which are not always feasible or scalable in real-world applications. There is a need for an easy-to-implement, scalable, and explainable framework that can automate the extraction of visual representations from raw video feeds and utilize object-level question–answer (QA) prompts to guide MLLMs in generating actionable insights for hazard detection and response.
This study aims to bridge this gap by introducing an MLLM framework specifically designed for the analysis and interpretation of safety-critical events. By integrating the different modalities of texts and images, our framework seeks to provide a more holistic and scalable view of dynamic driving environments. Furthermore, our approach emphasizes the automation of extracting visual representation from the raw video and feeding it to an MLLM with the creation of object-level QA prompts to guide the MLLM’s analysis, which focuses on generating actionable insights for safety-critical event detection and response. This study introduces a novel application of MLLMs in a domain where precision and reliable decision-making are essential, marking a significant step forward in the development of safer driving.
3. Preliminary
This section introduces the dataset utilized for evaluating the proposed framework and the Gemini model, which form the foundation for the experimental work in this study.
3.1. Dataset
Creating a dataset from driving videos that integrates language for visual understanding is a challenging task. This process is a resource-extensive task that requires trained human annotators for optimal accuracy and reliability. In addition, the variety and complexity of driving scenarios require a dataset rich in visual scenes. The dataset needs to cover a variety ranging from simple driving directions to complex situations involving pedestrians, other vehicles, and road signs.
Many researchers have either enhanced existing datasets with textual information [
30,
31,
32,
33] or developed new ones from scratch [
14,
34]. Notable among these are the DRAMA datasets [
14]. DRAMA focuses on driving hazards and related objects, featuring video and object-level inquiries. This dataset supports visual captioning with free-form language descriptions and accommodates both closed and open-ended questions, making it essential for assessing various visual captioning skills in driving contexts. In addition, the vast variety found in DRAMA scenarios makes it a uniquely comprehensive resource for investigating and evaluating MLLM models on complex driving situations.
Considering these factors, this study selected the DRAMA dataset to utilize the ground truth label to report this paper’s experimental results. DRAMA’s detailed focus on hazard detection and its comprehensive framework for handling natural language queries make it exceptionally suitable for pushing forward research in safety-critical event analysis.
The DRAMA dataset includes multiple levels of human-labeled question–answer (QA) pairs. These include base questions regarding whether a risk exists in the scene, with Yes/No answers; scene classification into urban road, intersection, and narrow lane categories; questions about the direction of the ego-car, with options such as straight, right, and left; questions about potential hazard-causing agents in the scene, such as vehicles, pedestrians, cyclists, and infrastructure; and finally, questions about recommended actions for the ego-car driver based on the scene analysis, with eight possible actions, including stop, slow down, be aware, follow the vehicle ahead, carefully maneuver, start moving, accelerate, and yield.
Figure 1 illustrates the distribution of each QA used in the studies, where 300 distinct videos ranging from 2 to 5 s were employed to examine the effectiveness of MLLMs in detecting traffic safety-critical events.
3.2. Gemini MLLM
The framework for detecting safety-critical events from driving videos in this study utilizes the Gemini-Pro-Vision 1.5 MLLM [
35]. This model was chosen for its advanced capabilities in logical and visual reasoning, particularly in identifying potential hazards across diverse traffic scenarios.
Gemini 1.5 is designed to process and integrate information across multiple modalities—text, images, and video—making it highly effective for tasks that require a deep understanding of both visual and textual data. This is crucial for driving scenarios where the model needs to interpret video frames and respond to natural language prompts simultaneously. One of the most striking features of the Gemini 1.5 model is its ability to handle a context window of up to 1 million tokens, which is significantly larger than most other models. This allows Gemini to process large amounts of data in a single pass.
4. Methodology
We conducted multiple experiments to investigate the capability and logical and visual reasoning power of MLLMs in identifying potential hazards across diverse traffic scenarios. To guide our investigation, we formulated the following research questions (RQs):
RQ1: How effective are MLLMs at identifying traffic hazards using in-context learning (ICL) with zero-shot and few-shot learning approaches?
RQ2: Does the number of frames used impact the accuracy of hazard detection in traffic scenarios?
RQ3: What is the impact of self-ensembling techniques on the reliability and robustness of MLLMs in detecting critical traffic safety events?
The employed methods range from ICL with zero-shot and few-shot learning to varying the number of frames, utilizing textual context alongside visual frames, and implementing self-ensembling techniques. The subsequent sections present the proposed framework and its operational flow for detecting critical traffic safety events, followed by an overview of the different methodologies employed and the implemented prompt design.
4.1. Framework
The framework illustrated in
Figure 2 is designed for detecting safety-critical events from driving video extracted from car dash cams, utilizing a multi-stage QA approach with an MLLM, specifically, Gemini-pro-vision 1.5. The process initiates with frame extraction, where the system automatically collects video frames from the ego-vehicle’s camera at regular intervals (i.e., every second). These frames are subjected to the hazard detection phase, where the model assesses the scene for potential dangers.
Upon identifying a hazard, the framework employs a tripartite categorization strategy to probe the nature of the threat further, using “What,” “Which,” and “Where” queries to reveal the object-level details. In the “What” phase, the MLLM classifies the entities detected by the camera, differentiating among agents like pedestrians, vehicles, or infrastructure elements. The “Which” stage involves the MLLM identifying specific features and attributes of these agents, such as pedestrian appearance, vehicle make and model, or infrastructure type, providing vital contextual insights for decision-making.
The final “Where” phase tasks the MLLM with determining the spatial location and distance of the hazard agents from the ego-car, including their position on the road, proximity to the vehicle, and movement direction. This spatial information is critical for the ego-car system to make a safer navigation decision. We tested the model across different dimensions to evaluate model performance in various tasks for each safety-critical event, including identifying risky scenarios, classifying different scenes, determining car direction, classifying agents, and suggesting correct actions.
The framework addresses traffic safety-critical events through a thorough analysis of interactions and road environments in three folds. First, the framework recognizes and evaluates scenarios where the interaction between the ego-vehicle and other road users (i.e., vehicles, pedestrians, and cyclists) or infrastructure may result in safety-critical events. These events include sudden stops, lane changes, or crossing pedestrians that could lead to hazardous situations if not managed correctly. Second, the model identifies and localizes risks within the driving environment, determining the exact location and potential impact of hazardous objects. It assesses the relative position of these objects, such as vehicles cutting into the lane or pedestrians crossing unexpectedly, which are crucial for proactive hazard mitigation. Third, the framework adapts to various road types, such as wide roads, intersections, and narrow streets, each presenting unique challenges. For example, intersections are flagged as particularly high risk due to the convergence of multiple traffic flows, necessitating precise detection and decision-making capabilities from the system.
4.2. Analysis Methods
We incorporated different methods to enhance the detection of safety-critical events. These methods were experimented with to enable the system to focus on the most relevant information, thereby optimizing processing speed and accuracy in detecting safety-critical events.
One approach employed was the sliding window frame capture technique, which systematically captures subsets of video frames by defining a window that slides over the video timeline. This window captures a specific range of frames from
to
where
represents the initial frame in the window, and
is the number of frames included in each window. The mathematical representation of this method can be expressed in Equation (1). This method allows for the dynamic adjustment of the window size based on the specific requirements of the analysis, which allows the framework to balance data completeness and processing efficiency.
In-context learning (ICL) was also integrated into the framework to enhance the predictive capabilities of MLLMs by providing them with relevant examples during the inference process. This technique is particularly effective in scenarios where annotated data are rare or the model has to adapt to new situations while in progress. In our framework, we explored two primary settings of in-context learning: zero-shot and few-shot learning. Zero-shot learning allows the model to make predictions about safety-critical events without having seen any prior examples specific to the task. The model relies completely on its base knowledge and reasoning capabilities. This method might be beneficial for its ability to generalize across diverse and unforeseen scenarios. The zero-shot learning process can be mathematically described in Equation (2).
where
is a sequence of specific frames and
is a general question or instruction provided to the MLLM to guide the analysis.
On the other hand, few-shot learning allows the model to be exposed to a small number of annotated examples relevant to the task before making predictions. This kind of learning enables the model to quickly adapt and improve its accuracy by leveraging specific patterns and features observed in the examples seen. The few-shot learning process is presented in Equation (3).
where
is the annotated observations used to fine-tune the model’s reasoning.
Label-augmented learning (LAL) was another method employed, providing context for the MLLM about how the data were originally annotated. This method helps the MLLM to understand the labeling scheme and the specific characteristics that were considered during the annotation process. By incorporating this context, the model can align its outputs more closely with the annotated data, thereby improving accuracy and consistency.
Image-augmented learning (IAL) involved applying various image augmentation techniques to the images before they were fed into the MLLM for safety-critical event detection QA. These augmented images aim to direct the MLLM to different areas within the language distribution it relies on for generating responses, as illustrated in
Figure 3.
By introducing augmented images in the prompt, the MLLM can start at various points within the data distribution, which influences the diversity of local sampling results.
Subsequently, the outcomes from different model sampling processes are aggregated using a top-k voting mechanism to determine the outcome response. This approach aims to aid the model in producing textual responses that more accurately represent the scene under query.
Self-ensemble learning is another strategy used to boost the performance of our framework. It involves generating multiple predictions from the MLLM using slightly different contexts or parameters, such as model temperature, and then combining these predictions to obtain a more robust and accurate result. This approach reduces the likelihood of errors and increases the reliability of hazard detection. The self-ensemble process can be described mathematically Equation (4).
where
N is the number of individual predictions. The
voting mechanism selects the
most frequent predictions among the
generated predictions, which enhances the overall performance by focusing on the most consistently identified outcomes.
4.3. Prompt Design
The design of the prompt is pivotal in guiding the MLLM to accurately evaluate and respond to safety-critical events in driving scenarios. This prompt was designed to ensure a structured and systematic analysis of the input frames, thereby minimizing the risk of hallucination and enhancing the reliability of the MLLM’s outputs, as seen in
Figure 4. The structure of the prompt is intended to break down the evaluation process into clear, logical steps, which helps the MLLM to focus on specific aspects of the scene sequentially.
The prompt design benefits the MLLM by ensuring a structured analysis that breaks down the evaluation into discrete steps, allowing the model to focus on one aspect at a time and reducing cognitive overload. By using predefined categories and limited response options, it controls the model’s output, minimizing the risk of hallucinations. Each step builds on the previous one, providing a holistic and context-aware understanding of the scene, which improves decision-making. The final step of suggesting actionable recommendations ensures that the analysis is not just descriptive but also prescriptive, offering clear and practical guidance for safe driving.
In summary, the prompt design is tailored to enhance the MLLM’s ability to accurately and reliably detect safety-critical events from driving video frames. Its structured approach, combined with controlled output options, significantly mitigates the risk of hallucination, ensuring that the model’s responses are both relevant and actionable.
5. Results
The work presented in this paper investigated the potential of leveraging the capabilities of MLLMs in analyzing safety-critical event scenarios using multi-modal data integration and dynamic contextual data reduction for guiding the model’s output. The prediction illustrated in
Figure 5 showcases the proficiency of Gemini-Pro-Vision 1.5 in zero-shot learning scenarios.
To understand the effectiveness of the proposed framework, a series of experiments was carried out utilizing Gemini-Pro-Vision 1.5. The results as shown in
Table 1 are analyzed across various frames and in-context learning settings, including zero-shot and few-shot learning, as well as additional strategies like self-ensemble learning and image-augmented learning. We tested the model across different dimensions to evaluate model performance in various tasks for each safety-critical event, including identifying risky scenarios, classifying different scenes, determining car direction, classifying agents, and suggesting correct actions.
5.1. Zero-Shot Learning Results
Zero-shot learning demonstrated a variable performance profile across different metrics and frame counts, as seen in
Figure 6. Initially, a single frame yielded an overall accuracy of 62.4%, with notable performance in detecting the direction of the car (87%) and scene classification (64%). However, as the number of frames increased, overall performance slightly decreased, reaching 56.8% with four frames. The decrease in performance with additional frames suggests potential trade-offs between the depth of context provided and the model’s ability to generalize without prior task-specific examples. The impact of frame count on metrics like agent classification and suggested actions also reflects the model’s challenge in maintaining accuracy across varying contexts.
5.2. Few-Shot Learning Results
Few-shot learning demonstrated a clear trend of improvement with an increasing number of shots. The performance improved progressively from 1-shot to 10-shot scenarios, with the highest overall percentage (71.6%) achieved with 10 shots. This improvement, as seen in
Figure 7, was evident across all metrics, particularly in scene classification and direction of cars, where the highest values were observed with 10-shot learning. The consistency in performance metrics with five-shot and seven-shot suggests that a moderate number of examples already offers substantial benefits.
When comparing zero-shot methods (including one frame, two frames, three frames, and four frames) to few-shot methods, it is evident that few-shot learning consistently outperforms zero-shot learning, as seen in
Figure 8. The bar plots highlight that, with zero-shot methods achieving lower percentages across all metrics. For instance, the “is_risk %” metric showed a significant improvement from 68% in the one-frame zero-shot method to 79% in the 10-shot method. Similarly, “scene classification %” saw an increase from 64% with 1 frame to 81% with 10 shots. The comparison underscores the robustness of few-shot learning in improving model performance across various metrics, showcasing its superiority over zero-shot learning approaches.
5.3. Self-Ensemble Learning Results
Self-ensemble learning provided a relatively stable performance, with slight improvements as the number of candidates increased. The five-candidate configuration yielded the highest overall percentage (64.2%), as illustrated in
Figure 9, showing that aggregating predictions from multiple candidates helped enhance performance. Although the improvements in metrics such as is_risk and scene classification were not drastic, the approach demonstrated increased reliability in hazard detection.
When comparing zero-shot (1-frame) methods to self-ensemble methods (3, 5, 7, 9 candidates), as in
Figure 10, it is clear that self-ensemble methods generally offered better performance. The bar plots show that self-ensemble methods frequently surpassed the zero-shot (1-frame) approach. For instance, the “scene classification %” and “direction of car %” metrics showed noticeable improvements with self-ensemble methods. The three-candidate and five-candidate configurations consistently performed well across these metrics.
The use of self-ensemble methods enhanced the overall metric, which meant a more balanced and robust model performance. The highest overall in self-ensemble methods (64.2% with 5 candidates) still outperformed the zero-shot (1-frame) approach (62.4%). This trend is consistent across other metrics, such as “agent classification %” and “suggestion action %,” where the self-ensemble methods exhibited a slight edge.
While the improvements in individual metrics like “is_risk %” and “scene classification %” were modest, the aggregated gains across all metrics suggest that self-ensemble learning provides a more reliable and effective approach than the zero-shot (1-frame) method. This highlights the value of leveraging multiple candidate predictions to improve the robustness and accuracy of the model’s performance across diverse evaluation criteria.
5.4. Image-Augmented Learning Results
Image-augmented learning with the top-k method resulted in lower overall performance compared to other methods, as seen in
Figure 11, with an overall percentage of 59.0%. The image augmentation approach appeared to have a mixed impact, providing a moderate enhancement in some metrics but falling short in overall accuracy and suggested action classification. This suggests that while image augmentation introduces variability, its effect on overall performance needs further refinement and evaluation.
When comparing the zero-shot (1-frame) method to the image-augmented method across different metrics, as in
Figure 12, it becomes evident that each approach has its strengths and weaknesses. The zero-shot (1-frame) method achieved a higher “is_risk %” (68%) compared to the image-augmented method (67%). Similarly, in “scene classification %,” the zero-shot method performed better (64%) than the image-augmented method (60%).
Few-shot learning, as shown in
Figure 13, consistently outperformed other methodologies across most metrics, with a notable improvement in overall performance as the number of shots increased. Zero-shot learning, while useful, showed decreased performance with an increasing number of frames, indicating that it may benefit from being combined with other methods for optimal results. Self-ensemble learning provided a modest increase in performance and stability, particularly in is_risk and scene classification metrics. Image-augmented learning, although innovative, showed less effectiveness compared to the other methods, suggesting that further exploration and refinement of augmentation techniques are necessary. These results highlight the potential of MLLMs in automated traffic safety event detection and offer insights into optimizing their use for various safety-critical scenarios.
6. Discussion
Across all methods, few-shot learning stands out as the most effective approach for improving overall accuracy and performance in various metrics. The ability to leverage annotated examples allows for significant enhancements in scene classification, direction of the car, and agent classification. This aligns with the general observation that models benefit from specific, contextually relevant examples to improve their predictive capabilities.
Self-ensemble learning provides a robust alternative by stabilizing performance across different candidate predictions, showcasing its strength in minimizing errors and achieving consistent results. This approach is particularly useful in scenarios where model outputs can be uncertain or variable.
Zero-shot learning, while valuable for its generalization capabilities, shows limitations when handling varying frame contexts and specific hazard scenarios. The decrease in performance with additional frames indicates a need for more sophisticated methods to balance context depth with generalization.
Image-augmented learning, although effective in enhancing specific metrics, does not match the overall accuracy of few-shot or self-ensemble learning methods. This suggests that while image augmentation can improve certain aspects of model performance, it may not provide a comprehensive solution for all types of safety-critical event detection.
The results obtained from the proposed framework compared with other baselines that utilize visual-language QA for driving safety are presented in
Table 2.
The comparative performance analysis in the table highlights the differences in how various visual-language QA frameworks perform in the context of driving safety tasks. Each method was tested on different datasets, and the results reveal significant variations in accuracy, reflecting the strengths and limitations of each approach. LLaVA-1.5 is a model that represents an advanced multimodal approach to integrating a vision encoder with an LLM fine-tuned using the VQA-v2 dataset. The model achieved a moderate accuracy of 38.5% in the driving safety context, suggesting that LLaVA-1.5 is not capable of handling all visual-language tasks well. Similarly, Cube-LLM, which was tested using the Talk2Car dataset, also achieved an accuracy of about 38.5%. The moderate performance of both models indicates that they might struggle with the real-time command interpretation in dynamic and mixed driving environments. In the case of SimpleLLM4AD, when tested on the DriveLM-nuScenes dataset, it achieved a significantly higher accuracy of about 66.5%. This suggests that SimpleLLM4AD is better optimized for driving-related tasks and is more able to understand the challenging scenarios that are closer to real-world driving conditions. However, our proposed model outperformed all the abovementioned methods, with an accuracy of about 79%. Our proposed model appears to be able to understand different driving scenarios and extract the contextual information necessary to excel in visual-language tasks related to driving safety. In addition, our MLLM model can also perform various tasks for each safety-critical event, including identifying risky scenarios, classifying different scenes, determining car direction, classifying agents, and suggesting correct actions, which, to the best of our knowledge, is the first model to do so. This performance shows the importance of domain-specific fine-tuning and training. This allows the model to better understand and respond to the unique challenges presented in autonomous driving.
7. Conclusions
The findings underscore the potential of MLLMs to advance the automated analysis of driving videos for traffic safety. The performance of different learning methods highlights the importance of choosing appropriate techniques based on specific detection requirements and available resources. Few-shot learning offers a promising avenue for improving hazard detection accuracy and adaptability in real-world scenarios. The few-shot model consistently outperformed other learning techniques, achieving the highest overall accuracy (about 67%), “is_risk” accuracy (78%), scene classification accuracy (65%), direction of car accuracy (82%), and agent classification accuracy (68%). This demonstrates its superior effectiveness across various tasks.
Future research should explore the integration of these methodologies to leverage their complementary strengths. Combining few-shot learning with self-ensemble or image-augmented techniques might provide a balanced approach that enhances overall performance while addressing the limitations observed in individual methods. Additionally, fine-tuning MLLMs on task-specific data is a crucial area for future investigation. Fine-tuning could enhance model performance by adapting the pre-trained models to the nuances of safety-critical event detection, thus improving accuracy and reliability. Further exploration into optimizing frame selection and processing strategies could help refine model accuracy and efficiency.
Moreover, we plan to incorporate RAG flow in future work. This approach would enable the model to dynamically retrieve and apply relevant information, such as implicit traffic rules, during inference. Incorporating RAG could further enhance the model’s capability in handling complex traffic safety scenarios, making it more robust in detecting and managing safety-critical events. This addition to the future work underscores our commitment to advancing the effectiveness of MLLMs in autonomous driving systems.
Although the DRAMA dataset was constructed in limited geographical locations, our proposed MLLM model was tested using different scenarios, which included a variety of road scenes, such as urban and rural roads, narrow lanes, and intersections. Our proposed framework has demonstrated promising results across these different road conditions, indicating its potential robustness to be scaled and generalized in other geographical locations. Validating the framework’s ability to detect safety-critical traffic events in diverse geographical contexts is indeed a crucial step, and we plan to incorporate this in our future research.
This study demonstrates the value of MLLMs in traffic safety applications and provides a foundation for further exploration and development of automated hazard detection systems. The insights gained from this research can guide the design of more effective and reliable safety-critical event detection frameworks in autonomous driving systems.