Set of experiments: "Multi-Modal Vision and Language Models for Real-Time Emergency Response" system

Поділитися
Вставка
  • Опубліковано 25 сер 2024
  • In our video, we present a detailed set of experiments of our system in the work titled "Enhancing Ambient Assisted Living: Multi-Modal Vision and Language Models for Real-Time Emergency Response". The system operates in two phases: continuous monitoring and user-model interaction. Initially, a camera equipped with YOLOv8 (Human Detection) technology captures images to detect individuals, which are then preprocessed and analyzed on a DGX server using the Large Language and Visual Assistant (LLaVA) model. This model performs Visual Question Answering (VQA) to identify initial anomalies from the images.
    If VQA detects an emergency, the system shifts to the interaction phase where LLaVA dynamically generates context-specific questions. These are converted to speech using the Piper text-to-speech (TTS) model for user interaction. User responses, transcribed by Whisper speech-to-text (STT), help refine LLaVA’s assessment of the situation, leading to actions like generating suggestions, alerting caregivers, or calling for medical help.
    The system was tested with 24 Nazarbayev University volunteers simulating various scenarios, including emergencies like heart attack, fainting, head injury, broken leg, open wound and everyday activities like watching TV, reading a book, and sitting with a laptop. Data is collected through video recordings and real-time interaction logs, capturing the images and the system's and user’s responses to each scenario.
    For qualitative analysis, participants also completed a questionnaire (generated from the NASA-Task Load Index and System Usability Scale) to assess aspects like system usability, response efficiency, difficulty, complexity, speed, and consistency. Also, the participants provided feedback on LLaVA's question and suggestion quality using a Likert scale. Statistical analysis of the data, including normality tests and Mann-Whitney U tests, helped to determine the differences between the male and female answers regarding the experiments and interaction with the system. As a result, the qualitative analysis demonstrated that the system was well-received by the participants, who found it effective, user-friendly, and reliable.
    Quantitative results highlighted the system's high VQA accuracy and efficient response times, with an average emergency detection to decision time of 154 seconds. Overall, the experiments confirmed the system’s effectiveness, achieving a detection accuracy of 93.44% and enhancing accuracy to 100% with user interactions.
    For more details and access to our system's source code, visit our GitHub repository:github.com/IS2...

КОМЕНТАРІ •