InfoSafeAI – Multimodal Platform for Urban Public Safety Proposed Technologies and Algorithms for Multimodal Analysis

Phase

Start

Not set

End

Not set

Timeline

No timeline events available yet.

Abstract

The InfoSafeAI project will integrate a coherent set of technologies and information-processing methods from multiple sources (video, image, voice, text) for the purpose of detecting, classifying, and correlating events relevant to public safety, in a non-intrusive, explainable manner compliant with the AI Act and GDPR.

This project is part of the AI4INO. The aim is to accelerate the digital transformation of the North-West region by creating an infrastructure equipped with high-performance computing equipment and specialized hardware and software modules for developing AI-based solutions, while facilitating effective collaboration between research organizations and SMEs through knowledge, resource, and expertise transfer.

1. Video and image analysis. Object detection & tracking: YOLOv8 and Detectron2 algorithms, along with semantic segmentation using the Segment Anything Model (SAM), for detecting objects, vehicles, abandoned luggage, and unusual crowding. Video anomaly detection: LSTM and 3D-CNN architectures (e.g., I3D – Inflated 3D ConvNets) for identifying abnormal behavior. Drone video analytics: video streams captured by drones processed with an optimized TensorRT and OpenVINO pipeline for real-time inference.

2. Audio and voice analysis. Event detection: wav2vec 2.0 and PANNs (Pretrained Audio Neural Networks) models for detecting specific sounds (shouts, explosions, breaking glass). Speech-to-text: Whisper model fine-tuned for Romanian and English, with on-premises processing. Speaker diarization: integration of pyannote.audio for separating voices and associating them with events.

3. Text analysis and language processing. NLP for generated content: automatic classification of text messages (alert, report, status) using RoBERTa / BERT models adapted to a local corpus. Named Entity Recognition (NER): identification of entities in reports, social media, or audio transcripts. Multi-source correlation: RAG (Retrieval-Augmented Generation) system for linking information from text streams with video/audio metadata.

4. Multimodal correlation and explanations. Multimodal Fusion: CLIP and FLAVA models for linking meaning across image and text. Graph-based event correlation: integration of data into an event graph (Neo4j or TigerGraph) to highlight relationships among entities and actions. Explainable AI (XAI): use of SHAP and LIME to explain model decisions, displayed directly in the interface.

Research Domains

Cybersecurity & Space