Back to Projects

InfoSafeAI – Multimodal Platform for Urban Public Safety Proposed Technologies and Algorithms for Multimodal Analysis

Region

national

Phase

Planned

Partners

1 partner

Abstract

The InfoSafeAI project will integrate a coherent set of technologies and information-processing methods from multiple sources (video, image, voice, text) for the purpose of detecting, classifying, and correlating events relevant to public safety, in a non-intrusive, explainable manner compliant with the AI Act and GDPR.

This project is part of the AI4INO. The aim is to accelerate the digital transformation of the North-West region by creating an infrastructure equipped with high-performance computing equipment and specialized hardware and software modules for developing AI-based solutions, while facilitating effective collaboration between research organizations and SMEs through knowledge, resource, and expertise transfer.

1. Video and image analysis. Object detection & tracking: YOLOv8 and Detectron2 algorithms, along with semantic segmentation using the Segment Anything Model (SAM), for detecting objects, vehicles, abandoned luggage, and unusual crowding. Video anomaly detection: LSTM and 3D-CNN architectures (e.g., I3D – Inflated 3D ConvNets) for identifying abnormal behavior. Drone video analytics: video streams captured by drones processed with an optimized TensorRT and OpenVINO pipeline for real-time inference.

2. Audio and voice analysis. Event detection: wav2vec 2.0 and PANNs (Pretrained Audio Neural Networks) models for detecting specific sounds (shouts, explosions, breaking glass). Speech-to-text: Whisper model fine-tuned for Romanian and English, with on-premises processing. Speaker diarization: integration of pyannote.audio for separating voices and associating them with events.

3. Text analysis and language processing. NLP for generated content: automatic classification of text messages (alert, report, status) using RoBERTa / BERT models adapted to a local corpus. Named Entity Recognition (NER): identification of entities in reports, social media, or audio transcripts. Multi-source correlation: RAG (Retrieval-Augmented Generation) system for linking information from text streams with video/audio metadata.

4. Multimodal correlation and explanations. Multimodal Fusion: CLIP and FLAVA models for linking meaning across image and text. Graph-based event correlation: integration of data into an event graph (Neo4j or TigerGraph) to highlight relationships among entities and actions. Explainable AI (XAI): use of SHAP and LIME to explain model decisions, displayed directly in the interface.

Research Domains

Partners

Collaborating organizations on this project.

ZA CLOUD