HPC-AI services

UTCN has designed, within AIRi@UTCN, a computing infrastructure consisting of 32 GPU Node servers (each with 8 GPUs), along with control servers, storage, and networking equipment, arranged in up to 6 racks with cooling based on RDHx (Rear Door Heat Exchanger) and DLC (Direct Liquid Cooling).

The operationalization of this infrastructure will be carried out gradually, depending on the current stage of AI development in the local ecosystem and the progressive maturity of AI-based systems developed at UTCN.

Overview

Stage 1 is currently being implemented through the acquisition of the first 4 servers under the project “Romanian Artificial Intelligence Hub – HRIA,” part of the Smart Growth, Digitalization, and Financial Instruments Program 2021–2027 (PoCIDIF), funded by the European Regional Development Fund (ERDF), SMIS code: 334906.

Stage 2 aims to equip the computing center with an additional 8 servers.

By 2026, UTCN’s computing center will integrate an advanced 4-server architecture, with plans to expand by an additional 8 GPU servers, creating one of the most powerful AI research infrastructures in the region.

Key Features of the Initial 4-Server Setup

High-Performance GPU Nodes with Liquid Cooling

Four GPU nodes, each equipped with cutting-edge accelerators and cooled using Direct Liquid Cooling (DLC) technology, ensure maximum efficiency and stability during the most demanding AI training workloads.

Secure Access through a Dedicated Login Node

Researchers connect to the cluster via a secure login node. This setup enhances cybersecurity by isolating user access from the compute infrastructure and supports modern security measures such as multi-factor authentication and key-based SSH access.

Smart Workload Management

A central controller node powered by Slurm scheduling software manages job queues, monitors resources, and optimizes task distribution across the cluster.

Centralized System Administration

A dedicated management node ensures smooth operation of the infrastructure through orchestration services, telemetry sensors, and real-time health monitoring.

Scalable Data Storage Platform

A high-performance, distributed file system enables seamless access to large datasets and can expand transparently with new storage nodes—critical for data-intensive AI projects.

High-Speed Networking

With 400Gbps low-latency interconnects, the cluster provides the bandwidth required for advanced parallel computing.

Efficient Cooling & Sustainable Operation

The system employs Rear Door Heat Exchangers (RDHX) and DLC technologies to remove heat efficiently. A Chiller unit with free cooling mode reduces energy use by leveraging outside air during colder seasons.

Reliability through UPS Protection

An Uninterruptible Power Supply (UPS) safeguards sensitive equipment against power fluctuations and outages, ensuring continuity of long AI training sessions that may last days or weeks.

Looking Ahead

This modular architecture will be expanded with 8 additional GPU servers, boosting computing power and enabling larger, more complex AI models to be developed within the AIRi@UTCN ecosystem.