Voxel-Grid Based Deep Learning for Robust People Counting and Tracking with Event-Based Vision Sensors


Alahmad R., Zhou Z., Albaroudi M., Alraee A., Alraie H., Yasukawa S.

31st International Conference on Artificial Life and Robotics, ICAROB 2026, Oita, Japan, 29 January - 01 February 2026, pp.173-178, (Full Text) identifier

  • Publication Type: Conference Paper / Full Text
  • City: Oita
  • Country: Japan
  • Page Numbers: pp.173-178
  • Keywords: Deep learning, Event-based vision, People counting, Real-time tracking, Voxel-grid
  • Middle East Technical University Northern Cyprus Campus Affiliated: Yes

Abstract

Conventional frame-based vision systems for people counting often fail in environments with high-speed motion, extreme lighting conditions, or strict privacy requirements. Event-based vision sensors (EVS) offer a promising alternative by asynchronously capturing pixel-level brightness changes with microsecond latency and a high dynamic range. However, the sparse and asynchronous nature of event data necessitates specialized processing architectures. This study proposes an end-to-end, fully event-driven pipeline for robust people counting. A Voxel-Grid representation was utilized to convert raw event streams into structured tensors that preserve temporal dynamics. A lightweightsliding-window Convolutional Neural Network CNN wasthen employed for real-time patch classification, coupled with a ByteTrack-style association method to ensure stable trajectory maintenance. To overcome the high cost of manual annotation in event-based vision, we introduced an automated ground-truth generation method based on center-point expansion and a specialized evaluation metric with temporal tolerance. Experimental results demonstrate that the proposed system achieves stable localization and tracking in real-world scenarios, making it suitable for privacy-preserving monitoring on edge-computing devices.