31st International Conference on Artificial Life and Robotics, ICAROB 2026, Oita, Japonya, 29 Ocak - 01 Şubat 2026, ss.173-178, (Tam Metin Bildiri)
Conventional frame-based vision systems for people counting often fail in environments with high-speed motion, extreme lighting conditions, or strict privacy requirements. Event-based vision sensors (EVS) offer a promising alternative by asynchronously capturing pixel-level brightness changes with microsecond latency and a high dynamic range. However, the sparse and asynchronous nature of event data necessitates specialized processing architectures. This study proposes an end-to-end, fully event-driven pipeline for robust people counting. A Voxel-Grid representation was utilized to convert raw event streams into structured tensors that preserve temporal dynamics. A lightweightsliding-window Convolutional Neural Network CNN wasthen employed for real-time patch classification, coupled with a ByteTrack-style association method to ensure stable trajectory maintenance. To overcome the high cost of manual annotation in event-based vision, we introduced an automated ground-truth generation method based on center-point expansion and a specialized evaluation metric with temporal tolerance. Experimental results demonstrate that the proposed system achieves stable localization and tracking in real-world scenarios, making it suitable for privacy-preserving monitoring on edge-computing devices.