CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification

Thu, 14 Mar 2024·

Yiming Ma

Victor Sanchez

Tanaya Guha

· 0 min read

PDF Cite Code

Abstract

The CLIP (Contrastive Language-Image Pretraining) model has exhibited outstanding performance in recognition problems, such as zero-shot image classification and object detection. However, its ability to count remains understudied due to the inherent challenges of transforming counting–a regression task–into a recognition task. In this paper, we investigate CLIP’s potential in counting, focusing specifically on estimating crowd sizes. Existing classification-based crowd-counting methods have encountered issues, including inappropriate discretization strategies, which impede the application of CLIP and result in suboptimal performance. To address these challenges, we propose the Enhanced Blockwise Classification (EBC) framework. In contrast to previous methods, EBC relies on integer-valued bins that facilitate the learning of robust decision boundaries. Within our model-agnostic EBC framework, we introduce CLIP-EBC, the first fully CLIP-based crowd-counting model capable of generating density maps. Comprehensive evaluations across diverse crowd-counting datasets demonstrate the state-of-the-art performance of our methods. Particularly, EBC can improve existing models by up to 76.9%. Moreover, our CLIP-EBC model surpasses current crowd-counting methods, achieving mean absolute errors of 55.0 and 6.3 on ShanghaiTech part A and part B datasets, respectively. The code will be made publicly available.

Type

Preprint

Last updated on Thu, 14 Mar 2024

Crowd Counting CLIP Multimodality

Authors

Yiming Ma

PhD Candidate

Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention Mon, 14 Aug 2023 →