👋 Hi, there! I’m Yiming Ma (é©¬ä¸€é“ in Chinese). As a PhD candidate at MathSys CDT at the University of Warwick, my research focuses on multimodality in computer vision and its applications (e.g., in crowd counting and driver monitoring systems). I am passionate about bridging mathematics and deep learning to solve real-world problems.
80%
50%
95%
Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors mounted at different locations to monitor the driver and the vehicle’s interior scene and employ decision-level fusion to integrate these heterogenous data. However, this fusion method may not fully utilize the complementarity of different data sources and may overlook their relative importance. To address these limitations, we propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA). We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Conv, SE, and AFF). We also present a novel GPU-friendly supervised contrastive learning framework SuMoCo to learn better representations. Furthermore, We fine-grained the test split of the DAD dataset to enable the multi-class recognition of drivers’ activities. Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC 97.0%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses. The code and annotations are publicly available.
State-of-the-art crowd counting models follow an encoder-decoder approach. Images are first processed by the encoder to extract features. Then, to account for perspective distortion, the highest-level feature map is fed to extra components to extract multiscale features, which are the input to the decoder to generate crowd densities. However, in these methods, features extracted at earlier stages during encoding are underutilised, and the multiscale modules can only capture a limited range of receptive fields, albeit with considerable computational cost. This paper proposes a novel crowd counting architecture (FusionCount), which exploits the adaptive fusion of a large majority of encoded features instead of relying on additional extraction components to obtain multiscale features. Thus, it can cover a more extensive scope of receptive field sizes and lower the computational cost. We also introduce a new channel reduction block, which can extract saliency information during decoding and further enhance the model’s performance. Experiments on two benchmark databases demonstrate that our model achieves state-of-the-art results with reduced computational complexity. PyTorch implementation of the model and weights trained on these two datasets are available at https://github.com/YimingMa/FusionCount.
This research projects aims to build an intelligent interior sensing systems for monitoring and recognising drivers’ activities from heterogeneous multi-view, multimodal data, encompassing video, audio, heart rate and driving trends. Efficient deep learning models will be developed to combine information from disparate modalities, detect relevant activities and accurately classify the detected activity. Additionally, the project will answer critical research questions including
Responsibilities include: