Multimodal Vehicle Type Classification Using Convolutional Neural Network and Statistical Representations of MFCC
Özet
Recognition of vehicle types in real life traffic scenarios is a challenging task due to the diversity of vehicles and uncontrolled environments. Efficient methods and feature representations are needed to cope with these challenges. In this paper, we address the vehicle type classification problem in real life traffic scenarios and propose a multimodal method that uses efficient representations of audio-visual modalities in the fusion context. We first separate audio-visual modalities from video data by extracting the keyframes and the corresponding audio fragments. Then we extract deep convolutional neural network (CNN) and the Mel Frequency Cepstral Coefficient (MFCC) features from the visual and audio modalities of the video data, respectively. The Principal Component Analysis (PCA) algorithm is used for the visual part and various types of statistical representations of the MFCC feature vectors are calculated to select representative features. These representations are then fused to form a robust multimodal feature. Finally, we train Support Vector Machine (SVM) classifiers for final classification of vehicle types using the obtained multimodal features. We evaluate the effectiveness of our proposed method on the TRECVID 2012 SIN video performance dataset for both single- and multi-modal cases. Our results show that, fusing the proposed MFCC representations with the GoogLeNet CNN features improves the classification accuracy.