Multimodal Video Concept Classification based on Convolutional Neural Network and Audio Feature Combination
Özet
Video concept classification is a very important task for several applications such as content based video indexing and searching In this study, we propose a multi-modal video classification method based on the feature-level fusion of audiovisual signals. In the proposed method, we extract Mel Frequency Cepstral Coefficient (MFCC) and convolutional neural network (CNN) features from the audio and visual parts of the video signal, respectively and calculate three statistical representations of the MFCC feature vectors. We perform feature level fusion of both modalities using the concatenation operator and train Support Vector Machine (SVM) classifiers using these multimodal features. We evaluate the effectiveness of our proposed method on the TRECVID video performance dataset for both single and multi-modal cases. Our results show that, fusing standard deviation representation of the audio modality along with the GoogleNet CNN features improves the classification accuracy.