Learning to Represent and Recognize Multimodal Videos