• Document: Multimodal Deep Learning
  • Size: 509.1 KB
  • Uploaded: 2018-12-07 01:59:26
  • Status: Successfully converted


Some snippets from your converted document:

Multimodal Deep Learning Jiquan Ngiam1 jngiam@cs.stanford.edu Aditya Khosla1 aditya86@cs.stanford.edu Mingyu Kim1 minkyu89@cs.stanford.edu Juhan Nam1 juhan@ccrma.stanford.edu Honglak Lee2 honglak@eecs.umich.edu Andrew Y. Ng1 ang@cs.stanford.edu 1 Computer Science Department, Stanford University, Stanford, CA 94305, USA 2 Computer Science and Engineering Division, University of Michigan, Ann Arbor, MI 48109, USA Abstract mation on the place of articulation and muscle move- ments (Summerfield, 1992) which can often help to dis- Deep networks have been successfully applied ambiguate between speech with similar acoustics (e.g., to unsupervised feature learning for single the unvoiced consonants /p/ and /k/ ). modalities (e.g., text, images or audio). In this work, we propose a novel application of Multimodal learning involves relating information deep networks to learn features over multiple from multiple sources. For example, images and 3-d modalities. We present a series of tasks for depth scans are correlated at first-order as depth dis- multimodal learning and show how to train continuities often manifest as strong edges in images. deep networks that learn features to address Conversely, audio and visual data for speech recogni- these tasks. In particular, we demonstrate tion have correlations at a “mid-level”, as phonemes cross modality feature learning, where better and visemes (lip pose and motions); it can be difficult features for one modality (e.g., video) can be to relate raw pixels to audio waveforms or spectro- learned if multiple modalities (e.g., audio and grams. video) are present at feature learning time. In this paper, we are interested in modeling “mid- Furthermore, we show how to learn a shared level” relationships, thus we choose to use audio-visual representation between modalities and evalu- speech classification to validate our methods. In par- ate it on a unique task, where the classifier is ticular, we focus on learning representations for speech trained with audio-only data but tested with audio which are coupled with videos of the lips. video-only data and vice-versa. Our mod- els are validated on the CUAVE and AVLet- We will consider the learning settings shown in Figure ters datasets on audio-visual speech classifi- 1. The overall task can be divided into three phases cation, demonstrating best published visual – feature learning, supervised training, and testing. speech classification on AVLetters and effec- A simple linear classifier is used for supervised train- tive shared representation learning. ing and testing to examine different feature learning models with multimodal data. In particular, we con- sider three learning settings – multimodal fusion, cross 1. Introduction modality learning, and shared representation learning. In speech recognition, humans are known to inte- In the multimodal fusion setting, data from all modal- grate audio-visual information in order to understand ities is available at all phases; this represents the typ- speech. This was first exemplified in the McGurk ef- ical setting considered in most prior work in audio- fect (McGurk & MacDonald, 1976) where a visual /ga/ visual speech recognition (Potamianos et al., 2004). In with a voiced /ba/ is perceived as /da/ by most sub- cross modality learning, data from multiple modalities jects. In particular, the visual modality provides infor- is available only during feature learning; during the supervised training and testing phase, only data from Appearing in Proceedings of the 28 th International Con- a single modality is provided. For this setting, the aim feren

Recently converted files (publicly available):