Carnegie Mellon University

11775 - Large-Scale Multi-Media Analysis

Can a robot watch "Youtube" to learn about the world? What makes us laugh? How to bake a cake? Why is Kim Kardashian famous? 12-unit class covering fundamentals of computer vision, audio and speech processing, multi-media files and streaming, multi-modal signal processing, video retrieval, semantics, and text (possibly also: speech, music) generation. Instructors will give an overview of relevant recent work and benchmarking efforts (Trecvid, Mediaeval, etc.). Students will work on research projects to explore these ideas and learn to perform multi-modal retrieval, summarization and inference on large amounts of "Youtube"-style data. The experimental environment for the practical part of the course will be given to students in the form of Virtual Machines.