Making Video as Accessible as Text

Research Area(s): Multimodal Computing and Interaction
This research seeks to enable machine understanding of video and associated multimedia content, structure, and semantics.

Alex Hauptmann

Video content analysis requires new approaches in a variety of domains, ranging from large-scale content-based search for video information and video content retrieval including question answering and trend detection, to real-time surveillance event analysis, social behavior understanding, and multimedia enabled coaching all the way to the construction of a multi-media “knowledge graph”. 

The effort combines research and systems building at the intersection of audio, speech, image, and video analysis, coupled with language processing, information retrieval, machine learning, and man-machine communication. Ultimately, we want to advance our ability to make use of all types of multimedia data at massive scale matching its rate of production in the commercial and Internet worlds.

The impact of this research is to inform our knowledge of how people perceive, remember, search, and summarize multimedia content, and ultimately provide insight into how they analyze, structure and remember it.  The research also multiplies its impact by building tools and systems enabling practical applications not yet conceived, leveraging and exploiting multimedia and its temporal dimensions.

Case Study: Event Reconstruction

What happened during the Boston Marathon in 2013?

At any major event, many people take videos and share them on social media. When it is necessary fully understand exactly what happened in these events, researchers and analysts often have to examine thousands of these videos manually. To reduce this manual effort, we present an investigative system that automatically synchronizes these videos to a global timeline and localizes them on a map. In addition to alignment in time and space, our system combines various functions for analysis, including gunshot detection, crowd size estimation, 3D reconstruction, and person tracking. This is the first time a unified framework has been built for comprehensive event reconstruction for social media videos.

For More Information:
Junwei Liang, Desai Fan, Han Lu, Poyao Huang, Jia Chen, Lu Jiang, Alexander Hauptmann. An Event Reconstruction Tool for Conflict Monitoring Using Social Media. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). 

Video: CMU Event Reconstruction Tool

Case Study: Cognitive Assistance for Medical Device Use

The Asthma Inhaler Coach

This project has developed several systems that observe individuals using a medical device such as an infusion pump or asthma inhaler. In real time, the systems detect when usage errors are made and provide coaching assistance to help redirect the patient. The core research insights used video and audio information, together with the predefined script of proper use, to learn correct sequences of fine-grained actions and to detect deviations. The key is the use of multiple modalities (audio and video) in a constrained task, where correct actions are well-trained, and deviations can be recognized in real- time and feedback provided. 

We developed an automated method to observe and monitor patients using metered dose inhalers, and coach them in proper inhaler use as appropriate.  Observations take place using an intelligent camera system that identifies incorrect actions. Coaching is performed by an interactive system that aims to reinforce good treatment outcomes. The system is based on multimedia analysis techniques utilizing multiple modalities including RGB, depth and audio data collected with a mobile sensing system, such as a Kinect in one example, to discover and bridge the gap between the prescribed steps by doctors and the real daily usages of patients. A rule-based joint classification method then provides personalized feedback to coach patients, in real time, to improve their performance.  Embodiments of the present invention include, but are not limited to data input from a camera-based system (e.g. a Kinect), audio and infrared (IR) data.

An automatic system such as this is cost saving and has the potential to be widely utilized. In addition, it is more convenient as patients decide when and where to use the system for learning and reinforcing the proper procedures of inhaler usage. Additional applications to other home medical devices such as home infusion pumps and home dialysis machines have also been explored.

Demo Video

Case Study: Marauder’s Map for Long Term Monitoring of Elderly Patients

Surveillance cameras are widely used, yet there is a lack of automated methods to understand what is happening in them. Our group has been working for many years on long-term surveillance in healthcare environments to improve patient care.  A recent implementation has achieved the unprecedented feat of tracking 15 patients in a nursing home using 15 cameras for around-the-clock observation for 25 days. Summaries of individual patients in terms of walking, eating, sleeping, and social interactions were created. The key insight was to combine highly accurate tracking methods and facial recognition with a 3-D reconstruction of the environment together with activity analysis. Project Website

Demo Video

Case Study: Large-Scale Processing of Internet Videos

The project idea is to develop automatic methods to detect event s, action or objects in Internet videos without any user-generated metadata. Responsible for two subtasks semantic concept detection (automatically detecting objects/actions in videos), and text-to-video search in our group. Designed the first of its kind system called E-Lamp Lite on 100 million videos. Our group designed a system for multimedia event detection that was consistently recognized as one of the top two scoring submissions in the annual evaluations of the NIST multimedia event detection task which stressed both accuracy of results and computational efficiency. 

We developed an accurate, efficient and scalable search method for video content. As opposed to text matching, the method relies on automatic video content understanding, and allows for intelligent and flexible search paradigms over the video content (text-to-video and text&video-to-video search). It provides new ways to look at content-based video search from finding a simple concept like “puppy” to searching a complex incident like “a scene in urban area where people running away after an explosion”. To achieve this ambitious goal, we implemented several novel methods focusing on accuracy, efficiency and scalability in this search paradigm.

Demo Video

Case Study: Personal Video Memory QA

People increasingly use their phones to capture and share many salient personal moments in their lives. Yet these moments are notoriously difficult to search for and temporally localize. This project is focused on retrieving specific memories based on user recollections that have been recorded in personal video collections and shared via email. Our Yahoo partners indicated this would be a particularly interesting combination for future services. We are developing mobile platform based tools to index and answer questions about a user’s mail and associated videos/images by content . Beyond merely tagging the media data, the proposed work seeks to transform personal videos, which contain much emotional and vivid content, into a customized information resource that is searchable from a mobile phone. The search will take place through spoken or text reminiscences (which function as queries) that are transformed into semantic search expressions appropriate for the collection. The search expressions map the query terms into concepts the system knows how to detect. The system supports both queries through words as well as one or more videos as examples of what the user seeks. Example queries from a user’s smart phone are: “Which videos did John send me about a year ago with my puppy chasing a cat? or “Find me similar videos to this one from anyone in the family about our visit to amusement parks, but with more people in it”. 

For More Information:
Visual Memory QA: Your Personal Photo and Video Search Agent [demo1, demo 2, mp4]. Lu Jiang, Liangliang Cao, Yannis Kalantidis, Sachin, Farfade, Alex Hauptmann. In AAAI Conference on Artificial Intelligence (AAAI), 2017.[PDF]


Demo 2

Case study: Smarter Parking

Our research system has resulted an intelligent system for improved parking assistance in large parking lots and other traffic monitoring scenarios. The system monitors the traffic entering and inside the parking area to be able to direct drivers to available parking spots using a mobile device application. The basis of our system is camera monitoring which keeps track of the current occupancy of the parking spaces, and identification of cars as they enter the parking lot. The identification of cars could be coupled to a license plate or EZPass as they enter the parking lot, and the app guides drivers to an optimal available spot. Advantages of the system include tracking all cars in different weather conditions, keeping inventory of all currently available spots, and a navigational guidance application for mobile devices. 

The system has been implemented and the methods are optimized for ease of passenger convenience from home to parking to terminal. The system uses camera observations to track cars, identify open and available parking spaces, and keep track of the spaces in a database in terms of which ones are either: available, reserved or occupied. Based on this information, the application is activated when entering the lot and the car is then tracked by cameras upon entry. The application guides the driver to the most optimal open spot combining the system’s capabilities with the GPS of the mobile device. The utility of our system is to save users time and avoid anxiety about searching for available parking spaces, and inform the parking management system with real-time information about available spaces.

Author: Sponsor: