Skip to main content

IBM ViaVoice Toolkit for Animation

This toolkit adds functions to IBM ViaVoice that can be used to generate "near real-time" lip synchronization ("lip sync") for animation.

Date Posted: September 16, 2005

alphaworks tab navigation


 

Update: April 16, 2007 Version 1.1: New starter kit provides easier introduction to the technology and enables lip-sync based on the relative energy of the audio (starter kit does not require ViaVoice).

 

What is IBM ViaVoice Toolkit for Animation?

Lip synchronization ("lip sync") is the process of synchronizing a character's mouth movements with his speech. Currently, lip sync for animation is often done manually, and, when it is done automatically, it cannot be done in real time and the results are not accurate. Because ViaVoice provides an accurate acoustic model, the lip sync will also be accurate. In addition, because ViaVoice operates in "near real time," the lip sync can be done in near real time, opening many opportunities for real-time animation.

IBM ViaVoice Toolkit for Animation provides data files and a modified speech recognition engine that allows the ViaVoice dictation product to send phonetic and audio information to a client process to be used for synchronizing character mouth movement with the corresponding speech. In addition, it includes the header files neccesary for building a client as well as sample code and images for creating real-time or offline lip sync animations.

This technology opens the possibility for live animation of speaking characters and provides simple and accurate lip sync for scripted animation, bringing character dialog animation within reach of nonprofessionals.

The toolkit includes sample code that generates real-time Macromedia Flash animations. Software developers can use this example to help them integrate the technology into their animation systems. With IBM® ViaVoice® for Windows® R10 and this toolkit, developers have everything necessary for adding automatic lip sync capability to their animation systems.

How does it work?

Using the acoustic model of the ViaVoice dictation product, the recognition engine determines the most likely phonetic sequence that corresponds to the input audio and sends this information along with the captured audio to a client one hundred times per second through the ViaVoice application programming interface (SMAPI). The client then uses the phonetic information to select from among a small set of character images that correspond to each phone and assembles them in sequence along with the corresponding audio. The result is an animation of the character speaking in the voice of the user.

The feature that distinguishes an "aaaaaah" sound from an "eeeeee" sound is its frequency spectrum. We can measure the spectrum of speech one hundred times per second and match each of those spectra to the most likely sound that made it. All of speech can be broken down into a fairly small set of sounds called phones or phonemes. English has about fifty phonemes. In some languages, the sounds correspond fairly well to the written letters that represent the same word. English, which has words such as "through," is a little more complicated. The phonemes for "through" would be TH, R, and OO.

The sounds of the phonemes are determined by the shape of the lips, the position of the tongue, the amount of opening of the jaw, and the vibration of the vocal folds. When working in three dimensions (3D), animators usually try to be as realistic as possible, so they animate the jaw, lips, and tongue, and to that they add expressions associated with emotions such as speaking while smiling, so the number of different images can be large. For cartoons (two-dimensional animation), animators need not be as realistic, so the number of different images is usually smaller. For the demonstration included with the toolkit, we use twelve different mouth shapes. These shapes are called visemes, short for visual phonemes. There are fewer visemes than phonemes because the same mouth shapes can produce different sounds, for example B, P, and M. By displaying the twelve visemes in proper sequences, the animated character can be made to look as if it is speaking. This process is called lip synchronization. Because the phonetic information can be provided within one or two seconds of the original speech, the user can generate nearly real-time lip sync.

What's new in Version 1.1?

Version 1.1 of this toolkit adds the ability to generate lip sync based on the relative energy of the audio rather than the phonetic quality. This simpler form of lip sync uses only three visemes to represent the mouth movement. This method still requires decoding of the audio stream, but it is merely a refinement of automatic gain control. The choice of the form of lip sync to use depends primarily on the expected level of reality, which in turn depends on the level of reality of the animated characters.

This simpler form of lip sync provides developers an easier entry into creating lip sync applications. Unlike phonetic lip sync, it does not require the ViaVoice dictation product in order to run. All the code needed for creating a lip-synced animation is included. The energy-based lip sync is packaged as a separate zip file for downloading.

A few changes were also made to the main toolkit. The code for creating the SWF file was separated into a new C++ class and is used by both phonetic and energy-based lip sync. The installation script was updated so that it works with current versions of Apache.

About the technology author(s)

This toolkit was created by Mike Monkowski of the IBM Systems and Technology Group; Andy Aaron of IBM Research; and Roberto Sicconi of IBM Research.

Trademarks




Related technologies