Behind the Glass

Opinion Pieces

FaceCap and Vocap: Frankensteining the Future

Keith Arem has been working as a director, producer, and composer in the game industry over the past 20 years, and has been fortunate to have been involved in many incredible projects and titles in Console, VR, Mobile, and Location Based Entertaiment. Here he explains the need for facecap.

I’ve been working as a director, producer, and composer in the game industry over the past 20 years, and I've been fortunate to have been involved in many incredible projects and titles in Console, VR, Mobile, and Location Based Entertaiment. From Call of Duty to Yakuza, these projects have given me the opportunity to work alongside the industry’s most talented artists, animators, designers, programmers, and fellow storytellers. I’ve also had the honor to direct some of the most dynamic actors in the entertainment industry including Gary Oldman, Idris Elba, Ed Harris, Michael Keaton, and dozens of others.

As a talent director, my role is to guide an actor’s performance, regardless of the technology, and ensure the quality and integrity of the scene and acting. After working on over 650 games, I’ve seen a vastly diverse range of tools and technology to achieve stellar gameplay. But from all of this experience, I believe the most important aspect is to develop creative methods to achieve great performances. As I’ve been moving into directing for films and television, many of these methods are becoming the backbone of modern virtual production.

As the game industry continues to grow and evolve, the need for realistic acting performances has become more important to tell emotionally engaging stories. In film, we have the benefit of directing actors on-set, in character, in costume, utilizing props, and reacting off other actors. However, in games, each element is often animated, triggered, or timed separately, so each performance is often constructed individually. As a result, I have found exceptional success in “frankensteining” performances - with the voice, body, movement, stunts and face all built, captured and brought together independently.

THE NEED FOR FACECAP

Facial animation has historically been a challenge for games, since dialog often comes late the game development process, and the time and resources to produce detailed results is limited.

Also, with many AAA games being produced in Asia, Europe, and South America, there is a growing need to re-voice and re-face performances when they are dubbed into English.

For “in-game dialog” (during gameplay), many games commonly utilize automated mouth flaps to provide lip movements. This is effective for common gameplay dialog, but lacks articulation and finer detail for realistic performances. However for cinematic “cut scenes” (pre-rendered movies), many developers can benefit from full performance capture to drive facial animation. This can provide greater detail and phenomenal resolution, creating extremely realistic and emotionally grounded results. With the evolution and refined detail and resolution of motion capture performance systems, there are now many creative and diverse methods to achieve high quality facial performances. In my opinion, the method and pipeline for capturing facial and vocal performances is equally important as the technology itself.

ART INFLUENCING ART

With Virtual Production becoming an important necessity in today’s turbulent landscape, the film industry is turning to games as a way to address the future of shooting. Ironically, the game Industry often looks to Hollywood for inspiration for Performance Capture methods, resulting in the actors performing motion, stunts, voice and face altogether. Most high fidelity Motion Capture systems (MOCAP) utilizes hundreds of infrared cameras, which capture the actors from 360 degrees, and the performances capture becomes similar to theater “in the round”. Since props and set pieces can obstruct the infrared cameras, we often create simplistic props to represent objects, and view real-time playback on monitors to verify the environment and characters.

From a performance standpoint, the advantage to this method allows us to bring together multiple actors at once, and allows them to work together, and most importantly re-act to each other. Actors love the freedom of MOCAP, since it allows them to perform as a stage play, and can be a complete representation of their performance.

FIX IT IN THE MIX

While this is a natural performance and animation technique, it comes at an extremely high cost, and requires a significant of labor, digital cleanup. Sometimes stage performances lose continuity with in-game elements, surrounding gameplay scenes, or in-game audio fidelity. Most importantly, it also places a significant amount of dependency on actors.

Due to the typical amount of game content, budget, time, and workflow costs - actors often must perform 10-12 pages of script per day, and often in 2-3 minute memorized continuous scenes, to provide continuity with hundreds of cameras. As a result, even one stumbled line, one mis-step, wrong eye line, or one mistake means an entire reshoot of the scene...or a correction in post. Additionally, due to the large MOCAP stage sizes, the acoustical capture and recording fidelity is vastly different from in-studio dialog, which generally compromises the majority of the gaming experience. Large diagram or shotgun mics are not practical for stage recordings, so strategically placed Lav are required for each performer. Most stage recordings require a fair amount of cleanup or ADR, especially if dialog needs to be isolated or incorporated with surrounding game elements. Also, since some stunt performers are not trained in VO techniques, many VO actors often replace MOCAP temp dialog.

As a Director, this means your focus is split between reviewing minute nuances of the performance, or being forced to compromise due to a technical or performance glitch. This “fix it in the mix” philosophy often requires the talent, audio engineers, and directors to revise, ADR and replace the original stage performance. This is not only a slower process, but often negates the benefits of the full performance capture. The end results can still be exceptionally good, but the process, costs, and time are often overlooked due to the fractured nature of production.

As a Director and Enginner, I have seen this scenario play out dozens of times on the largest of franchises, and due to the fractured nature of game production, it’s sometimes difficult to supervise the complete pipeline. As a result, the performances are often revised, tweaked, patched, and melded until the resulting product is a blend of many elements...and performers.

FRANKENSTEINING THE FUTURE

Which brings me back to the philosophy of developing new methods and pipelines of “Frankenstein-ing“ performance elements. The goal is to capture each acting element at it’s highest resolution, highest fidelity, lowest cost, and best performance. In this strategy, a game character is the composite of a VOCAP actor, MOCAP performer, and FACECAP performer...with a stellar team of engineers and animators behind them.

VOCAP: VOICE FIRST

Before addressing the Face, it’s important to first consider the Voice. If you watch any Pixar or hand animated film, most performances originate with the voice first. Animators often talk about using the actor’s voice inspire their animation, and that’s absolutely the case in games. The voice informs the speed, pitch, projection, distance, motion, intensity, breathing, and attitude of the characters. In VO dialog recording, actors must rely on their voice, rather than their physicality, to convey emotional situations. This does present a challenge for physically intensive scenes with action, which requires breathing and simulating movement, especially since dialog is often recorded individually, to isolate the voice from other sounds or cross talk.

This situation led me to creating a new performance method we coined: VOCAP. This method combines head mounted high quality condenser lav mics in a sound treated voice stage. Here we can direct and record 6-8 actors simultaneously, each with their own isolated track. The sound treated stage is large enough to provide space for the actors to move, and the lightweight headmics allow the performers to interact with props, move, turn, and most importantly react to other performers. This method provides the isolated quality necessary for studio recording, but also simulates the same performance environment from MOCAP, without the need for stunts, cameras or suits.

MOTION ON SET

Once VOCAP is complete, DAW systems are employed to clean and organize performances into isolated regions. This allows Directors and Engineers hand pick the best elements of the recordings, and construct audio animatics for stage playback. (This also allows dialog to go into game production sooner, since it can be implemented or setup for Facial without waiting for MOCAP). Sound recordings are then brought to the MOCAP stage for action, without the need to capture audio onset. This method allows stunt and trained physical motion performers to concentrate on the action, and ensure they are timing based on the director’s preselected vocal performance. This also allows celebrities and voice actors (who may not have the time, experience, or physicality to endure long performance capture shoots) to portray physically demanding roles, without being on set.

DAW systems are then brought to set, and MOCAP actors are given trigger cues for audio. MOCAP actors then block movement and timing based on the pre-recorded dialog, eliminating the need to memorize or record audio on set. The DAW workstation can easily adjust timing, speed or triggers to help actors hit their cues, or allow more/less time for action and movement. This allows Directors to focus specifically on action and animation, using final audio takes and choices. This also helps the MOCAP talent ground their movement, and not exaggerate or change timings.

FACECAP:

Once VOCAP AND MOCAP are complete, FaceCap can be employed to address the face, dialog and expressions. FaceCap is a method where a “Facial ADR” actor replaces the mouth and facial movements. Similar to the popular app “Dub Smash,” FaceCap artists lip-sync to final audio recordings, wearing Head Mounted Camera systems. This allows engineers to capture detailed facial expressions, based on final approved performances, and can focus exclusively on facial reactions, timing and eye lines. FaceCap actors do not have to perform new dialog, but rather perfectly mouth the original performance, while adding crafted expressions and physicality to the face. Directors can focus on specific movements and lines of dialog, instead of approving dozens of lines during the physicality of MOCAP.

DEVIL IN THE DETAILS

While everyone’s ideal goal is to allow actors to deliver complete physical performances, the diverse world of games requires engineers and directors to embraces the advantages of virtual production, and isolated performance elements. This unique marriage between performance, engineering, animation, and direction - results in an impressive blend of the best of all worlds. As technology continues to mature and improve, we also need to constantly evolve our production philosophies to deliver the highest quality content we can create.

www.pcbproductions.com

Other Opinion Pieces