Single motion capture MIT news

From Star Wars to Happy Feet, many of your favorite movies feature scenes that have been made possible by the use of motion capture technology that videos the movement of objects or people. In addition, this observational program, which encompasses the complex interactions of physics, geometry, and perception, covers not only Hollywood but also the military, sports education, medicine, computer vision, and robotics, allowing engineers to understand and simulate real-world actions. environment.

Because it can be a complex and costly process — often requiring markers on objects or people and recording a sequence of actions — scientists are trying to shift the burden to neural networks that can retrieve this data from a simple video and reproduce it in a model. . Working with physics modeling and mapping promises that it will be more widely used because it can describe realistic, continuous, dynamic motion of images and transform a 2D image into a 3D scene back and forth. However, in order to do so, current methods require accurate knowledge of the environmental conditions in which the action takes place and the selection of a renderer, both of which are often inaccessible.

A team of MIT and IBM researchers has now developed a trained neural network pipeline that avoids this problem and can infer the state of the environment and the actions taking place, the physical characteristics of the object or person (system) of interest. , and its control settings. The proven technique can outperform other methods by modeling four physical systems of rigid and deformable bodies that illustrate different types of dynamics and interactions under different environmental conditions. In addition, the methodology allows to simulate learning – to predict and reproduce the trajectory of a real – world flying quadrator from a video.

“This paper examines the problem of high-level research on how to recreate a digital twin from a dynamic system video,” says Tao Du PhD ’21, a doctoral student in the Department of Electrical Engineering and Computer Science (EECS). Member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and member of the research team. To do this, Du says, “We need to ignore the differences in the rendering of video clips and try to understand the basics of a dynamic system or dynamic motion.”

The two co-authors are the lead author Pingchuan Ma, a graduate of EECS and a member of CSAIL; Josh Tenenbaum, Paul E. Newton Professor of Career Development at the Department of Brain and Cognitive Sciences and a member of CSAIL; Wojciech Matusik, Professor of Electrical Engineering and Informatics and a member of CSAIL; and Chuang Ghana, chief research officer at MIT-IBM Watson AI Lab. This work was presented this week at an international conference on learning representations.

Filming characters, robots, or dynamic systems to infer dynamic movement makes this information more accessible, which also poses a new challenge. “Images or videos [and how they are rendered] it depends a lot on the lighting conditions, on the background information, on the texture information, on the material around you, and it may not necessarily be measurable in a realistic scenario, ”says Du. Without this rendering configuration information, or without knowing which rendering tool to use, it is currently difficult to gather dynamic information and predict the behavior of the video subject. Even if the renderer is known, current neural network methods still require large training datasets. However, with the new approach, this could become a contentious issue. “If you shoot a leopard running in the morning and evening, of course, you’ll get visually different video clips because the lighting conditions are quite different. But you really care about dynamic movement: the corners of the leopard joints – not because they look light or dark, ”says Du.

To eliminate the imaging domains and image differences, the team developed a pipeline system that includes a neural network called the RISP network. RISP converts image (pixel) differences into system state differences, ie the operating environment, so their method is generalized and agnostic for mapping configurations. RISP is trained using random rendering parameters and states that are input to a differentiated rendering device, a type of rendering that measures pixel sensitivity based on rendering configurations such as lighting or material colors. This creates a set of different images and videos based on known key truth parameters that will later allow RISP to change this process by predicting the state of the environment from the input video. The team further reduced the RISP rendering gradients to make its predictions less sensitive to changes in rendering configurations, so it learned to forget about the visual image and focus on learning dynamic states. This is possible using a differentiated renderer.

The method then uses two similar pipelines running in parallel. One is for the source domain with known variables. Here, system parameters and actions are entered into differentiated modeling. The generated simulation states are combined with different imaging configurations on a different imaging device to generate images that are input to the RISP. The RISP then provides forecasts of the state of the environment. A similar target domain pipeline with unknown variables is running at the same time. The RISP in this pipeline provides the following output images, generating the expected state. Comparing the predicted states from the source and target domains results in a new loss; this difference is used to adjust and optimize some of the conveyor parameters of the source domain. This process can then be repeated, further reducing losses between pipelines.

To determine the success of the method, the team tested it in four simulated systems: a quadrator (a flying rigid body with no physical contact), a cube (a standard body that interacts with the environment like a dice). , an articulated arm and a rod (a deformable body that can move like a snake). Tasks included evaluating the state of the system based on the image, setting system parameters and action control signals from the video, and detecting control signals from the target image directing the system to the desired state. In addition, they developed baselines and an oracle by comparing the new RISP process in these systems with similar methods that, for example, have no loss of transmission gradient, no training in the neural network with no loss, or no RISP neural network at all. The team also examined how the loss of the gradient affected the performance of the state prediction model over time. Finally, the researchers used their RISP system to infer from the video the motion of a real quadrator with complex dynamics. They compared performance with other methods that lacked loss function and used pixel differences, or with a technique that involved manual rendering of the renderer configuration.

In almost all experiments, the RISP procedure surpassed similar or state-of-the-art methods available by simulating or reproducing the desired parameters or motion and proved to be a data-efficient and generalized competitor to current motion capture methods.

For this work, the researchers made two important assumptions: that information about the camera is known, such as its position and settings, as well as the geometry and physics that govern the object or person being observed. It is planned to address this in the future.

“I think the biggest problem we’re dealing with here is retrieving information from one area to another without very expensive equipment,” Ma says. Such an approach should be “useful [applications such as the] metaverse, which aims to reconstruct the physical world in a virtual environment, ”adds Gan. “It’s basically a daily, affordable solution that’s neat and simple to do a cross-area reconstruction or reverse dynamics problem,” Ma says.

This study was supported in part by MIT-IBM Watson AI Lab, Nexplore, the DARPA Machine Common Sense Program, the Navy Research Bureau (ONR), ONR MURI, and Mitsubishi Electric.

Godfrey Kemp

"Bacon fanatic. Social media enthusiast. Music practitioner. Internet scholar. Incurable travel advocate. Wannabe web junkie. Coffeeaholic. Alcohol fanatic."

Leave a Reply

Your email address will not be published.