Classic rendering techniques can generate photorealistic images for various complex, real-world scenarios when given high-quality scene specifications.
They also gives us explicit editing capabilities over various elements of a scene like camera viewpoint, lighting, geometry, and materials. But a significant manual effort is required to generate high-quality scene models from images. Automating scene modeling from images is an open research problem.
Deep generative models have improved quite a lot in recent years, successfully produce high-quality, photorealistic images and videos that are visually compelling. These networks can generate images or videos either from random noise or can be conditioned on certain user inputs like segmentation masks or layouts. However, they have limitations and these techniques do not yet support fine-grained control of the details of the generated scene. They also cannot always handle complex interactions between the scene objects well.
In contrast, neural rendering methods try to use the best of both approaches and enable the controllability, and synthesis of novel, high-quality images or videos. There are different types of neural rendering techniques depending on the:
The inputs for a typical neural rendering approach are certain scene conditions like viewpoint, layout, and lighting. A neural scene representation is then built from these inputs. Later, images can be synthesized based on novel scene properties using this representation. This encoded scene representation is not constrained by modeling approximations and can be optimized for new, high-quality images. Neural rendering techniques also incorporate ideas from classical computer graphics like input features, network architectures, and scene representations. This makes the learning task easier and helps increase the controllability of the output.
There is a type of Neural Rendering that enables novel-viewpoint synthesis as well as scene-editing in 3D (geometry deformation, removal, copy-move). It is trained for a specific scene or object. Besides ground truth color images, it requires a coarse, reconstructed and tracked 3D mesh including a texture parametrization. Instead of the classical texture, the approach learns a neural texture, a texture that contains neural feature descriptors per surface point. A classical computer graphics rasterizer is used to sample these neural textures, and given the 3D geometry and viewpoint, resulting in a projection of the neural feature descriptors onto the image plane. The final output image is generated from the rendered feature descriptors using a small U-Net, which is trained in conjunction with the neural texture. The learned neural feature descriptors and decoder network compensates for the coarseness of the underlying geometry, as well as for tracking errors, while the classical rendering step ensures consistent 3D image formation.
These techniques also find their applications in facial reenactment problems. The human facial texture can be encoded in the form of neural textures. Then, with the help of UV maps, this texture can be sampled. The deep U-NET-type architectures can be used to decode the neural textures. Once the neural texture is decoded, another UNET-like architecture can be employed to paint the synthetic texture on the background image.
Colossyan makes use of such technology to achieve the talking avatar generation. We have developed conditional generative neural networks that help us generate photorealistic videos according to a given audio signal. The whole pipeline involves different steps from processing the speech signal to driving the human face model, to performing neural rendering. All the above-discussed techniques play a very important role in achieving lifelike results.