Matterport Inc.

31/07/2024 | News release | Distributed by Public on 31/07/2024 17:48

All we want is an empty room: Panorama Inpainting

Imagine the ability to see any space as if it were empty, free from furniture or clutter, without physically modifying anything. With a realistic estimate of the available space, redecoration, renovation, and remodel projects become much easier and less time-consuming. This is the power of panorama inpainting.

We recently introduced defurnishing as one of the primary applications under the umbrella of our Project Genesis efforts. Defurnishing is the ability to automatically remove furnishings and non-structural items from our digital replicas of spaces. This project consists of multiple AI-based technologies, as introduced in Matterport's Three Pillars of AI and recently published by our team, which we are describing in detail in the present blog series. Last month we unfolded how we employ Semantic Segmentation in order to identify all pixels and mesh faces that contain furniture. Today we will shed light on the core component of our defurnishing pipeline - inpainting.

What is Inpainting?

Furniture inpainting means replacing furniture with empty space consistent with its surroundings. For example, if a couch by the wall in the living room is being digitally removed, it needs to be substituted with the elements that it occludes, namely wall and floor.

A Matterport scan consists of 2D images and a corresponding textured 3D mesh. Furniture needs to be removed from all of these assets, for example we need to replace all furniture pixels from the 2D images and all furniture faces from the mesh. This article will focus explicitly on 2D inpainting in order to fully outline the technique, while the subsequent blog post will explain how we handle 3D assets.

You can imagine that if there is a custom wallpaper or an intricate carpet, the problem becomes very non-trivial. On top of that, humans are highly sensitive to aberrations in man-made structures such as warped walls and patterned carpets, as opposed to natural environments like the branches of a tree. Therefore our inpainting has to match the same visual quality as our scans.

How is Inpainting Done?

Inpainting, also known as image restoration, is a core problem in computer vision. Various classical approaches have been devised to tackle it, but it wasn't until the late 2010s with the advent of generative methods that high quality, photorealistic inpainting became a reality. Generative Adversarial Networks (GANs) have been successfully applied for face generation and inpainting, but at relatively low resolutions. The introduction of latent diffusion models, such as Stable Diffusion (SD), empowered another spur of research that unlocked quicker, hyper-realistic generation and inpainting at much higher resolutions - typically 512x512 to 1024x1024 pixels in mere seconds. Various commercial products derive from this technology, such as Midjourney and DALL-E.

Stable Diffusion is a text-to-image model that takes a text prompt such as "pink elephant on a beach" as input and generates an image that visually corresponds to the prompt. Similarly, SD inpainting is provided with an input image (a room in a space), an inpainting mask (indicate all pixels containing furniture), and a prompt describing what to replace the masked pixels with (empty space).

Let us understand how SD works. The core technique - diffusion - is an iterative process of gradually adding noise to an image. Stable diffusion learns the reverse process - how to generate an image, from the same image distribution as the training dataset, starting from pure noise as input. This denoising process is carried out by the UNet of the model, which is a neural network consisting of several residual and transformer blocks that takes in a noisy representation and learns to predict the noise in it, ultimately allowing us to obtain the noise-free result at inference.

As mentioned, SD is a text-to-image model. The text prompt is an input, which is converted into a numeric representation by a pre-trained text encoder. In the case of inpainting, we also input the image we want to inpaint. This image also needs to be converted into a compact numeric representation that is quicker to handle than all individual pixels of the image. This compression is achieved via an autoencoder - a pre-trained neural network that converts an image into a latent embedding, which is a compact array representing the image content. The two embeddings - that of the text prompt and that of the image, with added noise - are input to the denoising UNet. At the end of the iterative denoising process we obtain a denoised latent embedding. The autoencoder has a decoder counterpart, which in turn is responsible for converting the final output embedding into an image we can view.

Our Approach to Panorama Inpainting

Our spaces are represented as collections of high-resolution 8192x4096-pixel 360-degree equirectangular panoramas. This is in stark contrast to the data SD inpainting is typically presented with - lower-resolution perspective square images. Thus, we need to bridge the domain gap in camera model and aspect ratio, which means that we need to fine-tune Stable Diffusion to get it used to our modalities. Fine-tuning the model also makes it better adapted to our data, which is typically indoor living environments. We also have to find suitable ways to deal with the higher resolution, both in training and at inference time.

The approach we devised consists of pre-processing for context maximization, inpainting customized for our data types and needs, and post-processing for seamless blending of the generated content with the high-frequency details from the original image. We explain each of these components below - for full detail, refer to our paper proceedings.

Panorama Preparation

Our preprocessing converts the original 360-degree panorama in a format suitable for a Stable Diffusion inpainting pipeline. Note that while it is possible to convert this input into a set of perspective projection images, we choose not to due to several reasons. First, directly inpainting the panorama gives us maximal angular context, exposing the network to all structures available in the image. Additionally, inpainting several images separately and stitching them together is prone to artifacts due to color balance discrepancies between the generated regions in each individual image. Thus we work directly with the full-context panoramas, which alleviates the task of inferring what content should be inpainted behind the occluding furnishings.

First, as the top and bottom poles of the equirectangular projection correspond to the small unobserved areas under and above the tripod, they contain very little image context and are often blurry. Therefore, we crop the top and bottom bands, converting the panorama from 2:1 to 3:1 aspect ratio.

Next, we perform semantic segmentation, as outlined in our previous blog, to detect all instances of furniture from a set we have pre-defined, including couches, beds, tables, chairs, and furnishings that go on them, such as books, pillows, house plants, rugs, etc. We split this semantic map into pixels containing furnishings and non-furnishings, thus obtaining a binary mask that we use as our inpainting mask. Note that we do not apply dilation to this mask in order to keep intact as much of the original environment context as possible.

Then we roll the panorama horizontally so that the largest furniture region is placed in the center. This central placement ensures the least amount of equirectangular distortion, which will in turn be more suitable for SD, which is pre-trained mostly on perspective images. During training, we noted that SD quickly adapts to the equirectangular projection within a few thousand iterations. Finally, we lightly pad the image horizontally - a 360-degree panorama image wraps around continuously from the right to left edge, so this padding further ensures maximal context.

Stable Diffusion is memory-intensive. An image of our original resolution would not fit in a standard A10G GPU with 24 GB of memory. Therefore we down-scale the pre-processed input to a size of 1732x512 pixels, which ensures that as many as 3 images and their masks can fit into an SD inpainting pipeline. We are now ready to inpaint.

Custom Inpainting

The biggest issue of Stable Diffusion inpainting we have not mentioned above is its tendency to hallucinate objects. Much like large language models, such as ChatGPT, which always have an answer, even though it may be incorrect, diffusion models always generate a plausible output. Thus, if there are chairs that cast shadows on the vinyl floor and we attempt to inpaint, SD may try to explain the existence of the shadows by generating another object that may have cast that shadow - such as the tables hallucinated in the middle image below. Moreover, Stable Diffusion has been extensively trained to creatively generate imagery of various objects, characters and places, but we are asking it to replace furniture with emptiness, which is hard to imagine, let alone generate. To sum up, our goal is to fine-tune Stable Diffusion inpainting against hallucinations due to causal effects, such as shadows and light reflections, so that it can replace furniture with empty space in high-quality panorama captures of indoor spaces.

Original

Result with off-the-shelf Stable Diffusion 2 Inpainting

Our result

This is where Matterport's vast dataset comes into play. Our customers scan a variety of spaces, including both furnished and unfurnished ones. In fact, as many as 20% of the captured spaces are without furniture. Our motivation is that if a diffusion model has been exposed to many unfurnished images, it will be less likely to generate furniture and, therefore, less likely to hallucinate. To this end, we use 160 thousand unfurnished panoramas to fine-tune Stable Diffusion inpainting on unfurnished Matterport data.

How do we fine-tune the inpainting component itself? A successful inpainting strategy requires pairs of corresponding panoramas with and without furniture. However, we do not have access to such before and after furniture staging pairs of images. Thus, we create this data synthetically. In particular, we render synthetic furniture objects into the above-mentioned unfurnished spaces. Additionally, we render shadows near the synthetic furniture items - as we do not have an accurate environment map of a space, the shadows are not fully realistic. Finally, we fine-tune with masks that only cover the furniture items, but not the shadows. This makes our model robust to shadows and lighting effects, and even teaches it to remove some remnant shadows that are not covered by a mask. We additionally perturb the masks during training to mimic semantic segmentation inaccuracies, further increasing the robustness of our model to imprecise masks.

We perform low-rank adaptation of the UNet of a pre-trained Stable Diffusion inpainting pipeline, which is a technique for efficient fine-tuning of large models. For each image during training, we randomly select one of 32 variations of the prompt "empty room" in order to condition the text component of the pipeline on this phrase. The training we did for our publication took 96 GPU hours on 8 Nvidia A10G GPUs with an effective batch size of 96, utilizing gradient accumulation. We are constantly improving this model as we gain more insights.

At inference time we only require one GPU, on which we perform 10 diffusion steps with the prompt "empty room". The entire pipeline takes 12 seconds: 8 seconds for pre-processing, 3.5 seconds for inpainting and 0.5 seconds for post-processing, as explained below. We keep on further reducing the runtime following this initial pipeline.

Even though our model is fine-tuned to avoid hallucinations, they may still occur. One way to avoid them is to apply a slight mask dilation, as little as 1-2 pixels, so that the image context is preserved while simultaneously the risk of hallucination is decreased. Another strategy we found very effective deals with the latent embedding used as initialization in the inpainting pipeline. While traditionally pure noise is used as initialization, we found that a weighted combination of noise and latents from the input image, blurred under the inpainting mask, reduces hallucinations. However, if the influence of the input image latents is too high, the inpainted output becomes blurry. Therefore we use a balanced combination of 97% noise and 3% image latents, obtaining an image that is both sharp and hallucination-free.

The images below compare our fine-tuned inpainter to LaMa, a GAN-based inpainting technique that targets large masks, and off-the-shelf non-fine-tuned Stable Diffusion inpainting. For best results of these methods, we run them with mask dilation, while ours is without. The GAN-based LaMa creates a lot of blur near the shadow under the couch. Off-the-shelf Stable Diffusion inpainting is sharper, but leaves a trailing shadow that looks like a stain on the floor, and extends the brick wall onto the painted wall to the right. Our fine-tuned model generates a visually pleasing result, where the shadow is fully removed and the generated textures look like natural extensions of the original ones.

Original

Result of LaMa: Resolution-robust Large Mask Inpainting with Fourier Convolutions

Result with off-the-shelf Stable Diffusion 2 Inpainting

Our Result

For more results and comparisons, refer to our visualizations online.

Superresolution and Blending

After obtaining a satisfactory inpainted image, we need to restore the original panorama size, so that it can be displayed just like the originally captured imagery. We first run a superresolution network to increase the image dimensions 4 times, and then linearly interpolate to match the exact expected dimensions. We also remove the padding and rolling that was applied during pre-processing.

This gives us a panorama of the required dimensions, but due to the nature of the diffusion process, some details may be washed out. To preserve as much high-frequency detail from the original panorama as possible, we blend the original and inpainted images. We customize the blending technique to make sure any spurious hallucinations are not included in our output. At the end we obtain a panorama that is a seamless combination of high-frequency detail from the original image, inpainted with expected, aesthetically pleasing textures.

Challenges and Next Steps

We have successfully developed a panorama inpainting technique that addresses our Matterport-specific data needs, obtaining visually pleasing results in many scenarios. There are a few limitations to the above-described technique that we are addressing in follow-up work. First, out-of-distribution shadows and lighting effects, not encountered in the training images, might cause spurious hallucinations, such as small objects. Matterport's ever-growing dataset will allow us to continuously improve our custom inpainting model and adapt to new cases. Next, off-the-shelf semantic segmentation often fails in distinguishing free-standing and built-in structures like wardrobes. Defurnishing a wardrobe that takes up an entire wall can be unpredictable and is not what the customer would expect about built-in structures. This is why we are working on our own custom Semantic Segmentation extensions, as described in our previous blog post. Finally, far away objects appear small in a large panorama, so if they are surrounded by a lot of clutter, the inpainter may create false structures, from incorrectly placed walls to entire rooms. Our next step is to leverage geometric 3D information in order to disambiguate such cases - stay tuned for our next blog post to learn the details.