Michael Firman


I am Michael Firman, a staff research scientist at Niantic, Inc., where I work on machine learning and computer vision research to help people explore the world around them.

Previously, I worked at UCL in the Vision and Graphics group as a postdoc on the Engage project, making machine learning tools accessible to scientists across different disciplines. This work was with Prof. Mike Terry at the University of Waterloo, Dr. Gabriel Brostow at UCL and Prof. Kate Jones at UCL.

My PhD was supervised by Dr. Simon Julier and Dr. Jan Boehm. During my PhD I predominantly worked on the problem of inferring a full volumetric reconstruction of a scene, given only a single depth image as input. This problem has many applications in robotics, computer graphics and augmented reality.

During the summer of 2012 I worked at the National Institute of Informatics, Tokyo, under the supervision of Prof. Akihiro Sugimoto.

I have also served as a reviewer for CVPR, ECCV, ICCV, IROS, BMVC, ICRA, IJCV, CVIU, 3DV and ISMAR. I received CVPR's `outstanding reviewer' award in 2018, 2020 and 2022.


Virtual Occlusions Through Implicit Depth

Jamie Watson, Mohamed Sayed, Zawar Qureshi, Gabriel J. Brostow, Sara Vicente, Oisin Mac Aodha, Sara Vicente and Michael Firman.

Computer Vision and Patern Recognition (CVPR) 2023

For augmented reality (AR), it is important that virtual assets appear to ‘sit among’ real world objects. The virtual element should variously occlude and be occluded by real matter, based on a plausible depth ordering. This occlusion should be consistent over time as the viewer’s camera moves. Unfortunately, small mistakes in the estimated scene depth can ruin the downstream occlusion mask, and thereby the AR illusion. Especially in real-time settings, depths inferred near boundaries or across time can be inconsistent. In this paper, we challenge the need for depth-regression as an intermediate step.

We instead propose an implicit model for depth and use that to predict the occlusion mask directly. The inputs to our network are one or more color images, plus the known depths of any virtual geometry. We show how our occlusion predictions are more accurate and more temporally stable than predictions derived from traditional depth-estimation models. We obtain state-of-the-art occlusion results on the challenging ScanNetv2 dataset and superior qualitative results on real scenes.

Removing Objects From Neural Radiance Fields

Silvan Weder, Guillermo Garcia-Hernando, Áron Monszpart, Marc Pollefeys, Gabriel J. Brostow, Michael Firman and Sara Vicente.

Computer Vision and Patern Recognition (CVPR) 2023

Neural Radiance Fields (NeRFs) are emerging as a ubiquitous scene representation that allows for novel view synthesis. Increasingly, NeRFs will be shareable with other people. Before sharing a NeRF, though, it might be desirable to remove personal information or unsightly objects. Such removal is not easily achieved with the current NeRF editing frameworks. We propose a framework to remove objects from a NeRF representation created from an RGB-D sequence. Our NeRF inpainting method leverages recent work in 2D image inpainting and is guided by a user-provided mask. Our algorithm is underpinned by a confidence based view selection procedure. It chooses which of the individual 2D inpainted images to use in the creation of the NeRF, so that the resulting inpainted NeRF is 3D consistent. We show that our method for NeRF editing is effective for synthesizing plausible inpaintings in a multi-view coherent manner. We validate our approach using a new and still-challenging dataset for the task of NeRF inpainting.

Heightfields for Efficient Scene Reconstruction for AR

Jamie Watson, Sara Vicente, Oisin Mac Aodha, Clément Godard, Gabriel J. Brostow and Michael Firman.

Winter Conference on Applications of Computer Vision (WACV) 2023

3D scene reconstruction from a sequence of posed RGB images is a cornerstone task for computer vision and augmented reality (AR). While depth-based fusion is the foundation of most real-time approaches for 3D reconstruction, recent learning based methods that operate directly on RGB images can achieve higher quality reconstructions, but at the cost of increased runtime and memory requirements, making them unsuitable for AR applications. We propose an efficient learning-based method that refines the 3D reconstruction obtained by a traditional fusion approach. By leveraging a top-down heightfield representation, our method remains real-time while approaching the quality of other learning-based methods. Despite being a simplification, our heightfield is perfectly appropriate for robotic path planning or augmented reality character placement. We outline several innovations that push the performance beyond existing top-down prediction baselines, and we present an evaluation framework on the challenging ScanNetV2 dataset, targeting AR tasks. Ultimately, we show that our method improves over the baselines for AR applications.

SimpleRecon – 3D Reconstruction without 3D Convolutions

Mohamed Sayed, John Gibson, Jamie Watson, Victor Adrian Prisacariu, Michael Firman and Clément Godard.

European Conference on Computer Vision (ECCV) 2022

Traditionally, 3D indoor scene reconstruction from posed images happens in two phases: per image depth estimation, followed by depth merging and surface reconstruction. Recently, a family of methods have emerged that perform reconstruction directly in final 3D volumetric feature space. While these methods have shown impressive reconstruction results, they rely on expensive 3D convolutional layers, limiting their application in resource-constrained environments. In this work, we instead go back to the traditional route, and show how focusing on high quality multi-view depth prediction leads to highly accurate 3D reconstructions using simple off-the-shelf depth fusion. We propose a simple state-of-the-art multi-view depth estimator with two main contributions: 1) a carefully-designed 2D CNN which utilizes strong image priors alongside a plane-sweep feature volume and geometric losses, combined with 2) the integration of keyframe and geometric metadata into the cost volume which allows informed depth plane scoring. Our method achieves a significant lead over the current state-of-the-art for depth estimation and close or better for 3D reconstruction on ScanNet and 7-Scenes, yet still allows for online real-time low-memory reconstruction.

SimpleRecon is fast. Our batch size one performance is 70ms per frame. This makes accurate reconstruction via fast depth fusion possible!

Camera Pose Estimation and Localization with Active Audio Sensing

Karren Yang, Michael Firman, Eric Brachmann and Clément Godard.

European Conference on Computer Vision (ECCV) 2022

In this work, we show how to estimate a device's position and orientation indoors by echolocation, i.e., by interpreting the echoes of an audio signal that the device itself emits. Established visual localization methods rely on the device's camera and yield excellent accuracy if unique visual features are in view and depicted clearly. We argue that audio sensing can offer complementary information to vision for device localization, since audio is invariant to adverse visual conditions and can reveal scene information beyond a camera's field of view. We first propose a strategy for learning an audio representation that captures the scene geometry around a device using supervision transfer from vision. Subsequently, we leverage this audio representation to complement vision in three device localization tasks: relative pose estimation, place recognition, and absolute pose regression. Our proposed methods outperform state-of-the-art vision models on new audio-visual benchmarks for the Replica and Matterport3D datasets.

The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth

Jamie Watson, Oisin Mac Aodha, Victor Adrian Prisacariu, Gabriel J. Brostow and Michael Firman.

Computer Vision and Pattern Recognition (CVPR) 2021

Self-supervised monocular depth estimation networks are trained to predict scene depth using nearby frames as a supervision signal during training. However, for many applications, sequence information in the form of video frames is also available at test time. The vast majority of monocular networks do not make use of this extra signal, thus ignoring valuable information that could be used to improve the predicted depth. Those that do, either use computationally expensive test-time refinement techniques or off-theshelf recurrent networks, which only indirectly make use of the geometric information that is inherently available.

We propose ManyDepth, an adaptive approach to dense depth estimation that can make use of sequence information at test time, when it is available. Taking inspiration from multi-view stereo, we propose a deep end-to-end cost volume based approach that is trained using self-supervision only. We present a novel consistency loss that encourages the network to ignore the cost volume when it is deemed unreliable, e.g. in the case of moving objects, and an augmentation scheme to cope with static cameras. Our detailed experiments on both KITTI and Cityscapes show that we outperform all published self-supervised baselines, including those that use single or multiple frames at test time.

Single Image Depth Prediction with Wavelet Decomposition

Michaël Ramamonjisoa, Michael Firman, Jamie Watson, Vincent Lepetit and Daniyar Turmukhambetov

Computer Vision and Pattern Recognition (CVPR) 2021

We present a novel method for predicting accurate depths from monocular images with high efficiency. This optimal efficiency is achieved by exploiting wavelet decomposition, which is integrated in a fully differentiable encoder-decoder architecture. We demonstrate that we can reconstruct high-fidelity depth maps by predicting sparse wavelet coefficients.

In contrast with previous works, we show that wavelet coefficients can be learned without direct supervision on coefficients. Instead we supervise only the final depth image that is reconstructed through the inverse wavelet transform. We additionally show that wavelet coefficients can be learned in fully self-supervised scenarios, without access to ground-truth depth. Finally, we apply our method to different state-of-the-art monocular depth estimation models, in each case giving similar or better results compared to the original model, while requiring less than half the multiplyadds in the decoder network.

Panoptic Segmentation Forecasting with Instance-Level Trajectory Modeling

Colin Graber, Grace Tsai, Michael Firman, Gabriel J. Brostow and Alexander Schwing.

Computer Vision and Pattern Recognition (CVPR) 2021

Our goal is to forecast the near future given a set of recent observations. We think this ability to forecast, i.e., to anticipate, is integral for the success of autonomous agents which need not only passively analyze an observation but also must react to it in real-time. Importantly, accurate forecasting hinges upon the chosen scene decomposition. We think that superior forecasting can be achieved by decomposing a dynamic scene into individual 'things' and background 'stuff'. Background 'stuff' largely moves because of camera motion, while foreground 'things' move because of both camera and individual object motion. Following this decomposition, we introduce panoptic segmentation forecasting. Panoptic segmentation forecasting opens up a middle-ground between existing extremes, which either forecast instance trajectories or predict the appearance of future image frames. To address this task we develop a twocomponent model: one component learns the dynamics of the background stuff by anticipating odometry, the other one anticipates the dynamics of detected things. We establish a leaderboard for this novel task, and validate a state-of-theart model that outperforms available baselines.

Learning Stereo from Single Images

Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambetov Gabriel J. Brostow and Michael Firman.

European Conference on Computer Vision (ECCV) 2020 (Oral)

Supervised deep networks are among the best methods for finding correspondences in stereo image pairs. Like all supervised approaches, these networks require ground truth data during training. However, collecting large quantities of accurate dense correspondence data is very challenging. We propose that it is unnecessary to have such a high reliance on ground truth depths or even corresponding stereo pairs. Inspired by recent progress in monocular depth estimation, we generate plausible disparity maps from single images. In turn, we use those flawed disparity maps in a carefully designed pipeline to generate stereo training pairs. Training in this manner makes it possible to convert any collection of single RGB images into stereo training data. This results in a significant reduction in human effort, with no need to collect real depths or to hand-design synthetic data. We can consequently train a stereo matching network from scratch on datasets like COCO, which were previously hard to exploit for stereo. Through extensive experiments we show that our approach outperforms stereo networks trained with standard synthetic datasets, when evaluated on KITTI, ETH3D, and Middlebury.

Footprints and Free Space from a Single Color Image

Jamie Watson, Michael Firman, Áron Monszpart and Gabriel J. Brostow

Computer Vision and Pattern Recognition (CVPR) 2020 (Oral)

Understanding the shape of a scene from a single color image is a formidable computer vision task. However, most methods aim to predict the geometry of surfaces that are visible to the camera, which is of limited use when planning paths for robots or augmented reality agents. Such agents can only move when grounded on a traversable surface, which we define as the set of classes which humans can also walk over, such as grass, footpaths and pavement. Models which predict beyond the line of sight often parameterize the scene with voxels or meshes, which can be expensive to use in machine learning frameworks.

We introduce a model to predict the geometry of both visible and occluded traversable surfaces, given a single RGB image as input. We learn from stereo video sequences, using camera poses, per-frame depth and semantic segmentation to form training data, which is used to supervise an imageto-image network. We train models from the KITTI driving dataset, the indoor Matterport dataset, and from our own casually captured stereo footage. We find that a surprisingly low bar for spatial coverage of training scenes is required. We validate our algorithm against a range of strong baselines, and include an assessment of our predictions for a path-planning task.

Self-Supervised Monocular Depth Hints

Jamie Watson, Michael Firman, Gabriel J. Brostow and Daniyar Turmukhambetov

International Conference of Computer Vision (ICCV) 2019

Monocular depth estimators can be trained with various forms of self-supervision from binocular-stereo data to circumvent the need for high-quality laser-scans or other ground-truth data. The disadvantage, however, is that the photometric reprojection losses used with self-supervised learning typically have multiple local minima. These plausible-looking alternatives to ground-truth can restrict what a regression network learns, causing it to predict depth maps of limited quality. As one prominent example, depth discontinuities around thin structures are often incorrectly estimated by current state-of-the-art methods.

Here, we study the problem of ambiguous reprojections in depth-prediction from stereo-based self-supervision, and introduce Depth Hints to alleviate their effects. Depth Hints are complementary depth-suggestions obtained from simple off-the-shelf stereo algorithms. These hints enhance an existing photometric loss function, and are used to guide a network to learn better weights. They require no additional data, and are assumed to be right only sometimes. We show that using our Depth Hints gives a substantial boost when training several leading self-supervised-from-stereo models, not just our own. Further, combined with other good practices, we produce state-of-the-art depth predictions on the KITTI benchmark.

Digging Into Self-Supervised Monocular Depth Estimation

Clément Godard, Michael Firman, Oisin Mac Aodha and Gabriel J. Brostow

International Conference of Computer Vision (ICCV) 2019

Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.

Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.

DiverseNet: When One Right Answer is not Enough

Michael Firman, Neill D. F. Campbell, Lourdes Agapito and Gabriel J. Brostow

Computer Vision and Pattern Recognition (CVPR) 2018

Many structured prediction tasks in machine vision have a collection of acceptable answers, instead of one definitive ground truth answer. Segmentation of images, for example, is subject to human labeling bias. Similarly, there are multiple possible pixel values that could plausibly complete occluded image regions. State-of-the art supervised learning methods are typically optimized to make a single test-time prediction for each query, failing to find other modes in the output space. Existing methods that allow for sampling often sacrifice speed or accuracy.

We introduce a simple method for training a neural network which enables diverse structured predictions to be made for each test-time query. For a single input, we learn to predict a range of possible answers. We compare favorably to methods that seek diversity through an ensemble of networks. Such stochastic multiple choice learning faces mode collapse, where one or more ensemble members fail to receive any training signal. Our best performing solution can be deployed for various tasks, and just involves small modifications to the existing single-mode architecture, loss function, and training regime. We demonstrate that our method results in quantitative improvements across three challenging tasks: 2D image completion, 3D volume estimation, and flow prediction.

CityNet - Deep Learning Tools for Urban Ecoacoustic Assessment

Alison Fairbrass, Michael Firman, Carol Williams, Gabriel J. Brostow, Helena Titheridge and Kate E. Jones

Methods in Ecology and Evolution 2019

Cities support unique and valuable ecological communities, but understanding urban wildlife is limited due to the difficulties of assessing biodiversity. Ecoacoustic surveying is a useful way of assessing habitats, where biotic sound measured from audio recordings is used as a proxy for biodiversity. However, existing algorithms for measuring biotic sound have been shown to be biased by non-biotic sounds in recordings, typical of urban environments. We develop CityNet, a deep learning system using convolutional neural networks (CNNs), to measure audible biotic (CityBioNet) and anthropogenic (CityAnthroNet) acoustic activity in cities. The CNNs were trained on a large dataset of annotated audio recordings collected across Greater London, UK.

We found that our deep learned model outperformed existing measures of both biotic and anthropogenic sound. Predictions from our trained model can be visualised on this website, showing the distribution of sounds across London, and over different times of day.

Bat detective - Deep learning tools for bat acoustic signal detection

Oisin Mac Aodha, Rory Gibb, Kate E Barlow, Ella Browning, Michael Firman, Robin Freeman, Briana Harder, Libby Kinsey, Gary R Mead, Stuart E Newson, Ivan Pandourski, Stuart Parsons, Jon Russ, Abigel Szodoray-Paradi, Farkas Szodoray-Paradi, Elena Tilova, Mark Girolami, Gabriel J. Brostow and Kate E. Jones

PLoS Computational Biology 2018

There is a critical need for robust and accurate tools to scale up biodiversity monitoring and to manage the impact of anthropogenic change. For example, the monitoring of bat species and their population dynamics can act as an important indicator of ecosystem health as they are particularly sensitive to habitat conversion and climate change. In this work we propose a fully automatic and efficient method for detecting bat echolocation calls in noisy audio recordings. We show that our approach is more accurate compared to existing algorithms and other commercial tools. Our method enables us to automatically estimate bat activity from multi-year, large-scale, audio monitoring programmes.

RecurBot: Learn to Auto-complete GUI Tasks From Human Demonstrations

Thanapong Intharah, Michael Firman and Gabriel J. Brostow

CHI 2018: Late Breaking Work 2018

On the surface, task-completion should be easy in graphical user interface (GUI) settings. In practice however, different actions look alike and applications run in operating-system silos. Our aim within GUI action recognition and prediction is to help the user, at least in completing the tedious tasks that are largely repetitive. We propose a method that learns from a few user-performed demonstrations, and then predicts and finally performs the remaining actions in the task. For example, a user can send customized SMS messages to the first three contacts in a school's spreadsheet of parents; then our system loops the process, iterating through the remaining parents.

RGBD Datasets: Past, Present and Future

Michael Firman

CVPR Workshop on Large Scale 3D Data: Acquisition, Modelling and Analysis 2016

(Also: arXiv:1604.00999)

Since the launch of the Microsoft Kinect, scores of RGBD datasets have been released. These have propelled advances in areas from reconstruction to gesture recognition. In this paper we explore the field, reviewing datasets across eight categories: semantics, object pose estimation, camera tracking, scene reconstruction, object tracking, human actions, faces and identification. By extracting relevant information in each category we help researchers to find appropriate data for their needs, and we consider which datasets have succeeded in driving computer vision forward and why.

Finally, we examine the future of RGBD datasets. We identify key areas which are currently underexplored, and suggest that future directions may include synthetic data and dense reconstructions of static and dynamic scenes.

Structured Completion of Unobserved Voxels from a Single Depth Image

Michael Firman, Oisin Mac Aodha, Simon Julier and Gabriel J. Brostow

Computer Vision and Pattern Recognition (CVPR) 2016 (Oral)

Building a complete 3D model of a scene, given only a single depth image, is underconstrained. To gain a full volumetric model, one needs either multiple views, or a single view together with a library of unambiguous 3D models that will fit the shape of each individual object in the scene.

We hypothesize that objects of dissimilar semantic classes often share similar 3D shape components, enabling a limited dataset to model the shape of a wide range of objects, and hence estimate their hidden geometry. Exploring this hypothesis, we propose an algorithm that can complete the unobserved geometry of tabletop-sized objects, based on a supervised model trained on already available volumetric elements. Our model maps from a local observation in a single depth image to an estimate of the surface shape in the surrounding neighborhood. We validate our approach both qualitatively and quantitatively on a range of indoor object collections and challenging real scenes.

Learning to Discover Objects in RGB-D Images Using Correlation Clustering

Michael Firman, Diego Thomas, Simon Julier and Akihiro Sugimoto

International Conference on Intelligent Robots and Systems (IROS) 2013

We introduce a method to discover objects from RGB-D image collections which does not require a user to specify the number of objects expected to be found. We propose a probabilistic formulation to find pairwise similarity between image segments, using a classifier trained on labelled pairs from the recently released RGB-D Object Dataset. We then use a correlation clustering solver to both find the optimal clustering of all the segments in the collection and to recover the number of clusters. Unlike traditional supervised learning methods, our training data need not be of the same class or category as the objects we expect to discover. We show that this parameter-free supervised clustering method has superior performance to traditional clustering methods.

Further information: This work was begun during an internship at NII, Tokyo in the summer of 2012, and was partially supported by an NII internship grant.

'Misspelled' Visual Words in Unsupervised Range Data Classification: The Effect of Noise on Classification Performance.

Michael Firman and Simon Julier

International Conference on Intelligent Robots and Systems (IROS) 2011

Recent work in the domain of classification of point clouds has shown that topic models can be suitable tools for inferring class groupings in an unsupervised manner. However, point clouds are frequently subject to non-negligible amounts of sensor noise. In this paper, we analyze the effect on classification accuracy of noise added to both an artificial data set and data collected from a Light Detection and Ranging (LiDAR) scanner, and show that topic models are less robust to 'misspelled' words than the more naive k‑means classifier. Furthermore, standard spin images prove to be a more robust feature under noise than their derivative, 'angular' spin images. We additionally show that only a small subset of local features are required in order to give comparable classification accuracy to a full feature set.

Computer vision resources

View from Willow Garage dataset

A list of RGBD datasets

There are many great lists of computer vision datasets on the web but no dedicated source for datasets captured by Kinect or similar devices. I created this list in an attempt to remedy the situation.

2023 update: Unfortunately I am no longer maintaining this list. Please consider it a snapshot of datasets up to around 2017.


Simon Princes book

COMPM054/COMPGI14 Machine Vision

During my PhD I was a teaching assistant on COMPM054/COMPGI14 Machine Vision. Teaching materials are available on the Moodle page.

A digital copy of Dr. Simon Prince's book 'Computer Vision: Models, Learning, and Inference', which forms a core of the syllabus, can be downloaded from www.computervisionmodels.com

Matlab and Python

UCL Graduate School courses

During my PhD I taught on (and developed material for) the UCL Graduate School's MATLAB and Python courses. These are three and five day courses introducing graduate students from across the university to the basics of the languages.

Other Projects

Popular rhymes

Finding rhymes in popular music

I have an interest in the lyrical and musical content of music, and it seems sensible to try to use a computer to automate some of the process of discovering themes and trends in music.

As an experiment in Python, I wrote a program to analyse the lyrics of 33,000 songs from the US 40, from 1955 to the present day. I used Brian Langenberger's gdbm-based rhyming dictionary to automatically detect rhyme pairs in each song. This allowed the most popular pairs of rhyming words to be discovered, and changing trends in rhymes over time to be analysed.

A wordcloud showing the most popular rhymes in the whole corpus can be downloaded from the sidebar to the left.

As far as I am aware, this is the first time that rhyme analysis has been performed in this way, and on such a large scale. I presented a preliminary version of this work at the Comparative Innovations Workshop at King's College London, in May 2013. When I have the time I will publish a version of this work here, with explanations of the methodology used and more results.