Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Posts

Future Blog Post

less than 1 minute read

Published: January 01, 2199

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Blog Post number 4

less than 1 minute read

Published: August 14, 2015

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

less than 1 minute read

Published: August 14, 2014

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

less than 1 minute read

Published: August 14, 2013

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 1

less than 1 minute read

Published: August 14, 2012

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

Optical Flow in the Dark

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Many successful optical flow estimation methods have been proposed, but they become invalid when tested in dark scenes because low-light scenarios are not considered when they are designed and current optical flow benchmark datasets lack low-light samples. Even if we preprocess to enhance the dark images, which achieves great visual perception, it still leads to poor optical flow results or even worse ones, because information like motion consistency may be broken while enhancing. We propose an end-to-end data-driven method that avoids error accumulation and learns optical flow directly from low-light noisy images. Specifically, we develop a method to synthesize large-scale low-light optical flow datasets by simulating the noise model on dark raw images. We also collect a new optical flow dataset in raw format with a large range of exposure to be used as a benchmark. The models trained on our synthetic dataset can relatively maintain optical flow accuracy as the image brightness descends and they outperform the existing methods greatly on low-light images.

Download Paper

Optical Flow in the Dark

Published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021

Optical flow estimation in low-light conditions is a challenging task for existing methods and current optical flow datasets lack low-light samples. Even if the dark images are enhanced before estimation, which could achieve great visual perception, it still leads to suboptimal optical flow results because information like motion consistency may be broken during the enhancement. We propose to apply a novel training policy to learn optical flow directly from new synthetic and real low-light images. Specifically, first, we design a method to collect a new optical flow dataset in multiple exposures with shared optical flow pseudo labels. Then we apply a two-step process to create a synthetic low-light optical flow dataset, based on an existing bright one, by simulating low-light raw features from the multi-exposure raw images we collected. To extend the data diversity, we also include published low-light raw videos without optical flow labels. In our training pipeline, with the three datasets, we create two teacher-student pairs to progressively obtain optical flow labels for all data. Finally, we apply a mix-up training policy with our diversified datasets to produce low-light-robust optical flow models for release. The experiments show that our method can relatively maintain the optical flow accuracy as the image exposure descends and the generalization ability of our method is tested with different cameras in multiple practical scenes.

Download Paper

GazeOnce: Real-Time Multi-Person Gaze Estimation

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

Appearance-based gaze estimation aims to predict the 3D eye gaze direction from a single image. While recent deep learning-based approaches have demonstrated excellent performance, they usually assume one calibrated face in each input image and cannot output multi-person gaze in real time. However, simultaneous gaze estimation for multiple people in the wild is necessary for real-world applications. In this paper, we propose the first one-stage end-to-end gaze estimation method, GazeOnce, which is capable of simultaneously predicting gaze directions for multiple faces (> 10) in an image. In addition, we design a sophisticated data generation pipeline and propose a new dataset, MPSGaze, which contains full images of multiple people with 3D gaze ground truth. Experimental results demonstrate that our unified framework not only offers a faster speed, but also provides a lower gaze estimation error compared with state-of-the-art methods. This technique can be useful in real-time applications with multiple users.

Download Paper

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

The Multiplane Image (MPI), containing a set of fronto-parallel RGBA layers, is an effective and efficient representation for view synthesis from sparse inputs. Yet, its fixed structure limits the performance, especially for surfaces imaged at oblique angles. We introduce the Structural MPI (S-MPI), where the plane structure approximates 3D scenes concisely. Conveying RGBA contexts with geometrically-faithful structures, the S-MPI directly bridges view synthesis and 3D reconstruction. It can not only overcome the critical limitations of MPI, ie, discretization artifacts from sloped surfaces and abuse of redundant layers, and can also acquire planar 3D reconstruction. Despite the intuition and demand of applying S-MPI, great challenges are introduced, eg, high-fidelity approximation for both RGBA layers and plane poses, multi-view consistency, non-planar regions modeling, and efficient rendering with intersected planes. Accordingly, we propose a transformer-based network based on a segmentation model. It predicts compact and expressive S-MPI layers with their corresponding masks, poses, and RGBA contexts. Non-planar regions are inclusively handled as a special case in our unified framework. Multi-view consistency is ensured by sharing global proxy embeddings, which encode plane-level features covering the complete 3D scenes with aligned coordinates. Intensive experiments show that our method outperforms both previous state-of-the-art MPI-based view synthesis methods and planar reconstruction methods.

Download Paper

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Being able to map the activities of others into one’s own point of view is one fundamental human skill even from a very early age. Taking a step toward understanding this human ability we introduce EgoExoLearn a large-scale dataset that emulates the human demonstration following process in which individuals record egocentric videos as they execute tasks guided by demonstration videos. Focusing on the potential applications of daily assistance and professional support EgoExoLearn contains egocentric and demonstration video data spanning 120 hours captured in daily life scenarios and specialized laboratories. Along with the videos we record high-quality gaze data and provide detailed multimodal annotations formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints. To this end we present benchmarks such as cross-view association cross-view action planning and cross-view referenced skill assessment along with detailed analysis. We expect EgoExoLearn can serve as an important resource for bridging the actions across views thus paving the way for creating AI agents capable of seamlessly learning by observing humans in the real world.

Download Paper

Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation

Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

The pursuit of accurate 3D hand pose estimation stands as a keystone for understanding human activity in the realm of egocentric vision. The majority of existing estimation methods still rely on single-view images as input leading to potential limitations eg limited field-of-view and ambiguity in depth. To address these problems adding another camera to better capture the shape of hands is a practical direction. However existing multi-view hand pose estimation methods suffer from two main drawbacks: 1) Requiring multi-view annotations for training which are expensive. 2) During testing the model becomes inapplicable if camera parameters/layout are not the same as those used in training. In this paper we propose a novel Single-to-Dual-view adaptation (S2DHand) solution that adapts a pre-trained single-view estimator to dual views. Compared with existing multi-view training methods 1) our adaptation process is unsupervised eliminating the need for multi-view annotation. 2) Moreover our method can handle arbitrary dual-view pairs with unknown camera parameters making the model applicable to diverse camera settings. Specifically S2DHand is built on certain stereo constraints including pair-wise cross-view consensus and invariance of transformation between both views. These two stereo constraints are used in a complementary manner to generate pseudo-labels allowing reliable adaptation. Evaluation results reveal that S2DHand achieves significant improvements on arbitrary camera pairs under both in-dataset and cross-dataset settings and outperforms existing adaptation methods with leading performance.

Download Paper

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

Published in European Conference on Computer Vision (ECCV), 2024

Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representations via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world.

Download Paper

SiMHand: Mining Similar Hands for Large-Scale 3D Hand Pose Pre-training

Published in International Conference on Learning Representations (ICLR), 2025

We present a framework for pre-training of 3D hand pose estimation from in-the-wild hand images sharing with similar hand characteristics, dubbed SimHand. Pre-training with large-scale images achieves promising results in various tasks, but prior methods for 3D hand pose pre-training have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our pre-training method with contrastive learning. Specifically, we collect over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands: pairs of non-identical samples with similar hand poses. We then propose a novel contrastive learning method that embeds similar hand pairs closer in the feature space. Our method not only learns from similar samples but also adaptively weights the contrastive learning loss based on inter-sample distance, leading to additional performance gains. Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs sorely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method (PeCLR) in various datasets, with gains of 15% on FreiHand, 10% on DexYCB, and 4% on AssemblyHands.

Download Paper

Vinci: A Real-time Smart Assistant Based on Egocentric Vision-language Model for Portable Devices

Published in ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), 2025

We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights.

Download Paper

Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance

Published in International Conference on Computer Vision (ICCV), 2025

This paper presents a novel inertial localization framework named Egocentric Action-aware Inertial Localization (EAIL), which leverages egocentric action cues from head-mounted IMU signals to localize the target individual within a 3D point cloud. Human inertial localization is challenging due to IMU sensor noise that causes trajectory drift over time. The diversity of human actions further complicates IMU signal processing by introducing various motion patterns. Nevertheless, we observe that some actions observed through the head-mounted IMU correlate with spatial environmental structures (e.g., bending down to look inside an oven, washing dishes next to a sink), thereby serving as spatial anchors to compensate for the localization drift. The proposed EAIL framework learns such correlations via hierarchical multi-modal alignment. By assuming that the 3D point cloud of the environment is available, it contrastively learns modality encoders that align short-term egocentric action cues in IMU signals with local environmental features in the point cloud. These encoders are then used in reasoning the IMU data and the point cloud over time and space to perform inertial localization. Interestingly, these encoders can further be utilized to recognize the corresponding sequence of actions as a by-product. Extensive experiments demonstrate the effectiveness of the proposed framework over state-of-the-art inertial localization and inertial action recognition baselines.

Download Paper

Prompt-augmented Boundary Attentive Learning for Weakly Supervised Temporal Sentence Grounding

Published in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2025

Weakly supervised temporal sentence grounding aims to temporally locate events described by a sentence in a video, relying solely on video-level visual-language correspondences. Because of the absence of precise boundary information, existing works primarily focus on multiple instance learning methods to establish segment-level video-language alignment. In this work, we propose Prompt-augmented Boundary Attentive Learning (PBAL) to enable the explicit modeling of the segment boundaries in a weakly supervised context. To represent the boundaries with sentences, we first generate sentences describing the start and end of an event, leveraging the capabilities of large language models (LLMs). With the augmented sentences, we then model the boundary-level video-language correspondence using a novel boundary-attentive learning module. This module generates probability maps of the starting and ending points, and is learned through boundary type prediction and self-supervised reconstruction. Experiments on two standard datasets, Charades-STA [1] and ActivityNet Captions [2] demonstrate PBAL’s state-of-the-art performance. The results of our ablation study further demonstrate the effectiveness of our boundary-attentive learning and prompt augmentation techniques.

Download Paper