HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild

1Stony Brook University, USA, 2VinAI Research, Vietnam

CVPR 2024

Image.

Identification, segmentation, and tracking of hand-held objects. Each row displays selected frames from a single video. Within each row, a distinct hand-held object is assigned a unique tracking ID and is consistently represented in the same color.

Abstract

We address the challenging task of identifying, segmenting, and tracking hand-held objects, which is crucial for applications such as human action segmentation and performance evaluation. This task is particularly challenging due to heavy occlusion, rapid motion, and the transitory nature of objects being hand-held, where an object may be held, released, and subsequently picked up again. To tackle these challenges, we have developed a novel transformer-based architecture called HOIST-Former. HOIST-Former is adept at spatially and temporally segmenting hands and objects by iteratively pooling features from each other, ensuring that the processes of identification, segmentation, and tracking of hand-held objects depend on the hands’ positions and their contextual appearance. We further refine HOIST-Former with a contact loss that focuses on areas where hands are in contact with objects. Moreover, we also contribute an in-the-wild video dataset called HOIST, which comprises 4,125 videos complete with bounding boxes, segmentation masks, and tracking IDs for hand-held objects. Through experiments on the HOIST dataset and two additional public datasets, we demonstrate the efficacy of HOIST-Former in segmenting and tracking hand-held objects.

HOIST-Former Architecture

Image.

HOIST-Former consists of a backbone network, a pixel decoder, and a transformer decoder. The input video is initially processed through the backbone network and the pixel decoder to generate high-resolution spatio-temporal features F. The transformer decoder operates on F, decoding a set of N hand queries and their corresponding object queries, resulting in N spatio-temporal hand masks and corresponding object masks.

Hand-Object Transformer Decoder

Image.

The Hand-Object Transformer Decoder features a network architecture with L layers. This figure demonstrates the operational flow of a single layer, which includes two mask attention modules and two cross-attention modules. The inputs for this layer consist of N sets of four elements each: a hand query, an object query, a spatio-temporal hand mask, and a spatio-temporal object mask. The outputs of this layer are the correspondingly updated versions of these entities.

HOIST Dataset

We have collected and annotated a large-scale in-the-wild video dataset, named HOIST, a contribution of this work. Specifically, for each hand-held object in the video, we annotate its segmentation mask and assign a tracking instance ID that persists throughout the video. Our dataset comprises 4,228 videos with approximately 85,000 frames in total. The HOIST dataset includes numerous videos featuring hand-held objects within challenging and unconstrained environments, which can be used to train robust methods for hand-held object segmentation and tracking.
Image.

Sample videos from HOIST dataset. Each row displays selected frames from a single video. Within each row, a distinct hand-held object is assigned a unique tracking ID and is consistently represented in the same color. Rows (1) and (2) demonstrate object transformations, where one object splits into two or more instances, each with a different tracking ID. Row (3) displays significant object occlusion, with hands blocking a large part of objects or another hand. Row (4) illustrates the challenge of distinguishing similar instances in repetitive tasks. Row (5) presents instances with very similar textures, where one hand is in contact with multiple instances.




Image.

Diverse data. Sample frames from HOIST. HOIST dataset contains videos with diverse scenes, camera views, object sizes, and occlusions.

HOIST-Former Qualitative Results

Image. Image.

HOIST-Former Qualitative Results. Each row displays selected frames from a single video. Within each row, a distinct hand- held object is assigned a unique tracking ID and is consistently represented in the same color.

BibTeX

@InProceedings{sn_hoist_cvpr_2024,
    author = {Supreeth Narasimhaswamy and Huy Anh Nguyen and Lihan Huang and Minh Hoai},
    title = {HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild},
    booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year = {2024},
}