Selected Publications

* equal contribution, # corresponding author

An up-to-date list is available on Google Scholar


  1. A Survey on Deep Learning Technique for Video Segmentation
    Wenguan Wang, Tianfei Zhou, Fatih Porikli, David Crandall, and Luc Van Gool.

    arXiv, 2021.

    Video segmentation, i.e., partitioning video frames into multiple segments or objects, plays a critical role in a broad range of practical applications, e.g., visual effect assistance in movie, scene understanding in autonomous driving, and virtual background creation in video conferencing, to name a few. Recently, due to the renaissance of connectionism in computer vision, there has been an influx of numerous deep learning based approaches that have been dedicated to video segmentation and delivered compelling performance. In this survey, we comprehensively review two basic lines of research in this area, i.e., generic object segmentation (of unknown categories) in videos and video semantic segmentation, by introducing their respective task settings, background concepts, perceived need, development history, and main challenges. We also provide a detailed overview of representative literature on both methods and datasets. Additionally, we present quantitative performance comparisons of the reviewed methods on benchmark datasets. At last, we point out a set of unsolved open issues in this field, and suggest possible opportunities for further research.


    Rethinking Semantic Segmentation: A Prototype View
    Tianfei Zhou, Wenguan Wang, Ender Konukoglu, and Luc Van Gool.

    Computer Vision and Pattern Recognition (CVPR), 2022

    Prevalent semantic segmentation solutions, despite their different network designs (FCN based or attention based) and mask decoding strategies (parametric softmax based or pixel-query based), can be placed in one category, by considering the softmax weights or query vectors as learnable class prototypes. In light of this prototype view, this study uncovers several limitations of such parametric segmentation regime, and proposes a nonparametric alternative based on non-learnable prototypes. Instead of prior methods learning a single weight/query vector for each class in a fully parametric manner, our model represents each class as a set of non-learnable prototypes, relying solely on the mean features of several training pixels within that class. The dense prediction is thus achieved by nonparametric nearest prototype retrieving. This allows our model to directly shape the pixel embedding space, by optimizing the arrangement between embedded pixels and anchored prototypes. It is able to handle arbitrary number of classes with a constant amount of learnable parameters. We empirically show that, with FCN based and attention based segmentation models (i.e., HRNet, Swin, SegFormer) and backbones (i.e., ResNet, HRNet, Swin, MiT), our nonparametric framework yields compelling results over several datasets (i.e., ADE20K, Cityscapes, COCO-Stuff), and performs well in the large-vocabulary situation. We expect this work will provoke a rethink of the current de facto semantic segmentation model design.
  2. Regional Semantic Contrast and Aggregation for Weakly Supervised Semantic Segmentation
    Tianfei Zhou*, Meijie Zhang*, Fang Zhao, and Jianwu Li.

    Computer Vision and Pattern Recognition (CVPR), 2022

    Learning semantic segmentation from weakly-labeled (e.g., image tags only) data is challenging since it is hard to infer dense object regions from sparse semantic tags. Despite being broadly studied, most current efforts directly learn from limited semantic annotations carried by individual image or image pairs, and struggle to obtain integral localization maps. Our work alleviates this from a novel perspective, by exploring rich semantic contexts synergistically among abundant weakly-labeled training data for network learning and inference. In particular, we propose regional semantic contrast and aggregation (RCA) . RCA is equipped with a regional memory bank to store massive, diverse object patterns appearing in training data, which acts as strong support for exploration of dataset-level semantic structure. Particularly, we propose i) semantic contrast to drive network learning by contrasting massive categorical object regions, leading to a more holistic object pattern understanding, and ii) semantic aggregation to gather diverse relational contexts in the memory to enrich semantic representations. In this manner, RCA earns a strong capability of fine-grained semantic understanding, and eventually establishes new state-of-the-art results on two popular benchmarks, i.e., PASCAL VOC 2012 and COCO 2014.
  3. Deep Hierarchical Semantic Segmentation
    Liulei Li, Tianfei Zhou, Wenguan Wang, Jianwu Li, Yi Yang.

    Computer Vision and Pattern Recognition (CVPR), 2022

    Humans are able to recognize structured relations in observation, allowing us to decompose complex scenes into simpler parts and abstract the visual world in multiple levels. However, such hierarchical reasoning ability of human perception remains largely unexplored in current literature of semantic segmentation. Existing work is often aware of flatten labels and predicts target classes exclusively for each pixel. In this paper, we instead address hierarchical semantic segmentation (HSS), which aims at structured, pixel-wise description of visual observation in terms of a class hierarchy. We devise HSSN, a general HSS framework that tackles two critical issues in this task: i) how to efficiently adapt existing hierarchy-agnostic segmentation networks to the HSS setting, and ii) how to leverage the hierarchy information to regularize HSS network learning. To address i), HSSN directly casts HSS as a pixel-wise multi-label classification task, only bringing minimal architecture change to current segmentation models. To solve ii), HSSN first explores inherent properties of the hierarchy as a training objective, which enforces segmentation predictions to obey the hierarchy structure. Further, with hierarchy-induced margin constraints, HSSN reshapes the pixel embedding space, so as to generate well-structured pixel representations and improve segmentation eventually. We conduct experiments on four semantic segmentation datasets (i.e., Mapillary Vistas 2.0, Cityscapes, LIP, and PASCAL-Person-Part), with different class hierarchies, segmentation network architectures and backbones, showing the generalization and superiority of HSSN.
  4. Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning
    Liulei Li, Tianfei Zhou, Wenguan Wang, Lu Yang, Jianwu Li, Yi Yang.

    Computer Vision and Pattern Recognition (CVPR), 2022

    Our target is to learn visual correspondence from unlabeled videos. We develop LIIR, a locality-aware inter-and intra-video reconstruction framework that fills in three missing pieces, i.e., instance discrimination, location awareness, and spatial compactness, of self-supervised correspondence learning puzzle. First, instead of most existing efforts focusing on intra-video self-supervision only, we exploit cross video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme. This enables instance discriminative representation learning by contrasting desired intra-video pixel association against negative inter-video correspondence. Second, we merge position information into correspondence matching, and design a position shifting strategy to remove the side-effect of position encoding during inter-video affinity computation, making our LIIR location-sensitive. Third, to make full use of the spatial continuity nature of video data, we impose a compactness-based constraint on correspondence matching, yielding more sparse and reliable solutions. The learned representation surpasses self-supervised state-of-the-arts on label propagation tasks including objects, semantic parts, and keypoints.
  5. Detail-Preserving Transformer for Light Field Image Super-Resolution
    Shunzhou Wang*, Tianfei Zhou*, Yao Lu, and Huijun Di.

    The AAAI Conference on Artificial Intelligence (AAAI), 2022.

    Recently, numerous algorithms have been developed to tackle the problem of light field super-resolution (LFSR), i.e., super- resolving low-resolution light fields to gain high-resolution views. Despite delivering encouraging results, these ap- proaches are all convolution-based, and are naturally weak in global relation modeling of sub-aperture images neces- sarily to characterize the inherent structure of light fields. In this paper, we put forth a novel formulation built upon Transformers, by treating LFSR as a sequence-to-sequence reconstruction task. In particular, our model regards sub- aperture images of each vertical or horizontal angular view as a sequence, and establishes long-range geometric depen- dencies within each sequence via a spatial-angular locally- enhanced self-attention layer, which maintains the locality of each sub-aperture image as well. Additionally, to bet- ter recover image details, we propose a detail-preserving Transformer (termed as DPT), by leveraging gradient maps of light field to guide the sequence learning. DPT con- sists of two branches, with each associated with a Trans- former for learning from an original or gradient image se- quence. The two branches are finally fused to obtain com- prehensive feature representations for reconstruction. Evalu- ations are conducted on a number of light field datasets, including real-world scenes and synthetic data. The proposed method achieves superior performance comparing with other state-of-the-art schemes.


    Exploring Cross-Image Pixel Contrast for Semantic Segmentation
    Wenguan Wang*, Tianfei Zhou*, Fisher Yu, Jifeng Dai, Ender Konukoglu, and Luc Van Gool.

    International Conference on Computer Vision (ICCV), 2021.

    Current semantic segmentation methods focus only on mining "local" context, i.e., dependencies between pixels within individual images, by context-aggregation modules (e.g., dilated convolution, neural attention) or structure-aware optimization criteria (e.g., IoU-like loss). However, they ignore "global" context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely explored before. Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3, HRNet, OCR) and backbones (i.e., ResNet, HR-Net), our method brings consistent performance improvements across diverse datasets (i.e., Cityscapes, PASCAL-Context, COCO-Stuff, CamVid). We expect this work will encourage our community to rethink the current de facto training paradigm in fully supervised semantic segmentation.
  2. Quality-Aware Memory Network for Interactive Volumetric Image Segmentation
    Tianfei Zhou*, Liulei Li*, Gustav Bredell, Jianwu Li, and Ender Konukoglu

    International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2021.

    Despite recent progress of automatic medical image segmentation techniques, fully automatic results usually fail to meet the clinical use and typically require further refinement. In this work, we propose a quality-aware memory network for interactive segmentation of 3D medical images. Provided by user guidance on an arbitrary slice, an interaction network is firstly employed to obtain an initial 2D segmentation. The quality-aware memory network subsequently propagates the initial segmentation estimation bidirectionally over the entire volume. Subsequent refinement based on additional user guidance on other slices can be incorporated in the same manner. To further facilitate interactive segmentation, a quality assessment module is introduced to suggest the next slice to segment based on the current segmentation quality of each slice. The proposed network has two appealing characteristics: 1) The memory-augmented network offers the ability to quickly encode past segmentation information, which will be retrieved for the segmentation of other slices; 2) The quality assessment module enables the model to directly estimate the qualities of segmentation predictions, which allows an active learning paradigm where users preferentially label the lowest-quality slice for multi-round refinement. The proposed network leads to a robust interactive segmentation engine, which can generalize well to various types of user annotations (e.g., scribbles, boxes). Experimental results on various medical datasets demonstrate the superiority of our approach in comparison with existing techniques.
    Face Forensics in the Wild
    Tianfei Zhou, Wenguan Wang, Zhiyuan Liang, and Jianbing Shen

    Computer Vision and Pattern Recognition (CVPR), 2021

    On existing public benchmarks, face forgery detection techniques have achieved great success. However, when used in multi-person videos, which often contain many people active in the scene with only a small subset having been manipulated, their performance remains far from being satisfactory. To take face forgery detection to a new level, we construct a novel large-scale dataset, called FFIW-10K, which comprises 10,000 high-quality forgery videos, with an average of three human faces in each frame. The manipulation procedure is fully automatic, controlled by a domain-adversarial quality assessment network, making our dataset highly scalable with low human cost. In addition, we propose a novel algorithm to tackle the task of multi-person face forgery detection. Supervised by only video-level label, the algorithm explores multiple instance learning and learns to automatically attend to tampered faces. Our algorithm outperforms representative approaches for both forgery classification and localization on FFIW-10K, and also shows high generalization ability on existing benchmarks. We hope that our dataset and study will help the community to explore this new field in more depth.
    Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing
    Tianfei Zhou, Wenguan Wang, Si Liu, Yi Yang, and Luc Van Gool.

    Computer Vision and Pattern Recognition (CVPR), 2021

    To address the challenging task of instance-aware human part parsing, a new bottom-up regime is proposed to learn category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner. It is a compact, efficient and powerful framework that exploits structural information over different human granularities and eases the difficulty of person partitioning. Specifically, a dense-to-sparse projection field, which allows explicitly associating dense human semantics with sparse keypoints, is learnt and progressively improved over the network feature pyramid for robustness. Then, the difficult pixel grouping problem is cast as an easier, multi-person joint assembling task. By formulating joint association as maximum-weight bipartite matching, a differentiable solution is developed to exploit projected gradient descent and Dykstra's cyclic projection algorithm. This makes our method end-to-end trainable and allows back-propagating the grouping error to directly supervise multi-granularity human representation learning. This is distinguished from current bottom-up human parsers or pose estimators which require sophisticated post-processing or heuristic greedy algorithms. Experiments on three instance-aware human parsing datasets show that our model outperforms other bottom-up alternatives with much more efficient inference.
  5. Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation
    Tianfei Zhou, Jianwu Li, Xueyi Li, and Ling Shao.

    Computer Vision and Pattern Recognition (CVPR), 2021

    This paper addresses the task of unsupervised video multi-object segmentation. Current approaches follow a two-stage paradigm: 1) detect object proposals using pre-trained Mask R-CNN, and 2) conduct generic feature matching for temporal association using re-identification techniques. However, the generic features, widely used in both stages, are not reliable for characterizing unseen objects, leading to poor generalization. To address this, we introduce a novel approach for more accurate and efficient spatio-temporal segmentation. In particular, to address \textbf{instance discrimination}, we propose to combine foreground region estimation and instance grouping together in one network, and additionally introduce temporal guidance for segmenting each frame, enabling more accurate object discovery. For \textbf{temporal association}, we complement current video object segmentation architectures with a discriminative appearance model, capable of capturing more fine-grained target-specific information. Given object proposals from the instance discrimination network, three essential strategies are adopted to achieve accurate segmentation: 1) target-specific tracking using a memory-augmented appearance model; 2) target-agnostic verification to trace possible tracklets for the proposal; 3) adaptive memory updating using the verified segments. We evaluate the proposed approach on DAVIS17 and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
  6. Towards A Weakly Supervised Framework for 3D Point Cloud Object Detection and Annotation
    Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Shen, Yunde Jia, and Luc Van Gool.

    IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2021

    It is quite laborious and costly to manually label LiDAR point cloud data for training high-quality 3D object detectors. This work proposes a weakly supervised framework which allows learning 3D detection from a few weakly annotated examples. This is achieved by a two-stage architecture design. Stage-1 learns to generate cylindrical object proposals under inaccurate and inexact supervision, obtained by our proposed BEV center-click annotation strategy, where only the horizontal object centers are click-annotated in bird's view scenes. Stage-2 learns to predict cuboids and confidence scores in a coarse-to-fine, cascade manner, under incomplete supervision, i.e., only a small portion of object cuboids are precisely annotated. With KITTI dataset, using only 500 weakly annotated scenes and 534 precisely labeled vehicle instances, our method achieves 86-97% the performance of current top-leading, fully supervised detectors (which require 3712 exhaustively annotated scenes with 15654 instances). More importantly, with our elaborately designed network architecture, our trained model can be applied as a 3D object annotator, supporting both automatic and active (human-in-the-loop) working modes. The annotations generated by our model can be used to train 3D object detectors, achieving over 95% of their original performance (with manually labeled training data).
  7. Hierarchical Human Semantic Parsing with Comprehensive Part-Relation Modeling
    Wenguan Wang, Tianfei Zhou, Siyuan Qi, Jianbing Shen, and Song-Chun Zhu.

    IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2021

    Human parsing is for pixel-wise human semantic understanding. As human bodies are hierarchically structured, how to model human structures is the central theme in this task. We start with analyzing three inference processes over the human hierarchy: direct inference (directly predicting human semantic parts using image information), bottom-up inference (assembling knowledge from constituent parts), and top-down inference (leveraging context from parent nodes). We then formulate the problem as a compositional neural information fusion (CNIF) framework, which assembles the information from the three inference processes in a conditional manner, i.e., considering the confidence of the sources. Based on CNIF, we further present a part-relation aware human parser (PRHP), which precisely describes three kinds of human part relations, i.e., decomposition, composition, and dependency, by three distinct relation networks. Expressive relation information can be captured by imposing the parameters in the relation networks to satisfy specific geometric characteristics of different relations. By assimilating generic message-passing networks with their edge-typed, convolutional counterparts, PRHP performs iterative reasoning over the human hierarchy. With these efforts, PRHP provides a more general and powerful form of CNIF, and lays the foundation for more sophisticated and flexible human relation patterns of reasoning. Experiments on five datasets demonstrate that our two human parsers outperform the state-of-the-arts in all cases.
  8. Cascaded Parsing of Human-Object Interaction Recognition
    Tianfei Zhou, Siyuan Qi, Wenguan Wang, Jianbing Shen, and Song-Chun Zhu.

    IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2021

    This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images. Considering the intrinsic complexity and structural nature of the task, we introduce a cascaded parsing network (CP-HOI) for a multi-stage, structured HOI understanding. At each cascade stage, an instance detection module progressively refines HOI proposals and feeds them into a structured interaction reasoning module. Each of the two modules is also connected to its predecessor in the previous stage. The structured interaction reasoning module is built upon a graph parsing neural network (GPNN). In particular, GPNN infers a parse graph that i) interprets meaningful HOI structures by a learnable adjacency matrix, and ii) predicts action (edge) labels. Within an end-to-end, message-passing framework, GPNN blends learning and inference, iteratively parsing HOI structures and reasoning HOI representations (i.e., instance and relation features). Further beyond relation detection at a bounding-box level, we make our framework flexible to perform fine-grained pixel-wise relation segmentation; this provides a new glimpse into better relation modeling. A preliminary version of our CP-HOI model reached 1st place in the ICCV2019 Person in Context Challenge, on both relation detection and segmentation. Our CP-HOI shows promising results on two popular HOI recognition benchmarks, i.e., V-COCO and HICO-DET.
  9. Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation
    Xueyi Li, Tianfei Zhou#, Jianwu Li, Yi Zhou, and Zhaoxiang Zhang.

    The AAAI Conference on Artificial Intelligence (AAAI), 2021

    Acquiring sufficient ground-truth supervision to train deep visual models has been a bottleneck over the years due to the data-hungry nature of deep learning. This is exacerbated in some structured prediction tasks, such as semantic segmentation, which requires pixel-level annotations. This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation. We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths, which can be used for training more accurate segmentation models. In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes, and the underlying relations between a pair of images are characterized by an efficient co-attention mechanism. Moreover, in order to prevent the model from paying excessive attention to common semantics only, we further propose a graph dropout layer, encouraging the model to learn more accurate and complete object responses. The whole network is end-to-end trainable by iterative message passing, which propagates interaction cues over the images to progressively improve the performance. We conduct experiments on the popular PASCAL VOC 2012 and COCO benchmarks, and our model yields state-of-the-art performance.


  1. MATNet: Motion-Attentive Transition Network for Zero-Shot Video Object Segmentation
    Tianfei Zhou, Jianwu Li, Shunzhou Wang, Ran Tao, and Jianbing Shen.

    IEEE Transactions on Image Processing (TIP), 2020

    In this paper, we present a novel end-to-end learning neural network, i.e., MATNet, for zero-shot video object segmentation (ZVOS). Motivated by the human visual attention behavior, MATNet leverages motion cues as a bottom-up signal to guide the perception of object appearance. To achieve this, an asymmetric attention block, named Motion-Attentive Transition (MAT), is proposed within a two-stream encoder network to firstly identify moving regions and then attend appearance learning to capture the full extent of objects. Putting MATs in different convolutional layers, our encoder becomes deeply interleaved, allowing for close hierarchical interactions between object apperance and motion. Such a biologically-inspired design is proven to be superb to conventional two-stream structures, which treat motion and appearance independently in separate streams and often suffer severe overfitting to object appearance. Moreover, we introduce a bridge network to modulate multi-scale spatiotemporal features into more compact, discriminative and scale-sensitive representations, which are subsequently fed into a boundary-aware decoder network to produce accurate segmentation with crisp boundaries. We perform extensive quantitative and qualitative experiments on four challenging public benchmarks, i.e., DAVIS16, DAVIS17, FBMS and YouTube-Objects. Results show that our method achieves compelling performance against current state-of-the-art ZVOS methods. To further demonstrate the generalization ability of our spatiotemporal learning framework, we extend MATNet to another relevant task: dynamic visual attention prediction (DVAP). The experiments on two popular datasets (i.e., Hollywood-2 and UCF-Sports) further verify the superiority of our model. Our implementations have been made publicly available at
  2. Weakly Supervised 3D Object Detection from Lidar Point Cloud
    Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Shen, Luc Van Gool, and Dengxin Dai.

    European Conference on Computer Vision (ECCV), 2020

    It is laborious to manually label point cloud data for training high-quality 3D object detectors. This work proposes a weakly supervised approach for 3D object detection, only requiring a small set of weakly annotated scenes, associated with a few precisely labeled object instances. This is achieved by a two-stage architecture design. Stage-1 learns to generate cylindrical object proposals under weak supervision, i.e., only the horizontal centers of objects are click-annotated on bird's view scenes. Stage-2 learns to refine the cylindrical proposals to get cuboids and confidence scores, using a few well-labeled object instances. Using only 500 weakly annotated scenes and 534 precisely labeled vehicle instances, our method achieves 85-95% the performance of current top-leading, fully supervised detectors (which require 3, 712 exhaustively and precisely annotated scenes with 15, 654 instances). More importantly, with our elaborately designed network architecture, our trained model can be applied as a 3D object annotator, allowing both automatic and active working modes. The annotations generated by our model can be used to train 3D object detectors with over 94% of their original performance (under manually labeled data). Our experiments also show our model's potential in boosting performance given more training data. Above designs make our approach highly practical and introduce new opportunities for learning 3D object detection with reduced annotation burden.
  3. Video Object Segmentation with Episodic Graph Memory Networks
    Xiankai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, and Luc Van Gool.

    European Conference on Computer Vision (ECCV), 2020

    How to make a segmentation model efficiently adapt to a specific video and to online target appearance variations are fundamentally crucial issues in the field of video object segmentation. In this work, a graph memory network is developed to address the novel idea of "learning to update the segmentation model". Specifically, we exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges. Further, learnable controllers are embedded to ease memory reading and writing, as well as maintain a fixed memory scale. The structured, external memory design enables our model to comprehensively mine and quickly store new knowledge, even with limited visual information, and the differentiable memory controllers slowly learn an abstract method for storing useful representations in the memory and how to later use these representations for prediction, via gradient descent. In addition, the proposed graph memory network yields a neat yet principled framework, which can generalize well both one-shot and zero-shot video object segmentation tasks. Extensive experiments on four challenging benchmark datasets verify that our graph memory network is able to facilitate the adaptation of the segmentation network for case-by-case video object segmentation.
  4. Cascaded Human-Object Interaction Recognition
    Tianfei Zhou, Wenguan Wang, Siyuan Qi, Haibin Ling, and Jianbin Shen.

    Computer Vision and Pattern Recognition (CVPR), 2020

    Rapid progress has been witnessed for human-object interaction (HOI) recognition, but most existing models are confined to single-stage reasoning pipelines. Considering the intrinsic complexity of the task, we introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. Each of the two networks is also connected to its predecessor at the previous stage, enabling cross-stage information propagation. The interaction recognition network has two crucial parts: a relation ranking module for high-quality HOI proposal selection and a triple-stream classifier for relation prediction. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding. Further beyond relation detection on a bounding-box level, we make our framework flexible to perform fine-grained pixel-wise relation segmentation; this provides a new glimpse into better relation modeling. Our approach reached the 1st place in the ICCV2019 Person in Context Challenge, on both relation detection and segmentation tasks. It also shows promising results on V-COCO.
  5. Motion-Attentive Transition for Zero-Shot Video Object Segmentation
    Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Shao Ling.

    The AAAI Conference on Artificial Intelligence (AAAI), 2020

    In this paper, we present a novel Motion-Attentive Transition Network (MATNet) for zero-shot video object segmentation, which provides a new way of leveraging motion information to reinforce spatio-temporal object representation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder, which transforms appearance features into motion-attentive representations at each convolutional stage. In this way, the encoder becomes deeply interleaved, allowing for closely hierarchical interactions between object motion and appearance. This is superior to the typical two-stream architecture, which treats motion and appearance separately in each stream and often suffers from overfitting to appearance information. Additionally, a bridge network is proposed to obtain a compact, discriminative and scale-sensitive representation for multi-level encoder features, which is further fed into a decoder to achieve segmentation results. Extensive experiments on three challenging public benchmarks (i.e. DAVIS-16, FBMS and Youtube-Objects) show that our model achieves compelling performance against the state-of-the-arts.

before 2020

  1. Locality-Constrained Collaborative Model for Robust Visual Tracking
    Tianfei Zhou, Yao Lu, and Huijun Di.

    IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), 2017

    This paper presents a novel discriminative, generative, and collaborative appearance model for robust object tracking. In contrast to existing methods, we use different appearance manifolds to represent the target in the discriminative and generative appearance models and propose a novel collaborative scheme to combine these two components. In particular: 1) for the discriminative component, we develop a graph regularized discriminant analysis (GRDA) algorithm that can find a projection to more effectively distinguish the target from the background; 2) for the generative component, we introduce a simple yet effective coding method for object representation. The method involves no optimization, and thus better efficiency can be achieved; and 3) for the collaborative model, we apply GRDA again to find a subspace for discriminating the likelihood features (generated from the discriminative and generative appearance models) and use the nearest neighbor criterion to determine the final likelihood. Besides, all the components are online updated so that our tracker can deal with appearance changes effectively. The experimental results over 23 challenging image sequences demonstrate that the proposed algorithm achieves better performance compared with other state-of-the-art methods.
  2. Video Object Segmentation Aggregation
    Tianfei Zhou, Yao Lu, Huijun Di, and Jian Zhang.

    International Conference on Multimedia & Expo (ICME), 2016

    We present an approach for unsupervised object segmentation in unconstrained videos. Driven by the latest progress in this field, we argue that segmentation performance can be largely improved by aggregating the results generated by state-of-the-art algorithms. Initially, objects in individual frames are estimated through a per-frame aggregation procedure using majority voting. While this can predict relatively accurate object location, the initial estimation fails to cover the parts that are wrongly labeled by more than half of the algorithms. To address this, we build a holistic appearance model using non-local appearance cues by linear regression. Then, we integrate the appearance priors and spatio-temporal information into an energy minimization framework to refine the initial estimation. We evaluate our method on challenging benchmark videos and demonstrate that it outperforms state-of-the-art algorithms.
  3. Abrupt Motion Tracking via Nearest Neighbor Field Driven Stochastic Sampling
    Tianfei Zhou, Yao Lu, Feng Lv, Huijun Di, Qingjie Zhao, and Jian Zhang.

    Neurocomputing, 2015

    Stochastic sampling based trackers have shown good performance for abrupt motion tracking so that they have gained popularity in recent years. However, conventional methods tend to use a two-stage sampling paradigm, in which the search space needs to be uniformly explored with an inefficient preliminary sampling phase. In this paper, we propose a novel sampling-based method in the Bayesian filtering framework to address the problem. Within the framework, nearest neighbor field estimation is utilized to compute the importance proposal probabilities, which guide the Markov chain search towards promising regions and thus enhance the sampling efficiency; given the motion priors, a smoothing stochastic sampling Monte Carlo algorithm is proposed to approximate the posterior distribution through a smoothing weight-updating scheme. Moreover, to track the abrupt and the smooth motions simultaneously, we develop an abruptmotion detection scheme which can discover the presence of abrupt motions during online tracking. Extensive experiments on challenging image sequences demonstrate the effectiveness and the robustness of our algorithm in handling the abrupt motions.