DVS-Group | Efficient Self-supervised Vision Transformers for Representation Learning

The talk is scheduled on Friday, August 13, 2021 at 10:00 , online: Tencent.

Abstract

Self-supervised Learning (SSL) in computer vision (CV) aims to learn general-purpose image encoders from raw pixels without relying on manual supervisions, and the learned networks serve as the backbone of various downstream tasks. To this end, we develop efficient self-supervised vision transformers (EsViT) with two techniques. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result improves the quality of the learned vision representations. Our results show that EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order of magnitude higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets.

Code and paper can be found here https://github.com/microsoft/esvit.

Bio

Chunyuan Li is currently a senior researcher with the deep learning team, Microsoft Research, Redmond. His recent research focuses on large-scale self-supervised learning in computer vision and natural language processing. These results are published in top venues in artificial intelligence such as NIPS/ICLR/ICML/CVPR/ACL/EMNLP etc, with total citation 4000+. Chunyuan received his PhD in machine learning from Duke University.

More info http://chunyuan.li.