Teng Wang 王腾

alt text 

Senior Researcher at Tencent
Shenzhen, China

Email: ttengwang@gmail.com
Github
Google scholar

Hi, there! I'm Teng Wang. I am a researcher at Tencent ARC Lab, focusing on advancing multimodal foundation models and video understanding systems. Prior to this, I earned my Ph.D. in Computer Science from the University of Hong Kong (HKU) in 2024, where I was fortunate to be advised by Prof. Ping Luo and Prof. Feng Zheng. Before my doctoral studies, I completed my B.E. and M.E. degrees at Sun Yat-sen University (SYSU) under the supervision of Prof. Huicheng Zheng.

Prospective Collaborators: We are actively seeking motivated research interns and collaborators to join our Multimodal Foundation Model team at Tencent ARC Lab. If you share an interest in vision-language-audio tasks, video understanding or multi-modal reasoning, feel free to reach out via email!

News

  • [Apr 2025] Three papers accepted to CVPR 2025, ICLR 2025 (1 Spotlight), TCSVT.

  • [Dec 2024] Two papers accepted to ECCV 2024, ACMMM 2024.

  • [Dec 2023] Six papers accepted to ICCV 2023 (1 oral), CVPR 2023, ICML 2023.

  • [Dec 2022] Four papers accepted to ICML 2022, TMM 2022, ICCV 2021, TCSVT 2020.

Research

My research interests include:

  • Unified Multimodal Models (vision-language-audio unification, understanding-generation unification.)

  • Multi-Modal Reasoning (visual chain-of-thought, R1-like reasoning, etc.)

  • Long-Term Video Understanding (video grounding, dense captioning, long-chain reasoning, etc.)

Selected Publications

* equal contribution

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Haokun Lin*, Teng Wang*, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan
Arxiv, 2025.

Video understanding with large language models: A survey
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, et al.
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2025. (github 2k stars)

UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization
Tiantian Geng, Teng Wang, Yanfu Zhang, Jinming Duan, Weili Guan, Feng Zheng
Arxiv, 2024.

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models
Jinrui Zhang, Teng Wang, Haigang Zhang, Ping Lu, and Feng Zheng
European Conference on Computer Vision (ECCV), 2024.

Caption anything: Interactive image description with diverse multimodal controls
Teng Wang*, Jinrui Zhang*, Junjie Fei*, Yixiao Ge, Hao Zheng, et al.
Arxiv, 2023. (github 1.7k stars)

Transferable decoding with visual entities for zero-shot image captioning
Junjie Fei*, Teng Wang*, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng Zheng
International Conference on Computer Vision (ICCV), 2023.

Knowledge-aware prompt tuning for generalizable vision-language models
Baoshuo Kan*, Teng Wang*, Wenpeng Lu, Xiantong Zhen, Weili Guan, Feng Zheng
International Conference on Computer Vision (ICCV), 2023.

Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models
Dong Lu*, Zhiqiang Wang*, Teng Wang, Weili Guan, Hongchang Gao, Feng Zheng
International Conference on Computer Vision (ICCV Oral), 2023.

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Teng Wang*, Jinrui Zhang*, Feng Zheng, Wenhao Jiang, Ran Cheng, Ping Luo
Arxiv, 2023. (Rank 1 in PIC Challenge 2022 Track 1&2)

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng
IEEE Computer Vision and Pattern Recognition (CVPR), 2023.

Accelerating Vision-Language Pretraining with Free Language Modeling
Teng Wang , Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, Xiaohu Qie, Ping Luo
IEEE Computer Vision and Pattern Recognition (CVPR), 2023.

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, Ping Luo
International Conference on Machine Learning (ICML), 2022

End-to-end dense video captioning with parallel decoding
Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, Ping Luo
International Conference on Computer Vision (ICCV), 2021.

Event-centric hierarchical representation for dense video captioning
Teng Wang, Huicheng Zheng, Mingjing Yu, Qian Tian, Haifeng Hu
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2020.

Academic service

    Conference reviewer for CVPR, ICCV, ECCV, ICML, ICLR, NeurIPS, AISTATS
    Journal reviewer for IJCV, IEEE TNNLS, IEEE TIP, IEEE TMM, IEEE TCSVT

Experience

    Research Intern at TikTok(2023), Tencent ARC Lab(2022), Tencent Data Platform(2021), Tencent AI Lab(2019)

Competitions & Awards