Hi, there! I am Teng Wang, a senior research scientist at XPeng Robotics, working on embodied foundation models. I was previously a senior researcher at Tencent ARC Lab, focusing on multimodal foundation models and video understanding.
I obtained my Ph.D. in Computer Science from MMLab at the University of Hong Kong (HKU), supervised by Prof. Ping Luo and Prof. Feng Zheng.
Prior to that, I received my B.E. and M.E. degrees from Sun Yat-sen University (SYSU) under the supervision of Prof. Huicheng Zheng.
Join Our Team:
Our long-term research agenda is to build embodied intelligence that operates seamlessly across both the physical and virtual worlds. We welcome applications from highly motivated research interns and full-time researchers in vision-language-action (VLA) modeling, open-world robotic manipulation, and whole-body control for humanoid robots. Please reach out via email.
News
[Jun 2026] We release Fe₀, a controlled study that demystifies what cross-embodiment data teaches embodied foundation models. Using minimal teleoperation as an anchor, we introduce an L1–L5 evaluation framework on the IRON humanoid and reveal clear capability boundaries between visual-semantic transfer and physical action generalization.
[Jun 2026] Our works on long-video understanding (OmniScript) and reinfoced multimodal reasoning (Video-Holmes, Chain-of-Glimpse, Proactive Routing) have been accepted by ECCV 2026 and ICML 2026.
[Feb 2026] Our works on video grounding (TimeLens) and audio generation (AudioStory) accepted to CVPR 2026.
[Nov 2025] We release the ARC series of video structural understanding models (ARC-Hunyuan-Video and ARC-Chapter).
[Aug 2025] UniAV (omni-modal video localization) accepted to IEEE TPAMI.
[Jun 2025] Five papers accepted to CVPR 2025, ICLR 2025 (Spotlight), ICCV 2025, EMNLP 2025 (Oral) and TCSVT.
[Dec 2024] Two papers accepted to ECCV 2024 and ACM MM 2024.
[Dec 2023] Six papers accepted to ICCV 2023 (1 Oral), CVPR 2023, and ICML 2023.
[Dec 2022] Four papers accepted to ICML 2022, TMM, ICCV 2021, and TCSVT.
Inside Fe₀: What Cross-Embodiment Data Teaches an Embodied Foundation Model Yizhuo Li*, Xiaoyu Zhang*, Yuying Ge*+, Teng Wang, Boyu Chen, Yi Chen, Bo Liu, Feng Qiu, Jiacheng Wei, Yuguo Gan, Hui Zhou, Yixiao Ge XPENG Robotics Blog, 2026.
[Project] (Fe₀ systematically unpacks what heterogeneous cross-embodiment data transfers to humanoid robots—and where it fails.)
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video Junfu Pu*, Yuxin Chen*, Teng Wang*, Ying Shan ECCV, 2026.
[Project]
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? Junhao Cheng, Yuying Ge+, Teng Wang+, Yixiao Ge, Jing Liao, Ying Shan ECCV, 2026.
Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning Yinan Zhou, Haokun Lin, Yichen Wu, Yuxin Chen, Teng Wang, Caifeng Shan, Zhenan Sun, Chen Ma, Li Zhu, Ying Shan ECCV, 2026.
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs Jun Zhang, Teng Wang+, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang CVPR, 2026.
AudioStory: Generating Long-Form Narrative Audio with Large Language Models Yuxin Guo, Teng Wang+, Yuying Ge, Shijie Ma, Yixiao Ge, Wei Zou, Ying Shan CVPR, 2026.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding Zhixuan Wu, Quanxing Zha, Teng Wang, Genbao Xu, Wenyuan Gu, Wei Rao, Nan Ma, Bo Cheng, Soujanya Poria ICML, 2026.
[Paper]
R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios Lu Zhu, Tiantian Geng, Yangye Chen, Teng Wang, Ping Lu, Feng Zheng AAAI, 2026.
From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model Yatai Ji, Teng Wang+, Yuying Ge, Zhiheng Liu, Sidi Yang, Ying Shan, Ping Luo Arxiv, 2025.
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts Yuying Ge*, Yixiao Ge*, Chen Li*, Teng Wang*, Junfu Pu*, Yizhuo Li*, Lu Qiu* et al. Arxiv, 2025.
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries Junfu Pu*, Teng Wang*, Yixiao Ge, Yuying Ge, Chen Li, Ying Shan Arxiv, 2025.
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation Haokun Lin*, Teng Wang*, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan Arxiv, 2025.
Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors Xiangchen Wang*, Jinrui Zhang*, Teng Wang*, Haigang Zhang, Feng Zheng EMNLP, 2025. (Oral Presentation)
Sample Then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models Qingni Wang, Tiantian Geng, Zhiyuan Wang, Teng Wang, Bo Fu, Feng Zheng ICLR, 2025. (Spotlight)
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng CVPR, 2025.
Video Understanding with Large Language Models: A Survey Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, et al. IEEE TCSVT, 2025.
(2k GitHub stars)
UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization Tiantian Geng, Teng Wang, Yanfu Zhang, Jinming Duan, Weili Guan, Feng Zheng IEEE TPAMI, 2025.
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models Jinrui Zhang, Teng Wang, Haigang Zhang, Ping Lu, Feng Zheng ECCV, 2024.
Two in One Go: Single-Stage Emotion Recognition with Decoupled Subject-Context Transformer Xinpeng Li, Teng Wang, Jian Zhao, Shuyi Mao, Jinbao Wang, Feng Zheng, Xiaojiang Peng, Xuelong Li ACM MM, 2024.
MCoCa: Towards Fine-Grained Multimodal Control in Image Captioning Shijie Zhao, Teng Wang, Jinrui Zhang, Xiaowei Wang, Feng Zheng Pattern Recognition (PR), 2024.
Caption Anything: Interactive Image Description with Diverse Multimodal Controls Teng Wang*, Jinrui Zhang*, Junjie Fei*, Yixiao Ge, Hao Zheng, et al. Arxiv, 2023.
(1.7k GitHub stars)
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning Junjie Fei*, Teng Wang*, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng Zheng ICCV, 2023.
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos Teng Wang*, Jinrui Zhang*, Feng Zheng, Wenhao Jiang, Ran Cheng, Ping Luo Arxiv, 2023.
(Rank 1 in PIC Challenge 2022 Track 1 & 2)
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng CVPR, 2023.
Accelerating Vision-Language Pretraining with Free Language Modeling Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, Xiaohu Qie, Ping Luo CVPR, 2023.
π-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-Task Interpolation Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo ICML, 2023.
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, Ping Luo ICML, 2022.
Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training Zhu Liu, Teng Wang, Jinrui Zhang, Feng Zheng, Wenhao Jiang, Ke Lu IEEE TMM, 2022.
End-to-End Dense Video Captioning with Parallel Decoding Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, Ping Luo ICCV, 2021.
Event-Centric Hierarchical Representation for Dense Video Captioning Teng Wang, Huicheng Zheng, Mingjing Yu, Qian Tian, Haifeng Hu IEEE TCSVT, 2020.
Experience
2026 – present — Senior Research Scientist, XPeng Robotics
2024 – 2026 — Senior Researcher, Tencent ARC Lab
2020 – 2024 — Ph.D. in Computer Science, University of Hong Kong
Internships
2023 — Research Intern, ByteDance Seed
2022 — Research Intern, Tencent ARC Lab
2021 — Research Intern, Tencent Data Platform
2019 — Research Intern, Tencent AI Lab
Academic Service
Conference PC Member/Reviewer: CVPR, ICCV, ECCV, ICML, ICLR, NeurIPS, AISTATS