| Teng Wang 王腾
 Hi, there! I'm Teng Wang. I am a researcher at Tencent ARC Lab, focusing on advancing multimodal foundation models and video understanding systems. Prior to this, I earned my Ph.D. in Computer Science from the University of Hong Kong (HKU) in 2024, where I was fortunate to be advised by Prof. Ping Luo and Prof. Feng Zheng. Before my doctoral studies, I completed my B.E. and M.E. degrees at Sun Yat-sen University (SYSU) under the supervision of Prof. Huicheng Zheng. Prospective Collaborators: We are actively seeking motivated research interns and collaborators to join our Multimodal Foundation Model team at Tencent ARC Lab. If you share an interest in vision-language-audio tasks, video understanding or multi-modal reasoning, feel free to reach out via email! News
 ResearchMy research interests include: 
 Selected Publications* equal contribution
      TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation 
      Video understanding with large language models: A survey      
      UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization
       
      Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models 
        Caption anything: Interactive image description with diverse multimodal controls
         
        Transferable decoding with visual entities for zero-shot image captioning
         
        Knowledge-aware prompt tuning for generalizable vision-language models
         
        Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models
         
        Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
         
        Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
         
        Accelerating Vision-Language Pretraining with Free Language Modeling
         
        VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
         
          End-to-end dense video captioning with parallel decoding
           
          Event-centric hierarchical representation for dense video captioning Academic service
 Journal reviewer for IJCV, IEEE TNNLS, IEEE TIP, IEEE TMM, IEEE TCSVT Experience
 Competitions & Awards
 Rank 1 in Make-up Temporal Video Grounding Track of PIC challenge at ACM MM 2022 Rank 1 in Make-up Dense Video Captioning Track of PIC challenge at ACM MM 2022 Rank 2 in Generic Event Boundary Captioning Track of LOVEU Challenge at CVPR 2022 Rank 2 in Event Dense-Captioning Track of ActivityNet Challenge at CVPR 2020, CVPR2021, CVPR2022 Rank 3 in TinyAction Challenge at CVPR 2021 |