Publications

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture
Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning
The Role of Data Curation in Image Captioning
MAP: Low-data Regime Multimodal Learning with Adapter-based Pre-training and Prompting
Systems and Methods for Training Voice Query Models