Vision-language models (VLMs) have emerged as a core component in AI agents, enabling them to connect natural language instructions with the visual world of documents, interfaces, and environments. Yet, their ability to navigate and understand complex visual structures is brittle. In this talk, I will present a research trajectory that combines the (1) construction of targeted benchmarks that reveal the limitations of current VLMs, with the (2) development of VLM architectural innovations that illustrate how inductive biases can be introduced to address these gaps. Taken together, these efforts highlight how benchmarks and architectures co-evolve, and move us toward a new generation of VLMs capable of both understanding and acting in complex multimodal environments.