0
On the dynamic evolution of CLIP texture-shape bias and its relationship to human alignment and model robustness
arXiv:2508.09814v2 Announce Type: replace
Abstract: Contrastive language-image models such as CLIP have demonstrated remarkable generalization capabilities. However, how their internal visual representations evolve during training and how this evolution relates to human perception remains poorly understood. Most existing analysis characterize fully trained models, leaving the dynamics of representational biases and perceptual alignment largely unexplored. In this work, we present an epoch-by-epoch analysis of CLIP models throughout training, focusing on the evolution of texture-shape bias, alignment with human perceptual judgements, and sensitivity to image noise. Using multiple perceptual benchmarks spanning low-level image quality assessment, mid-level perceptual similarity, saliency correspondence, and noisy robustness, we identify a consistent, training-stage-dependent representational transition. Early training stages exhibit strong texture bias, elevated alignment with low-level human perceptual measures, and increased sensitivity to Gaussian noise perturbations. As training progresses, this texture bias gradually diminishes in favor of more shape-based representations, coinciding with improved robustness to noise and a decline in low-level perceptual alignment. Importantly, these dynamics are consistently observed across multiple CLIP model scales, indicating that the phenomenon is not specific to a particular architecture size. Our findings provide an empirical characterization of how perceptual alignment, feature bias, and robustness co-evolve during multimodal model training. This work reveals a systematic trade-off between early low-level perceptual alignment and later robustness, offering new insights into the representational dynamics of vision-language models and their relationship to human visual processing.
Abstract: Contrastive language-image models such as CLIP have demonstrated remarkable generalization capabilities. However, how their internal visual representations evolve during training and how this evolution relates to human perception remains poorly understood. Most existing analysis characterize fully trained models, leaving the dynamics of representational biases and perceptual alignment largely unexplored. In this work, we present an epoch-by-epoch analysis of CLIP models throughout training, focusing on the evolution of texture-shape bias, alignment with human perceptual judgements, and sensitivity to image noise. Using multiple perceptual benchmarks spanning low-level image quality assessment, mid-level perceptual similarity, saliency correspondence, and noisy robustness, we identify a consistent, training-stage-dependent representational transition. Early training stages exhibit strong texture bias, elevated alignment with low-level human perceptual measures, and increased sensitivity to Gaussian noise perturbations. As training progresses, this texture bias gradually diminishes in favor of more shape-based representations, coinciding with improved robustness to noise and a decline in low-level perceptual alignment. Importantly, these dynamics are consistently observed across multiple CLIP model scales, indicating that the phenomenon is not specific to a particular architecture size. Our findings provide an empirical characterization of how perceptual alignment, feature bias, and robustness co-evolve during multimodal model training. This work reveals a systematic trade-off between early low-level perceptual alignment and later robustness, offering new insights into the representational dynamics of vision-language models and their relationship to human visual processing.