Clustering meets Vision+Language

Organistation

Course: 3+2 hours weekly (equals 6 ECTS)
Lecture:: Prof. Dr. Thomas Seidl
Assistant:: Dr. Udo Schlegel
Audience:: Master students in the programs of the Institute for Informatics
Course Material:: Moodle
Prior Knowledge:: The course expects participants to have basic skill in machine learning.
Course Language:: English

Content

Clustering in the context of vision+language is about grouping images or image–text pairs based on their semantic meaning, rather than just low-level visual similarity. Using models like CLIP, we first project both images and text into a common embedding space, where distances reflect how similar concepts are in terms of meaning. In this space, standard clustering algorithms (for example k-means, hierarchical clustering, or density-based methods) can be applied to discover structure in large, unlabeled multimodal datasets.

There are several general ways to combine clustering with vision–language models. A common approach is feature-based clustering, where we simply extract fixed embeddings from a pretrained model and cluster them; this is popular for dataset exploration, automatic tagging, or building semantic image search. A second direction is joint or deep clustering, where the model and the clusters are updated together, so that the representation is gradually shaped to produce tighter, more meaningful groups. A third line of work uses clustering inside the model pipeline itself, for example to group prompts, images, or tasks, and then adapt the model differently for each cluster (e.g., experts or prompt pools).

On the vision side, clustering can reveal latent visual concepts such as object types, styles, or scenes, while the language side provides interpretable anchors in the form of captions, tags, or label names. Text can be used to name clusters after they have been formed (“cluster-labelling”), or even to steer clustering by providing prototypes like “dog,” “cat,” “car,” which act as semantic centroids. More recent work also considers clustering image–text pairs jointly, discovering typical combinations of what is depicted and how it is described, which is useful for applications like retrieval, recommendation, or content moderation.

From a methodological perspective, these approaches differ in at least three dimensions: whether the vision–language model is frozen or fine-tuned, whether clustering is purely unsupervised or weakly supervised, and whether clusters live in a shared multimodal space or in modality-specific subspaces. Frozen-feature methods are simple and robust but may not be perfectly aligned to a specific domain; fine-tuned or jointly trained methods can adapt better but require more computation and careful training. Overall, “clustering meets vision+language” has become a flexible toolbox that ranges from quick, exploratory data analysis with off-the-shelf embeddings to sophisticated, end-to-end systems that discover and exploit semantic structure in large-scale multimodal data.

Clustering meets Vision+Language

Organistation

Content

What are you looking for?