Batik: a Vision Language Model for End-to-End Social Behavior Discovery, Interpretation and Annotation

Citation

Kolhe, Rohan Rajendra (2025) Batik: a Vision Language Model for End-to-End Social Behavior Discovery, Interpretation and Annotation. Senior thesis (Major), California Institute of Technology. doi:10.7907/cmvx-1n97. https://resolver.caltech.edu/CaltechTHESIS:06062025-053817901

Abstract

Quantitative analysis of animal behavior is a burgeoning field. By converting behavior into measurable features, the field replaces anecdotal observations with precise, data-driven insights into how animals interact with their environment and with one another. Through certain analyses, reproducible structure and diversity within behaviors are revealed, illuminating complex behavioral patterns. Most current state-of-the-art methods focus on annotation and segmentation of behavior using pose-estimation; these methods attach nodes to body parts of mice which then compute a features space. This feature space is then used for discovery of behavior classes or training supervised behavior classifiers. However, this excludes the time-consuming task of interpreting resulting behavioral syllables and has multiple failure modes, such as an inability to attend to frames where there are other objects of interest or frames where the nodes are all on top of one another. The majority of these methods use a convolutional neural network structure. In recent years, a new set of feed-forward neural networks called transformers have been proven to surpass CNNs on most vision-related tasks. Batik addresses the long time-commitments of interpreting these syllables by using multimodal transformers to extract unsupervised features directly from raw video, and perform end-to-end analysis, bypassing pose estimation. Alongside state-of-the-art supervised annotation, Batik leverages fine-tuned large language models to automate discovery and provide expert human-level interpretation of behavior syllables, offering researchers a transformative UI-based tool for behavioral analysis through vision-language models. Through these methods, we show a 96% accuracy for syllables like attack and mount, a large jump from previous methods (85%). We also accurately identify differences in behavior in different metabolic states, as well as an interpretation with sub behaviors for the broad investigative behavior. We further apply our method to other species datasets, correctly classifying distinct fly aggressive behaviors with no additional fine-tuning of the underlying model, showing our model’s generalizability.

Item Type:	Thesis (Senior thesis (Major))
Subject Keywords:	Social Behavior, Behavior Classification, Behavior Interpretation, Transformer Models, Vision-Language Models, Machine Learning
Degree Grantor:	California Institute of Technology
Division:	Biology and Biological Engineering
Major Option:	Computation and Neural Systems
Thesis Availability:	Not set
Research Advisor(s):	Anderson, David J.
Thesis Committee:	None, None
Defense Date:	5 June 2025
Record Number:	CaltechTHESIS:06062025-053817901
Persistent URL:	https://resolver.caltech.edu/CaltechTHESIS:06062025-053817901
DOI:	10.7907/cmvx-1n97
Default Usage Policy:	No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:	17411
Collection:	CaltechTHESIS
Deposited By:	Rohan Kolhe
Deposited On:	06 Jun 2025 22:04
Last Modified:	17 Jun 2025 17:38

Full text not available from this repository.

Repository Staff Only: item control page