Citation
Kolhe, Rohan Rajendra (2025) Batik: a Vision Language Model for End-to-End Social Behavior Discovery, Interpretation and Annotation. Senior thesis (Major), California Institute of Technology. doi:10.7907/cmvx-1n97. https://resolver.caltech.edu/CaltechTHESIS:06062025-053817901
Abstract
Quantitative analysis of animal behavior is a burgeoning field. By converting behavior into measurable features, the field replaces anecdotal observations with precise, data-driven insights into how animals interact with their environment and with one another. Through certain analyses, reproducible structure and diversity within behaviors are revealed, illuminating complex behavioral patterns. Most current state-of-the-art methods focus on annotation and segmentation of behavior using pose-estimation; these methods attach nodes to body parts of mice which then compute a features space. This feature space is then used for discovery of behavior classes or training supervised behavior classifiers. However, this excludes the time-consuming task of interpreting resulting behavioral syllables and has multiple failure modes, such as an inability to attend to frames where there are other objects of interest or frames where the nodes are all on top of one another. The majority of these methods use a convolutional neural network structure. In recent years, a new set of feed-forward neural networks called transformers have been proven to surpass CNNs on most vision-related tasks. Batik addresses the long time-commitments of interpreting these syllables by using multimodal transformers to extract unsupervised features directly from raw video, and perform end-to-end analysis, bypassing pose estimation. Alongside state-of-the-art supervised annotation, Batik leverages fine-tuned large language models to automate discovery and provide expert human-level interpretation of behavior syllables, offering researchers a transformative UI-based tool for behavioral analysis through vision-language models. Through these methods, we show a 96% accuracy for syllables like attack and mount, a large jump from previous methods (85%). We also accurately identify differences in behavior in different metabolic states, as well as an interpretation with sub behaviors for the broad investigative behavior. We further apply our method to other species datasets, correctly classifying distinct fly aggressive behaviors with no additional fine-tuning of the underlying model, showing our model’s generalizability.
Item Type: | Thesis (Senior thesis (Major)) |
---|---|
Subject Keywords: | Social Behavior, Behavior Classification, Behavior Interpretation, Transformer Models, Vision-Language Models, Machine Learning |
Degree Grantor: | California Institute of Technology |
Division: | Biology and Biological Engineering |
Major Option: | Computation and Neural Systems |
Thesis Availability: | Not set |
Research Advisor(s): |
|
Thesis Committee: |
|
Defense Date: | 5 June 2025 |
Record Number: | CaltechTHESIS:06062025-053817901 |
Persistent URL: | https://resolver.caltech.edu/CaltechTHESIS:06062025-053817901 |
DOI: | 10.7907/cmvx-1n97 |
Default Usage Policy: | No commercial reproduction, distribution, display or performance rights in this work are provided. |
ID Code: | 17411 |
Collection: | CaltechTHESIS |
Deposited By: | Rohan Kolhe |
Deposited On: | 06 Jun 2025 22:04 |
Last Modified: | 17 Jun 2025 17:38 |
Full text not available from this repository.
Repository Staff Only: item control page