A Caltech Library Service

Utilizing Machine Learning Techniques to Rapidly Identify MUC2 Expression in Colon Cancer Tissues


Periyakoil, Preethi Kasthuri (2018) Utilizing Machine Learning Techniques to Rapidly Identify MUC2 Expression in Colon Cancer Tissues. Senior thesis (Major), California Institute of Technology. doi:10.7907/sapn-r691.


Colorectal cancer is the third-most common form of cancer among American men and women. Like most tumors, colon cancer is sustained by a subpopulation of “stem cells” that possess the ability to self-renew and differentiate into more specialized cell types. It would be useful to detect stem cells in images of colon cancer tissue, but the first step in being able to do so is to know what genes are expressed in the stem cells and how to detect their expression pattern from the tissue images. Machine learning (ML) is a powerful tool that is widely used in biological research as a novel and innovative technique to facilitate rapid diagnosis of cancer. The current study demonstrates the feasibility and effectiveness of using ML techniques to rapidly detect the expression of the gene MUC2 (mucin 2) in colon cancer tissue images. We analyzed histological images of colon cancer and segmented the nuclei to look for features (area, perimeter, eccentricity, compactness, etc.) that correlate with high or low levels of MUC2. Grid search was then run on this data set to tune the hyper-parameters, and the following models were tested as potential classifiers: random forest, gradient boosting, decision trees with AdaBoost, and support vector machines. Of all of the tested models, it was found that the random forest classifier (f1 score of 0.71) and the gradient boosting classifier (f1 score of 0.72) were able to predict the output label most accurately. Under certain conditions, we have identified four features that have predictive capabilities. Predicting individual gene expression with machine learning is the first step in detecting genes that are specific to cancer stem cells in the early stages of cancer, while there is still hope for a cure.

Item Type:Thesis (Senior thesis (Major))
Subject Keywords:machine learning, biology, medicine, cancer, colon, AI, ML, computer science
Degree Grantor:California Institute of Technology
Division:Engineering and Applied Science
Major Option:Computer Science
Awards:Library Friends Senior Thesis Prize Finalist, 2018.
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Yue, Yisong
Group:Senior Undergraduate Thesis Prize
Thesis Committee:
  • None, None
Defense Date:19 March 2018
Record Number:CaltechTHESIS:08232018-131754063
Persistent URL:
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:11159
Deposited By: Preethi Periyakoil
Deposited On:24 Aug 2018 17:48
Last Modified:02 Aug 2022 21:39

Thesis Files

PDF - Final Version
See Usage Policy.


Repository Staff Only: item control page