[go: up one dir, main page]

100% found this document useful (1 vote)
387 views2 pages

Brain Cell Types & Clustering Analysis

This document summarizes work on analyzing unlabeled brain cell data from 3 main cell types and their subtypes. It discusses using clustering and feature selection methods like logistic regression with regularization to identify key genes that distinguish cell types. The effects of hyperparameters like the number of principal components used in T-SNE visualization and regularization parameters in models are also analyzed. Maintaining reproducibility and addressing issues like multiple testing are important considerations discussed.

Uploaded by

Begad Hosni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
387 views2 pages

Brain Cell Types & Clustering Analysis

This document summarizes work on analyzing unlabeled brain cell data from 3 main cell types and their subtypes. It discusses using clustering and feature selection methods like logistic regression with regularization to identify key genes that distinguish cell types. The effects of hyperparameters like the number of principal components used in T-SNE visualization and regularization parameters in models are also analyzed. Maintaining reproducibility and addressing issues like multiple testing are important considerations discussed.

Uploaded by

Begad Hosni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

6.

419x Module 2 Report


BegadE

May 2022

Problem 2: Larger unlabeled subset

Part 1: Visualization
1. (3 points) Provide at least one visualization which clearly shows the existence of three main brain cell types as
described by the scientist, and explain how it shows this. Your visualization should support the idea that cells from a
different group (for example, excitatory vs inhibitory) can differ greatly.
Solution:
2. (4 points) Provide at least one visualization which supports the claim that within each of the three types, there are
numerous possible sub-types for a cell. In your visualization, highlight which of the three main types these sub-types
belong to. Again, explain how your visualization supports the claim.
Solution:

Part 2: Unsupervised Feature Selection


1. (4 points) Using your clustering method(s) of choice, find a suitable clustering for the cells. Briefly explain how you
chose the number of clusters by appropriate visualizations and/or numerical findings.
Solution:
2. (6 points) We will now treat your cluster assignments as labels for supervised learning. Fit a logistic regression
model to the original data (not principal components), with your clustering as the target labels. Since the data is
high-dimensional, make sure to regularize your model using your choice of ℓ1 ,ℓ2 , or elastic net, and separate the data
into training and validation or use cross-validation to select your model. Report your choice of regularization parameter
and validation performance.
Solution:
3. (9 points) Select the features with the top 100 corresponding coefficient values (since this is a multi-class model,
you can rank the coefficients using the maximum absolute value over classes, or the sum of absolute values). Take the
evaluation training data in p2evaluation and use a subset of the genes consisting of the features you selected. Train
a logistic regression classifier on this training data, and evaluate its performance on the evaluation test data. Report
your score.
Solution:

1
Problem 3: Influence of Hyper-parameters
1. (3 points) When we created the T-SNE plot in Problem 1, we ran T-SNE on the top 50 PC’s of the data. But
we could have easily chosen a different number of PC’s to represent the data. Run T-SNE using 10, 50, 100, 250,
and 500 PC’s, and plot the resulting visualization for each. What do you observe as you increase the number of PC’s
used?
Solution:

2. (13 points) Pick three hyper-parameters below and analyze how changing the hyper-parameters affect the conclu-
sions that can be drawn from the data. Please choose at least one hyper-parameter from each of the two categories
(visualization and clustering/feature selection). At minimum, evaluate the hyper-parameters individually, but you may
also evaluate how joint changes in the hyper-parameters affect the results. You may use any of the datasets we have
given you in this project. For visualization hyper-parameters, you may find it productive to augment your analysis with
experiments on synthetic data, though we request that you use real data in at least one demonstration.
Solution:

Reference
[1] R. L. Wasserstein and N. A. Lazar, “The ASA statement on p-values: context, process, and purpose,” The American
Statistician, vol. 70, no. 2, pp. 129-133, 2016.
Ioannidis, J. P. A. (2005, August). “Why most published research findings are false”. PLoS medicine. Retrieved June
7, 2022, from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/
National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science.
Washington, DC: The National Academies Press. https://doi.org/10.17226/25303.

You might also like