1 Description of study and data statistics

The study by Zhao et al. [1] examined the role of the Susd2 gene in suppressing immune response to tumor formation by CD8+ T cells. Functional CD8+ T cells produce antitumor effectors in response to the binding of interleukin-2 (IL-2) to interleukin-2 receptor α (IL-2Rα). The study found that the SUSD2 protein interacts with IL-2Rα, which suppresses the binding of IL-2. This interference with IL-2Rα signaling leads to suppression of antitumor molecules in CD8+ T cells. To investigate the effect of deleting the Susd2 gene on antitumor function, wild-type mice were compared to Susd2-/- mice, with increased suppression of tumor growth observed in the Susd2-/- mice. The study made significant use of scRNA-seq analysis to examine gene expression profiles of CD45+ immune cells (which include CD8+ T cells) in the tumor microenvironment (TME) of tumors from wild-type and Susd2-/- mice. A number of common statistical analyses of the scRNA-seq data were performed, based on a Seurat pipeline for single-cell data. These analysis steps (on post-QC data) included identifying the most variable genes, performing data integration, using PCA to identify the top principal components, and performing unsupervised clustering based on these PCs using uniform manifold approximation and project (UMAP) to identify subpopulations of immune cells. Differential expression analysis was then performed across the clusters, using Seurat’s method based on the Wilcoxon rank test. This analysis was repeated on subclusters on interesting genes, and functional and pathway enrichment analysis was also performed to examine the biological activity of identified DEGs.

For this assignment, we partially recreate some of these same analysis steps, using Seurat for scRNA-seq data analysis.


2 Setting up the Seurat object

During initial creation of the Seurat object, only features detected in at least 3 cells were included, to exclude uninformative genes and reduce dimensionality of the dataset. For our dataset we used a single sample from the original Zhao et al. study, which was supplied as an HDF5 file.

library(dplyr)
library(Seurat)
library(patchwork)

# Load the cd8 dataset
setwd("~/BMI7235/WK3/HW1")

cd8_fromH5 <- Read10X_h5("./data/GSE210704_WT.h5")

# Initialize the Seurat object with the raw (non-normalized) data.
cd8 <- CreateSeuratObject(counts = cd8_fromH5, project = "CD8 TILs", min.cells = 3, min.features = 0)
cd8
## An object of class Seurat 
## 17428 features across 7469 samples within 1 assay 
## Active assay: RNA (17428 features, 0 variable features)
##  1 layer present: counts


3 Data filtering / QC

Using the Seurat object, violin plots were then constructed for number of features, number of counts, and percent mitochondrial content, and further data filtering was performed based on inspection of the plots.

Cells with less than 200 features were filtered, as these may represent dead or dying cells; and cells with greater than 4000 features were also excluded, based on the distribution in the violin plot. Although the distribution of features continues past this cut-off of 4000, and even displays a smaller second peak, these cells are likely doublets, and should therefore should be excluded.

Cells were further filtered based on mitchondrial percent, as a high percentage may indicated nonviable/dying cells. Excluding percentages >5% is standard, and has been shown to perform well in mice tissue [2], but there has been some evidence that tumor cells exhibit naturally higher mitochondrial activity and over-filtering may remove biologically relevant information [3]. I therefore chose to filter at 10%, which also appears to be a natural cutoff based on the violin plot distribution.

cd8[["percent.mt"]] <- PercentageFeatureSet(cd8, pattern = "^mt-")

# Plot QC violin plots
VlnPlot(cd8, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

# Perform data filtering
cd8 <- subset(cd8, nFeature_RNA > 200 & nFeature_RNA < 4000 & percent.mt < 10)


4 Normalization and variable feature selection

We next normalize our gene count data to account for variation introduced by technical factors rather than by biologically significant differences. Then, using our normalized data, we identify the 2000 most variable features – that is, those genes which vary the most across cells – as these will be the most informative genes to look at for the study. The dot plot below shows these 2000 most variable features, with the top 10 most variable genes labeled.

cd8 <- NormalizeData(cd8, normalization.method = "LogNormalize", scale.factor = 10000)
cd8 <- FindVariableFeatures(cd8, selection.method = "vst", nfeatures = 2000)

top10 <- head(VariableFeatures(cd8), 10)
plot1 <- VariableFeaturePlot(cd8)
LabelPoints(plot = plot1, points = top10, repel = TRUE)


5 Scaling and PCA

We next scale our data, while also regressing out cells with the most variable mitochondrial percentage, for improving dimensional reduction and later clustering. We then perform linear dimension reduction using Principal Component Analysis (PCA), and visualize the results: examining the top genes associated with the first two principle components; looking at how cells cluster according to these components; and using heatmaps to examine the top genes contributing to the heterogeneity of a selection of the top principal components (PCs 1 through 15).

To select the number of PCs to include for further analysis, we can examine these heatmaps to identify where the PCs start to become uninformative. In conjunction with this, an elbow plot of the standard deviations of PCs is useful for identifying the range of PCs to include. Visually, the inflection point of this plot appears to occur around PC 7, so we choose to include the first 7 PCs.

cd8 <- ScaleData(object = cd8, vars.to.regress = c("percent.mt"))
cd8 <- RunPCA(cd8)

VizDimLoadings(cd8, dims = 1:2, reduction = "pca")

DimPlot(object = cd8, reduction = "pca")

DimHeatmap(cd8, dims = 1:15, cells = 500, balanced = TRUE)

ElbowPlot(cd8)


6 UMAP and Clustering

We use these 7 PCs to run a non-linear dimensional reduction (UMAP), and construct a KNN graph in this reduction space. We then identify clusters. To choose a resolution for clustering, I referenced the original Zhao et al. study, in which they identified 18 cell types in their UMAP clustering; I therefore tuned my clustering resolution to have a similar result, finding a resolution of 0.8 to produce the same number of clusters.

Below we provide the plot of our UMAP clusters, as well as a table of the number of cells per cluster.

cd8 <- RunUMAP(cd8, dims = 1:7)
cd8 <- FindNeighbors(cd8, dims = 1:7)
cd8 <- FindClusters(cd8, resolution = 0.8)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 4804
## Number of edges: 141339
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8459
## Number of communities: 18
## Elapsed time: 0 seconds
DimPlot(cd8, reduction = "umap")

# Show number of cells per cluster
table(Idents(cd8))
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
## 620 592 530 385 341 339 269 253 223 219 192 189 186 162 106 100  58  40


7 Differential expression and DEG visualization

Next we performed differential expression analysis to compare expression levels across our identified clusters. We filtered DE results to keep only those with L2FC > 0.25 and adjusted p-value < 0.05. The first ten entries of our final DEG table are shown.

DefaultAssay(cd8) <- "RNA"
cd8.markers <- FindAllMarkers(cd8, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

# Filter results for p_val_adj < 0.05
cd8.markers <- subset(cd8.markers, p_val_adj < 0.05)

cd8.markers[1:10, ]
##                p_val avg_log2FC pct.1 pct.2     p_val_adj cluster   gene
## Arg1   1.009272e-217   1.933432 0.990 0.604 1.758960e-213       0   Arg1
## Hilpda 3.065514e-167   1.804581 0.924 0.534 5.342577e-163       0 Hilpda
## Ctsl   4.927354e-153   1.896761 0.952 0.677 8.587392e-149       0   Ctsl
## Emp1   1.303033e-136   2.086978 0.789 0.394 2.270927e-132       0   Emp1
## Mmp8   2.610907e-131   3.733462 0.368 0.062 4.550288e-127       0   Mmp8
## Cxcl3  5.827156e-119   2.249712 0.463 0.114 1.015557e-114       0  Cxcl3
## Pf4    6.578964e-118   3.192398 0.682 0.297 1.146582e-113       0    Pf4
## F13a1  8.934712e-116   1.868238 0.739 0.337 1.557142e-111       0  F13a1
## Lgals3 4.730797e-115   1.089760 0.995 0.942 8.244832e-111       0 Lgals3
## Ccl9   1.633207e-114   1.924763 0.831 0.477 2.846354e-110       0   Ccl9


8 Selected gene visualization

Next, to examine a small selection of genes, we picked 4 DEGs at random from within a similar range of the results (from about 4 to 6.5 L2FC). These genes were Crabp2, Sox9, Gulp1, and Slurp1. For each gene we plotted violin and feature plots to examine that gene’s distribution among the clusters. Finally, we plotted a heatmap of the top 10 DEGs in each cluster. (Due to the high number of clusters, gene names are compressed and hard to read.)

genes <- c("Crabp2", "Sox9", "Gulp1", "Slurp1")

# Create violin plots and feature plots for 4 selected genes
VlnPlot(cd8, features = genes, slot = "counts", log = TRUE)

FeaturePlot(cd8, features = genes)

# Create heatmap for top 10 genes in each cluster
top10 <- cd8.markers %>%
    group_by(cluster) %>%
    top_n(n = 10, wt = avg_log2FC)
DoHeatmap(cd8, features = top10$gene) + NoLegend()


9 References

  1. Zhao B, Gong W, Ma A, et al. SUSD2 suppresses CD8+ T cell antitumor immunity by targeting IL-2 receptor signaling. Nat Immunol. 2022;23(11):1588-1599. doi:10.1038/s41590-022-01326-8
  2. Osorio D, Cai JJ. Systematic determination of the mitochondrial proportion in human and mice tissues for single-cell RNA-sequencing data quality control. Bioinformatics. 2021;37(7):963-967. doi:10.1093/bioinformatics/btaa751
  3. Yates J, Kraft A, Boeva V. Filtering cells with high mitochondrial content depletes viable metabolically altered malignant cell populations in cancer single-cell studies. Genome Biol. 2025;26(1):91. Published 2025 Apr 9. doi:10.1186/s13059-025-03559-w