Setting Up

Load packages and set working directory.

Import counts, samples, and gene info.

star_counts <- read.delim("TCGA-LUSC.filter_counts.txt", row.names=1)

phenotype <- read.delim("TCGA-LUSC.clinical.tsv.gz")

gencode.v22.annotation.gene <- read.delim("gencode.v36.annotation.gtf.gene.probemap", row.names=1)

Sample filtering

To focus our analysis, we want to filter our samples/counts according to some appropriate sample variable. In the lecture example, we filtered according to diagnosis, reducing our phenotypes to only those with a “Stage IB” diagnosis. For this example, I decided to filter according to a different diagnosis system, using the “T” diagnosis value and restricting the samples to only those with a “T2” diagnosis. I have maintained the lecture example’s grouping of tumor/normal for the differential analysis, as this comparison is likely to have the most interesting results.

table(phenotype$sample_type.samples)

## 
##        FFPE Scrolls       Primary Tumor Solid Tissue Normal 
##                   6                 505                 119

# Filter variable to only contain two groups
phenotype<-subset(phenotype,sample_type.samples=="Primary Tumor" | sample_type.samples=="Solid Tissue Normal" )
table(phenotype$sample_type.samples)

## 
##       Primary Tumor Solid Tissue Normal 
##                 505                 119

# Reduce sample size for faster analysis and more focused results
phenotype<-subset(phenotype,phenotype$ajcc_pathologic_t.diagnoses=="T2")

table(phenotype$sample_type.samples)

## 
##       Primary Tumor Solid Tissue Normal 
##                 174                  44

rownames(phenotype)<-gsub("-",".",phenotype$sample)

Matching count and sample phenotype data

We want only the counts for samples present in our phenotype data, and only the phenotype data for samples with counts, so we create new data frames for each of these restricted sets (sample_counts and sample_meta, respectively). Reorder sample_counts to have the same order of sample IDs as sample_meta. We now have matched sets of counts and samples. Finally, for our differential expression analysis, we want to take the phenotype groupings of our analysis (“Solid Tissue Normal” and “Primary Tumor”) and set them as factors.

sample_counts <- star_counts[  , colnames(star_counts) %in% row.names(phenotype)]
sample_meta<-phenotype[rownames(phenotype) %in% colnames(sample_counts),  ]

#rownames(sample_meta)==colnames(sample_counts)

sample_counts<-sample_counts[,rownames(sample_meta)]

#rownames(sample_meta)==colnames(sample_counts)
#ncol(sample_counts)==nrow(sample_meta)

table(sample_meta$sample_type.samples)

## 
##       Primary Tumor Solid Tissue Normal 
##                 172                  15

sample_meta$sample_type.samples<-factor(sample_meta$sample_type.samples, 
                                             levels = c("Solid Tissue Normal","Primary Tumor"))

Plotting PCA

dds <- DESeqDataSetFromMatrix(countData = sample_counts,
                               colData = sample_meta,
                               design = ~ 1)

vsd <- vst(dds, blind = FALSE)
mat<-assay(vsd)

plotPCA(vsd,intgroup = "sample_type.samples")

## using ntop=500 top features by variance

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the DESeq2 package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Differential expression

Use deseq2 to perform differential expression analysis between tumor and normal groups.

dds1 <- DESeqDataSetFromMatrix(countData = sample_counts,
                                       colData = sample_meta,
                                       design = ~ sample_type.samples)

##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]

dds1 <- estimateSizeFactors(dds1)

##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]

dds1 <- estimateDispersions(dds1, fitType = "local")

## gene-wise dispersion estimates

## mean-dispersion relationship

##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]

## final dispersion estimates

dds1 <- nbinomWaldTest(dds1)
res <- results(dds1)

summary(res, alpha = 0.05)

## 
## out of 28267 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 10096, 36%
## LFC < 0 (down)     : 6253, 22%
## outliers [1]       : 2497, 8.8%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

result <- as.data.frame(res[order(res$padj), ])

result.plus.genes<-merge(gencode.v22.annotation.gene,result,by="row.names")
rownames(result.plus.genes) <- result.plus.genes$Row.names
result.plus.genes$Row.names <- NULL

result.plus.genes<-subset(result.plus.genes,!is.na(result.plus.genes$pvalue))

write.table(result.plus.genes,"DEG_results_full.txt",row.names=T,col.names=NA,quote=F,sep="\t")

Volcano plots

EnhancedVolcano(result.plus.genes,
                lab = result.plus.genes$gene,
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 0.05,
                FCcutoff = 0.58)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the EnhancedVolcano package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the EnhancedVolcano package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

EnhancedVolcano(result.plus.genes,
                lab = "",
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 0.000005,
                FCcutoff = 1)

Plotting heatmap

Before plotting our heat map, we want to consider how many genes are in the result of our differential expression analysis, as there is a limit to how many genes we can compare on a heat map. Because we must restrict our results, we want to do so on the basis of significance, filtering out less significant results. Using fold change and adjusted p-value (FDR) as our filtering parameters, we begin with a less restrictive filtering and review the results to determine if we need to raise our filtering thresholds. Let’s begin by taking only those genes with a fold change having absolute value greater than 1 and an FDR-adjusted p-value less than 0.05.

# Select differentially expressed genes with abs(FC) > 1 and adjusted p-value < 0.05
FC<-1
FDR<-0.05
 
significant_results<-subset(result.plus.genes,abs(result.plus.genes$log2FoldChange)>FC & result.plus.genes$padj<FDR)
(filtered_genes<-dim(significant_results)[1])

## [1] 10145

We see that this leaves us with 10145 genes. This is too many to plot on our heat map, so we want to filter our results more restrictively. We will try filtering according to abs(FC) > 5 and FDR < 0.00000001.

FC<-5
FDR<-0.00000001

significant_results<-subset(result.plus.genes,abs(result.plus.genes$log2FoldChange)>FC & result.plus.genes$padj<FDR)
(filtered_genes<-dim(significant_results)[1])

## [1] 648

Our more stringent filtering leaves us with 648 genes. This gives us enough samples for an informative heat map, without cluttering the heat map with too many samples, so we will proceed with this filtering of the results. (Note that this is a more stringent filtering than used in the lecture example, yet it still leaves us with a substantial number of genes.)

n_samples<-dim(mat)[2]

significant_expression<-merge(mat,significant_results,by="row.names")

  head(subset(gencode.v22.annotation.gene,gencode.v22.annotation.gene$gene=="U8"))

##                   gene chrom chromStart  chromEnd strand
## ENSG00000212144.1   U8  chr1  234593275 234593402      +
## ENSG00000202537.1   U8  chr2   86347062  86347195      -
## ENSG00000212581.1   U8  chr2  185661774 185661906      +
## ENSG00000212145.2   U8  chr3   41899111  41899253      -
## ENSG00000201398.1   U8  chr3  154007367 154007502      -
## ENSG00000201810.1   U8  chr3  180542701 180542836      +

  significant_expression<-subset(significant_expression,
                                 significant_expression$gene!="Metazoa_SRP" & 
                                   significant_expression$gene!="U8" & 
                                   significant_expression$gene!="Y_RNA"  & 
                                   significant_expression$gene!="SYT15" &
                                   significant_expression$gene!="U8")
  rownames(significant_expression) <- significant_expression$gene

    # IMPORTANT NOTE:
    # if you get an error about duplicate genes, add them to the subset function 
    # above and rerun the rownames assignment
  
    
significant_expression$Row.names <- NULL
significant_expression<-significant_expression[1:n_samples]

sca<-t(scale(t(significant_expression)))

#build sample annotation to label tumor and normal
ha<-HeatmapAnnotation(" "=sample_meta$sample_type.samples, 
                      col = list(" " = c("Solid Tissue Normal"="lightgray","Primary Tumor"="magenta")),
                      show_legend = T,simple_anno_size = unit(.5, "cm"))


h_map<-Heatmap(sca,top_annotation = ha,
        show_row_names = FALSE,show_column_names = FALSE,
        name="vst\nnorm",column_title=paste("Comparing Tumor vs Normal\nlogFC>",FC," and FDR<",FDR, sep=""))

draw(h_map)

Plotting boxplots

We want to pick out two of the best genes to examine as box plots, so let’s use the gene with the highest positive fold change (the most upregulated gene) and the gene with highest negative fold change (the most downregulated gene). From manually examining our results table, we can see that the most positively expressed gene is UGT1A7, while the most negatively expressed gene is AC008268.1.

significant_expression_sample <- merge(sample_meta,t(significant_expression),by="row.names")
rownames(significant_expression_sample)<-significant_expression_sample$Row.names
significant_expression_sample$Row.names<-NULL

result.plus.genes[which.max(result.plus.genes$log2FoldChange),]$gene

## [1] "UGT1A7"

result.plus.genes[which.min(result.plus.genes$log2FoldChange),]$gene

## [1] "AC008268.1"

boxplot(UGT1A7~sample_type.samples,data=significant_expression_sample,
        main=paste("Expression of an example positive gene"),
        xlab="Sample type", ylab="Normalized expression")

pos_boxplt<-recordPlot()

neg_boxplt<-boxplot(AC008268.1~sample_type.samples,data=significant_expression_sample,
        main=paste("Expression of an example negative gene"),
        xlab="Sample type", ylab="Normalized expression")

neg_boxplt<-recordPlot()

EnhancedVolcano(result.plus.genes,
                lab = result.plus.genes$gene,
                x = 'log2FoldChange',selectLab=c("UGT1A7","AC008268.1"),
                y = 'padj',
                pCutoff = 0.05,
                FCcutoff = 0.58)

sca<-subset(sca, rownames(sca) %in% c("UGT1A7","AC008268.1"))
Heatmap(sca,top_annotation = ha,
        show_row_names = TRUE,show_column_names = FALSE,
        name="vst\nnorm",column_title="Pick two genes")

Importing and filtering differentially expressed genes

DEG_results <- read.delim("DEG_results_full.txt",row.names=1)
FDR_results<-subset(DEG_results,log2FoldChange>FC & padj<FDR)

# Aiming for 200 to 2000 genes for functional enrichment
dim(FDR_results)[1]

## [1] 591

GO enrichment

Finally, we can use clusterProfiler to perform over-enrichment analysis. I have modified this section from the example script to analyse enrichment according to biological process (BP) rather than molecular function (MF). I have also used a different plot type – a dot plot rather than a bar plot – to display results of the analysis enrichment.

#Over-representation test
ego <- enrichGO(gene = FDR_results$gene, 
                OrgDb = org.Hs.eg.db, ont = "BP",
                pAdjustMethod = "BH",
                qvalueCutoff  = 0.05,
                keyType="SYMBOL")
head(ego)

##                    ID                       Description GeneRatio   BgRatio
## GO:0008544 GO:0008544             epidermis development    39/269 395/18805
## GO:0031424 GO:0031424                    keratinization    18/269  83/18805
## GO:0043588 GO:0043588                  skin development    30/269 324/18805
## GO:0030216 GO:0030216      keratinocyte differentiation    22/269 177/18805
## GO:0030326 GO:0030326      embryonic limb morphogenesis    18/269 124/18805
## GO:0035113 GO:0035113 embryonic appendage morphogenesis    18/269 124/18805
##            RichFactor FoldEnrichment   zScore       pvalue     p.adjust
## GO:0008544 0.09873418       6.902216 14.28170 1.160730e-21 2.768340e-18
## GO:0031424 0.21686747      15.160568 15.57532 1.156897e-16 1.379599e-13
## GO:0043588 0.09259259       6.472876 11.97068 4.410772e-16 3.506563e-13
## GO:0030216 0.12429379       8.689014 12.38133 1.005077e-14 5.992771e-12
## GO:0030326 0.14516129      10.147799 12.31178 1.883796e-13 7.427198e-11
## GO:0035113 0.14516129      10.147799 12.31178 1.883796e-13 7.427198e-11
##                  qvalue
## GO:0008544 2.620806e-18
## GO:0031424 1.306075e-13
## GO:0043588 3.319686e-13
## GO:0030216 5.673395e-12
## GO:0030326 7.031376e-11
## GO:0035113 7.031376e-11
##                                                                                                                                                                                                                                                   geneID
## GO:0008544 PAX6/COL17A1/TP63/PKP1/PTHLH/TGM5/LHX2/KRT32/COL7A1/HOXC13/SOX21/KRT17/KRT34/ABCA12/CERS3/GRHL3/IVL/SPRR3/SPRR2D/PITX2/SPRR1B/SPRR1A/KRT78/KRT74/KRT15/CALML5/FOXE1/KRT6B/KRT5/KRT16/KRT14/GJB5/LCE1C/TMPRSS11F/SPRR2E/LIPK/KRT6A/FOXI3/LCE1F
## GO:0031424                                                                                                                                  KRT17/ABCA12/CERS3/IVL/SPRR3/SPRR2D/SPRR1B/SPRR1A/KRT78/KRT74/KRT6B/KRT5/KRT16/LCE1C/SPRR2E/LIPK/KRT6A/LCE1F
## GO:0043588                                                          PAX6/TP63/PKP1/LHX2/HOXC13/SOX21/KRT17/ABCA12/CERS3/GRHL3/IVL/SPRR3/SPRR2D/SPRR1B/SPRR1A/KRT78/KRT74/FOXE1/KRT6B/KRT5/KRT16/KRT14/GJB3/LCE1C/TMPRSS11F/SPRR2E/LIPK/KRT6A/FOXI3/LCE1F
## GO:0030216                                                                                                             PAX6/TP63/PKP1/KRT17/ABCA12/CERS3/IVL/SPRR3/SPRR2D/SPRR1B/SPRR1A/KRT78/KRT74/KRT6B/KRT5/KRT16/KRT14/LCE1C/SPRR2E/LIPK/KRT6A/LCE1F
## GO:0030326                                                                                                                                   HOXA11/DLX6/PITX1/TP63/SALL1/HOXA13/HOXC11/HOXD10/HOXD13/TFAP2A/FBN2/EN1/SP8/SHOX2/HOXD12/HOXC10/SP9/HOXA10
## GO:0035113                                                                                                                                   HOXA11/DLX6/PITX1/TP63/SALL1/HOXA13/HOXC11/HOXD10/HOXD13/TFAP2A/FBN2/EN1/SP8/SHOX2/HOXD12/HOXC10/SP9/HOXA10
##            Count
## GO:0008544    39
## GO:0031424    18
## GO:0043588    30
## GO:0030216    22
## GO:0030326    18
## GO:0035113    18

#simplify output from enrichGO by removing redundancy of enriched GO terms
simple_ego<-simplify(ego, cutoff = 0.7, by = "p.adjust", select_fun = min,
                     measure = "Wang", semData = NULL )
write.table(head(simple_ego,10),"BP_Enrichment.txt",row.names=T,col.names=NA,quote=F,sep="\t")

#barplot(ego, showCategory=10)
#barplot(simple_ego, showCategory=10)

# For homework, modifying script to use dotplot instead of barplot
dotplot(ego, showCategory=10)

dotplot(simple_ego, showCategory=10)

library(stringr)
pdf("myplot.pdf", width=10)
barplot(simple_ego, showCategory=10) + scale_y_discrete(labels=function(x) str_wrap(x, width=100))

## Warning in fortify(object, showCategory = showCategory, by = x, ...): Arguments in `...` must be used.
## ✖ Problematic argument:
## • by = x
## ℹ Did you misspell an argument name?

## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.

dev.off()

## png 
##   2

Summary Write-Up

See above for brief explanations of the individual analysis/interpretation involved in each step of the script.

Responses to specific prompts follow.

1. Describe the process to prepare the counts/phenotypes for differential expression.

Our script first reads in the data for count totals, sample phenotype information, and gene annotation information. We then perform some filtering to restrict our sample space, which both focuses our analysis and reduces the amount of data for computation. In our case, we want to only look at samples with a type of “Primary Tumor” or “Solid Tissue Normal.” I then filtered these results further to focus only on samples having a diagnosis of “T2.”

We then want to match our count data and sample phenotype data, so that we have all the phenotypic information for a sample associated with its mapped read count results. From here we are ready to perform analysis of differential expression, since we can now compare the differences in counts for each gene (and by implication, the differences in expression) across different conditions of our samples.

2. Explain how deseq2 normalizes the counts and why this is important.

To normalize differences in read counts between samples, deseq2 calculates a scaling factor for every sample, which it applies to the read counts for each gene that sample. It does this by first creating a “pseudo” sample based on the geometric mean of all samples, and then comparing every sample to this pseudo sample to calculate a ratio. The scaling factor for each sample is then calculated as the median ratio; and finally the count for each gene in a sample is normalized by dividing it by this scaling factor.

This normalization is important because among all our samples, there can be differences in the count totals for a gene which are not due to actual differences in expression level, but which arise for other reasons, such as a technical issue occurring during sequencing. Therefore, before doing differential expression analysis, we want to adjust for these other sources of difference among sample counts by normalizing out these differences, so that we are more confident that the read count differences we see are actually due to differences in expression level.

3. Display the plots–PCA, volcano, heatmap, boxplots. What information is displayed by each plot? What commentary can you make about how this plot provides meaning to your analysis?

## using ntop=500 top features by variance

The PCA plot helps us to visualize the intensity of difference between our two groups of tumor and normal samples. The groups are color coded, allowing us to see how strongly they segregate from each other and cluster more closely together within themselves. This gives us preliminary evidence that there is indeed a significant difference between the groups.

The volcano plots provide more detailed information about the differential expression between our groups, illustrating the actual fold changes in expression across genes between our samples. Because of how we factored our samples, a positive fold change represents a gene which is upregulated in our tumor samples, while a negative fold change represents a gene which is downregulated in our tumor samples. The y-axis of the plot represents p-value (as a negative log), so we can see by a gene’s position on the plot both the value of its fold change and the p-value of this. The further from 0 that a gene gets on either axis, the more differentially expressed that gene is and the more statistically significant this difference is.

The heatmap shows the differential expression of all genes across all samples, with rows corresponding to genes and columns corresponding to sample types. Every cell is color-coded to represent the expression of that gene in that sample, with a redder color representing a positive fold change and a bluer color representing a negative fold change. The heatmap is useful for getting a high-level view of differential expression patterns across all genes and all samples of the analysis, as we can see the way that samples of a particular group show an abundance of one color over another for particular genes.

The boxplots examine specific genes identified as having high fold change, with one boxplot showing a gene with positive fold change and the other boxplot showing a gene with negative fold change. We can see from these plots that directions of expression match how we have defined our groups, with the tumor group being compared against the normal group, such that positive fold change means being more highly expressed in the tumor group compared to normal, while negative fold change means being more lowly expressed in the tumor group compared to normal. From the boxplots we also get information about the distribution (such as median and quartile values) of expression levels within each sample group.

4. What filtering parameters for fold change and FDR did you choose and how many genes do you find with that filter set? What is FDR and why is this preferred to use instead of p value?

For fold change, I filtered results to contain only those genes with an absolute value of fold change greater than 5; and for the FDR-adjusted p-value, I filtered results to keep only genes with this value less than 1e-08. These filtering parameters left me with 648 genes.

FDR (which stands for False Discovery Rate) is an adjusted p-value, which is preferred to use because it helps us address the issue of multiple testing. A p-value is the probability of getting our results due to random variation, so we want a low p-value, as this implies a low probability of seeing the result by chance alone. This is good enough when performing only one test, but when performing many tests, such as performing tens of thousands of tests on our large data set of genes, there are many more opportunities for a false positive (or a “false discovery”) to occur, despite being unlikely for the case of a single test. Using an adjusted p-value helps to correct for this phenomenon.

5. Run over-enrichment analysis with clusterProfiler. Feel free to change the ont parameter to BP, CC, or ALL (it is set to MF right now). Display one of the plot options–what does this plot mean? How might that guide your future research? Cite a publication that supports your findings.

The dotplot of our enrichment analysis shows us several things. For one thing, the size of the dot represents the count of our genes enriched for that ontology term, and the dot’s position along the x-axis tells us the ratio of the number of genes enriched for that term to the number of genes in the overall term. The color of the dot gives an adjusted p-value for that finding. All of this information combined gives us an overview of the enrichment analysis, letting us see which terms across an ontological domain are most represented in our samples.

In the plot produced by my analysis, some of the biological process terms which are most enriched are “epidermis development”, “nuclear division”, “skin development”, and “mitotic nuclear division.” This is interesting, as it aligns with things we already know from the data, such as the disease type being “Squamous Cell Neoplasm” – squamous cells are skin cells (in our samples, collected from the lining of airways), and thus it makes sense to find the enrichment in terms relating to skin and epidermis development. Because we are looking at cancer, it also makes sense to find the enrichment in terms relating to cellular division, as cancer is largely a process of out-of-control cellular division.

For future research directions, I might want to investigate squamous cell carcinoma more broadly, since my enrichment analysis shows that the differentially expressed genes are implicated in biological processes related to skin/epidermis. As my analysis also identified a specific gene–UGT1A7–as being highly expressed in tumor tissue, I may want to investigate the role of this gene in particular in other types of squamous cell carcinoma. As our samples came from lung/bronchus tissue, it may be informative to investigate similar nearby tissues.

A publication that would support this research direction is “Regulation and function of family 1 and family 2 UDP-glucuronosyltransferase genes (UGT1A, UGT2B) in human oesophagus” by Strassburg et al [1]. This paper looks at differential expression in squamous cell carcinoma of the esophagus, a biological region adjacent to the lung/bronchus region of our own analysis, and has findings on our own gene of interest, UGT1A7. Therefore comparing analysis results between this research and our own findings could be a promising direction.

Citations: 1. Strassburg CP, Strassburg A, Nguyen N, Li Q, Manns MP, Tukey RH. Regulation and function of family 1 and family 2 UDP-glucuronosyltransferase genes (UGT1A, UGT2B) in human oesophagus. Biochem J. 1999;338 ( Pt 2)(Pt 2):489-498.

BMI-5730, Lab 4: processing counts, DE analysis, functional enrichment

2025-10-10