My classwork from bimm143
Samuel Fisher (A18131929)
Today we will perform an RNASeq analysis on the effects of dexamethasone (hereafter “dex”), a common steroid, on airway smooth muscle (ASM) cell lines.
We need two things for this analysis:
1) countData: a table with genes as rows and samples/experiments as columns
2) colData: metadata about the columns (i.e. samples) in the main countData object
#counts <- read.csv("airway_scaledcounts.csv", row.names=1)
#metadata <- read.csv("airway_metadata.csv")
Let’s have a peak at the objects:
#metadata
#counts
#head(counts)
##Check on metadata counts correspondance
We need to check that the metadata matches the samples in our count data.
#ncol(counts) == nrow(metadata)
#colnames(counts) == metadata$id
Q1. How many genes are in this dataset?
#nrow(counts)
38694 genes in this dataset
Q2. How many “control” samples are in this dataset?
#sum(metadata$dex == "control")
4 control samples in this dataset
We have 4 replicates per condition (“control” and “treated”) (drug and no drug respectively). We want to compare the control vs the treated to see which gene’s expression levels change when we have the drug present.
We are going to row by row (gene by gene) and see if the average value in control columns is different than the treated columns
Step 1: Find which columns in counts correspond to the “control”
Step 2: Filter / Select / Extract these columns
Step 3: Calculate an average value for each gene.
# The indices (i.e. positions) that are "control"
#control.inds <- metadata$dex == "control"
# Extract/select these "control" columns from counts
#control.counts <- counts[, control.inds]
# Calculate the mean for each gene (i.e. row)
#control.mean <- rowMeans(control.counts)
Q. Do the same for “treated” samples - find the mean count value per gene.
#treated.counts <- rowMeans(counts[, metadata$dex == "treated"])
Let’s put these two mean values into a new data.frame meancounts for
easy book-keeping and plotting.
#treated.mean <- rowMeans(counts[, metadata$dex == "treated"])
#meancounts <- data.frame(control.mean,
# treated.mean)
#head(meancounts)
Q. Make a ggplot of average counts of control vs treated
#library(ggplot2)
#ggplot(meancounts) +
# aes(control.mean,
# treated.mean) +
#geom_point(alpha=0.4)
Adding logarithmic axises
#library(ggplot2)
#ggplot(meancounts) +
# aes(control.mean,
# treated.mean) +
#geom_point(alpha=0.4) +
# scale_x_log10() + scale_y_log10()
If we consider “treated”/“control” counts we will get a number that tells us the change.
# No change
#log2(20/20)
# A doubling in the treated vs control
#log2(40/20)
# A halving in the treated vs control
#log2(10/20)
Q. Add a new column called
log2fcfor log2 fold change of treated/ control to ourmeancountsobject.
#meancounts$log2fc <-
# log2(meancounts$treated.mean/
# meancounts$control.mean)
#head(meancounts)
Typically we would not consider zero count genes - as we have no data about them and they should be excluded from further consideration. These lead to “funky” log2 fold change values (e.g. divide by zero errors etc.)
We are missing any measure of significance from the work we have so far. Let’s do this properly with the DESeq2 package.
#library(DESeq2)
The DESeq2 package, like many bioconductor packages, wants it’s input in a very specific way - a data structure setup with all the info it needs for the calculations.
#dds <- DESeqDataSetFromMatrix(countData = counts,
#colData = metadata,
#design = ~dex)
The main function in this package is called DESeq() it will run the
full analysis for us on our dds input object:
#dds <- DESeq(dds)
Extract our results:
#res <- results(dds)
#head(res)
A useful summary figure of our results is often called a volcano plot. It’s basically a plot of log2 fold change values vs adjusted p-values.
Q. Use ggplot to make a first version “volcano plot” of
log2FoldChangevspadj
#ggplot(res, aes(log2FoldChange, padj)) + geom_point()
This is niot very useful because the y-axis (P-value) is not really helpful - we want to focus on low P-values.
#ggplot(res) +
#aes(x = log2FoldChange,
# y = log10(padj)) +
#geom_point()
#ggplot(res) +
#aes(x = log2FoldChange,
#y = -log10(padj)) +
#geom_point()
#ggplot(res) +
#aes(x = log2FoldChange,
# y = -log10(padj)) +
#geom_point() +
#geom_vline(xintercept = c(-2, 2), col="red") +
#geom_hline(yintercept = -log(0.05), col = "red")
Q. Add color to the points (genes) we care about, nice axis labels, a useful title and a nice theme.
#ggplot(res) +
# aes(x = log2FoldChange,
# y = -log10(padj)) +
# geom_point() +
# geom_vline(xintercept = c(-2, 2), col="red") +
# geom_hline(yintercept = -log(0.05), col = "red")
#mycols <- rep("gray", nrow(res))
#mycols[res$log2FoldChange > 2] <- "blue"
#mycols[res$log2FoldChange < -2 ] <- "darkgreen"
#mycols[res$padj >= 0.05] <- "gray"
#ggplot(res, aes(x = log2FoldChange,
# y = -log10(padj))) +
#geom_point(color = mycols, alpha = 0.5) +
#geom_vline(xintercept = c(-2, 2), col = "red") +
#geom_hline(yintercept = -log10(0.05), col = "red") +
#labs(title = "Volcano Plot",
# x = "Log2 Fold Change",
# y = "-Log10 Adjusted P-value") +
#theme_minimal()
#write.csv(res, file="results.csv")