Chapter 9 Multisample analysis
The FLAMES pipeline supports the analysis of multiple samples simultaneously, allowing users to efficiently process complex experimental datasets. This is achieved using the sc_long_multisample_pipeline
function. The output generated by this function is similar to the single-sample analysis outlined in previous chapters but offers additional capabilities when the experimental design includes sample replicates.
To take full advantage of these features we need to perform sample integration to generate a single object that we can use to integrate isoform expression across our conditions. First standard preprocessing as described in Chapters 3 and 4 is required. After integration we can perform the following key analysis:
Sample Integration: Combine multiple sample together and add isoform counts.
Trajectory Analysis: Explore genes that change along pseudotime trajectories.
Differential Expression Analysis: Identify differentially expressed isoforms across time points or sample conditions.
Differential Transcript Usage (DTU) Analysis: Investigate changes in isoform proportions between conditions.
Most preprocessing and QC steps remain the same. However, we will also demonstrate how to proceed directly from FLAMES output to initial QC without performing empty droplet removal or ambient RNA normalization. Although these steps are recommended, they may not always be necessary depending on the data.
9.0.1 Dataset Information
For this analysis, we are using a dataset comprising eight samples spanning four time points from an excitatory neuronal differentiation protocol. The protocol is the same as the one used for the D55 sample in previous chapters but utilizes a different cell line. Two samples were collected from each of the following time points:
Stem cells (STC)
Day 25
Day 55
Day 80
This comprehensive dataset will allow us to explore temporal changes in gene and isoform expression throughout neuronal differentiation.
9.1 Standard pre-processing and quality control
As before we will modify the count matrix for each sample so we use gene_symbol instead of ENSG id. The Data generated from this multi-sample experiment is quite large. This is especially the case with Oarfish isoform quantification as there are many more features to consider. This means much of the workflow bellow is time consuming and computationally intensive. Please keep this in mind when processing your samples. In the tutorial we provide code for the required steps but do not evaluate many of the sections bellow due to these computations constraints. The raw counts are available in the multisample data folder if users wish to run through the analysis.
Code
# convert Gene_id to gene symbol for all counts
#run aloop to run this fucntion on all count files.
# Directory with input files
<- "./data/muti_sample/"
input_dir <- "./data/muti_sample/"
output_dir
# List all CSV files in the input directory
<- list.files(input_dir, pattern = "\\count.csv$", full.names = TRUE)
input_files
# Initialize an empty list to store data frames
<- list()
result_list
# Process each file in the directory
<- length(input_files) # Total number of files
total_files
for (i in seq_along(input_files)) {
<- input_files[i]
file_path
# Extract file name without extension
<- tools::file_path_sans_ext(basename(file_path))
file_name
# Construct the output file name
<- file.path(output_dir, paste0(file_name, "_gene_symbol.csv"))
output_file
# Show progress to the user
message(sprintf("Processing file %d of %d: %s", i, total_files, file_name))
# Call the function with return_df = TRUE to store the result
<- convert_ENSGID_to_geneSymbol(
result_list[[file_name]] gene_count_matrix_path = file_path,
output_file = output_file,
return_df = TRUE
) }
9.1.1 Define QC function
We have developed a multisample QC function @ref(Multisample QC function) for processing count matrices and constructing filtered Seurat objects, following best practices outlined in the Seurat tutorial. This function includes steps to identify and remove doublets, as well as generate key diagnostic plots for data inspection. We encourage users to customize this function to fit their specific needs, such as applying additional filtering criteria.
For the purposes of presenting concise and readable code we will apply the default filtering parameters to all 8 objects. If users wish to be more specific about filtering each sample independent this is possible and can be done by running the perform_qc_filtering
function on each sample with desired parameters.
Code
# Define sample names
<- c("C1_STC", "C4_STC", "C2_Day25", "C5_Day25", "C2_Day55", "C3_Day55", "C3_Day80", "C5_Day80")
sample_names
# Initialize a list to store UMAP objects
<- list()
umap_objects
# Start the PDF for output
pdf(file = "./output_files/mutli_sample/QC/multisample_QC.pdf", width = 10, height = 6)
# Loop through each sample and apply the QC filtering function
for (i in seq_along(sample_names)) {
<- sample_names[i]
sample_name
# Print progress message
message(paste0("Processing sample ", i, " of ", length(sample_names), ": ", sample_name))
# Perform QC filtering
<- perform_qc_filtering(
result paste0(sample_name, "_matched_reads_gene_count")],
result_list[fig_name = sample_name,
project = sample_name
)
# Store the UMAP object from the result
<- result[[1]]
umap_objects[[sample_name]]
}
# Close the PDF device
dev.off()
#save the object
#saveRDS(umap_objects, file = "./output_files/mutli_sample/umap_objects.rds")
9.2 Multi-Sample integration
First we will merge the seurat objects together and then perform integration. There are many options for sample integration and Seurat provides many wrappers for these functions. Here we will Harmony (Korsunsky et al., 2019) for fast and effective integration. As these processes are time consuming we will load our saved objects.
Code
# intergrate samples
#from the fucntion above pull the filtered seurat object
<- merge(
merged_seurat "C1_STC"]],
umap_objects[[y = list(
"C4_STC"]],
umap_objects[["C2_Day25"]],
umap_objects[["C5_Day25"]],
umap_objects[["C2_Day55"]],
umap_objects[["C3_Day55"]],
umap_objects[["C3_Day80"]],
umap_objects[["C5_Day80"]]
umap_objects[[
),add.cell.ids = c(
"C1_STC", "C4_STC", "C2_Day25", "C5_Day25",
"C2_Day55", "C3_Day55", "C3_Day80", "C5_Day80"
), project = "Multi-sample_tutorial"
)
# create a sample column
$sample <- rownames(merged_seurat@meta.data)
merged_seurat
## split sample column to makw a batch col
@meta.data <- separate(merged_seurat@meta.data, col = 'sample', into = c('batch', 'Day', 'Barcode'),
merged_seuratsep = '_')
#check some QC metrics
VlnPlot(merged_seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, group.by = "Day") +
plot_annotation(title = "Vln plots grouped by Day")
#can plot based on any metadat col
#VlnPlot(merged_seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, group.by = "orig.ident")
# Number of cells per group
table(merged_seurat$orig.ident)
table(merged_seurat$Day)
# Apply normal preprocessing steps to merged file
<- NormalizeData(object = merged_seurat)
merged_seurat <- FindVariableFeatures(object = merged_seurat)
merged_seurat <- ScaleData(object = merged_seurat)
merged_seurat <- RunPCA(object = merged_seurat)
merged_seurat ElbowPlot(merged_seurat)
<- FindNeighbors(object = merged_seurat, dims = 1:16)
merged_seurat <- FindClusters(object = merged_seurat, resolution = 0.6)
merged_seurat <- RunUMAP(object = merged_seurat, dims = 1:30)
merged_seurat
##save merged object
#saveRDS(merged_seurat, file = "./output_files/mutli_sample/merged_seurat.rds")
Code
set.seed(00003)
<- readRDS("./output_files/mutli_sample/merged_seurat.rds")
merged_seurat
#Lets plot the merged file.
<- FeaturePlot(merged_seurat, reduction = 'umap', features = 'nCount_RNA')
p1 <- FeaturePlot(merged_seurat, reduction = "umap", features = 'nFeature_RNA')
p2 <- DimPlot(merged_seurat, reduction = 'umap', group.by = 'Day')
p3 <- DimPlot(merged_seurat, reduction = 'umap', group.by = 'orig.ident')
p4
grid.arrange(p1, p2, p3, p4, ncol = 2)
Code
<- JoinLayers(object = merged_seurat)
merged_seurat "RNA"]] <- split(merged_seurat[["RNA"]], f = merged_seurat$orig.ident)
merged_seurat[[
#perform intergation with Harmony
<- IntegrateLayers(object = merged_seurat,
obj method = HarmonyIntegration,
orig.reduction = "pca",
new.reduction = 'integrated.harm',
verbose = FALSE,
)
## Warning in harmony::HarmonyMatrix(data_mat = Embeddings(object = orig), : HarmonyMatrix is deprecated and will be removed in the future from the API
## in the future
## Warning: Warning: The parameters do_pca and npcs are deprecated. They will be ignored for this function call and please remove parameters do_pca and npcs and pass to harmony cell_embeddings directly.
## This warning is displayed once per session.
## Warning: Warning: The parameter tau is deprecated. It will be ignored for this function call and please remove parameter tau in future function calls. Advanced users can set value of parameter tau by using parameter .options and function harmony_options().
## This warning is displayed once per session.
## Warning: Warning: The parameter block.size is deprecated. It will be ignored for this function call and please remove parameter block.size in future function calls. Advanced users can set value of parameter block.size by using parameter .options and function harmony_options().
## This warning is displayed once per session.
## Warning: Warning: The parameter max.iter.harmony is replaced with parameter max_iter. It will be ignored for this function call and please use parameter max_iter in future function calls.
## This warning is displayed once per session.
## Warning: Warning: The parameter max.iter.cluster is deprecated. It will be ignored for this function call and please remove parameter max.iter.cluster in future function calls. Advanced users can set value of parameter max.iter.cluster by using parameter .options and function harmony_options().
## This warning is displayed once per session.
## Warning: Warning: The parameter epsilon.cluster is deprecated. It will be ignored for this function call and please remove parameter epsilon.cluster in future function calls. Advanced users can set value of parameter epsilon.cluster by using parameter .options and function harmony_options().
## This warning is displayed once per session.
## Warning: Warning: The parameter epsilon.harmony is deprecated. It will be ignored for this function call and please remove parameter epsilon.harmony in future function calls. If users want to control if harmony would stop early or not, use parameter early_stop. Advanced users can set value of parameter epsilon.harmony by using parameter .options and function harmony_options().
## This warning is displayed once per session.
Code
<- FindNeighbors(obj, reduction="integrated.harm")
obj <- FindClusters(obj, resolution=0.4, cluster.name="harm_cluster") obj
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 7216
## Number of edges: 233507
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9338
## Number of communities: 10
## Elapsed time: 0 seconds
Code
<- RunUMAP(obj, reduction="integrated.harm", dims=1:20, reduction.name = "umap.harm")
obj
DimPlot(obj, reduction = "umap.harm", group.by = c("Day", "orig.ident", "harm_cluster"), label = T)
Code
### save object
#saveRDS(obj, file = "./output_files/mutli_sample/integrated_harm_seurat.rds")
9.3 Add isoform counts
Now we can add isoform level information to the merged Seurat object. When doing this you may need to convert the oafish files into a count.csv file the row names in the ENSTID_gene_symbol ID form. The function to do so can be found here @ref(Convert Oarfish files to count matrix)
The workflow follows these steps.
- Create a Seurat object per sample
- Merge the objects together
- Filter the isoform merged object based on cell types that are high quality (based on gene level filtering)
- Add the isoform filtered and merged object to the gene level object as a new assay
- Integration on the isoform assay is possible but not shown here.
Code
############ read in count matix and add isoform assay to Seurat object #########
# Define sample names
<- c("C1_STC", "C4_STC", "C2_Day25", "C5_Day25", "C2_Day55", "C3_Day55", "C3_Day80", "C5_Day80")
sample_names
#Make Seurat objects not filtered
# Function to read a CSV file and create a Seurat object
<- function(sample_name, input_dir, project_name) {
create_seurat_object # Construct file path
<- file.path(input_dir, paste0("gene_symbol_oarfish_", sample_name, "_counts.csv"))
file_path
# Read the CSV file
<- fread(file_path)
counts <- as.data.frame(counts)
counts rownames(counts) <- counts[[1]] # Set row names from the first column
1]] <- NULL
counts[[
# Create the Seurat object
<- CreateSeuratObject(counts = counts, project = project_name, min.cells = 5, min.features = 500)
seurat_obj
return(seurat_obj)
}
# Directory where the CSV files are stored
<- "./output_files/mutli_sample/oarfish_counts/"
input_directory
# Create an empty list to store Seurat objects
<- list()
iso_seurat_objects
<- length(sample_names)
total_samples for (i in seq_along(sample_names)) {
<- sample_names[i]
sample
message(sprintf("Processing sample %d of %d: %s", i, total_samples, sample))
<- create_seurat_object(
iso_seurat_objects[[sample]] sample_name = sample,
input_dir = input_directory,
project_name = sample
)
}
####### Merge the samples into one object ######
<- merge(
iso.merged_seurat "C1_STC"]],
iso_seurat_objects[[y = list(
"C4_STC"]],
iso_seurat_objects[["C2_Day25"]],
iso_seurat_objects[["C5_Day25"]],
iso_seurat_objects[["C2_Day55"]],
iso_seurat_objects[["C3_Day55"]],
iso_seurat_objects[["C3_Day80"]],
iso_seurat_objects[["C5_Day80"]]
iso_seurat_objects[[
),add.cell.ids = c(
"C1_STC", "C4_STC", "C2_Day25", "C5_Day25",
"C2_Day55", "C3_Day55", "C3_Day80", "C5_Day80"
), project = "iso-Multi-sample_tutorial"
)
# create a sample column
$sample <- rownames(iso.merged_seurat@meta.data)
iso.merged_seurat
# split sample column to makw a batch col
@meta.data <- separate(iso.merged_seurat@meta.data, col = 'sample', into = c('batch', 'Day', 'Barcode'),
iso.merged_seuratsep = '_')
## filter the data to iso and gene cells match
<- subset(iso.merged_seurat, cells =obj@graphs[["RNA_nn"]]@Dimnames[[1]])
merged_seurat_isoform_filtered
VlnPlot(merged_seurat_isoform_filtered, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# perform standard workflow steps
<- NormalizeData(object = merged_seurat_isoform_filtered) # if using SCT dont run this
merged_seurat_isoform_filtered <- FindVariableFeatures(object = merged_seurat_isoform_filtered) # if using SCT dont run this
merged_seurat_isoform_filtered <- ScaleData(object = merged_seurat_isoform_filtered) # if using SCT dont run this
merged_seurat_isoform_filtered <- RunPCA(object = merged_seurat_isoform_filtered)
merged_seurat_isoform_filtered ElbowPlot(merged_seurat_isoform_filtered)
<- FindNeighbors(object = merged_seurat_isoform_filtered, dims = 1:10)
merged_seurat_isoform_filtered <- FindClusters(object = merged_seurat_isoform_filtered, resolution = 0.6)
merged_seurat_isoform_filtered <- RunUMAP(object = merged_seurat_isoform_filtered, dims = 1:10)
merged_seurat_isoform_filtered
#Save file
#saveRDS(merged_seurat_isoform_filtered, file = "./output_files/mutli_sample/merged_seurat_isoform_filtered.rds")
Code
<- readRDS(file = "./output_files/mutli_sample/merged_seurat_isoform_filtered.rds")
merged_seurat_isoform_filtered
#plots
<- DimPlot(merged_seurat_isoform_filtered, reduction = 'umap', group.by = 'orig.ident')
p5 <- DimPlot(merged_seurat_isoform_filtered, reduction = 'umap', group.by = 'Day')
p6 <- FeaturePlot(merged_seurat_isoform_filtered, reduction = 'umap', features = 'nCount_RNA')
p7 <- FeaturePlot(merged_seurat_isoform_filtered, reduction = "umap", features = 'nFeature_RNA')
p8
grid.arrange(p5, p6, p7, p8, ncol = 2)
Code
#####
<- JoinLayers(merged_seurat_isoform_filtered)
merged_seurat_isoform_filtered <- merged_seurat_isoform_filtered[["RNA"]]$counts
counts_table
"iso"]] <- CreateAssay5Object(counts = counts_table)
obj[[
# Step 1: Normalize the new assay data
<- NormalizeData(obj, assay = "iso")
obj <- FindVariableFeatures(obj, assay = "iso")
obj <- ScaleData(obj, assay = "iso")
obj
# Step 4: Perform PCA
<- RunPCA(obj, assay = "iso", reduction.name = "pca_iso") obj
## Warning: Key 'PC_' taken, using 'pcaiso_' instead
Code
# Step 5: Run UMAP
<- RunUMAP(obj, reduction = "pca_iso", dims = 1:15, assay = "iso", reduction.name = "umap_iso")
obj
# Visualize the UMAP
DimPlot(obj, label = TRUE, reduction = "umap_iso") | DimPlot(obj, label = TRUE, reduction = "umap.harm")
Here will annotate the cell types using the sctype and some prior knowledge.
Code
<- c("Tanycytes", "Dopaminergic neurons", "Oligodendrocyte precursor cells", "Non myelinating Schwann cells", "Endothelial cells")
gs_removal_list # list of cell types from the db to remove
<- perform_sctype_analysis(obj, db_, tissue, gs_removal_list,
obj metadat_col_prefix ="sctype_db", figure_prefix = "multi",
output_file = "multi", cluster_res = "harm_cluster", reduction = "umap.harm")
## # A tibble: 10 × 3
## # Groups: cluster [10]
## cluster type scores
## <fct> <chr> <dbl>
## 1 7 Cancer stem cells 713.
## 2 1 Cancer stem cells 819.
## 3 4 Radial glial cells 1136.
## 4 2 Radial glial cells 896.
## 5 9 Immature neurons 486.
## 6 3 Mature neurons 979.
## 7 0 GABAergic neurons 1268.
## 8 5 GABAergic neurons 792.
## 9 8 Radial glial cells 1506.
## 10 6 Microglial cells 408.
Code
#change name of cancer stem cell to stem cell
@meta.data$sctype_db <- gsub("Cancer stem cells", "Stem cells", obj@meta.data$sctype_db)
obj
DimPlot(obj, reduction="umap.harm", group.by = "sctype_db", label = T)
Code
#saveRDS(obj, file = "./output_files/mutli_sample/multisample_seurat.intergrated_harm.isofrom.rds")