Chapter 8 Novel isoforms
With long read sequencing its possible to explore both known and novel isofroms. Novel isoform can be interesting feature to explore and here we will provide some simple analysis that one can use to extract novel isoforms and explore them in some more detail. Although the bar plots above show that the vast majority of isofroms are full-splice-matches i.e they match the reference GTF
there are a few we could extract.
8.1 Find all the genes that express at least 1 novel isoform
First lets get a list of genes with at least one novel isoform
Code
# Find the genes that express at least one novel isoform and save them to a list
# Convert row names to a data frame
<- as.data.frame(row.names(seu_obj@assays$iso$counts))
isoform_ids colnames(isoform_ids) <- "IDs"
#Add in total expression
$Total_Expression <- rowSums(seu_obj@assays$iso$counts)
isoform_ids
# Filter rows where 'IDs' contains the string "Bambu"
<- as.data.frame(isoform_ids[grepl("Bambu", isoform_ids$IDs), ]) # can comment this out if we want to do this for all isofroms
isoform_ids
# Separate the 'IDs' column into two columns
<- isoform_ids %>% separate(IDs, into = c("transcript_id", "gene_id"), sep = "-", extra = "merge")
filtered_df
<- filtered_df %>%
isoform_counts group_by(gene_id) %>%
summarise(
Isoform_Count = n(),
Total_Expression = sum(Total_Expression)
%>%
) arrange(desc(Total_Expression))
# Print or return the filtered data frame
print(isoform_counts)
## # A tibble: 31 × 3
## gene_id Isoform_Count Total_Expression
## <chr> <int> <dbl>
## 1 OAZ2 1 396.
## 2 ENSG00000291214 1 159.
## 3 BambuGene6113 1 141.
## 4 ENSG00000305512 1 140.
## 5 BambuGene19505 1 139.
## 6 ENSG00000305069 1 132
## 7 BambuGene18812 1 95.8
## 8 ZMAT5 1 76.0
## 9 ENSG00000295459 1 68.7
## 10 ENSG00000231748 1 62
## # ℹ 21 more rows
In this table we can see that we have a total of 31 novel isofroms. This number is smaller than we might expect but this is because Bambu (Chen et al., 2023) is quite conservative for single cell data and we haven’t sequenced these data very deeply. Users can change the sensitivity of Bambu when running FLAMES by setting the NDR
parameter or using a different discovery method like stringtie2 (Kovaka et al., 2019).
In the above table we have ordered the the genes that contain a novel isoform by total expression. OAZ2 is our top hit. We can look at the isoform structure using IsoViz as described in the section 6.5.
Code
::include_graphics("images/IsoVis_ENSG00000180304.png") knitr

Figure 8.1: IsoViz visualization of Novel OAZ2 isoform.
8.2 Visualizing OAZ2 novel isoform
Visualizing OAZ2 reveals that the primary distinction between the novel isoform and the canonical isoform, OAZ2-201 seems to be a shorter 5’ UTR. This minor alteration at face value should not affect the canonical open reading frame (ORF). However, according to SQANTI, BambuTx25 is classified as a novel_in_catalog isoform, meaning it is an isoform not present in the reference GTF and SQANTI suggests that BambuTx25 is non-coding, which is quite surprising! Interestingly, SQANTI also indicates that the primary difference between BambuTx25 and the canonical isoforms lies in intron retention (Specified in the subcategory column).
Code
# Load necessary libraries
grep('BambuTx25', SQANTI$isoform), ] SQANTI[
## isoform chrom strand length exons structural_category associated_gene associated_transcript ref_length ref_exons diff_to_TSS diff_to_TTS
## 21835 BambuTx25 chr15 - 1870 5 novel_in_catalog ENSG00000180304.14 novel 1934 6 NA NA
## diff_to_gene_TSS diff_to_gene_TTS subcategory RTS_stage all_canonical min_sample_cov min_cov min_cov_pos sd_cov FL n_indels n_indels_junc
## 21835 -2 -1 intron_retention FALSE canonical NA NA NA NA NA NA NA
## bite iso_exp gene_exp ratio_exp FSM_class coding ORF_length CDS_length CDS_start CDS_end CDS_genomic_start CDS_genomic_end predicted_NMD
## 21835 FALSE NA NA NA C non_coding NA NA NA NA NA NA NA
## perc_A_downstream_TTS seq_A_downstream_TTS dist_to_CAGE_peak within_CAGE_peak dist_to_polyA_site within_polyA_site polyA_motif polyA_dist
## 21835 10 GTGGTTGGTCTATTCTTTAT NA NA NA NA NA NA
## polyA_motif_found ORF_seq ratio_TSS
## 21835 NA <NA> NA
To better understand these SQANTI results lets zoom in on exon 2 of OAZ2. Here, we observe a small break within the exon. The novel BambuTx25 does not contain this break so perhaps this change affects the resulting amino acid sequence.
Code
::include_graphics("images/IsoVis_ENSG00000180304_zoom.png") knitr

Figure 8.2: IsoViz visualization of exon 2 of OAZ2.
8.3 Functional impacts of novel isoforms
There are many ways to investigate the translated sequence and compare the protein generated from conanocial isoform and the novel one. Simple approaches may involve extracting the fasta sequence and looking for the open reading frames (ORFs) in a online tool like ‘expasy translate’ which can be found here https://web.expasy.org/translate/.
Bellow we have written a function that will take a list of isoforms from a gene and plot the ORFs as defined by the package ORFik5 (Tjeldnes et al., 2021). We will visualize these ORFS in Gviz (Hahne & Ivanek, 2016) an R package for visualizing genomic features. This analysis will help us determine if the ORF in the transcript BambuTx25 has been impacted by the change in nucleotide sequence.
Code
# Global cache for TxDb
<- list()
txdb_cache
<- function(reference_gtf) {
get_txdb # Check if the TxDb for this file is already in cache
if (!is.null(txdb_cache[[reference_gtf]])) {
message("Using cached TxDb...")
return(txdb_cache[[reference_gtf]])
}
# Create TxDb and cache it
message("Creating TxDb object from reference GTF. This may take time...")
<- txdbmaker::makeTxDbFromGFF(reference_gtf)
txdb <<- txdb
txdb_cache[[reference_gtf]] return(txdb)
}
<- function(reference_gtf, target_gtf, isoforms_of_interest,
plot_isoform_ORFs
chromosome, plot_start, plot_end) {
# Load genome reference
<- BSgenome.Hsapiens.UCSC.hg38
genomedb
# Retrieve cached TxDb or create if not available
<- get_txdb(reference_gtf)
txdb
# Import and filter GTF for specific isoforms
<- import(target_gtf)
gtf_data <- gtf_data[gtf_data$transcript_id %in% isoforms_of_interest]
gtf_filtered <- txdbmaker::makeTxDbFromGRanges(gtf_filtered)
txdb_filtered
# Extract exons and convert to GRangesList
<- GenomicFeatures::exonsBy(txdb_filtered, by = c("tx", "gene"), use.names = TRUE)
txs <- GRangesList(txs)
txs_grl
# Extract transcript sequences and identify ORFs
<- extractTranscriptSeqs(genomedb, txs_grl)
tx_seqs <- findMapORFs(txs_grl, tx_seqs, groupByTx = FALSE,
ORFs longestORF = FALSE, minimumLength = 30,
startCodon = "ATG", stopCodon = stopDefinition(1))
# Unlist and prepare ORF data for plotting
<- unlist(ORFs)
ORFs_unlisted $type <- "exon"
ORFs_unlisted$transcript <- ORFs_unlisted$names
ORFs_unlisted$transcript <- paste0(ORFs_unlisted$transcript, "_ORF")
ORFs_unlisted
#could add in a filter for ORFs here
# Define start and stop codons
<- startCodons(ORFs, is.sorted = TRUE)
starts <- stopCodons(ORFs, is.sorted = TRUE)
stops
# Visualization Tracks
<- GenomeAxisTrack()
gtrack <- IdeogramTrack(genome = "hg38", chromosome = chromosome)
itrack
<- GeneRegionTrack(
ref_track range = txdb,
name = "gencodev47",
genome = "hg38",
chromosome = chromosome,
col = "darkblue",
fill = "lightblue",
arrowHead = FALSE
)
<- GeneRegionTrack(
input_track range = txdb_filtered,
name = "Isoforms (Exons)",
genome = "hg38",
chromosome = chromosome,
col = "darkgreen",
fill = "lightgreen",
arrowHead = FALSE
)
<- GeneRegionTrack(
orf_track
ORFs_unlisted,genome = "hg38",
chromosome = unique(seqnames(ORFs_unlisted)),
transcriptAnnotation = "transcript",
name = "ORFs",
col = "grey",
fill = "grey",
stacking = "squish",
)
<- AnnotationTrack(
starts_track range = starts,
name = "Starts",
genome = "hg38",
chromosome = seqnames(txs_grl[[1]])[1],
col = "green",
fill = "green",
shape = "box", # Ensure no arrows by using box shape
stacking = "dense"
)
<- AnnotationTrack(
stops_track range = stops,
name = "Stops",
genome = "hg38",
chromosome = seqnames(txs_grl[[1]])[1],
col = "red",
fill = "red",
shape = "box", # Ensure no arrows by using box shape
stacking = "dense"
)
# Plot tracks
plotTracks(list(itrack, gtrack, ref_track, input_track,
orf_track, stops_track, starts_track), from = plot_start, to = plot_end,
transcriptAnnotation = "transcript")
}
# Example usage
plot_isoform_ORFs(
reference_gtf = "data/gencode.v47.annotation.gtf",
target_gtf = "data/FLAMES_out/isoform_annotated_unfiltered.gtf",
isoforms_of_interest = c("ENST00000326005.10", "BambuTx25"),
chromosome = "chr15",
plot_start = 64687015,
plot_end = 64703412
)
Code
plot_isoform_ORFs(
reference_gtf = "data/gencode.v47.annotation.gtf",
target_gtf = "data/FLAMES_out/isoform_annotated_unfiltered.gtf",
isoforms_of_interest = c("ENST00000326005.10", "BambuTx25"),
chromosome = "chr15",
plot_start = 64691416,
plot_end = 64691685
)
Visualizing ORFs using the above code chunck shows that the BambuTx25_1_ORF is incomplete. The additional sequence (retained intron as specified by SQANTI) results in a premature stop codon shown in red which likely triggers nonsense-mediated decay (NMD). The canonical ENST00000326005.10_3_ORF is complete.
We can examine additional genes from our list. For instance, ZMAT5 has a novel isoform that utilizes a different, known splice junction in the 5’ UTR. Using the function above to examine ORFs, we can see that the ORF is unchanged compared to the canonical isoform. This may suggest some import regulatory role in the 5’ UTR yet the protein remains unchanged.
Code
::include_graphics("images/IsoVis_ENSG00000100319.png") knitr

Figure 8.3: IsoViz visualization of Novel ZMAT5 isoform.
Code
plot_isoform_ORFs( reference_gtf = "../../../../resources/gencode.v47.annotation.gtf", target_gtf = "/data/scratch/users/yairp/FLAMES_Day55/outs/isoform_annotated_unfiltered.gtf", isoforms_of_interest = c("BambuTx32", "ENST00000344318.4"), chromosome = "chr22", plot_start = 29728956, plot_end = 29769011 )
## Creating TxDb object from reference GTF. This may take time...
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
ORFs found by the package ORFik may need to be filtered based on length or position when analysising differnt genes or isoforms.↩︎