smallRNA-seq QC Report — Demo-Project

1. Project Overview

Total Samples

9

biological replicates

Total Raw Reads

~210M

across all samples (R1)

Read Length

151 bp

paired-end

Avg Mapping Rate

88.3%

STAR alignment

RNA Types

6

miRNA/piRNA/snoRNA/snRNA/tRNA/circRNA

DE Comparisons

6

per RNA type (36 total)

Study Design: Human Group A differentiation time course across four timepoints (Group A/Day0, Group B, Group C, Group D) with 2–3 biological replicates per group. Small RNA-seq libraries were sequenced on an Illumina platform (151 bp paired-end). Six ncRNA classes were quantified using featureCounts v2.0.6 and STAR 2.7.10b alignment.

Analysis Pipeline

Step	Tool	Output
Read QC	FastQC 0.12.1 + MultiQC	01.fastqc/
Adapter trimming	Trimming (pre-processed)	00.TrimmedFastq/
Alignment	STAR 2.7.10b (hg38)	02.bam/
Mapping QC	STAR logs + MultiQC	03.mappingQC/
Feature counting	featureCounts v2.0.6	04.smallRNAcounts/
Differential expression	DESeq2 (R)	05.DEanalysis/

2. Sample Information

#	Sample ID	Library ID	Group	Timepoint	Note
1	SampleA1	SampleA1_S1_L002	Group A	Group A	Replicate I
2	SampleA2	SampleA2_S2_L002	Group A	Group A	Replicate II
3	SampleB1	SampleB1_S6_L002	Group B	Group B	Backup replicate
4	SampleB2	SampleB2_S7_L002	Group B	Group B	Replicate I
5	SampleC1	SampleC1_S8_L002	Group C	Group C	Replicate I
6	SampleC2	SampleC2_S0_L001	Group C	Group C	Replicate II
7	SampleD1	SampleD1_S1_L004	Group D	Group D	Backup replicate
8	SampleD2	SampleD2_S0_L001	Group D	Group D	Replicate I
9	SampleD3	SampleD3_S0_L001	Group D	Group D	Replicate II

3. Read Quality (FastQC)

Metrics below are from R1 reads. High duplication rates (75–88%) are expected for small RNA-seq libraries due to the limited diversity of small RNA species. All samples pass basic quality thresholds.

Full interactive report: 01.fastqc/multiqc_report.html

Sample	Group	Total Reads (M)	% Duplicates	% GC	Read Length
SampleA1	Group A	22.38	84.4%	55%	151 bp
SampleA2	Group A	20.10	80.8%	55%	151 bp
SampleB1	Group B	25.49	88.0%	53%	151 bp
SampleB2	Group B	23.15	86.9%	55%	151 bp
SampleC1	Group C	31.98	84.4%	53%	151 bp
SampleC2	Group C	22.46	81.7%	58%	151 bp
SampleD1	Group D	22.37	86.0%	56%	151 bp
SampleD2	Group D	20.55	83.3%	58%	151 bp
SampleD3	Group D	20.92	83.9%	59%	151 bp

4. Alignment Statistics (STAR)

Alignment performed against the human genome (hg38) using STAR 2.7.10b. High multi-mapping rates are characteristic of small RNA-seq because many small RNA sequences share sequence identity across loci.

Full interactive report: 03.mappingQC/multiqc_report.html

Sample	Group	Total Reads (M)	Mapped %	Uniquely Mapped %	Multi-mapped %	Unmapped %	Avg Read Length	Mismatch Rate
SampleA1	Group A	15.83	96.0%	7.3%	88.7%	4.0%	33 bp	0.50%
SampleA2	Group A	11.25	91.3%	8.2%	83.1%	8.8%	30 bp	0.59%
SampleB1	Group B	17.95	82.6%	12.6%	70.0%	17.4%	38 bp	0.57%
SampleB2	Group B	15.95	88.3%	12.1%	76.2%	11.7%	33 bp	0.36%
SampleC1	Group C	10.81	84.1%	27.7%	56.3%	15.9%	34 bp	0.27%
SampleC2	Group C	9.04	88.5%	14.5%	74.0%	11.5%	28 bp	0.51%
SampleD1	Group D	5.06	84.2%	25.0%	59.2%	15.8%	39 bp	0.31%
SampleD2	Group D	6.26	88.1%	14.8%	73.3%	11.9%	29 bp	0.37%
SampleD3	Group D	4.45	87.9%	14.5%	73.4%	12.1%	31 bp	0.38%

5. Feature Counting (featureCounts v2.0.6)

Reads were counted against six ncRNA annotation sets using featureCounts v2.0.6. Percentages shown are fraction of total alignments assigned to each RNA class. snRNA captures the largest proportion of reads (~34–58%), followed by circRNA (~8–13%) and piRNA (~3–9%). miRNA, snoRNA, and tRNA account for <3% each. Aggregated count and CPM matrices for all samples are available in 04.smallRNAcounts/Data_matrixes_with_all_samples/.

Sample	Group	miRNA %	piRNA %	snoRNA %	snRNA %	tRNA %	circRNA %
SampleA1	Group A	0.08%	5.22%	0.04%	43.08%	0.47%	6.97%
SampleA2	Group A	0.15%	4.72%	0.11%	54.78%	0.35%	8.33%
SampleB1	Group B	0.02%	9.21%	0.14%	5.31%	0.16%	12.60%
SampleB2	Group B	0.18%	5.65%	0.05%	43.58%	0.15%	8.55%
SampleC1	Group C	0.17%	2.78%	0.04%	49.90%	0.09%	10.60%
SampleC2	Group C	0.21%	5.48%	0.04%	57.90%	0.60%	8.43%
SampleD1	Group D	0.17%	6.56%	0.09%	34.17%	1.34%	7.76%
SampleD2	Group D	0.20%	5.33%	0.03%	56.34%	2.93%	8.38%
SampleD3	Group D	0.20%	5.10%	0.05%	52.05%	2.97%	7.93%

Reference Feature Counts

RNA Type	Features in Reference	Count Files Location
miRNA	1,878	`04.smallRNAcounts/miRNA/`
piRNA	27,700	`04.smallRNAcounts/piRNA/`
snoRNA	941	`04.smallRNAcounts/snoRNA/`
snRNA	1,909	`04.smallRNAcounts/snRNA/`
tRNA	619	`04.smallRNAcounts/tRNA/`
circRNA	768,986	`04.smallRNAcounts/circRNA/`

6. Differential Expression Analysis (DESeq2)

DE analysis was performed using DESeq2. Significance thresholds: p-value < 0.05 AND |log₂FC| > 0. Six comparisons were run per RNA type against the differentiation time course. Results, volcano plots, heatmaps, PCA, and sample correlation matrices are available under 05.DEanalysis/.

miRNA — Significant DE Genes

Comparison	Total Sig.	Up-regulated	Down-regulated
Group B vs Group A	31	↑ 8	↓ 23
Group C vs Group B	18	↑ 12	↓ 6
Group C vs Group A	19	↑ 8	↓ 11
Group D vs Group B	30	↑ 18	↓ 12
Group D vs Group C	8	↑ 3	↓ 5
Group D vs Group A	49	↑ 16	↓ 33

piRNA — Significant DE Genes

Comparison	Total Sig.	Up-regulated	Down-regulated
Group B vs Group A	46	↑ 9	↓ 37
Group C vs Group B	35	↑ 15	↓ 20
Group C vs Group A	68	↑ 13	↓ 55
Group D vs Group B	81	↑ 40	↓ 41
Group D vs Group C	34	↑ 20	↓ 14
Group D vs Group A	78	↑ 24	↓ 54

snoRNA — Significant DE Genes

Comparison	Total Sig.	Up-regulated	Down-regulated
Group B vs Group A	49	↑ 9	↓ 40
Group C vs Group B	11	↑ 7	↓ 4
Group C vs Group A	34	↑ 16	↓ 18
Group D vs Group B	21	↑ 15	↓ 6
Group D vs Group C	5	↑ 1	↓ 4
Group D vs Group A	29	↑ 9	↓ 20

snRNA — Significant DE Genes

Comparison	Total Sig.	Up-regulated	Down-regulated
Group B vs Group A	49	↑ 36	↓ 13
Group C vs Group B	137	↑ 2	↓ 135
Group C vs Group A	227	↑ 112	↓ 115
Group D vs Group B	167	↑ 0	↓ 167
Group D vs Group C	12	↑ 6	↓ 6
Group D vs Group A	246	↑ 98	↓ 148

tRNA — Significant DE Genes

Comparison	Total Sig.	Up-regulated	Down-regulated
Group B vs Group A	70	↑ 35	↓ 35
Group C vs Group B	24	↑ 4	↓ 20
Group C vs Group A	50	↑ 17	↓ 33
Group D vs Group B	114	↑ 39	↓ 75
Group D vs Group C	42	↑ 3	↓ 39
Group D vs Group A	125	↑ 37	↓ 88

circRNA — Significant DE Genes

Comparison	Total Sig.	Up-regulated	Down-regulated
Group B vs Group A	1,368	↑ 480	↓ 888
Group C vs Group B	1,761	↑ 334	↓ 1,427
Group C vs Group A	1,862	↑ 154	↓ 1,708
Group D vs Group B	1,462	↑ 607	↓ 855
Group D vs Group C	606	↑ 536	↓ 70
Group D vs Group A	1,381	↑ 350	↓ 1,031

7. Methods

7.1 Library Preparation

Small RNA-seq libraries were prepared using the Takara Small RNA library preparation kit. Paired-end sequencing (2 × 151 bp) was performed on an Illumina platform. Only R1 reads were used for downstream analysis; R2 reads were discarded as they could not be trimmed correctly for small RNA analysis.

7.2 Read Quality Control and Adapter Trimming

Raw reads were assessed for quality using FastQC v0.12.1 on both R1 and R2. Quality reports were aggregated with MultiQC.

Adapter trimming was performed on R1 reads using cutadapt with the following parameters:

Parameter	Value	Description
`-m`	15	Discard reads shorter than 15 bp after trimming
`-u`	3	Remove 3 bases from the 5′ end of each read
`-a`	AAAAAAAAAA	Trim poly-A 3′ adapter sequence

FastQC was re-run on trimmed R1 reads to confirm adapter removal. High duplication rates (75–88%) are expected for small RNA libraries due to the limited sequence diversity of small RNA species.

7.3 Reference Genome and STAR Index

Reads were aligned to the Homo sapiens GRCh38 (hg38) genome using STAR 2.7.10b. The STAR genome index was built without a splice junction database (appropriate for small RNA) using:

Parameter	Value	Description
`--genomeSAindexNbases`	14	Optimised suffix array index for small genomes / small RNA
`--sjdbOverhang`	0	No splice junction database (intron-free small RNA mode)

7.4 Alignment (STAR 2.7.10b)

Trimmed R1 reads were aligned to hg38 using STAR with parameters optimised for small RNA multi-mapping:

Parameter	Value	Description
`--outSAMtype`	BAM SortedByCoordinate	Output coordinate-sorted BAM
`--alignEndsType`	Local	Soft-clipping allowed; suited for short small RNA reads
`--outFilterMismatchNmax`	1	Maximum 1 mismatch per alignment
`--outFilterMismatchNoverLmax`	0.05	Mismatch fraction ≤ 5% of read length
`--outFilterMatchNmin`	16	Minimum matched bases = 16
`--outFilterMultimapNmax`	1,000,000	Allow up to 1 M multi-mapping loci (captures multi-copy small RNAs)
`--outFilterMultimapScoreRange`	1	Report alignments within score range 1 of best
`--outFilterScoreMinOverLread`	0	No minimum score-over-length filter
`--outFilterMatchNminOverLread`	0	No minimum match-over-length filter
`--alignIntronMax`	1	Effectively disables spliced alignments
`--outSAMunmapped`	Within	Write unmapped reads to the output BAM
`--outReadsUnmapped`	None	Do not write separate unmapped FASTQ

BAM files were indexed using samtools. Alignment statistics were summarised with MultiQC. The high multi-mapping rate observed (56–89%) is characteristic of small RNA-seq, as many small RNA species share identical or near-identical sequences across genomic loci.

7.5 Feature Counting (featureCounts v2.0.6)

Read counts for six ncRNA classes were quantified using featureCounts v2.0.6 (Subread package) against annotation files in SAF format. Multi-mapping reads were included (-M) and strand specificity was set to unstranded (-s 0). Junction reads were also reported (-J).

RNA Type	Annotation Database	Reference / Version
miRNA	miRBase	Homo sapiens GRCh38, Ensembl release 113
piRNA	piRNAdb	v1.7.6, hg38
snoRNA	Ensembl GTF	Homo sapiens GRCh38.113
snRNA	Ensembl GTF	Homo sapiens GRCh38.113
tRNA	GtRNAdb	hg38
circRNA	CircAtlas	v3.0, hg38

CPM (counts per million) normalisation was applied to each sample's raw counts using a custom R script: CPM = round(count / sum(all counts) × 10⁶). Per-sample count and CPM files were merged into cross-sample matrices using a custom R script (generate_matrixes_with_all_samples.R).

7.6 Differential Expression Analysis (DESeq2)

Differential expression (DE) analysis was performed independently for each of the six ncRNA types using DESeq2 (R/Bioconductor, v1.46.0) via a custom pipeline (prepare_for_DGE_analysis_DESeq2.sh + DGE_analysis_DESeq2.R). Six pairwise comparisons were tested across the differentiation time course:

Comparison	Test Group	Control / Reference Group
Group B vs Group A	Group B	Group A
Group C vs Group B	Group C	Group B
Group C vs Group A	Group C	Group A
Group D vs Group B	Group D	Group B
Group D vs Group C	Group D	Group C
Group D vs Group A	Group D	Group A

DESeq2 model and normalisation: Raw per-sample count files were loaded with DESeqDataSetFromHTSeqCount() using the design formula ~ condition. The control group was set as the reference level using relevel(). Size-factor normalisation and negative-binomial model fitting were performed with DESeq(). Normalised counts were extracted with counts(dds, normalized=TRUE) and saved as output-normalized-count.csv. Per-group normalised mean expression (baseMean_<group>) was calculated using rowMeans(counts(dds, normalized=TRUE)) for each condition level.

DE results: Statistics (log₂FoldChange, lfcSE, Wald statistic, p-value, adjusted p-value) were extracted with results(dds) and merged with per-group baseMeans. The full results table was sorted by adjusted p-value (Benjamini–Hochberg) and saved as output-AnalysisResult.csv.

Significance thresholds: Features were called differentially expressed if they met both:

raw p-value < 0.05
|log₂FoldChange| > 0 (any directional change)

Significant features were further split into up-regulated (log₂FC > 0) and down-regulated (log₂FC < 0) subsets and saved separately.

Variance-stabilising transformation (VST): Raw counts were transformed using varianceStabilizingTransformation() for all visualisations. Plots were generated using the top 2,000 most variable genes (ranked by row variance of the VST matrix).

Visualisation outputs per comparison:

Output File	Description
`output-PCA.pdf`	PCA plot on VST-transformed data (`plotPCA()`); PC1 vs PC2 coloured by group
`output-PCA-data.csv`	Underlying PCA coordinates and variance explained per PC
`output-heatmap.pdf`	Sample-to-sample distance heatmap: Pearson correlation on top 2,000 variable genes (VST), distance = √(1 − r²), clustered by `pheatmap`
`output-Pearson-correlation-of-top-2000-genes.pdf`	Pearson correlation matrix heatmap for top 2,000 variable genes (VST)
`output-sample-correlation.csv`	Pearson correlation matrix (numeric, top 2,000 genes)
`output-heatmap-gene.pdf`	Gene × sample heatmap (row-scaled VST, top 2,000 variable genes, samples in original order, rows clustered by correlation distance)
`output-heatmap-gene---clustering-samples.pdf`	Same as above but with hierarchical clustering of samples
`output-BetweenSampleDis.pdf`	Boxplot of log₂(count + 1) per sample showing raw count distributions
`output-VolcanoPlot.pdf`	Volcano plot: log₂FC vs −log₁₀(p-value); significant features highlighted in red
`output-VolcanoPlot-data.csv`	Data table underlying the volcano plot
`Volcano-Plot--*--show-gene-names.html`	Interactive HTML volcano plot with gene labels (EnhancedVolcano / Plotly)
`output-MAplot.pdf`	MA plot: log₂(baseMean) vs log₂FC; significant features in red
`output-pval.pdf`	Histogram of raw p-values (bin width = 0.05)
`heatmap_of_top_100_smallest_pvalue_genes--*.pdf`	Heatmap of top 100 smallest-p-value genes for all-sig, up-regulated, and down-regulated sets
`output-AnalysisResult.csv`	Full DE table (all features, sorted by adjusted p-value)
`output-AnalysisResult-sig.csv`	Significant DE features (both directions)
`output-AnalysisResult-sig-upregulated.csv`	Significant up-regulated features
`output-AnalysisResult-sig-downregulated.csv`	Significant down-regulated features
`output-normalized-count.csv`	DESeq2 size-factor normalised counts for all samples

7.7 Software Summary

Tool	Version	Purpose
FastQC	0.12.1	Raw and trimmed read quality control
cutadapt	—	Adapter trimming (poly-A, 5′ trimming)
MultiQC	—	Aggregation of FastQC and STAR QC reports
STAR	2.7.10b	Read alignment to hg38
samtools	—	BAM indexing
featureCounts	v2.0.6 (Subread)	Read counting against ncRNA annotations
R / DESeq2	DESeq2 1.46.0	Differential expression analysis

8. Deliverable File Structure

Demo-Project-smallRNA-analysis/
├── 00.TrimmedFastq/ ~1.9 GB
│ └── 9 × *.trimmed.fastq.gz
├── 01.fastqc/ ~34 MB
│ ├── 01.fastqc-multiqc_report.html ← FastQC MultiQC report
│ ├── multiqc_report.html
│ └── multiqc_data/
├── 02.bam/ ~12 GB
│ └── 9 × *_Aligned.sortedByCoord.out.bam (.bai)
├── 03.mappingQC/ ~2.9 MB
│ ├── 03.mappingQC-multiqc_report.html ← STAR alignment MultiQC report
│ ├── multiqc_report.html
│ └── multiqc_data/
├── 04.smallRNAcounts/ ~500 MB
│ ├── miRNA/ — 9 samples × {.count.txt, .count_CPM.txt, .count_CountOnly.txt, .summary, .jcounts}
│ ├── piRNA/ — 9 samples × 5 file types
│ ├── snoRNA/ — 9 samples × 5 file types
│ ├── snRNA/ — 9 samples × 5 file types
│ ├── tRNA/ — 9 samples × 5 file types
│ ├── circRNA/ — 9 samples × 5 file types
│ └── Data_matrixes_with_all_samples/ ← merged count & CPM matrices (12 files)
├── 05.DEanalysis/ ~459 MB
│ ├── Demo-Project-miRNA-analysis/06.DE-GO-KEGG/
│ │ ├── sampleInfo.csv
│ │ └── {GroupB_vs_GroupA, GroupC_vs_GroupB, GroupC_vs_GroupA, GroupD_vs_GroupB, GroupD_vs_GroupC, GroupD_vs_GroupA}_DE/
│ │ ├── output-AnalysisResult.csv / -sig.csv / -sig-upregulated.csv / -sig-downregulated.csv
│ │ ├── output-normalized-count.csv, output-PCA-data.csv, output-VolcanoPlot-data.csv
│ │ └── output-*.pdf (heatmap, PCA, volcano, MA, correlation plots)
│ ├── Demo-Project-piRNA-analysis/ — same structure
│ ├── Demo-Project-snoRNA-analysis/ — same structure
│ ├── Demo-Project-snRNA-analysis/ — same structure
│ ├── Demo-Project-tRNA-analysis/ — same structure
│ └── Demo-Project-circRNA-analysis/— same structure
└── 05.smallRNAreport/ ~1.4 MB
├── logs/ — 54 featureCounts log files (9 samples × 6 RNA types)
└── summary/ — 54 featureCounts summary files

Small RNA-seq Quality Control Report