7. Methods
7.1 Library Preparation
Small RNA-seq libraries were prepared using the Takara Small RNA library preparation kit. Paired-end sequencing (2 × 151 bp) was performed on an Illumina platform. Only R1 reads were used for downstream analysis; R2 reads were discarded as they could not be trimmed correctly for small RNA analysis.
7.2 Read Quality Control and Adapter Trimming
Raw reads were assessed for quality using FastQC v0.12.1 on both R1 and R2. Quality reports were aggregated with MultiQC.
Adapter trimming was performed on R1 reads using cutadapt with the following parameters:
| Parameter | Value | Description |
-m | 15 | Discard reads shorter than 15 bp after trimming |
-u | 3 | Remove 3 bases from the 5′ end of each read |
-a | AAAAAAAAAA | Trim poly-A 3′ adapter sequence |
FastQC was re-run on trimmed R1 reads to confirm adapter removal. High duplication rates (75–88%) are expected for small RNA libraries due to the limited sequence diversity of small RNA species.
7.3 Reference Genome and STAR Index
Reads were aligned to the Homo sapiens GRCh38 (hg38) genome using STAR 2.7.10b. The STAR genome index was built without a splice junction database (appropriate for small RNA) using:
| Parameter | Value | Description |
--genomeSAindexNbases | 14 | Optimised suffix array index for small genomes / small RNA |
--sjdbOverhang | 0 | No splice junction database (intron-free small RNA mode) |
7.4 Alignment (STAR 2.7.10b)
Trimmed R1 reads were aligned to hg38 using STAR with parameters optimised for small RNA multi-mapping:
| Parameter | Value | Description |
--outSAMtype | BAM SortedByCoordinate | Output coordinate-sorted BAM |
--alignEndsType | Local | Soft-clipping allowed; suited for short small RNA reads |
--outFilterMismatchNmax | 1 | Maximum 1 mismatch per alignment |
--outFilterMismatchNoverLmax | 0.05 | Mismatch fraction ≤ 5% of read length |
--outFilterMatchNmin | 16 | Minimum matched bases = 16 |
--outFilterMultimapNmax | 1,000,000 | Allow up to 1 M multi-mapping loci (captures multi-copy small RNAs) |
--outFilterMultimapScoreRange | 1 | Report alignments within score range 1 of best |
--outFilterScoreMinOverLread | 0 | No minimum score-over-length filter |
--outFilterMatchNminOverLread | 0 | No minimum match-over-length filter |
--alignIntronMax | 1 | Effectively disables spliced alignments |
--outSAMunmapped | Within | Write unmapped reads to the output BAM |
--outReadsUnmapped | None | Do not write separate unmapped FASTQ |
BAM files were indexed using samtools. Alignment statistics were summarised with MultiQC. The high multi-mapping rate observed (56–89%) is characteristic of small RNA-seq, as many small RNA species share identical or near-identical sequences across genomic loci.
7.5 Feature Counting (featureCounts v2.0.6)
Read counts for six ncRNA classes were quantified using featureCounts v2.0.6 (Subread package) against annotation files in SAF format. Multi-mapping reads were included (-M) and strand specificity was set to unstranded (-s 0). Junction reads were also reported (-J).
| RNA Type | Annotation Database | Reference / Version |
| miRNA | miRBase | Homo sapiens GRCh38, Ensembl release 113 |
| piRNA | piRNAdb | v1.7.6, hg38 |
| snoRNA | Ensembl GTF | Homo sapiens GRCh38.113 |
| snRNA | Ensembl GTF | Homo sapiens GRCh38.113 |
| tRNA | GtRNAdb | hg38 |
| circRNA | CircAtlas | v3.0, hg38 |
CPM (counts per million) normalisation was applied to each sample's raw counts using a custom R script: CPM = round(count / sum(all counts) × 106). Per-sample count and CPM files were merged into cross-sample matrices using a custom R script (generate_matrixes_with_all_samples.R).
7.6 Differential Expression Analysis (DESeq2)
Differential expression (DE) analysis was performed independently for each of the six ncRNA types using DESeq2 (R/Bioconductor, v1.46.0) via a custom pipeline (prepare_for_DGE_analysis_DESeq2.sh + DGE_analysis_DESeq2.R). Six pairwise comparisons were tested across the differentiation time course:
| Comparison | Test Group | Control / Reference Group |
| Group B vs Group A | Group B | Group A |
| Group C vs Group B | Group C | Group B |
| Group C vs Group A | Group C | Group A |
| Group D vs Group B | Group D | Group B |
| Group D vs Group C | Group D | Group C |
| Group D vs Group A | Group D | Group A |
DESeq2 model and normalisation: Raw per-sample count files were loaded with DESeqDataSetFromHTSeqCount() using the design formula ~ condition. The control group was set as the reference level using relevel(). Size-factor normalisation and negative-binomial model fitting were performed with DESeq(). Normalised counts were extracted with counts(dds, normalized=TRUE) and saved as output-normalized-count.csv. Per-group normalised mean expression (baseMean_<group>) was calculated using rowMeans(counts(dds, normalized=TRUE)) for each condition level.
DE results: Statistics (log₂FoldChange, lfcSE, Wald statistic, p-value, adjusted p-value) were extracted with results(dds) and merged with per-group baseMeans. The full results table was sorted by adjusted p-value (Benjamini–Hochberg) and saved as output-AnalysisResult.csv.
Significance thresholds: Features were called differentially expressed if they met both:
- raw p-value < 0.05
- |log₂FoldChange| > 0 (any directional change)
Significant features were further split into up-regulated (log₂FC > 0) and down-regulated (log₂FC < 0) subsets and saved separately.
Variance-stabilising transformation (VST): Raw counts were transformed using varianceStabilizingTransformation() for all visualisations. Plots were generated using the top 2,000 most variable genes (ranked by row variance of the VST matrix).
Visualisation outputs per comparison:
| Output File | Description |
output-PCA.pdf | PCA plot on VST-transformed data (plotPCA()); PC1 vs PC2 coloured by group |
output-PCA-data.csv | Underlying PCA coordinates and variance explained per PC |
output-heatmap.pdf | Sample-to-sample distance heatmap: Pearson correlation on top 2,000 variable genes (VST), distance = √(1 − r²), clustered by pheatmap |
output-Pearson-correlation-of-top-2000-genes.pdf | Pearson correlation matrix heatmap for top 2,000 variable genes (VST) |
output-sample-correlation.csv | Pearson correlation matrix (numeric, top 2,000 genes) |
output-heatmap-gene.pdf | Gene × sample heatmap (row-scaled VST, top 2,000 variable genes, samples in original order, rows clustered by correlation distance) |
output-heatmap-gene---clustering-samples.pdf | Same as above but with hierarchical clustering of samples |
output-BetweenSampleDis.pdf | Boxplot of log₂(count + 1) per sample showing raw count distributions |
output-VolcanoPlot.pdf | Volcano plot: log₂FC vs −log₁₀(p-value); significant features highlighted in red |
output-VolcanoPlot-data.csv | Data table underlying the volcano plot |
Volcano-Plot--*--show-gene-names.html | Interactive HTML volcano plot with gene labels (EnhancedVolcano / Plotly) |
output-MAplot.pdf | MA plot: log₂(baseMean) vs log₂FC; significant features in red |
output-pval.pdf | Histogram of raw p-values (bin width = 0.05) |
heatmap_of_top_100_smallest_pvalue_genes--*.pdf | Heatmap of top 100 smallest-p-value genes for all-sig, up-regulated, and down-regulated sets |
output-AnalysisResult.csv | Full DE table (all features, sorted by adjusted p-value) |
output-AnalysisResult-sig.csv | Significant DE features (both directions) |
output-AnalysisResult-sig-upregulated.csv | Significant up-regulated features |
output-AnalysisResult-sig-downregulated.csv | Significant down-regulated features |
output-normalized-count.csv | DESeq2 size-factor normalised counts for all samples |
7.7 Software Summary
| Tool | Version | Purpose |
| FastQC | 0.12.1 | Raw and trimmed read quality control |
| cutadapt | — | Adapter trimming (poly-A, 5′ trimming) |
| MultiQC | — | Aggregation of FastQC and STAR QC reports |
| STAR | 2.7.10b | Read alignment to hg38 |
| samtools | — | BAM indexing |
| featureCounts | v2.0.6 (Subread) | Read counting against ncRNA annotations |
| R / DESeq2 | DESeq2 1.46.0 | Differential expression analysis |
8. Deliverable File Structure
Demo-Project-smallRNA-analysis/
├── 00.TrimmedFastq/ ~1.9 GB
│ └── 9 × *.trimmed.fastq.gz
├── 01.fastqc/ ~34 MB
│ ├── 01.fastqc-multiqc_report.html ← FastQC MultiQC report
│ ├── multiqc_report.html
│ └── multiqc_data/
├── 02.bam/ ~12 GB
│ └── 9 × *_Aligned.sortedByCoord.out.bam (.bai)
├── 03.mappingQC/ ~2.9 MB
│ ├── 03.mappingQC-multiqc_report.html ← STAR alignment MultiQC report
│ ├── multiqc_report.html
│ └── multiqc_data/
├── 04.smallRNAcounts/ ~500 MB
│ ├── miRNA/ — 9 samples × {.count.txt, .count_CPM.txt, .count_CountOnly.txt, .summary, .jcounts}
│ ├── piRNA/ — 9 samples × 5 file types
│ ├── snoRNA/ — 9 samples × 5 file types
│ ├── snRNA/ — 9 samples × 5 file types
│ ├── tRNA/ — 9 samples × 5 file types
│ ├── circRNA/ — 9 samples × 5 file types
│ └── Data_matrixes_with_all_samples/ ← merged count & CPM matrices (12 files)
├── 05.DEanalysis/ ~459 MB
│ ├── Demo-Project-miRNA-analysis/06.DE-GO-KEGG/
│ │ ├── sampleInfo.csv
│ │ └── {GroupB_vs_GroupA, GroupC_vs_GroupB, GroupC_vs_GroupA, GroupD_vs_GroupB, GroupD_vs_GroupC, GroupD_vs_GroupA}_DE/
│ │ ├── output-AnalysisResult.csv / -sig.csv / -sig-upregulated.csv / -sig-downregulated.csv
│ │ ├── output-normalized-count.csv, output-PCA-data.csv, output-VolcanoPlot-data.csv
│ │ └── output-*.pdf (heatmap, PCA, volcano, MA, correlation plots)
│ ├── Demo-Project-piRNA-analysis/ — same structure
│ ├── Demo-Project-snoRNA-analysis/ — same structure
│ ├── Demo-Project-snRNA-analysis/ — same structure
│ ├── Demo-Project-tRNA-analysis/ — same structure
│ └── Demo-Project-circRNA-analysis/— same structure
└── 05.smallRNAreport/ ~1.4 MB
├── logs/ — 54 featureCounts log files (9 samples × 6 RNA types)
└── summary/ — 54 featureCounts summary files