Computational Reproducibility With Scientific Workflows: Analysing viral genomes with Nextflow

An ACM Rep 2025 tutoriel

George Marchment, Sarah Cohen-Boulakia and Frédéric Lemoine

comic

Table of content

I. Abstract
II. Learning Objectives and outline
III. Tutoriel Material
- A. Lecture
- B. Pratical Session
  - a. Objectif
  - b. Input data
  - c. Detailed steps
  - d. List of tools needed
  - e. Analysis Questions
  - f. Correction
- C. Reproducibility consensus
IV. Intended Audience, Format and Special Equipment Needs
V. Authors

I. Abstract

In an era of generation of large datasets and complex scientific analyses, ensuring the reproducibility of data analyses has become paramount. Workflow management systems have emerged as a key solution to this challenge. By managing:

the software environment,
task scheduling,
parallelisation and
communication with the execution machines (HPC, cloud, etc.).

They significantly facilitate workflow development compared to historical practices (e.g., simple bash scripts), all while ensuring scalablility and a high level of reproducibility. However, while they are becoming more popular, workflow management systems have not yet gained wide adoption within the scientific community, largely due to established practices and the perceived high learning curve associated with their use.

intro

This tutorial aims at demonstrating the critical role of workflow management systems in implementing reproducible data analyses, with an emphasis on their capacity to encapsulate heterogeneous code, manage software environments, scale with the data size, and leverage heterogeneous computational resources efficiently. To do so, we will use the Nextflow workflow system and a viral genome sequence reconstruction pipeline as a use case. This will demonstrate the fundamentals of Nextflow and illustrate how it can be used to easily implement, execute, and share a simple workflow.

II. Learning Objectives and outline

Key learning outcomes include

acquiring basic workflow concepts,
learning how to implement simple workflows and
understanding the capabilities of workflow management systems in encapsulating heterogeneous code, scalability, software environment management, and computational resource management.

The tutorial will be organized in three phases.

We will start with a short lecture to present the main challenges in implementing reproducible data analyses, specifically using a motivating example for illustration. We will then introduce how workflow management systems are capable of solving these challenges. To achieve this, we will present the Nextflow framework and then demonstrate how it can be used to solve the motivating example. Finally, we will introduce the analysis pipeline to implement as a workflow. It consists of several viral genome sequencing datasets that need to go through multiple analysis steps in order to reconstruct full viral genomes with their annotations.
The second part of the tutorial will consist of a practical session, during which the participants will implement the analysis using the Nextflow management system. The session will be highly interactive, featuring coding demonstrations and structured group discussions, which will allow participants to apply their knowledge, with organizers providing support and answering questions.
The tutorial will conclude with a discussion and survey on the challenges encountered. We will then conduct a reproducibility consensus by evaluating the results provided by the participants, where we as a group will assess the level of reproducibility achieved, fostering a collaborative evaluation of the reproducibility achieved.

By the end of the tutorial, participants will have a solid foundation in workflow management systems and be capable of designing and implementing reproducible data analysis workflows, aligning with the broader goals and themes of ACM REP 2025.

III. Tutoriel Material

A. Lecture

Link to the lecture slides can be found here.

B. Pratical Session

a. Objectif

The aim of this practical session is to create a Nextflow workflow to analyse a SARS-CoV-2 sequencing dataset. The objectives are:

Infer the full sequence of the virus
Detect the clade (alpha, beta, etc.).

To do so we will start from a sample that has been sequenced on an Illumina sequencer, and we will run the following steps:

Map the reads onto a reference genome.
Build the consensus sequence from the mapped reads.
Identify the clade of the virus based on the consensus sequence.

Performing these types of analysis, understanding the genetic sequence and clade of a virus is crucial for multiple reasons.

Firstly, they help in the tracking of the evolution and spread of the virus, providing insights into how it mutates over time. This information is vital for public health officials to make informed decisions and implement effective control measures.
Secondly, identifying specific clades can help in understanding the virulence and transmissibility of different variants, which is essential for developing targeted treatments and vaccines. Additionally, this analysis supports epidemiological studies by enabling the identification of outbreak sources and transmission patterns.

Overall, the detailed genetic analysis of viruses enhances the ability to respond to outbreaks.

coronavirus — SARS-CoV-2 phylogeny from nextstrain.org. Data updated 2025-06-09

b. Input data

It consists of:

Two compressed fastq files containing paired-end reads from Amplicon sequencing a SARS-CoV2 sample. Which can be downloaded here:
- Reads 1: here
- Reads 2: here
The reference genome to map the reads against (https://www.ncbi.nlm.nih.gov/nuccore/MN908947). Which can be download here:
- reference

c. Detailed steps

The resulting workflow should look like this:

Workflow graph

1. Mapping the reads on a reference genome

First the reference genome needs to be indexed, this steps is important since it allows to accelerate the mapping process. For more information regarding genome indexing see here.

Input of step
```
file ref
```

Output of step

tuple val(ref.name), file("${ref.baseName}.*")

Command lines
```
bwa index ${ref}
```

The next step is the mapping of the reads to reference genome. Essentially, this step allows to find where in the genome our reads originate from. Read mapping will be performed using bwa mem.

Inputs of step

tuple val(name), file(f1), file(f2)
tuple val(refName), file(ref)

Output of step

tuple val(name), file("*.bam"), file("*.bai")

Command lines

bwa mem -t 1 reference.fa reads1.fq reads2.fq > tmp.sam
samtools sort -o sample.bam tmp.sam
samtools index sample.bam

2. Building consensus sequence

Consensus sequence will be inferred also using iVAR. The consensus sequence is a theoretical representative of a nucleotide sequence in which each nucleotide is the one which occurs most frequently at that site in the different sequences. The goal of using a consensus sequence is by using the most frequent nucleotide at each site, the conserved regions are preserved, these being the most functionnaly important. Additionaly, sequencing errors are also neglected (by averaging them out), creating a sequence in which we have more confidence. For more information regarding the consensus sequence see here.

Input of step
```
tuple val(name), file(bam), file(bai)
```
Output of step
```
file "${name}.fa"
```

Command lines

samtools mpileup -d 600000 -A -Q 0 -F 0 ${bam} | ivar consensus -q 20 -t 0 -m 5 -n N -p ${name}

3. Detecting clade

Viral diversity is often broken down into Clades or lineages which are defined by specific combinations of signature mutations. Clades are groups of related sequences that share a common ancestor. To detect a sequences clade we use 2 different methods Pangolin and NextClade. For more information see here.

Detecting clade (NextClade)

Nextclade assigns sequences to clades by placing the sequences on a phylogenetic tree annotated with clade definitions. More specifically, Nextclade assigns the clade of the nearest reference node found during the Phylogenetic placement step.

First, the nextclade sars-cov-2 reference files have to be downloaded.

Output of step
```
path "ncref"
```

Command lines

nextclade dataset get --name 'sars-cov-2' --output-dir 'ncref'

Then, use these reference files along the fasta file to annotate the sample consensus.

Inputs of step
```
path ncref
path seq
```
Output of step
```
path "annotations.tsv"
```

Command lines

nextclade run --in-order --input-dataset ${ncref} --output-tsv 'annotations.tsv' ${seq}
sed -i 's/\x0D\$//' annotations.tsv

Detecting clade (Pangolin)

Pangolin will assign the most likely lineage out of all currently designated lineages. For more information see here.
- Inputs of step
```
file fa
path seq
```
- Output of step
```
file "*.csv"
```
- Command lines
```
PATH=/opt/conda/envs/pangolin/bin/:\$PATH
pangolin --usher 'sample_consensus.fa' -t 20 --outfile Pangolin_lineage_report.csv
```

d. List of tools needed

Here are the list of tools you will need in the workflow with a corresponding container to use them:

Tool	Container
samtools	`evolbioinfo/samtools:v1.11`
iVAR	`evolbioinfo/ivar:v1.3.1`
Nextclade	`nextstrain/nextclade:3.13.3`
Pangolin	`evolbioinfo/pangolin:v4.3.1-v1.33-v0.3.19-v0.1.12`
bwa	`evolbioinfo/bwa:v0.7.17`

e. Analysis Questions

After running the workflow, we can analyse some of its resutls

Using the annotations.tsv, determine what clade has been predicted for the samples?
Using the lineage_report.csv, determine what lineage has been predicted for the samples?
What can you deduce?
What year do you estimate the sequence samples were obtained (using the clade and the Nextclade website)?

f. Correction

A link to the correction workflow will be added at the end of the tutoriel.

C. Reproducibility consensus

After implementing and running the workflow described above. Please send the results of the workflow (the annotations.tsv as well as the lineage_report.csv file) to george.marchment[at]universite-paris-saclay.fr.

The goal of the reproducibility consensus is to establish the level to which the results from participants are similar and thus the analysis reproducible. Once all participants have submitted their results to the organisers, the consensus results will be shared, followed by a discussion on the level of reproducibility achieved among all participants.

IV. Intended Audience, Format and Special Equipment Needs

Intended Audience
- This introductory-level tutorial is targeted at scientists with an informatics background who analyse data in their projects. Participants should have an intermediate level of proficiency in Bash (navigate a terminal, install software, manage dependencies). No prior bioinformatics or biological knowledge is needed.
Format
- The tutorial will follow a hybrid format, the instructors will be connected remotely.
Length
- The tutorial will last half a day (3 hours).
Special Equipment Needs
- Participants will need access to a terminal (Linux or Mac) and should have Apptainer/Docker and Nextflow installed prior to the tutorial, for infromation to install the different software, see here.

Computational Reproducibility With Scientific Workflows: Analysing viral genomes with Nextflow

George Marchment, Sarah Cohen-Boulakia and Frédéric Lemoine

Table of content

I. Abstract

II. Learning Objectives and outline

III. Tutoriel Material

A. Lecture

B. Pratical Session

a. Objectif

b. Input data

c. Detailed steps

1. Mapping the reads on a reference genome

2. Building consensus sequence

3. Detecting clade

d. List of tools needed

e. Analysis Questions

f. Correction

C. Reproducibility consensus

IV. Intended Audience, Format and Special Equipment Needs

V. Authors