Computational Reproducibility With Scientific Workflows: Analysing viral genomes with Nextflow

An ACM Rep 2025 tutoriel

George Marchment, Sarah Cohen-Boulakia and Frédéric Lemoine

comic

Table of content

I. Abstract

In an era of generation of large datasets and complex scientific analyses, ensuring the reproducibility of data analyses has become paramount. Workflow management systems have emerged as a key solution to this challenge. By managing:

They significantly facilitate workflow development compared to historical practices (e.g., simple bash scripts), all while ensuring scalablility and a high level of reproducibility. However, while they are becoming more popular, workflow management systems have not yet gained wide adoption within the scientific community, largely due to established practices and the perceived high learning curve associated with their use.

intro

This tutorial aims at demonstrating the critical role of workflow management systems in implementing reproducible data analyses, with an emphasis on their capacity to encapsulate heterogeneous code, manage software environments, scale with the data size, and leverage heterogeneous computational resources efficiently. To do so, we will use the Nextflow workflow system and a viral genome sequence reconstruction pipeline as a use case. This will demonstrate the fundamentals of Nextflow and illustrate how it can be used to easily implement, execute, and share a simple workflow.

II. Learning Objectives and outline

Key learning outcomes include

The tutorial will be organized in three phases.

  1. We will start with a short lecture to present the main challenges in implementing reproducible data analyses, specifically using a motivating example for illustration. We will then introduce how workflow management systems are capable of solving these challenges. To achieve this, we will present the Nextflow framework and then demonstrate how it can be used to solve the motivating example. Finally, we will introduce the analysis pipeline to implement as a workflow. It consists of several viral genome sequencing datasets that need to go through multiple analysis steps in order to reconstruct full viral genomes with their annotations.
  2. The second part of the tutorial will consist of a practical session, during which the participants will implement the analysis using the Nextflow management system. The session will be highly interactive, featuring coding demonstrations and structured group discussions, which will allow participants to apply their knowledge, with organizers providing support and answering questions.
  3. The tutorial will conclude with a discussion and survey on the challenges encountered. We will then conduct a reproducibility consensus by evaluating the results provided by the participants, where we as a group will assess the level of reproducibility achieved, fostering a collaborative evaluation of the reproducibility achieved.

By the end of the tutorial, participants will have a solid foundation in workflow management systems and be capable of designing and implementing reproducible data analysis workflows, aligning with the broader goals and themes of ACM REP 2025.

III. Tutoriel Material

A. Lecture

Link to the lecture slides can be found here.

B. Pratical Session

a. Objectif

The aim of this practical session is to create a Nextflow workflow to analyse a SARS-CoV-2 sequencing dataset. The objectives are:

  1. Infer the full sequence of the virus
  2. Detect the clade (alpha, beta, etc.).
To do so we will start from a sample that has been sequenced on an Illumina sequencer, and we will run the following steps:
  1. Map the reads onto a reference genome.
  2. Build the consensus sequence from the mapped reads.
  3. Identify the clade of the virus based on the consensus sequence.
Performing these types of analysis, understanding the genetic sequence and clade of a virus is crucial for multiple reasons. Overall, the detailed genetic analysis of viruses enhances the ability to respond to outbreaks.

coronavirus
SARS-CoV-2 phylogeny from nextstrain.org. Data updated 2025-06-09

b. Input data

It consists of:

c. Detailed steps

The resulting workflow should look like this:

Workflow graph

1. Mapping the reads on a reference genome
2. Building consensus sequence
3. Detecting clade

Viral diversity is often broken down into Clades or lineages which are defined by specific combinations of signature mutations. Clades are groups of related sequences that share a common ancestor. To detect a sequences clade we use 2 different methods Pangolin and NextClade. For more information see here.

d. List of tools needed

Here are the list of tools you will need in the workflow with a corresponding container to use them:

Tool Container
samtools evolbioinfo/samtools:v1.11
iVAR evolbioinfo/ivar:v1.3.1
Nextclade nextstrain/nextclade:3.13.3
Pangolin evolbioinfo/pangolin:v4.3.1-v1.33-v0.3.19-v0.1.12
bwa evolbioinfo/bwa:v0.7.17

e. Analysis Questions

After running the workflow, we can analyse some of its resutls

f. Correction

A link to the correction workflow will be added at the end of the tutoriel.

C. Reproducibility consensus

After implementing and running the workflow described above. Please send the results of the workflow (the annotations.tsv as well as the lineage_report.csv file) to george.marchment[at]universite-paris-saclay.fr.

The goal of the reproducibility consensus is to establish the level to which the results from participants are similar and thus the analysis reproducible. Once all participants have submitted their results to the organisers, the consensus results will be shared, followed by a discussion on the level of reproducibility achieved among all participants.

IV. Intended Audience, Format and Special Equipment Needs

V. Authors