How to run metaGOflow

Raw data

metaGOflow takes as input shotgun sequences in .fastq format without any particual dependency on their production.

The sequences file can be provided to metaGOflow directly or an ENA accession id of the run of intereste can be provided and metaGOflow will fetch the data automatically.


metaGOflow is not valid for the analysis of long reads samples, e.g. Oxford Nanopore or PacBio reads.

Run metaGOflow

Assuming metaGOflow is about to perform in a HPC environment where Singularity is set and that we have built a conda environment as shown in Installation let’s break down how we would execute a run given the config.yml is set.

About the config.yml file and how to set the parameters on it, you may see the Arguments and parameters section.

#SBATCH --partition=fat
#SBATCH --nodes=1
#SBATCH --nodelist=
#SBATCH --ntasks-per-node=40
#SBATCH --mem=
#SBATCH --mail-type=ALL
#SBATCH --requeue
#SBATCH --job-name="mg_run"
#SBATCH --output=metagoflow_run.output

# Deactivate conda if already there
conda activate metagoflow

# Load module
module load singularity/3.7.1

# To run an ENA run
./ -e ERR599171 -d my_analysis -n ERR599171 -s

The first lines starting with a # stand for SLURM commands SLURM is a widely used cluster management and job scheduling system among several ones. In any case, you need to ensure you are in line with your HPC instructions.

We activate the conda environment and ensure that the computing node can use Singularity. Then we run metaGOflow by executing hte script. In this case, the ERR599171 sample from ENA will be fetched and the workflow will be performed using Singularity (-s). An output directory will be built called my_analysis and the prefix of the data products will be the same as the accession id, as -n has the same value with -e.


Remember to always keep the config.yml file in the root directory of the folder as downloaded from the GitHub repository.

In case an HPC is not used, then the SLURM commands or any similar ones are not required.


metaGOflow builds several intermediate files that are, by default, removed once completed. However, it may require more than 1 TB of storage during its performance and based on the sample’s size.

Output / data products

Apparently, based on the steps asked to be performed metaGOflow returns a series of data products. In all cases, the main output is a .zip file including the RO-Crate produced.

In the root of the output folder there are 4 data products:

Data product



Folder with the metaGOflow findings


JSON-LD file describing the structure of the RO-Crate


metaGOflow configuration file


Extended configuration file automatically produced

If the -b flag was used, asking to save the tmp folder, then a folder called like this would be also present.

The data products of the qc_and_merge step can be found in the root of the results directory. In the same place, the output of the assembly step (final.contigs.fa) will be found, if asked to be performed.

Data product



Filtered .fastq file of the forward (R1) reads


Filtered .fastq file of the reverse (R2) reads


Summary with statistics of the forward (R1) reads


Summary with statistics of the reverse (R2) reads


Aminoacid coding sequences


Nucleotide coding sequences


Sequence hits against covariance model databases


Merged filtered sequences


mOTUs along with their taxonomic assignment and their abundance


Quality control (QC) summary of the merged sequences


Merged sequences with clean headers


FASTP analysis of raw sequence data


FASTA formatted contig sequences


Numbers of RNAs counted

The taxonomic inventory related data products can be found in a subfolder inside the results folder called taxonomy-summary.

Data product



Folder with data products based on the large ribosomal subunit


LSU rRNA sequences used for taxonomic indentification


OTUs and taxonomic assignments for LSU rRNA (hdf5 formatted BIOM)


OTUs and taxonomic assignments for LSU rRNA (json formatted BIOM)


Tab-separated formatted taxon counts for LSU rRNA sequences


Text-based taxon counts for LSU rRNA sequences


Interactive krona charts for LSU rRNA taxonomic inventory


Folder with data products based on the small ribosomal subunit


SSU rRNA sequences used for taxonomic identification


OTUs and taxonomic assignments for SSU rRNA (hdf5 formatted BIOM)


OTUs and taxonomic assignments for SSU rRNA (json formatted BIOM)


Tab-separated formatted taxon counts for SSU rRNA sequences


Text-based taxon counts for SSU rRNA sequences


Interactive krona charts for SSU rRNA taxonomic inventory

Likewise, the data products of the functional annotation step can be found in the functional-annotation subfolder including:

Data product





Merged contigs CDS I5 summary


Merged contigs HMM summary


Gene Ontology annotation summary


GO slim annotation summary


InterProScan annotation summary


KO annotation summary


Pfam annotation summary


eggNOG annotation summary


Folder containing files with statistics on each annotation approach


Gene Ontology (GO) annotation summary statistics


InterProScan annotation summary statistics


Kegg Orthology (KO) annotation summary statistics


Open Reading Frame (ORF) annotation summary statistics


Pfam annotation summary statistics

Last, a subfolder called sequence-categorisation is also part of the results folder including information about specific reads assigned in various categories.


Data product



5.8S ribosomal RNA sequences


Predicted Alphaproteobacteria transfer-messenger RNA (RF01849)


Predicted Bacterial large signal recognition particle RNA (RF01854)


Predicted Bacterial small signal recognition particle RNA (RF00169)


Predicted Cyanobacteria transfer-messenger RNA (RF01851)


Predicted Archaeal large subunit ribosomal RNA (RF02540)


Predicted Bacterial large subunit ribosomal RNA (RF02541)


Predicted Eukaryotic large subunit ribosomal RNA (RF02543)


Predicted Bacterial RNase P class A (RF00010)


Predicted Archaeal small subunit ribosomal RNA (RF01959)


Predicted Bacterial small subunit ribosomal RNA (RF00177)


Predicted Eukaryotic small subunit ribosomal RNA (RF01960)


Predicted transfer-messenger RNA (RF00023)


Predicted transfer RNA (RF00005)


Predicted Selenocysteine transfer RNA (RF01852)