How to run metaGOflow

Raw data

metaGOflow takes as input shotgun sequences in .fastq format without any particual dependency on their production.

The sequences file can be provided to metaGOflow directly or an ENA accession id of the run of intereste can be provided and metaGOflow will fetch the data automatically.

Attention

metaGOflow is not valid for the analysis of long reads samples, e.g. Oxford Nanopore or PacBio reads.

Run metaGOflow

Assuming metaGOflow is about to perform in a HPC environment where Singularity is set and that we have built a conda environment as shown in Installation let’s break down how we would execute a run given the config.yml is set.

About the config.yml file and how to set the parameters on it, you may see the Arguments and parameters section.

#SBATCH --partition=fat
#SBATCH --nodes=1
#SBATCH --nodelist=
#SBATCH --ntasks-per-node=40
#SBATCH --mem=
#SBATCH --mail-user=my_accountr@email.com
#SBATCH --mail-type=ALL
#SBATCH --requeue
#SBATCH --job-name="mg_run"
#SBATCH --output=metagoflow_run.output

# Deactivate conda if already there
conda activate metagoflow

# Load module
module load singularity/3.7.1

# To run an ENA run
./run_wf.sh -e ERR599171 -d my_analysis -n ERR599171 -s

The first lines starting with a # stand for SLURM commands SLURM is a widely used cluster management and job scheduling system among several ones. In any case, you need to ensure you are in line with your HPC instructions.

We activate the conda environment and ensure that the computing node can use Singularity. Then we run metaGOflow by executing hte run_wf.sh script. In this case, the ERR599171 sample from ENA will be fetched and the workflow will be performed using Singularity (-s). An output directory will be built called my_analysis and the prefix of the data products will be the same as the accession id, as -n has the same value with -e.

Attention

Remember to always keep the config.yml file in the root directory of the folder as downloaded from the GitHub repository.

In case an HPC is not used, then the SLURM commands or any similar ones are not required.

Attention

metaGOflow builds several intermediate files that are, by default, removed once completed. However, it may require more than 1 TB of storage during its performance and based on the sample’s size.

Output / data products

Apparently, based on the steps asked to be performed metaGOflow returns a series of data products. In all cases, the main output is a .zip file including the RO-Crate produced.

In the root of the output folder there are 4 data products:

Data product

Description

results

Folder with the metaGOflow findings

ro-crate-metadata.json

JSON-LD file describing the structure of the RO-Crate

config.yml

metaGOflow configuration file

my_prefix.yml

Extended configuration file automatically produced

If the -b flag was used, asking to save the tmp folder, then a folder called like this would be also present.

The data products of the qc_and_merge step can be found in the root of the results directory. In the same place, the output of the assembly step (final.contigs.fa) will be found, if asked to be performed.

Data product

Description

*_1.fastq.trimmed.fasta

Filtered .fastq file of the forward (R1) reads

*_2.fastq.trimmed.fasta

Filtered .fastq file of the reverse (R2) reads

*_1.fastq.trimmed.qc_summary

Summary with statistics of the forward (R1) reads

*_2.fastq.trimmed.qc_summary

Summary with statistics of the reverse (R2) reads

*merged_CDS.faa

Aminoacid coding sequences

*.merged_CDS.ffn

Nucleotide coding sequences

*.merged.cmsearch.all.tblout.deoverlapped

Sequence hits against covariance model databases

*.merged.fasta

Merged filtered sequences

*.merged.motus.tsv

mOTUs along with their taxonomic assignment and their abundance

*.merged.qc_summary

Quality control (QC) summary of the merged sequences

*.merged.unfiltered_fasta

Merged sequences with clean headers

fastp.html

FASTP analysis of raw sequence data

final.contigs.fa

FASTA formatted contig sequences

RNA-counts

Numbers of RNAs counted

The taxonomic inventory related data products can be found in a subfolder inside the results folder called taxonomy-summary.

Data product

Description

LSU

Folder with data products based on the large ribosomal subunit

*.merged_LSU.fasta.mseq.gz

LSU rRNA sequences used for taxonomic indentification

*.merged_LSU.fasta.mseq_hdf5.biom

OTUs and taxonomic assignments for LSU rRNA (hdf5 formatted BIOM)

*.merged_LSU.fasta.mseq_json.biom

OTUs and taxonomic assignments for LSU rRNA (json formatted BIOM)

*.merged_LSU.fasta.mseq.tsv

Tab-separated formatted taxon counts for LSU rRNA sequences

*.merged_LSU.fasta.mseq.txt

Text-based taxon counts for LSU rRNA sequences

krona.html

Interactive krona charts for LSU rRNA taxonomic inventory

SSU

Folder with data products based on the small ribosomal subunit

*.merged_SSU.fasta.mseq.gz

SSU rRNA sequences used for taxonomic identification

*.merged_SSU.fasta.mseq_hdf5.biom

OTUs and taxonomic assignments for SSU rRNA (hdf5 formatted BIOM)

*.merged_SSU.fasta.mseq_json.biom

OTUs and taxonomic assignments for SSU rRNA (json formatted BIOM)

*.merged_SSU.fasta.mseq.tsv

Tab-separated formatted taxon counts for SSU rRNA sequences

*.merged_SSU.fasta.mseq.txt

Text-based taxon counts for SSU rRNA sequences

krona.html

Interactive krona charts for SSU rRNA taxonomic inventory

Likewise, the data products of the functional annotation step can be found in the functional-annotation subfolder including:

Data product

Description

*.merged_CDS.I5.tsv

.chunks

*.merged_CDS.I5.tsv.gz

Merged contigs CDS I5 summary

*.merged.hmm.tsv.gz

Merged contigs HMM summary

*.merged.summary.go

Gene Ontology annotation summary

*.merged.summary.go_slim

GO slim annotation summary

*.merged.summary.ips

InterProScan annotation summary

*.merged.summary.ko

KO annotation summary

*.merged.summary.pfam

Pfam annotation summary

*.merged.emapper.summary.eggnog

eggNOG annotation summary

stats

Folder containing files with statistics on each annotation approach

go.stats

Gene Ontology (GO) annotation summary statistics

interproscan.stats

InterProScan annotation summary statistics

ko.stats

Kegg Orthology (KO) annotation summary statistics

orf.stats

Open Reading Frame (ORF) annotation summary statistics

pfam.stats

Pfam annotation summary statistics

Last, a subfolder called sequence-categorisation is also part of the results folder including information about specific reads assigned in various categories.

sequence-categorisation

Data product

Description

5_8S.fa.gz

5.8S ribosomal RNA sequences

alpha_tmRNA.RF01849.fasta.gz

Predicted Alphaproteobacteria transfer-messenger RNA (RF01849)

Bacteria_large_SRP.RF01854.fasta.gz

Predicted Bacterial large signal recognition particle RNA (RF01854)

Bacteria_small_SRP.RF00169.fasta.gz

Predicted Bacterial small signal recognition particle RNA (RF00169)

cyano_tmRNA.RF01851.fasta.gz

Predicted Cyanobacteria transfer-messenger RNA (RF01851)

LSU_rRNA_archaea.RF02540.fa.gz

Predicted Archaeal large subunit ribosomal RNA (RF02540)

LSU_rRNA_bacteria.RF02541.fa.gz

Predicted Bacterial large subunit ribosomal RNA (RF02541)

LSU_rRNA_eukarya.RF02543.fa.gz

Predicted Eukaryotic large subunit ribosomal RNA (RF02543)

RNaseP_bact_a.RF00010.fasta.gz

Predicted Bacterial RNase P class A (RF00010)

SSU_rRNA_archaea.RF01959.fa.gz

Predicted Archaeal small subunit ribosomal RNA (RF01959)

SSU_rRNA_bacteria.RF00177.fa.gz

Predicted Bacterial small subunit ribosomal RNA (RF00177)

SSU_rRNA_eukarya.RF01960.fa.gz

Predicted Eukaryotic small subunit ribosomal RNA (RF01960)

tmRNA.RF00023.fasta.gz

Predicted transfer-messenger RNA (RF00023)

tRNA.RF00005.fasta.gz

Predicted transfer RNA (RF00005)

tRNA-Sec.RF01852.fasta.gz

Predicted Selenocysteine transfer RNA (RF01852)