How to run metaGOflow
Raw data
metaGOflow
takes as input shotgun sequences in .fastq
format without any particual dependency on their production.
The sequences file can be provided to metaGOflow
directly or an ENA accession id of the run of intereste can be provided and
metaGOflow
will fetch the data automatically.
Attention
metaGOflow
is not valid for the analysis of long reads samples, e.g. Oxford Nanopore or PacBio reads.
Run metaGOflow
Assuming metaGOflow
is about to perform in a HPC environment where Singularity is set
and that we have built a conda
environment as shown in Installation
let’s break down how we would execute a run given the config.yml
is set.
About the config.yml
file and how to set the parameters on it, you may see the Arguments and parameters section.
#SBATCH --partition=fat
#SBATCH --nodes=1
#SBATCH --nodelist=
#SBATCH --ntasks-per-node=40
#SBATCH --mem=
#SBATCH --mail-user=my_accountr@email.com
#SBATCH --mail-type=ALL
#SBATCH --requeue
#SBATCH --job-name="mg_run"
#SBATCH --output=metagoflow_run.output
# Deactivate conda if already there
conda activate metagoflow
# Load module
module load singularity/3.7.1
# To run an ENA run
./run_wf.sh -e ERR599171 -d my_analysis -n ERR599171 -s
The first lines starting with a #
stand for SLURM commands
SLURM is a widely used cluster management and job scheduling system among several ones.
In any case, you need to ensure you are in line with your HPC instructions.
We activate the conda
environment and ensure that the computing node can use Singularity.
Then we run metaGOflow
by executing hte run_wf.sh
script.
In this case, the ERR599171
sample from ENA will be fetched
and the workflow will be performed using Singularity (-s).
An output directory will be built called my_analysis
and the prefix of the data products will be the same
as the accession id, as -n
has the same value with -e
.
Attention
Remember to always keep the config.yml
file in the root directory of the
folder as downloaded from the GitHub repository.
In case an HPC is not used, then the SLURM commands or any similar ones are not required.
Attention
metaGOflow
builds several intermediate files that are, by default, removed once completed.
However, it may require more than 1 TB of storage during its performance and based on the sample’s size.
Output / data products
Apparently, based on the steps asked to be performed metaGOflow
returns a series of data products.
In all cases, the main output is a .zip
file including the RO-Crate produced.
In the root of the output folder there are 4 data products:
Data product |
Description |
|
Folder with the metaGOflow findings |
|
JSON-LD file describing the structure of the RO-Crate |
|
metaGOflow configuration file |
|
Extended configuration file automatically produced |
If the -b
flag was used, asking to save the tmp
folder, then a folder called like this would be also present.
The data products of the qc_and_merge
step can be found in the root of the results
directory.
In the same place, the output of the assembly step (final.contigs.fa
) will be found, if asked to be performed.
Data product |
Description |
---|---|
|
Filtered .fastq file of the forward (R1) reads |
|
Filtered .fastq file of the reverse (R2) reads |
|
Summary with statistics of the forward (R1) reads |
|
Summary with statistics of the reverse (R2) reads |
|
Aminoacid coding sequences |
|
Nucleotide coding sequences |
|
Sequence hits against covariance model databases |
|
Merged filtered sequences |
|
mOTUs along with their taxonomic assignment and their abundance |
|
Quality control (QC) summary of the merged sequences |
|
Merged sequences with clean headers |
|
FASTP analysis of raw sequence data |
|
FASTA formatted contig sequences |
|
Numbers of RNAs counted |
The taxonomic inventory related data products can be found in a subfolder inside the results
folder called taxonomy-summary
.
Data product |
Description |
---|---|
|
Folder with data products based on the large ribosomal subunit |
|
LSU rRNA sequences used for taxonomic indentification |
|
OTUs and taxonomic assignments for LSU rRNA (hdf5 formatted BIOM) |
|
OTUs and taxonomic assignments for LSU rRNA (json formatted BIOM) |
|
Tab-separated formatted taxon counts for LSU rRNA sequences |
|
Text-based taxon counts for LSU rRNA sequences |
|
Interactive krona charts for LSU rRNA taxonomic inventory |
|
Folder with data products based on the small ribosomal subunit |
|
SSU rRNA sequences used for taxonomic identification |
|
OTUs and taxonomic assignments for SSU rRNA (hdf5 formatted BIOM) |
|
OTUs and taxonomic assignments for SSU rRNA (json formatted BIOM) |
|
Tab-separated formatted taxon counts for SSU rRNA sequences |
|
Text-based taxon counts for SSU rRNA sequences |
|
Interactive krona charts for SSU rRNA taxonomic inventory |
Likewise, the data products of the functional annotation step can be found in the functional-annotation
subfolder
including:
Data product |
Description |
---|---|
|
.chunks |
|
Merged contigs CDS I5 summary |
|
Merged contigs HMM summary |
|
Gene Ontology annotation summary |
|
GO slim annotation summary |
|
InterProScan annotation summary |
|
KO annotation summary |
|
Pfam annotation summary |
|
eggNOG annotation summary |
|
Folder containing files with statistics on each annotation approach |
|
Gene Ontology (GO) annotation summary statistics |
|
InterProScan annotation summary statistics |
|
Kegg Orthology (KO) annotation summary statistics |
|
Open Reading Frame (ORF) annotation summary statistics |
|
Pfam annotation summary statistics |
Last, a subfolder called sequence-categorisation
is also part of the results
folder
including information about specific reads assigned in various categories.
Data product |
Description |
---|---|
|
5.8S ribosomal RNA sequences |
|
Predicted Alphaproteobacteria transfer-messenger RNA (RF01849) |
|
Predicted Bacterial large signal recognition particle RNA (RF01854) |
|
Predicted Bacterial small signal recognition particle RNA (RF00169) |
|
Predicted Cyanobacteria transfer-messenger RNA (RF01851) |
|
Predicted Archaeal large subunit ribosomal RNA (RF02540) |
|
Predicted Bacterial large subunit ribosomal RNA (RF02541) |
|
Predicted Eukaryotic large subunit ribosomal RNA (RF02543) |
|
Predicted Bacterial RNase P class A (RF00010) |
|
Predicted Archaeal small subunit ribosomal RNA (RF01959) |
|
Predicted Bacterial small subunit ribosomal RNA (RF00177) |
|
Predicted Eukaryotic small subunit ribosomal RNA (RF01960) |
|
Predicted transfer-messenger RNA (RF00023) |
|
Predicted transfer RNA (RF00005) |
|
Predicted Selenocysteine transfer RNA (RF01852) |