Bacterial Whole Genome Analysis

Author

Anwesh

Published

February 27, 2026

Dataset details

Organism: Helicobacter pylori
SRR ID: SRR22388518
Sequencing platform: Illumina HiSeq 2500
Library type: Paired End

Raw reads can be downloaded from NCBI’s SRA using sra-tools package.

As of now, download from the following links in Windows. In the ws_2026 directory create a subdirectory called wgs_demo and then raw_reads within it. Then download read files into it.

Sync to the Linux system. In Linux terminal, type…

$ rsync -avrP /mnt/c/Users/Anwesh/Desktop/ws_2026 ./

Navigate to wgs_demo directory

$ cd ws_2026/wgs_demo

QC

Using FastQC

Activate qc env

$ mamba activate qc

Run FastQC GUI

(qc)$ fastqc

Using GUI for 1 or 2 samples is OK, but for large number of samples it is ideal to run using CLI

(qc)$ fastqc --help
(qc)$ mkdir raw_qc
(qc)$ fastqc raw_reads/*gz -o raw_qc -t 2

Filtering

Download adapters file to the files directory and copy it to the Linux system

Run BBDuk.

(qc)$ bbduk.sh in1=raw_reads/hpy_a45_R1.fastq.gz in2=raw_reads/hpy_a45_R2.fastq.gz out1=bb_out/hpy_a45_R1.fq.gz out2=bb_out/hpy_a45_R2.fq.gz ref=../files/adapters.fa k=19 mink=7 ktrim=r hdist=1 hdist2=0 trimq=20 qtrim=r tpe tbo

Re-run FastQC

(qc)$ mkdir bb_qc
(qc)$ fastqc bb_out/*gz -o bb_qc -t 2

Assembly

Using Unicycler

The -t 10 should be based on the threads your CPU supports. Check lscpu.

(qc)$ mamba deactivate
$ mamba activate unicycler
(unicycler)$ unicycler -1 bb_out/hpy_a45_R1.fq.gz -2 bb_out/hpy_a45_R2.fq.gz -o uni_out -t 10
# ~ 20 mins on WSL

Check the graph using Bandage

(unicycler)$ mamba deactivate
$ mamba activate bandage
(bandage)$ Bandage

Classify

GTDB-tk - Genome based classification - requires 140GB of reference DB to be downloaded. Recommended way

Extract 16S rRNA gene from the genome and perform BLASTn

$ mkdir barrnap_out
(prokka)$ barrnap -o barrnap_out/a45.fa < uni_out/assembly.fasta > barrnap_out/a45.gff

Get the genomes of closest hits from NCBI Genome DB and run ANI

$ mkdir -p pyani/all_genomes
$ cp -v refs/* pyani/all_genomes/
$ cp -v uni_out/assembly.fasta pyani/all_genomes/
$ cd pyani
(pyani)$ pyani-plus fastani all_genomes --database all_genomes.db --create-db --name "unicycler_stage"
(pyani)$ pyani-plus list-runs --database all_genomes.db
(pyani)$ pyani-plus export-run --database all_genomes.db --outdir fastani_out --run-id 1
(pyani)$ pyani-plus plot-run --database all_genomes.db --outdir fastani_out --run-id 1

Scaffolding

Use the closest genome for scaffolding

(ragtag)$ ragtag.py scaffold refs/A45.fna uni_out/assembly.fasta -o ragtag_out

Try MeDuSa scaffolder

Get summary of the placed and unplaced contigs

(seqkit)$ seqkit fx2tab ragtag_out/ragtag.scaffold.fasta -lngiH
$ grep 'N' ragtag_out/ragtag.scaffold.fasta -o | wc -l
# 1800 N's

Polishing

There are multiple gap filling and polishing tools, like - Pilon, POLCA, GapFiller, etc. Improvement over the existing assembly may be minimal unless you have another set of high quality reads or long reads.

Using POLCA. This required filtered reads to polish the scaffolds

(polca)$ polca.sh -a ../ragtag_out/ragtag.scaffold.fasta -r '../bb_out/hpy_a45_R1.fq.gz ../bb_out/hp
y_a45_R2.fq.gz' -t 10

Run SeqKit and check for Ns

Plasmid Detection

Using Plasmer. Requires 10GB reference DB

$ mkdir plasmer_out
(plasmer)$ Plasmer -g polca_out/ratag.scaffold.fasta.PolcaCorrected.fa -p hpy -d ~/DBs/plasmer_db -t 10 -m 99 -o plasmer_out

Discard smaller, unclassified scaffolds. Copy the final scaffolds using SeqKit to a new directory and rename the files, ,if required, rename the fasta headers

(seqkit)$ seqkit grep -f polca_out/required_headers.txt polca_out/ragtag.scaffold.fasta.PolcaCorrected.fa -o genome/a45_assembled.fa

Assembly QC

Using CheckM

(checkm)$ checkm lineage_wf -x fa genome/ checkm_out/ -t 10 -f checkm_out/results.txt --tab_table

Analysis

MLST

(mlst)$ mlst genome/a45_assembled.fa --full --outfile mlst_out/a45_mlst.txt

Annotation

(prokka)$ prokka --outdir a45 --prefix a45 genome/a45_assembled.fa

Webserver - RAST

AMR and Virulence genes

With ABRIcate

$ mkdir abricate_out
# NCBI database
(abricate)$ abricate genome/a45_assembled.fa > abricate_out/S3_amr.tab
# MEGAres database
(abricate)$ abricate --db megares genome/a45_assembled.fa > abricate_out/S3_megares.tab
# Virulence genes
(abricate)$ abricate --db vfdb genome/a45_assembled.fa > abricate_out/S3_vfdb.tab
# Trying plasmidfinder
(abricate)$ abricate --db plasmidfinder genome/a45_assembled.fa > abricate_out/S3_pf.tab

SNPs

Using Snippy. Requires filtered reads, not the genome. Reference genome is provided in GBF format.

$ mkdir snippy_out
# With ATCC43504 as reference
(snippy)$ snippy --cpus 10 --outdir snippy_out/ATCC43504 --ref refs/ATCC43504.gbff --R1 bb_out/hpy_a45_R1.fq.gz --R2 bb_out/hpy_a45_R2.fq.gz