Bacterial Whole Genome Analysis
Dataset details
- Organism: Helicobacter pylori
- SRR ID: SRR22388518
- Sequencing platform: Illumina HiSeq 2500
- Library type: Paired End
Raw reads can be downloaded from NCBI’s SRA using sra-tools package.
As of now, download from the following links in Windows. In the ws_2026 directory create a subdirectory called wgs_demo and then raw_reads within it. Then download read files into it.
Sync to the Linux system. In Linux terminal, type…
$ rsync -avrP /mnt/c/Users/Anwesh/Desktop/ws_2026 ./Navigate to wgs_demo directory
$ cd ws_2026/wgs_demoQC
Using FastQC
Activate qc env
$ mamba activate qcRun FastQC GUI
(qc)$ fastqcUsing GUI for 1 or 2 samples is OK, but for large number of samples it is ideal to run using CLI
(qc)$ fastqc --help
(qc)$ mkdir raw_qc
(qc)$ fastqc raw_reads/*gz -o raw_qc -t 2Filtering
Download adapters file to the files directory and copy it to the Linux system
Run BBDuk.
(qc)$ bbduk.sh in1=raw_reads/hpy_a45_R1.fastq.gz in2=raw_reads/hpy_a45_R2.fastq.gz out1=bb_out/hpy_a45_R1.fq.gz out2=bb_out/hpy_a45_R2.fq.gz ref=../files/adapters.fa k=19 mink=7 ktrim=r hdist=1 hdist2=0 trimq=20 qtrim=r tpe tboRe-run FastQC
(qc)$ mkdir bb_qc
(qc)$ fastqc bb_out/*gz -o bb_qc -t 2Assembly
Using Unicycler
The -t 10 should be based on the threads your CPU supports. Check lscpu.
(qc)$ mamba deactivate
$ mamba activate unicycler
(unicycler)$ unicycler -1 bb_out/hpy_a45_R1.fq.gz -2 bb_out/hpy_a45_R2.fq.gz -o uni_out -t 10
# ~ 20 mins on WSLCheck the graph using Bandage
(unicycler)$ mamba deactivate
$ mamba activate bandage
(bandage)$ BandageClassify
GTDB-tk - Genome based classification - requires 140GB of reference DB to be downloaded. Recommended way
OR
Extract 16S rRNA gene from the genome and perform BLASTn
$ mkdir barrnap_out
(prokka)$ barrnap -o barrnap_out/a45.fa < uni_out/assembly.fasta > barrnap_out/a45.gffGet the genomes of closest hits from NCBI Genome DB and run ANI
$ mkdir -p pyani/all_genomes
$ cp -v refs/* pyani/all_genomes/
$ cp -v uni_out/assembly.fasta pyani/all_genomes/
$ cd pyani
(pyani)$ pyani-plus fastani all_genomes --database all_genomes.db --create-db --name "unicycler_stage"
(pyani)$ pyani-plus list-runs --database all_genomes.db
(pyani)$ pyani-plus export-run --database all_genomes.db --outdir fastani_out --run-id 1
(pyani)$ pyani-plus plot-run --database all_genomes.db --outdir fastani_out --run-id 1Scaffolding
Use the closest genome for scaffolding
(ragtag)$ ragtag.py scaffold refs/A45.fna uni_out/assembly.fasta -o ragtag_outTry MeDuSa scaffolder
Get summary of the placed and unplaced contigs
(seqkit)$ seqkit fx2tab ragtag_out/ragtag.scaffold.fasta -lngiH
$ grep 'N' ragtag_out/ragtag.scaffold.fasta -o | wc -l
# 1800 N'sPolishing
There are multiple gap filling and polishing tools, like - Pilon, POLCA, GapFiller, etc. Improvement over the existing assembly may be minimal unless you have another set of high quality reads or long reads.
Using POLCA. This required filtered reads to polish the scaffolds
(polca)$ polca.sh -a ../ragtag_out/ragtag.scaffold.fasta -r '../bb_out/hpy_a45_R1.fq.gz ../bb_out/hp
y_a45_R2.fq.gz' -t 10Run SeqKit and check for Ns
Plasmid Detection
Using Plasmer. Requires 10GB reference DB
$ mkdir plasmer_out
(plasmer)$ Plasmer -g polca_out/ratag.scaffold.fasta.PolcaCorrected.fa -p hpy -d ~/DBs/plasmer_db -t 10 -m 99 -o plasmer_outDiscard smaller, unclassified scaffolds. Copy the final scaffolds using SeqKit to a new directory and rename the files, ,if required, rename the fasta headers
(seqkit)$ seqkit grep -f polca_out/required_headers.txt polca_out/ragtag.scaffold.fasta.PolcaCorrected.fa -o genome/a45_assembled.faAssembly QC
Using CheckM
(checkm)$ checkm lineage_wf -x fa genome/ checkm_out/ -t 10 -f checkm_out/results.txt --tab_tableAnalysis
MLST
(mlst)$ mlst genome/a45_assembled.fa --full --outfile mlst_out/a45_mlst.txtAnnotation
(prokka)$ prokka --outdir a45 --prefix a45 genome/a45_assembled.faWebserver - RAST
AMR and Virulence genes
With ABRIcate
$ mkdir abricate_out
# NCBI database
(abricate)$ abricate genome/a45_assembled.fa > abricate_out/S3_amr.tab
# MEGAres database
(abricate)$ abricate --db megares genome/a45_assembled.fa > abricate_out/S3_megares.tab
# Virulence genes
(abricate)$ abricate --db vfdb genome/a45_assembled.fa > abricate_out/S3_vfdb.tab
# Trying plasmidfinder
(abricate)$ abricate --db plasmidfinder genome/a45_assembled.fa > abricate_out/S3_pf.tabSNPs
Using Snippy. Requires filtered reads, not the genome. Reference genome is provided in GBF format.
$ mkdir snippy_out
# With ATCC43504 as reference
(snippy)$ snippy --cpus 10 --outdir snippy_out/ATCC43504 --ref refs/ATCC43504.gbff --R1 bb_out/hpy_a45_R1.fq.gz --R2 bb_out/hpy_a45_R2.fq.gz Type Strain
Webserver - TYGS - Upload genome