Assembly
Objective: does subsampling to assembly?
- Determine if subsampling reads prior to assembly makes a difference for metagneomic assembly statistics, metrics to look at: length of assembly, N50 metrics etc.
- Learn to use SLURM system on the MERCED cluster
Helful links:
-Inspiration from Hug et al. 2015
- Setting up the work space
Data structure:
data/
data repository for backed up data
scratch/
scratch will be cleaned up every 5 weeks; good for jobs etc
typical slurm script:
#!/bin/bash
#SBATCH --mail-user=esogin@ucmerced.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --ntasks=20
#SBATCH --partition fast.q
#SBATCH --mem=96G
#SBATCH --time=0-00:15:00 # 15 minutes
#SBATCH --output=test.qlog
#SBATCH --job-name=test
#SBATCH --export=ALL
# This submission file will run a simple set of commands. All stdout will
# be captured in test1.qlog (as specified in the Slurm command --output above).
# This job file uses a shared-memory parallel environment and requests 20
# cores (--ntasks option) on a single node(--nodes option). This job will also run a global #script called
# run. For more info on this script, cat /usr/local/bin/merced_node_print.
#
whoami
pwd
uptime
hostname
date
To submit the job: sbatch script
- Start
This scripts will: (1) download the sequencing reads from ENA (2) check the md5sums (3) preform basic quality trimming (4) subsample reads to 5%, 10%, 20%, 40% and 80% (5) removes raw squecing read files
Tools required: bbtools
#!/bin/bash
#SBATCH --mail-user=esogin@ucmerced.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --ntasks=20
#SBATCH --partition std.q
#SBATCH --mem=96G
#SBATCH --time=0-24:00:00 #
#SBATCH --job-name=start
#SBATCH --export=ALL
i=3847_I
cd /home/esogin/scratch/prj001-comp-gene-chr/
##----------01 Download data from ENA/Facilitity--------------##
#echo "--------download file--------"
## Study accession nr. PRJEB35096
#sample run accession nr:
#cd /home/esogin/scratch/prj001-comp-gene-chr/01_Data/00_raw/
#wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/${i:0:6}/00${i:9:10}/$i/$i*
##----------02 Clean Data --------------##
#echo "--------clean data--------"
#mkdir /home/esogin/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/
#cd /home/esogin/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/;
#bbduk.sh ref=/home/esogin/data/db/adapters.fa ktrim=l minlength=36 mink=11 hdist=1 in=../00_raw/${i}_R1.fastq.gz in2=../00_raw/${i}_R2.fastq.gz out=${i}_ktriml.fq.gz;
#bbduk.sh ref=/home/esogin/data/db/adapters.fa ktrim=r trimq=2 qtrim=rl minlength=36 mink=11 hdist=1 in=${i}_ktriml.fq.gz interleaved=t out=${i}_q2_ktrimmed.fq.gz;
#rm *_R*.fastq.gz ${i}_ktriml.fq.gz
#rm ${i}_ktriml.fq.gz
##----------03 Subsample Data --------------##
echo "--------subsample data--------"
mkdir /home/esogin/scratch/prj001-comp-gene-chr/02_Analysis/01_subsampled -p;
cd /home/esogin/scratch/prj001-comp-gene-chr/02_Analysis/01_subsampled;
reformat.sh in=~/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.1 out=${i}_0.1.fq.gz
reformat.sh in=~/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.2 out=${i}_0.2.fq.gz
reformat.sh in=~/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.4 out=${i}_0.4.fq.gz
reformat.sh in=~/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.8 out=${i}_0.8.fq.gz
reformat.sh in=~/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=1 out=${i}_1.fq.gz
##----------04 Copy results back to /home/data --------------##
rsync -r /home/esogin/scratch/prj001-comp-gene-chr/ /home/esogin/data/prj001-comp-gene-chr/
echo "clean up scratch"
uptime
hostname
date
The download of the data failed (probably has to do with my coding above)
instead download files prior to running script:
#wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR362/ERR3624084/3847_I_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR362/004/ERR3624084/ERR3624084_1.fastq.gz
echo "4b0e21b15df9a76fe9739f9690c27102 ERR3624084_1.fastq.gz" | md5sum -c
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR362/004/ERR3624084/ERR3624084_2.fastq.gz
echo "b3e7f37af49846385c5feca0b10c2b73 ERR3624084_2.fastq.gz" | md5sum -c
- Assembly - trouble shooting started Sept 24 2021
Assembly each set of reads with spades
#!/bin/bash
#SBATCH --mail-user=esogin@ucmerced.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --ntasks=20
#SBATCH --partition std.q
#SBATCH --mem=96G
#SBATCH --time=0-24:00:00 #
#SBATCH --job-name=start
#SBATCH --export=ALL
i=3847_I
cd /home/esogin/scratch/prj001-comp-gene-chr/02_Analysis/
##----------03 Subsample Data --------------##
echo "--------Assemble Data--------"
mkdir /home/esogin/scratch/prj001-comp-gene-chr/02_Analysis/02_assembly -p;
cd /home/esogin/scratch/prj001-comp-gene-chr/02_Analysis/02_assembly;
reformat.sh in=/home/esogin/prj001-comp-gene-chr/01_Data/02_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.1 out=${i}_0.1.fq.gz
reformat.sh in=/home/esogin/prj001-comp-gene-chr/01_Data/02_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.2 out=${i}_0.2.fq.gz
reformat.sh in=/home/esogin/prj001-comp-gene-chr/01_Data/02_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.4 out=${i}_0.4.fq.gz
reformat.sh in=/home/esogin/prj001-comp-gene-chr/01_Data/02_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.8 out=${i}_0.8.fq.gz
reformat.sh in=/home/esogin/prj001-comp-gene-chr/01_Data/02_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=1 out=${i}_1.fq.gz
##----------04 Copy results back to /home/data --------------##
rsync -r /home/esogin/scratch/prj001-comp-gene-chr/ /home/esogin/data/prj001-comp-gene-chr/
echo "clean up scratch"
uptime
hostname
date
Written on September 20, 2021