Assembly

Objective: does subsampling to assembly?

  • Determine if subsampling reads prior to assembly makes a difference for metagneomic assembly statistics, metrics to look at: length of assembly, N50 metrics etc.
  • Learn to use SLURM system on the MERCED cluster

Helful links:

-getting started

-rosetta qsub to slurm

-Inspiration from Hug et al. 2015

  1. Setting up the work space

Data structure:

data/ data repository for backed up data scratch/ scratch will be cleaned up every 5 weeks; good for jobs etc

typical slurm script:

#!/bin/bash  
#SBATCH --mail-user=esogin@ucmerced.edu  
#SBATCH --mail-type=ALL  
#SBATCH --nodes=1  
#SBATCH --ntasks=20
#SBATCH --partition fast.q  
#SBATCH --mem=96G  
#SBATCH --time=0-00:15:00 # 15 minutes  
#SBATCH --output=test.qlog  
#SBATCH --job-name=test
#SBATCH --export=ALL

# This submission file will run a simple set of commands. All stdout will
# be captured in test1.qlog (as specified in the Slurm command --output above).
# This job file uses a shared-memory parallel environment and requests 20
# cores (--ntasks option) on a single node(--nodes option). This job will also run a global #script called
# run. For more info on this script, cat /usr/local/bin/merced_node_print.
#  

whoami

pwd

uptime
hostname
date

To submit the job: sbatch script

  1. Start

This scripts will: (1) download the sequencing reads from ENA (2) check the md5sums (3) preform basic quality trimming (4) subsample reads to 5%, 10%, 20%, 40% and 80% (5) removes raw squecing read files

Tools required: bbtools

#!/bin/bash 
#SBATCH --mail-user=esogin@ucmerced.edu  
#SBATCH --mail-type=ALL  
#SBATCH --nodes=1  
#SBATCH --ntasks=20
#SBATCH --partition std.q  
#SBATCH --mem=96G  
#SBATCH --time=0-24:00:00 #  
#SBATCH --job-name=start
#SBATCH --export=ALL

i=3847_I	
cd /home/esogin/scratch/prj001-comp-gene-chr/

##----------01 Download data from ENA/Facilitity--------------##
#echo "--------download file--------"

## Study accession  nr. PRJEB35096	
#sample run accession nr:

#cd /home/esogin/scratch/prj001-comp-gene-chr/01_Data/00_raw/
#wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/${i:0:6}/00${i:9:10}/$i/$i* 

##----------02 Clean Data --------------##
#echo "--------clean data--------"

#mkdir /home/esogin/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/
#cd /home/esogin/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/;

#bbduk.sh ref=/home/esogin/data/db/adapters.fa ktrim=l minlength=36 mink=11 hdist=1 in=../00_raw/${i}_R1.fastq.gz in2=../00_raw/${i}_R2.fastq.gz out=${i}_ktriml.fq.gz;
#bbduk.sh ref=/home/esogin/data/db/adapters.fa ktrim=r trimq=2 qtrim=rl minlength=36 mink=11 hdist=1 in=${i}_ktriml.fq.gz interleaved=t out=${i}_q2_ktrimmed.fq.gz;

#rm *_R*.fastq.gz ${i}_ktriml.fq.gz
#rm ${i}_ktriml.fq.gz

##----------03 Subsample Data --------------##
echo "--------subsample data--------"

mkdir /home/esogin/scratch/prj001-comp-gene-chr/02_Analysis/01_subsampled -p;
cd /home/esogin/scratch/prj001-comp-gene-chr/02_Analysis/01_subsampled;

reformat.sh in=~/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.1 out=${i}_0.1.fq.gz
reformat.sh in=~/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.2 out=${i}_0.2.fq.gz
reformat.sh in=~/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.4 out=${i}_0.4.fq.gz
reformat.sh in=~/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.8 out=${i}_0.8.fq.gz
reformat.sh in=~/scratch/prj001-comp-gene-chr/01_Data/01_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=1 out=${i}_1.fq.gz

##----------04 Copy results back to /home/data --------------##
rsync -r /home/esogin/scratch/prj001-comp-gene-chr/ /home/esogin/data/prj001-comp-gene-chr/

echo "clean up scratch"

uptime
hostname
date

The download of the data failed (probably has to do with my coding above)

instead download files prior to running script:

#wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR362/ERR3624084/3847_I_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR362/004/ERR3624084/ERR3624084_1.fastq.gz
echo "4b0e21b15df9a76fe9739f9690c27102  ERR3624084_1.fastq.gz" | md5sum -c 

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR362/004/ERR3624084/ERR3624084_2.fastq.gz
echo "b3e7f37af49846385c5feca0b10c2b73  ERR3624084_2.fastq.gz" | md5sum -c 
  1. Assembly - trouble shooting started Sept 24 2021

Assembly each set of reads with spades

#!/bin/bash 
#SBATCH --mail-user=esogin@ucmerced.edu  
#SBATCH --mail-type=ALL  
#SBATCH --nodes=1  
#SBATCH --ntasks=20
#SBATCH --partition std.q  
#SBATCH --mem=96G  
#SBATCH --time=0-24:00:00 #  
#SBATCH --job-name=start
#SBATCH --export=ALL

i=3847_I	
cd /home/esogin/scratch/prj001-comp-gene-chr/02_Analysis/


##----------03 Subsample Data --------------##
echo "--------Assemble Data--------"

mkdir /home/esogin/scratch/prj001-comp-gene-chr/02_Analysis/02_assembly -p;
cd /home/esogin/scratch/prj001-comp-gene-chr/02_Analysis/02_assembly;


reformat.sh in=/home/esogin/prj001-comp-gene-chr/01_Data/02_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.1 out=${i}_0.1.fq.gz
reformat.sh in=/home/esogin/prj001-comp-gene-chr/01_Data/02_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.2 out=${i}_0.2.fq.gz
reformat.sh in=/home/esogin/prj001-comp-gene-chr/01_Data/02_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.4 out=${i}_0.4.fq.gz
reformat.sh in=/home/esogin/prj001-comp-gene-chr/01_Data/02_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=0.8 out=${i}_0.8.fq.gz
reformat.sh in=/home/esogin/prj001-comp-gene-chr/01_Data/02_trimmed/${i}_q2_ktrimmed.fq.gz samplerate=1 out=${i}_1.fq.gz

##----------04 Copy results back to /home/data --------------##
rsync -r /home/esogin/scratch/prj001-comp-gene-chr/ /home/esogin/data/prj001-comp-gene-chr/

echo "clean up scratch"

uptime
hostname
date
Written on September 20, 2021