--^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^-- ---------------- Run and understand VELVET ---------------- -v v v v v v v v v v v v v v v v v v v v v v v v v v v v v- Program URL: https://github.com/dzerbino/velvet The program “Velvet is used to to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25–50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. Errors are corrected after graph creation to allow for simulta- neous operations over the whole set of reads. In our framework, errors can be due to both the sequencing process or to the bio- logical sample, for example, polymorphisms. Distinguishing polymorphisms from errors is a post-assembly task. A naive ap- proach to error removal would be to use the difference between the expected coverage of genuine sequences and that of random errors. Therefore removing all the low coverage nodes (and their corresponding arcs) would remove the errors. However, this relies on the differences being due to genuine errors and not to bio- logical variants present at a reasonable frequency in the sample, and errors being randomly distributed in the reads. Instead, Velvet focuses on topological features. Erroneous data create three types of structures: “tips” due to errors at the edges of reads, “bulges” due to internal read errors or to nearby tips connecting, and erroneous connections due to cloning errors or to distant merging tips. The three features are removed con- secutively. ORDER OF THINGS 1. VelvetOptimizer.pl 2. velveth 3. velvetg --^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^-- ----------------------------------------------------------- -v v v v v v v v v v v v v v v v v v v v v v v v v v v v v- ~~~~~~~~~~~~~~~~~~~~~~ ~ ~~ ~ VelvetOptimizer.pl ~~ ~ ~~ ~~~~~~~~~~~~~~~~~~~~~~ The Velvet software comes includes the script VelvetOptimizer.pl, which uses a heuristic method to find the optimal k-mer length and coverage cutoff for Velvet. ######################## # Example command line # ######################## $ VelvelOptimizer.pl -s 25 -e 45 -f' -shortPaired -fastq long_1.fastq long_2.fastq -s and -e: indicate the k-mer minimum and maximum size -f: set the insert size to auto By default, VelvetOptimizer will choose the optimal k-mer size based on the N50. However, the -k option enables users to base the assembly optmization function on other variables. ~~~~~~~~~~~ ~ ~~ ~ velveth ~~ ~ ~~ ~~~~~~~~~~~ velveth stands for "Velvet hash". It reads the sequence ~~~~~~~~~~~ velveth stands for "Velvet hash". It reads the sequence input files and outputs three files, Sequences, Roadmaps, and Log. velveth requires an output directory, the k-mer length (must be an odd number), the sequence file format, the read type, and the input filename(s). ######################## # Example command line # ######################## $ velveth velvet_output/ -fastq -shortPaired long_1.fastq long_2.fastq ~~~~~~~~~~~ ~ ~~ ~ velvetg ~~ ~ ~~ ~~~~~~~~~~~ velvetg stands for "Velvet graph". It uses velveth outputs to build the assembly and outputs the files contigs.fa, UnusedReads.fa, Graph2, LastGraph, PreGraph, stats.txt, and Log. velvetg requires coverage cutoff to be specified in order to exclude short, low-coverage nodes from the assembly. In addition, running velvetg on paired-end reads requires the expected insert length (the average length of the sequenced fragment) and the expected kmer coverage. ######################## # Example command line # ######################## $ velvetg velvet_output/ -cov_cutoff auto