Update (Sept 04, 2010):
Due to fast nature of the field in question, this project has been unfortunately retired. It was an interesting experience in which I learned alot, many skills gained have been implemented in other projects involving DNA sequence manipulation and study. I've updated the paper section of this page to post the latest draft of the paper. If for whatever chance any of the scripts that I have written or any of the information are found of use to someone, this paper may be cited as follows:
Ovchinnikov, Sergey (2010). "Short read assembly pipeline using scripted filters, de novo assembly and manual extension of terminated contigs".
PSU McNair Online Journal. Volume 4.
About Me:I'm an undergraduate at Portland State University studying Molecular Biology and Chemistry. This summer of 2009, I've participated in a Ronald E. McNair Scholars Program, where I worked with Dr. Mark Fishbein on a project of my choosing. My initial goal was to analyze non coding regions of the chloroplast genome, by constructing phylogenic trees. But there was a halt in the operation, attributed to that fact that I didn't have that data that I hoped to have gotten by summer.
Instead I concentrated on actually assembling that data. And so began my journey, in the process I learned everything I could about de novo, reference guided assembly of short reads. And of course I had to learn computer programming. (It's funny how fast you can learn something if you have a goal.)
“Short read assembly pipeline using scripted filters, de novo assembly and manual extension of terminated contigs”
Multiple methods and algorithms exist which provide means of assembling millions of short DNA sequence reads generated by equipment such as illumina. Depending on the algorithm, the method used in obtaining the preferred amplicon, the depth of coverage and the region; the assembly could result in multiple contigs, which are the result of termination caused by either gaps in data and/or failure to pass various safety checks.
For the case study, the partial chloroplast genome of Asclepias syriaca and Asclepias tuberosa were sequenced. To extract the chloroplast sequence from the entire genome, PCR reactions were preformed with overlapping amplicons of ~3kbps. Various de novo algorithms and methods of pre-filtration and trimming were examined. It was found that terminations and misassembles existed due to primer sequences present in reads, base call and PCR error, repeats, poly regions and low coverage.
In this paper a tool is provided for the purpose of aligning the generated contigs onto a reference sequence, which would allow the user to examine the potential overlaps and mis-terminations. A tool is also provided to assist in manual extension of the terminated contigs. The tool can be used to examine every possible extension from the available pool of reads or taken a step further in actual extension. For extension the tool employs a method that searches for the most common pattern of agreeing reads at each extension cycle, designed to solve issues such as high error that exists in poly regions. The tool can also take a reference sequence into account to extend into areas of low coverage where only short overlaps exists between reads and into areas where a repeat is present.