Add abstract
Want to add your dissertation abstract to this database? It only takes a minute!
Search abstract
Search for abstracts by subject, author or institution
Want to add your dissertation abstract to this database? It only takes a minute!
Search for abstracts by subject, author or institution
Search-Optimized Disk Layouts For Suffix-Tree Genomic Indexes
by Rajul D Bhavsar
Institution: | Indian Institute of Science |
---|---|
Department: | |
Degree: | |
Year: | 2011 |
Keywords: | Suffix-Tree; Suffix-tree Indexing; Disk-based Suffix-trees; Suffix-tree Internal Nodes; Suffix Trees - Storage Optimization; Suffix-tree Layouts; Suffix-Tree Nodes; Disk-resident Suffix-trees; Computer Science |
Posted: | |
Record ID: | 1211005 |
Full text PDF: | http://hdl.handle.net/2005/2124 |
Over the last decade, biological sequence repositories have been growing at an exponential rate. Sophisticated indexing techniques are required to facilitate efficient searching through these humongous genetic repositories. A particularly attractive index structure for such sequence processing is the classical suffix-tree, a vertically compressed trie structure built over the set of all suffixes of a sequence. Its attractiveness stems from its linearity properties – suffix-tree construction times are linear in the size of the indexed sequences, while search times are linear in the size of the query strings. In practice, however, the promise of suffix-trees is not realized for extremely long sequences, such as the human genome, that run into the billions of characters. This is because suffix-trees, which are typically an order of magnitude larger than the indexed sequence, necessarily have to be disk-resident for such elongated sequences, and their traditional construction and traversal algorithms result in random disk accesses. We investigate, in this thesis, post-construction techniques for disk-based suffix-tree storage optimization, with the objective of maximizing disk-reference locality during query processing. We begin by focusing on the layout reorganization in which the node-to-block assignments and sequence of blocks are reworked. Our proposed algorithm is based on combining the breadth-first layout approach advocated in the recent literature with probabilistic techniques for minimizing the physical distance between successive block accesses, based on an analysis of node traversal patterns. In our next step, we consider techniques for reducing the space overheads incurred by suffix-trees. In particular, we propose an embedding strategy whereby leaf nodes can be completely represented within their parent internal nodes, without requiring any space extension of the parent node's structure. To quantitatively evaluate the benefits of our reorganized and restructured layouts, we have conducted extensive experiments on complete human genome sequences, with complex and computationally expensive user queries that involve finding the maximal common substring matches of the query strings. We show, for the first time, that the layout reorganization approach can be scaled to entire genomes, including the human genome. In the layout reorganization, with careful choice of node-to-block assignment condition and optimized sequence of blocks, search-time improvements ranging from 25% to 75% can be achieved with respect to the construction layouts on such genomes. While the layout reorganization does take considerable time, it is a one-time process whereas searches will be repeatedly invoked on this index. The internalization of leaf nodes results in a 25% reduction in the suffix-tree space occupancy. More importantly, when applied to the construction layout, it provides search-time improvements ranging from 25% to 85%, and in conjunction with the reorganized layout, searches are speeded up by 50% to 90%. Overall, our…
Want to add your dissertation abstract to this database? It only takes a minute!
Search for abstracts by subject, author or institution
Prediction of Upper Body Power of Cross-Country Sk...
|
|
Bitcoins
Mining, Transaction, Security Challenges and Futur...
|
|
Applying User-Centered Interface Design Methods to...
|
|
Head-Order Techniques and Other Pragmatics of Lamb...
|
|
Visualization of Interface Metaphor for Software
An Engineering Approach
|
|
Indoor Wireless Metering Networks
A Collection of Algorithms Enabling Low Power/Low ...
|
|
Automated Generation of Geometrically-Precise and ...
|
|
A Study on the Tone-Reservation Technique for Peak...
|
|