Occasionally, GBA yields LCRs that appear fragmented. This flaw has been attributed to some lack of optimization in the steps the algorithm performs to find the LCRs in a sequence. Two improvements to two separate steps in the GBA algorithm have been proposed to remedy the situation.
First, note the four step process GBA uses to determine LCRs:
- Construct a graph
- Find the longest path
- Extend longest path intervals
- Post-process extended intervals
More details on each of these steps may be found in Xuehui Li’s paper listed in the Publications section.
The first proposed improvement pertains to the first step in the GBA algorithm. In the current set-up, a vertex is created whenever two letters in a sequence are the same as, or similar to, each other. Edges are inserted between two vertices if the combination of the first letters from each vertex and the combination of the second letters from each vertex form a repeat. The improvement to this process defines each vertex as a k-gram sequence similarity rather than just a 2-letter similarity.
The second proposed improvement takes aim at finding the longest path. Rather than use a purely greedy method for jumping from a source node to a destination (by traveling along the longest edge), the improvement suggests an N-lookahead before picking each edge on the path. So, for each node the improved algorithm will calculate which route yields the longest path N nodes down the line. It chooses the first edge on this path to follow, then looks ahead again. Note that while this method will add a certain amount of foresight when approximating the longest path, it will also add complexity that increases exponentially with the value of N.
Categories: Uncategorized
As I delve deeper and deeper into the development, it’s natural that I’ll uncover new questions and important considerations. Here they are, presented in list format:
- Standalone vs. web-based: While a web-based system offers convenience (no install, etc.), it limits the processing power and memory capacity to that of the server which houses the system.
- Outputs: Apart from the normal GBA outputs– either a LCR-masked sequence or a list of the locations of LCRs– what data should the UI collect/present? The ratio of LCRs to the rest of the sequence? Suspected LCRs?
- Representation: What representation, graphical or otherwise, best sums the information output by the algorithm into one, palatable whole?
- Method of representation: Which tools will be used to build the mechanism for representation? Java applets? Java Web Start? HTML/XML?
Categories: Uncategorized
The main goal of the project will be to implement a user interface to GBA, enabling biologists and other semi-technical folk to easily utilize the algorithm. Two methods, web-based and standalone, will be investigated. The choice of implementation language will depend on factors such as ease of implementation, speed, and compatibility with original implementation language. The author will perform experiments, verifying that the results of the original algorithm implementation and the results exhibited by the user interface match, using a selection of data from the Swiss-Prot database.
As a secondary goal of the project, the author will investigate shortcomings of GBA in an attempt to improve upon the algorithms. Particularly, resultant LCRs that should appear as one continuous LCR occasionally appear as a scattered, broken lengths of smaller LCRs. Alternate methods of LCR sequence extension (a.k.a. following vertex edges in GBA) may provide insight into the fragmentation problem, leading to a novel solution. Experiments comparing the resultant LCRs of the original and altered algorithms will determine the ability of the new algorithm to solve the older algorithm’s LCR segmentation inconsistencies. Again, the Swiss-Prot database will provide the data used to test the altered algorithm.
Categories: Uncategorized
This is the main page for the “Graph Based Algorithm” project. Graph Based Algorithm (GBA) is an algorithm formulated by Xuehui Li for identifying low-complexity regions (LCRs) in protein sequences. The paper describing this algorithm, “A novel algorithm for identifying low-complexity regions in a protein sequence,” can be found here. Here, on the front page, you’ll find important information, and updates on the progress of the project, as well as any discoveries helpful in improving the algorithm.
The “Project” section presents an abstract describing the goals and purpose of the project. The “Publications” section lists all publications referenced by the author in order to accomplish the project goals. The “Author” section contains a summary of the author’s interests and current research.
Categories: Uncategorized