As such many things work differently. This cookbook provides examples how to work with the new codebase. Basically if you find the code you want and copy and paste it into your program you should be up and running quickly. I have endeavoured to over document the code to make it more obvious what I am doing so some of the code might look a bit bloated. If you have any suggestions, questions or comments contact the biojava mailing list. To subscribe to this list go here.
|Published (Last):||27 March 2014|
|PDF File Size:||7.3 Mb|
|ePub File Size:||17.36 Mb|
|Price:||Free* [*Free Regsitration Required]|
When doing the analysis of code from Biojava 1 and what should be done in Biojava3 and emphasis was placed on breaking the code into modules. Thus core represent a collection of classes that would be common to other modules.
The common elements for all modules is reading, writing and representation of sequence data. We also thought it was important to use Java to model the biological relationships between sequences as accurately as possible.
The Biojava3 api should establish concrete relationships that help the computer scientist understand the biology through code and be familiar to the Biologist when writing code. In the genomic view of sequence data we now have very large data sets which presents challenges in loading everything into memory or retreating to a database and let it handle that complexity. We want to allow easy integration of sequence databases such as BioSQL but at the same time support large sequence datasets loaded from disk or accessed via web services.
This is why the Sequence Interface reigns supreme! Once you have the gene sequence you should be able to easily extract intron sequences or sequence data flanking the gene sequence for analysis.
By leveraging the REST or Web Services of public data sources like Uniprot or NCBI we want the api to hide these implementation details but offer enough flexibility that other public or private data sources can be easily integrated into BioJava3. An additional design goal is to keep the size of biojava3-core module as small as possible by not making it a convient place to add in new classes that do not directly relate to protein or DNA sequences or become dependent on external jar files.
It is tempting to make Dom4J a standard library in BioJava3 because of its speed and api but it is no longer being actively developed.
Before you realize it core has a large number of external dependencies which creates potential problems for developers who are using the Biojava3 api in their application if a different version of an external api is required.
For now Core is all about sequences and keeping it as small as possible. Currently, the biojava3-core module is being developed as part of the day job for two developers with tight deadlines and never enough time to do extensive documentation or even minimal documentation. Now that the biojava3-core module is settling down we will be working on finishing the JavaDoc, adding additional test cases and providing examples in the wiki. We really want to make it easy to create a sequence and what could be easier than using a String.
The storage of the sequence data is defined by the Sequence interface which allows for some interesting and we hope useful abstraction. The simplest Sequence interface to represent a sequences as a String is the ArrayListSequenceReader and is the default data store when creating a sequence from a string. By using the Sequence Interface we can easily extend the concept of local sequence storage in a fasta file to loading the sequence from Uniprot or NCBI based on an accession ID.
The UniprotProxySequenceReader can implement other feature interfaces and using the XML data that describes the Protein Sequence we can give a list of known mutations or mutagenenis studies with references to papers. We are still in the early stages of extending these sequence relationships and expect some api changes.
The abstraction of the sequence storage to an interface allows for a great deal of flexibility but has also added some challenges in how to handle situations when something goes wrong and you need to throw an exception. By introducing the ability to load sequences from remote URLs when the internet is not working or you have implemented a lazy instantiation approach to loading sequence data we have made it difficult to handle error conditions without making every method throw an exception.
This a design work in progress as we get feedback from developers and expect some level of api changes as we improve the overall design. The use of the SequenceCreator interface also allows us to address large genomic data sets where the sequence data is loaded from a fasta file but done in a way where the sequence is loaded in a lazy fashion when the appropriate method for sequence data or sub-sequence data is needed.
The FileProxyProteinSequenceCreator implements the Sequence interface but is very specific to learning the location of the sequence data in the file. In the above example a FastaReader class is created where we abstract out the code that is used to parse the Fasta Header and use FileProxyProteinSequenceCreator to learn the beginning and ending offset location of each protein sequence.
A SequenceFileProxyLoader is created for each sequence and stores the beginning and ending index of each sequence in the fasta file. The current implementation of SequenceFileProxyLoader will load the protein sequence data when needed and retain in memory which works great if you are only interested in a subset of sequences.
If the application using the API is going to iterate through all sequences in a large fasta file then in the end all sequence data would be loaded into memory. The SequenceFileProxyLoader could be easily extended to maintain a max number of sequences loaded or memory used and free up sequence data that is loaded into memory.
This way you can implement the appropriate cacheing algorithm based on the usage of the sequence data. In an effort to provide a flexible and modular api the abstraction can often make it difficult for someone getting started with the api to know what to use. We are implementing a set of classes that have the word Helper in them to hide the abstraction and at the same time provide examples on how to use the underlying API.
Typically the helper methods will be static methods and generally should be a small block of glue code. Sometimes it is useful to index a set of sequences by their length. Avoid using any kind of String method to do this since String operations are costly in BioJava due to the String conversion that must be applied.
Here is an example on how to do it for any Sequence object. Note our usage of the generic type to capture Sequence objects of any type since the assessment of length is something which can be applied to any Sequence not just AminoAcidCompound sequences. This is translated to a protein sequence using codons. All parts of the translation process are configurable including:. The following translates the given DNASequence to a peptide using the non-ambiguity CompoundSets with Codon table 1 in Frame 1 in the forward orientation.
A common feature of transcription is the ability to specify the base at which we start translating from DNA to RNA which in turn has an effect on how we convert the resulting RNA into a protein. This can be the difference between a working translation and one full of gibberish. Multiple frames of translations are possible but see later on. This requires a TranscriptionEngine but we will work with the default one for the moment.
If you are unsure of the frame a portion of DNA is to be translated in you can specify a number of frames to perform the translation in. The following example attempts to translate a sequence in all three forward frames. The code returns a map of the results keyed by their frame. Transcription engines are the workhorse of the translation process. A singleton version is available and is what the methods involved in the translation process use when not given an instance of TranscriptionEngine.
If building a custom engine then you do this using the Builder object as shown in the following example. Here we will build an engine to. The translation can be started from the TranscriptionEngine directly except this results in more general objects you will get back objects which implement the Sequence interface and not the true object type.
It is possible to define your own codon table should BioJava not support it. If this does not suffice then you can implement your own instance of Table to return the required codons. Draft copy of Core module design and capabilities. Builder ; b.
The BIOJAVA interface in STRAP
I've been desperately trying to make this basic example from the wiki Biojava cookbook work, but in vain. I am creating a java app so I downloaded all the necesseary jars and placed them to the build path. Whenever I try to run this, the error message appears. I am a bit lost here.
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community. Already on GitHub? Sign in to your account.