The Structural step of GenSAS is where the gene prediction and other sturtural feature tools are run. For eukaryotes, the tools are described below. For prokaryotes, GeneMarkS, Glimmer3, RNAmmer, Getorf, and tRNAscan are available at this step. Please see below for examples of setting up jobs, the interfaces are very similar for the prokaryote-only tools. For detailed information on each tool, please use the links on the "Available Tools" table on the GenSAS homepage. There are two main types of tools available in this step: "Gene Prediction" and "Other Features" tools (Fig. 30A). Under the "Gene Prediction" section (Fig. 30B), there are six options and the settings for each tool are viewed by clicking on the tool name. As with the other GenSAS tools, you can run each tool multiple times with different settings as long as each job has an unique name.
Figure 30. Structural step of GenSAS.
The first tool is Augustus, and this tool can be run with pre-trained datasets or you can train the tool for your organism using evidence you provide. If you would like to run Augustus with one of the provided organism settings,choose the species from the drop down menu (Fig. 31A), use or adjust the other settings, and then click on "Add Augustus Job" (Fig. 31B). If you would like to train Augustus, you can open the "Options for training Augustus" section (Fig. 31C). Under the training parameters section for Augustus (Fig. 31D), there are four data types. The selections for each type will depend on the evidence that has been uploaded to GenSAS and/or tools that have already been run. BRAKER requires a BAM file to run (Fig. 31E).
Figure 31. Augustus and BRAKER settings in GenSAS.
To train Augustus, one of data selections/combinations from Table 2 is required to set-up a training job. The data should be specific to the organism being annotated and can be uploaded under the "Evidence" step of GenSAS. The BAM file section will show the results from aligning uploaded RNA-seq reads to the genome with TopHat (from "Align" step). Once the appropriate data selections are made, click on the "Add Augustus Job" button to add a training job to the job queue.
Training Option | Required Data Files to Select |
---|---|
Genes and Transcripts | "Genes Structures" (Genbank file) and "cDNA sequences" (FASTA file) |
Proteins only | "Protein sequences" (FASTA file) |
Proteins and Transcripts | "Protein sequences" (FASTA file) and "cDNA sequences" (FASTA file) |
RNA-seq reads | Select the TopHat or HISAT job results or uploaded file under "BAM File" |
Table 2. Data type combinations needed to train Augustus.
If you are working on a non-model organism, and do not have species-specific evidence, GeneMark is a self-training gene prediction tool that might work well for your organism (Fig. 32A). In GenSAS, there is also an interface for FGENESH (Fig. 32B). Due to licensing restrictions, GenSAS cannot run this tool, but you can import the results from the tool and GenSAS will parse the data and make it available to use in downstream steps. GenSAS also has Genscan (Fig. 33A), GlimmerM (Fig. 33B), and SNAP (Fig. 33C). Genscan can only process DNA pieces shorter than 8 Mbp. If your assembly has scaffolds or contigs larger than 8 Mbp, Genscan will not return any results.
Figure 32. GeneMark and FGENESH interfaces.
Figure 33. Genscan, Glimmer and SNAP interfaces.
Under the "Other Features" section (Fig. 34), there are four tools that identify other structural features. To identify tRNA and rRNA sequences, use tRNAscan-SE (Fig. 35A) and RNAmmer (Fig. 35B), respectively. Getorf is a tool to identify open reading frames (Fig. 35C). The SSRs tool (Fig. 35D) locates simple sequence repeats (SSRs). Please note that many SSRs are masked by Repeatmasker and RepeatModeler, so if you want a complete list of SSRs from your DNA it is best to run this tool on unmasked sequences.
Figure 34. The Other Features section of Structural Tab.
Figure 35. Interfaces for tRNAscan, RNAmmer, Getorf, and SSR tool.
You will be able to track the progress of all structural tool jobs via the Job Queue on the right of the GenSAS interface. As the jobs complete, the data can be viewed in JBrowse/Apollo. If you have JBrowse open when a job completes, you will have to reload JBrowse to view the new data (Please see Apollo and JBrowse section for more info). It is very important to look at the results from each tool and to determine if the data makes sense for your organism before using the results in downstream steps. Once you have completed running jobs under the "Structural" step and looked at the results, click on the "Proceed to next step" button to move on to the "OGS" step.