AGOUTI: Annotated Genome Optimization Using Transcriptome Information¶
Table of Contents¶
- Overview
- Features
- Cite AGOUTI
- Obtain AGOUTI
- Get Started
- Prepare Inputs
- Understand Outputs
- Example
- Example Data
- Scaffoldding on Shredded Assembly
- Break-and-Continue
- Graph Visualization
- Support
Overview¶
AGOUTI uses paired-end RNA-seq reads to guide genome scaffolding and improve gene annotation. It works in the following steps:
- Extracting uniquely mapped joining-read pairs
- Denoise the set of joining-pairs using gene models
- Traverse the graph built from the noise-free data to identify scaffolding paths
- Reconcile each scaffolding path by applying rules defined in denoise step
- Updating assembly and gene annotation given the scaffolds
Features¶
- Scaffold hundreds to thousands of contigs, yielding more contiguous assemblies;
- Reduce the number of gene models and update them simultaneously;
- Shred genome assembly and corresponding annotation simultaneously;
- Record any inconsistencies with the original (input) scaffolding results;
- Recover original scaffolds from contigs not scaffolded by AGOUTI;
- Support break-and-continue feature such that some time-consuming steps can be skipped if the previous run is successful;
- Output graph in dot format for visualization
Cite AGOUTI¶
AGOUTI is published here:
http://gigascience.biomedcentral.com/articles/10.1186/s13742-016-0136-3
Please cite AGOUTI using the following format:
Zhang, S. V, Zhuo, L., & Hahn, M. W. (2016). AGOUTI: improving genome assembly and annotation using transcriptome data. GigaScience, 5(1), 1.12. doi:10.1186/s13742-016-0136-3
Important: please submit your problem through GitHub issue tracker.
Obtain AGOUTI¶
AGOUTI is designed to be light-weight by only requiring SAMtools, python 2.7 or above, and git 1.8.5 or later.
To download AGOUTI, please use:
git clone https://github.com/svm-zhang/AGOUTI.git
To get the current version of the download, simply go to AGOUTI folder and do
python agouti.py --version
You should see, for example, something like this:
AGOUTI v0.2.4-dirty
Since v0.2.4, you can check if there is a new version by running:
python agouti.py update
It is RECOMMENDED to run update before you run AGOUTI.
If you wish to use a specific version of AGOUTI, you can first fetch all the versions available:
git fetch --all
Then show all available versions using:
git tag
Next you simply need to specify a version. For example, if you’d like to use v0.2.3:
git checkout v0.2.3 -b v0.2.3
You can also click on releases from AGOUTI github page to see all available versions.
Get Started¶
In its simplest usage, AGOUTI takes three inputs: an initial genome assembly in FASTA format, paired-end RNA-seq reads mapped against the assembly in BAM format, and gene predictions from the initial assembly in GFF3 format. For instance:
python agouti.py scaffold \
-assembly example.fasta \
-bam example.bam \
-gff example.gff \
-outdir ./example
This will produce a scaffoled assembly in FASTA format, and a updated
gene models in GFF3 format. All files (including the intermediate files)
will be stored under a directory specified by -outdir
, “example” in
this case.
Prepare Inputs¶
Genome Assembly¶
AGOUTI accepts assemblies as both contigs and scaffolds. In its scaffold form, AGOUTI breaks assemblies at gaps of a minimum lengths, essentially producing a shredded/split assembly (see Shred Assembly). AGOUTI scaffolds on the split assembly, and report any inconsistencies between the RNA-based scaffolding and the original scaffolding.
To shred a given assembly at gaps of at least 25 bp:
python agouti.py shred \
-assembly example.fasta \
-p example \
-mlg 25
This produces a shredded assembly: example.ctg.fasta
, and a file of
a format similar to Fasta: example.shred.info.txt
. Each header gives
IDs of sequences in the original assembly. Under each header is a list
of pairs of shredded contigs and the length of gaps between them. A
sequence without any gaps will be by itself, and NA are used for such
cases.
It is very important to use this split assembly in the following reads-mapping and gene prediction.
SAM/BAM File¶
Assuming you have a dataset of paired-end RNA-seq reads,
example.1.fq
and example.2.fq
, and an assembly generated from
either an assembler of your favorite or shredded by AGOUTI,
example.fasta
or example.ctg.fasta
. You will first need to map
the RNA-seq data against the assembly using a short-reads mapper, such
as Bowtie2 or BWA. For example,
bwa index example.fasta
bwa mem -M example.fasta example.1.fq example.2.fq | samtools view -Sb - > example.bam
This produces a mapping results in BAM format. AGOUTI uses this BAM file for scaffolding. More specifically, it reads the file and extracts joining-pairs. A joining-pair is defined as one with both ends mapped to different contigs. AGOUTI uses only uniquely mapped ones by checking mapping quality. Short-reads mappers such as BWA, Bowtie2 uses a non-zero mapping quality to define unique mapping. If the mapper you are using does not use quality to mark ambiguous mapping, then you must first process your SAM/BAM file before running AGOUTI.
Several more things worth of noting:
- Please run samtools flagstat to get stats of the mapping, and looks particular for number of pairs mapped to different chromosomes. If none, then AGOUTI will not be able to do any scaffolding.
- Please make sure the BAM is sorted by reads name, not coordiantes.
Gene Models¶
To run AGOUTI, you will also need a set of gene models predicted from the assembly. For instance,
augustus --AUGUSTUS_CONFIG_PATH=[path to augustus config file] -gff3=on --species=[your sepcies] example.fasta > example.gff
At the end of gene prediction, you will now have a set of gene models predicted from the assembly. You can choose any * ab inito * gene predictor as long as it spits out the models in GFF format. More specifically, AGOUTI looks for the following information:
- lines annotated as
gene
- contig ID
- gene ID, e.g.
ID=gene1
from the attribute column (i.e. last column) - start and stop positions of the gene
- strand
- lines annotated as
CDS
- start and stop positions
Important Notes
- AGOUTI is yet to support the GTF format. It will be in the near future. I will also try to provide a converter script from GTF to GFF.
- If your GFF file has FASTA sequences at the end (e.g. generated from MAKER pipeline), please make sure to use verions v0.2.5 or above.
- If AGOUTI fails to find any gene models, it will stop.
Understand Outputs¶
AGOUTI outputs its results to a base directory specified by -outdir
.
Under the base director, there are several sub-folders created, each
corresponding to a step built in AGOUTI. A run of AGOUTI using the
command-line setting demonstrated in Getting Started will generated
a structured output as shown in the following screenshot:
Each subfolder includes three types of file:
- general progress meter info
- debug info
- intermediate outputs
To get a file with debug info you will need to specify -debug
. An
intermediate file can have all the joining-pairs, the denoised set of
joining-pairs, the graph in DOT format, etc. Some intermediate files are
important to support the break-and-continue feature, e.g. the file with
the noise-free set of joining-pairs (see below for more details).
The agouti.main.log
is prefixed with the string specified by -p
,
so do all the other files generated by AGOUTI. The sequence ID in the
final assembly will also be as this prefix. By default, agouti
will
be used.
The final assembly and the updated gene models can be found
under the base directory, example
, along with plain text files of
useful information, such as scaffolding paths, gene paths, differences
between scaffolds generated by AGOUTI and original scaffolding.
Example¶
Scaffoldding using joining-pairs with a minimum mapping quality of 20, a maximum of 5% mismtaches:
python agouti.py scaffold \
-assembly example.fasta \
-bam example.bam \
-gff example.gff \
-outdir ./example \
-minMQ 20 -maxFracMM 0.05
Scaffolding without updating gene model (v0.3.0 or above):
python agouti.py scaffold \
-assembly example.fasta \
-bam example.bam \
-gff example.gff \
-outdir ./example -no_update_gff
Scaffolding a shredded assembly and report any inconsistencies between RNA-seq based scaffolding and orignial scaffolding:
python agouti.py scaffold \
-assembly example.shred.fasta \
-bam example.bam \
-gff example.gff \
-outdir ./example \
-shredpath example.shred.info.txt
Shredding an assembly and annotation simultaneously (v0.3.0 or above):
python agouti.py shred -assembly example.fasta -gff example.gff -p example
This will generate example.shred.info.txt
and
example.shred.ctg.gff
Example Data¶
Here gives one example data set that we used in our paper.
Scaffoldding on Shredded Assembly¶
Why shredding original assemblies¶
There are two benefits you can get from shredding the original assembly (you can optionally skip this entire section if your assembly is in the contig form, and no previous scaffolding is attempted). First, in the case of a gene spanning across a gap, the prediction tends to report two gene models, one for each side of the gap. This is because, to our knowledge, many programs cannot predict across gaps, especially those longer ones. Breaking at the gap and using RNA-seq data, AGOUTI therefore can correct for it by merging the two gene models, given there were connections between the two shredded contigs.
Second, scaffolding using RNA-seq reads can produce alternative paths that are based on evidences of gene models. Any inconsistencies with ones given by DNA-based scaffolding can provide useful information for further improving genome assembly.
The downside of scaffolding this way is that sequences, especially those from regions of low gene density, lose their context with others. This makes all efforts of doing DNA-based scaffolding, if any, become futile. To avoid such loss, AGOUTI (v.0.3.0 or above) tries to recover the original connections between contigs as much as possible (see Recovering Original Paths section below).
When shredding original assemblies¶
It’s always recommended that you run AGOUTI directly on your scaffolds, before trying to tear it up. AGOUTI will simply try to find additionally connections between scaffolds that were missed by original scaffolding programs. This should be the firs best practice to do, regardless of how many pieces your assembly is composed of.
If you’d like to fix gene models flanking gaps and/or identify any inconsistencies from your original DNA-based scaffolding, shredding the assembly can be helpful. We are currently working on a new module that can correct for split gene models interrupted by gaps, without shredding the assembly. This way AGOUTI can preserve as much as possible the contiguity, and further improve genome annotation at the same time. This module will be available soon.
Shredding Practices¶
First: If you shred the assembly and predict gene model on the shredded assembly using programs like AUGUSTUS, the following command line is suggested:
python agouti.py shred -assembly scaffold.fasta -p scaffold
This will generate scaffold.ctg.fasta
, scaffold.shred.info.txt
,
and two files for debugging purpose. You then run, for instance AUGUSTUS
and BWA, on the shredded assembly to get scaffold.ctg.gff
and
scaffold.ctg.bam`, respectively. To scaffold, run
python agouti.py scaffold \
-assembly scaffold.ctg.fasta \
-bam scaffold.ctg.bam \
-gff scaffold.ctg.gff \
-outdir ./example \
-shredpath scaffold.shred.info.txt
With the scaffold.shred.info.txt
, AGOUTI will try to recover the
original scaffolding path. To disable this feature, you can simply not
specify -shredpath
option.
Second: Many people found laborious to repeat gene prediction on the
shredded assembly, especially in cases the genome is huges. AGOUTI
handle such cases by simultaneously shredding the annotation company the
sequence. The only difference is to specify -gff
option in the shred
command line.
python agouti.py shred -assembly scaffold.fasta -gff scaffold.gff -p scaffold
In addition to the files described above, this also generates
scaffold.shred.ctg.gff
. This gene annotation is then used for
scaffolding.
python agouti.py scaffold \
-assembly scaffold.ctg.fasta \
-bam scaffold.ctg.bam \
-gff scaffold.shred.ctg.gff \
-outdir ./example \
-shredpath scaffold.shred.info.txt
In this scenario, when AGOUTI tries to recover the original path, it will also connect the shredded gene models accordingly (see below).
Shred Assembly¶
Given an assembly in its scaffold form, AGOUTI can shred scaffolds into
contigs at gaps of a minimum length (5 by default, user-tunable). The
following figure gives an example of how it works. Let’s say a scaffold
called scaffold 1
in the assembly. This scaffold consists of three
stretches of gaps of various lengths, 5, 3, and 9, respectively. By
default, AGOUTI shreds it into three contigs, Scaffold_1_0
,
Scaffold_1_1
, and Scaffold_1_2
. AGOUTI does not cut at the
second gap because it has a length of 3. Notably, AGOUTI uses
SEQID_INDEX
to tell the order of contigs in the given original
scaffold. For scaffolds without gaps, AGOUTI does not split them.)
Shred Annotation¶
Since v0.3.0, AGOUTI is also able to shred gene annotation matching the
give assembly. It compares start and end positions of features with
coordinates of cut sites, and updates annotation accordingly. There are
five types of features AGOUTI cares: gene, exon, CDS, firve_prime_UTR,
three_prime_UTR. The following figure gives an example of how it works.
Let’s use the same scaffold (i.e. Scaffold_1
) shredded in the
picture above. Assume that there is a gene span across the second cut
site, and it consists of three exons (green box). AGOUTI splits the
assembly such that one gene becomes two (boxes in different colors)
sitting on two different contigs. AGOUTI assigns them with different ID,
in a similar fashion as names of shredded contigs, GENEID_INDEX
.
This naming tells 1) whether two shredded genes belong to a single one;
and 2) the order.
Recover Original Paths¶
AGOUTI will try to recover original connections for shredded contigs to preserve contiguity as much as possible. To do so, contigs that are not scaffolded by AGOUTI are first identified. For a pair of such contigs, AGOUTI then re-connects them as long as they are next to each other in the original scaffolding path. Consider an example in which a scaffold is shredded into 5 contigs: A, B, C, D, and E, and AGOUTI is able to scaffold C and D. This leaves A, B and E untouched. Given our rules, AGOUTI will re-connect A and B without appending E to B, because B and E are not consecutive in the original path. If annotation is shredded at the same time with the assembly, AGOUTI will also merge them during the process.
Report Inconsistencies¶
Any inconsistencies with the original path can provide useful
information to further improve genome assembly. AGOUTI provides
alternative scaffolding paths in the form of network, named
[prefix].consistency.nw
. This network consists of 4 columns,
from
, interaction
, to
, and type
. from
and to
are
source and target nodes/contig. If two contigs connected from the same
original scaffold, a type of agouti_same
will be assigned;
agouti_diff
given otherwise. In the former case if the two contigs
are not consecutive, AGOUTI reports all the skipped
ones in between.
In the latter scenario on the other hand, AGOUTI additionally gives
immediate neighbors around each of the two contigs according to original
paths. In the network file, these connections are typed original
.
Consider an example illustrated in the figure below. AGOUTI connects
three pairs of shredded contigs. For contig scaf669029_3
and
scaf669029_7
, they come from the same original scaffold (blue solid
line), which can be tell by the string before the underscore. Because
they are not consecutive (index 3 and 7), scaf669029_4
,
scaf669029_5
, and scaf669029_6
are reported to tell the contigs
being skipped (pink dotted line).
In the same example, AGOUTI also connects two contigs from different
original scaffolds (red zigzag line), scaf669029_3
and
scaf668522_35
. AGOUTI additionally reports immediate neighbors of
each of the two contigs (connected by green arrowed line). Both contigs
come from the ends of their corresponding original scaffolds, and a path
between the two can suggest a connection between the same two original
scaffolds. Connections between two contigs from the middle of their
original scaffolds, on the other hand, flag inconsistencies, e.g.
scaf668522_30
and scaf669547_5
.
The network file is Cytoscape-ready, and we also provide a style file
consistency.xml
under the cytoscape
folder. The example
demonstrated here is only a tiny part of the network. You can get the
full network by simply load the example.consistency.nw
file into
Cytoscape.
Break-and-Continue¶
AGOUTI is built with a couple of modules. The output of current module
will be taken as the input as the next module. Given the same input,
modules such as extracting joining-pairs from BAM file, spits out the
same intermediate results. AGOUTI therefore tries to save some running
time by skipping such steps if they were finished successfully from
previous runs. To use this feature, simply run AGOUTI the second time
with the same output directory and output prefix as the previous run. If
you desire a fresh start, simply use -overwrite
to overwrite all
results generated previously, or gives a new prefix.
Graph Visualization¶
AGOUTI makes the scaffolding graph accessible to users. Under
scaffolding
folder, you can find a file named after
[prefix].agouti_scaffolding.graph.dot
. The dot file can be directly
loaded in packages like Graphviz. In the graph, contigs/vertices are in
black circle, while there are two color codings for edges. Ones in red
are the scaffolding path in the final assembly, and others in black are
simply edges that were not traversed. Edges in dotted style represent
connections with a number of supporting joining-pairs lower than the
minimum specified.
Support¶
Please feel free to submit any issues through GitHub issue tracker.
Any comments are welcome as well!