Refine Annotation

Depending on the number and location of the differences found between the reference CDS and the target sequences, the Splign-Compart (NCBI) and the ProSplign-Compart (NCBI) options do not always provide a complete CDS annotation. Nevertheless, if all exon-intron splice junctions are covered in the partial CDS annotation it may be possible to obtain a complete CDS annotation by combining the results produced by the Splign-Compart (NCBI) and the ProSplign-Compart (NCBI) options and information on putative open reading frames generated by the getorf application. When using the Refine Annotation option the user must provide the name of the FASTA file generated by the Splign-Compart (NCBI) or the ProSplign-Compart (NCBI) options, as well as a FASTA file with approximate genome region where the gene is located (for instance the file with the Fasta extension that is saved by the ProSplign-Compart (NCBI) option in the /Export Files/nucleotides folder, located in the specified repository folder. Moreover, it must specify the size of the region used to determine if there is an overlap, as well as the minimum and maximum size of the open reading frames to be reported by the getorf application.

The Refine Annotation option will automatically perform the following steps:

  1. Get all open reading frames (between STOP codons) for the provided genome region (only the plus strand is considered thus it is important to give the genome sequence in the proper orientation) and sort them by sequence size (from the longest to the shortest).
  2. For each sequence in the partial CDS annotation file: extract the last n positions of the sequence and try to find a match in the open reading frames obtained in (1); only the first match is considered; if a match is found add the piece of sequence found after the hit to the sequence from where the motif used for the search originates from.
  3. For all sequences obtained in (2): compare each possible pair of sequences in order to extract the first n positions of the first sequence and try to find a match in the second sequence; if a match is found erase the sequence region from the hit to the end of the target sequence and concatenate the two sequences.
  4. Repeat step (3) until there is no possibility of merging two sequences.
  5. Get all open reading frames (between a START and a STOP codon) for the provided genome region and sort them by size (from the longest to the shortest).
  6. For each sequence obtained in (4): extract the first n positions of the sequence and try to find a match in the open reading frames obtained in (5); only the first match is considered; if a hit is found, erase the sequence of the open reading frame showing the hit from the place where the motif is found until the end and prefix the first sequence being processed with the sequence just obtained.
  7. For all sequences obtained in (6): compare each possible pair of sequences in order to extract the first n positions of the first sequence and try to find a match in the second sequence; if a match is found erase the sequence region from the hit to the end of the target sequence and concatenate the two sequences.
  8. Repeat step (7) until there is no possibility of merging two sequences.

Refine annotation