logo

FOLDALIGN

Department of Genetics
Washington University in St. Louis
School of Medicine


FOLDALIGN server


Wash U. mirror


Introduction


Fasta format


Download software


Man pages
FOLDALIGN


BiRC


Stormo lab


 

NAME

FOLDALIGN -Local structural alignments of unaligned RNA sequences.  

SYNOPSIS

FOLDALIGN [-spec file] [-round number] [-maxround number] [-fasta file] [-sort] [-nopre] [-nomain] [-help|-]  

VERSION

This version is: 0.02.  

DESCRIPTION

FOLDALIGN perform local structural alignment of a set of unaligned RNA sequences. The current version is constrained to a stem-loop type structure only.

The algorithm is based on a dynamical programming approach that simultaneously takes sequence and structure constraints into consideration and compute a score based on a score matrix that list the cost of replacing any pair of bases in one sequence with any other pair of bases in another sequence.

The multiple alignments are constructed by a greedy algorithm where a sequence is aligned to r-1 already aligned sequences (through all pairwise comparisons of sequences), r>1. The s best alignments of r such sequences, round r, are saved, and used to the construct multiple alignments consisting r+1 sequences, that is round r+1.

The score growth of best (highest scoring) alignment for increasing round is shown for pure score and round normalized score. Also the best overall alignments from these two plots are shown in a separate data file with extension .best. We recommend that the user consult the references listed below for further details on the method.

The scoring matrix lists, the minimum loop size, the base-pair elongation factor, and the base-pair substitution cost. The base-pair elongation factor favors nested stacked base-pair and can be considered as a pseudo-energy term. There are four matrices listed corresponding to initiation and elongation costs, and matrices with and without score for actual base-pairs, which by default is AU, CG, and GU.

The program output for each round is stored in specified files with extension  .align.'r'. The files with extension .align.'r'.pre.orig.gz and  .align.'r'.main.orig.gz contains data of alignments at round 'r' that were discarded due to filter schemes. These are usually relatively large files, but the creation of the files can be turned off.

Postscript files of the score growth is made as well. In addition, the data will be placed in .num files.

The output format. An example of an alignment of three sequences with names 1, 2, and 3 is:


  3   170          
    2     1     3    32 
  GGAUUUGAGAUACACGGAAGUGGACUCUCC
  (((...(((...((......))..))))))
    3     2     4    32 
  GCA-UUGAGAAACACGUUUGUGGACUCUGU
  (((-..(((...((......))..))))))
    1     2     4    31 
  GAA-UUGAGAAACAC-UAACUGGCCUCUUU
  (((-..(((...((.-....))..))))))

The first line indicate that the  round 3 alignment obtained a score of 170 of these three sequences. The second line refers to sequence, entity, startpos, endpos. 'sequence' is either a number or its name given in a fasta file. 'entity' is either 1 or 2 depending on the sequence has just been aligned (1), or belong to an entity of already aligned sequences (2). The third line contains the region 'startpos' to 'endpos' of the sequence. The fourth line contains the structural assignment made by FOLDALIGN. Corresponding left and right parentheses indicates a base-pair. The remaining lines show aligned regions of the remaining sequences in the alignment.

The  .align.'r' files do always contain (corresponding) numeric sequence names, whereas  .align.'r'.txt files always (if created) contains the names provided. The .orig.gz files will contain the corresponding fasta names if fasta file is used as input data (which will cause the .txt files to be created). For details, see direction to example directory below.

The FOLDALIGN package contains a program,  fa2col which can convert the foldalign output format into a standard column format, see http://www.bioinf.au.dk/colformat, for details.

The filter schemes works in two ways. The  pre-filter schemes selects the best alignment among alignments that consist of the same sequences, but where they are aligned in a different order, which sometimes might result slightly different alignments. (If two different combinations have the same score, the first appearing one in the date file will be selected.) The main-filter then select the s highest scoring alignments to used in the next round of multiple alignments.

The input data, is either in a FOLDALIGN format (see below), or standard fasta format. The use of fasta format will generate files with extension  .align.'r'.txt where the sequences are expressed by their fasta name, rather than number of appearance in the data file, as in the FOLDALIGN format. For details, see direction to example directory below. An example of the FOLDALIGN format for the three sequences mention above


  3
  >1
  ACGGGAAUUGAGAAACACUAACUGGCCUCUUUCCCCC
  >2
  UUGGAUUUGAGAUACACGGAAGUGGACUCUCCGGU
  >3
  ACGUGCAUUGAGAAACACGUUUGUGGACUCUGUAAA

where the first line list the number of sequences in the data set. The current sequence length is limited to 400 nucleotides.

Details, see the references listed below, and the FOLDALIGN webpages:

http://www.bioinf.au.dk/FOLDALIGN
http://bifrost.wustl.edu/FOLDALIGN

where online access is available.

Examples, of data files can be found in $FOLDALIGN_DIR/examples/data, and examples of score matrices in $FOLDALIGN_DIR/lib.  

REQUIRED EXECUTABLES

FOLDALIGN requires the existence of $FOLDALIGN_DIR/bin/ with the executables

*
foldalign, prefilter, mainfilter, sortfalign

as well as the existence of $FOLDALIGN_DIR/lib/*.xmgr from the FOLDALIGN package.

FOLDALIGN also requires the following system commands to be installed

*
gawk, cat, touch, head, tail, mv, rm, gzip, gunzip, and grbatch (non-interactive version of xmgr (http://plasma-gate.weizmann.ac.il/Xmgr.).

 

OPTIONS

FOLDALIGN accepts the following options.

-spec file
Required specification file, has the following form:

    # spec file
    Runname:                results/r17
    Scoring file:           ../lib/score_a0.mat
    Sequence filename:      data/r17.fa
    Align range:            #  4  40
    Filter parameters:      2  2  20 30 1

The option in the specification file is as follows:
Runname should have the path/filename of output files, which gets the extensions mentioned above.
Scoring file should have the path to the scoring matrix.
Sequence filename indicates the sequence data given in the FOLDALIGN format. If the '-fasta' option is used this file is generated from the fasta file.
Align range. Type of comparisons: '# 4' means 4 nt difference in window length comparison. The '40' is maximum motif size. The '# <number>' can also be replaced by 'E <number>', which equalizes the length between the sequences compared, ie., adds the difference in sequence length to the align range.
Filter parameters. #1: complete exhaustive up to and including round #1; #2: pre-filter scheme up to and including round #2; #3: stop when done with round #3; #4: the number of best alignments to keep in between rounds '#2' and  '#3'; #5: max minimum size of any of the two entities compared.


Lines where the first character is '#' is considered as comment lines and are ignored upon reading the spec file.

-round number
Start multiple alignment by performing alignments that have 'round' number of sequences in them.

-maxround number
Stop after performing multiple alignments that have 'maxround' number of sequences in them regardless of what is written in the specification file.

-fasta file
Use a fasta file instead of the foldalign format file. When this option is used alignment files 'runname'.align.'round'.txt are created. These files contains the sequence names listed in the fasta file. If '-' data is assumed to come from standard input, and a fasta file 'runname'.fasta will be created.

-sort
Sort results for each round by obtained score.

-nopre
Do not make the .pre.orig.gz files.

-nomain
Do not make the .main.orig.gz files.

-help, -
print options.

 

EXAMPLES

A standard run using the fasta foo.fasta and the specification file specfoo:

FOLDALIGN -spec specfoo -fasta foo.fasta

A standard run using the default FOLDALIGN input data format:

FOLDALIGN -spec specfoo
 

AUTHORS

Laurie J. Heyer and Jan Gorodkin
(laheyer@davidson.edu and gorodkin@bioinf.au.dk)  

REFERENCES

For any use of FOLDALIGN programs please cite:

J. Gorodkin, L. J. Heyer and G. D. Stormo. Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Research, 25:18, 3724-3732, 1997.

J. Gorodkin, L. J. Heyer and G. D. Stormo. Finding Common Sequence and Structure Motifs in a set of RNA sequences. Proceedings ISMB-97, pp. 120-123, 1997.  

CONTACT

If you have questions, bug reports, complaints, comments, requests, email foldalign@bioinf.au.dk.  

SEE ALSO

fa2col, sortfalign, prefilter, mainfilter, makefamatrix, genmatrix.

 

Index

NAME
SYNOPSIS
VERSION
DESCRIPTION
REQUIRED EXECUTABLES
OPTIONS
EXAMPLES
AUTHORS
REFERENCES
CONTACT
SEE ALSO

Comments, questions, etc., email foldalign@bioinf.au.dk

Last updated July 18th, 2001 by Jan Gorodkin