![]() |
FOLDALIGNDepartment of GeneticsWashington University in St. Louis School of Medicine |
|
|
NAMEFOLDALIGN -Local structural alignments of unaligned RNA sequences. SYNOPSISFOLDALIGN [-spec file] [-round number] [-maxround number] [-fasta file] [-sort] [-nopre] [-nomain] [-help|-] VERSIONDESCRIPTIONFOLDALIGN perform local structural alignment of a set of unaligned RNA sequences. The current version is constrained to a stem-loop type structure only. The algorithm is based on a dynamical programming approach that simultaneously takes sequence and structure constraints into consideration and compute a score based on a score matrix that list the cost of replacing any pair of bases in one sequence with any other pair of bases in another sequence. The multiple alignments are constructed by a greedy algorithm where a sequence is aligned to r-1 already aligned sequences (through all pairwise comparisons of sequences), r>1. The s best alignments of r such sequences, round r, are saved, and used to the construct multiple alignments consisting r+1 sequences, that is round r+1. The score growth of best (highest scoring) alignment for increasing round is shown for pure score and round normalized score. Also the best overall alignments from these two plots are shown in a separate data file with extension .best. We recommend that the user consult the references listed below for further details on the method. The scoring matrix lists, the minimum loop size, the base-pair elongation factor, and the base-pair substitution cost. The base-pair elongation factor favors nested stacked base-pair and can be considered as a pseudo-energy term. There are four matrices listed corresponding to initiation and elongation costs, and matrices with and without score for actual base-pairs, which by default is AU, CG, and GU. The program output for each round is stored in specified files with extension .align.'r'. The files with extension .align.'r'.pre.orig.gz and .align.'r'.main.orig.gz contains data of alignments at round 'r' that were discarded due to filter schemes. These are usually relatively large files, but the creation of the files can be turned off. Postscript files of the score growth is made as well. In addition, the data will be placed in .num files. The output format. An example of an alignment of three sequences with names 1, 2, and 3 is:
The first line indicate that the round 3 alignment obtained a score of 170 of these three sequences. The second line refers to sequence, entity, startpos, endpos. 'sequence' is either a number or its name given in a fasta file. 'entity' is either 1 or 2 depending on the sequence has just been aligned (1), or belong to an entity of already aligned sequences (2). The third line contains the region 'startpos' to 'endpos' of the sequence. The fourth line contains the structural assignment made by FOLDALIGN. Corresponding left and right parentheses indicates a base-pair. The remaining lines show aligned regions of the remaining sequences in the alignment. The .align.'r' files do always contain (corresponding) numeric sequence names, whereas .align.'r'.txt files always (if created) contains the names provided. The .orig.gz files will contain the corresponding fasta names if fasta file is used as input data (which will cause the .txt files to be created). For details, see direction to example directory below. The FOLDALIGN package contains a program, fa2col which can convert the foldalign output format into a standard column format, see http://www.bioinf.au.dk/colformat, for details.
The filter schemes works in two ways. The pre-filter schemes selects the best alignment among alignments that consist of the same sequences, but where they are aligned in a different order, which sometimes might result slightly different alignments. (If two different combinations have the same score, the first appearing one in the date file will be selected.) The main-filter then select the s highest scoring alignments to used in the next round of multiple alignments. The input data, is either in a FOLDALIGN format (see below), or standard fasta format. The use of fasta format will generate files with extension .align.'r'.txt where the sequences are expressed by their fasta name, rather than number of appearance in the data file, as in the FOLDALIGN format. For details, see direction to example directory below. An example of the FOLDALIGN format for the three sequences mention above
where the first line list the number of sequences in the data set. The current sequence length is limited to 400 nucleotides. Details, see the references listed below, and the FOLDALIGN webpages: where online access is available. Examples, of data files can be found in $FOLDALIGN_DIR/examples/data, and examples of score matrices in $FOLDALIGN_DIR/lib. REQUIRED EXECUTABLESFOLDALIGN requires the existence of $FOLDALIGN_DIR/bin/ with the executables
as well as the existence of $FOLDALIGN_DIR/lib/*.xmgr from the FOLDALIGN package.
FOLDALIGN also requires the following system commands to be installed
OPTIONSFOLDALIGN accepts the following options.
EXAMPLESA standard run using the fasta foo.fasta and the specification file specfoo:
A standard run using the default FOLDALIGN input data format:
AUTHORS
Laurie J. Heyer and Jan Gorodkin
REFERENCESFor any use of FOLDALIGN programs please cite: J. Gorodkin, L. J. Heyer and G. D. Stormo. Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Research, 25:18, 3724-3732, 1997. J. Gorodkin, L. J. Heyer and G. D. Stormo. Finding Common Sequence and Structure Motifs in a set of RNA sequences. Proceedings ISMB-97, pp. 120-123, 1997. CONTACTIf you have questions, bug reports, complaints, comments, requests, email foldalign@bioinf.au.dk. SEE ALSOfa2col, sortfalign, prefilter, mainfilter, makefamatrix, genmatrix.
IndexComments, questions, etc., email
foldalign@bioinf.au.dk |