Copyright 2001-2003 Jay H. Choi
May be copied
for noncommercial purposes.
Author:
Jay H. Choi
Dept. of Genetics
Washington University
St. Louis, MO 63112
jhc1@genetics.wustl.edu
CONSENSUS-VIO (consensus
- version of iteration-optimization method)
This program
determines consensus patterns in unaligned sequences. The
algorithm is based on a matrix representation of a
consensus pattern. Each
row
corresponds to one of the letters of the relevant alphabet---e.g., 4
rows
in the case of DNA. Each column
corresponds to one of the positions
within the pattern. The elements of the matrix are
determined by the
number of times that the indicated letter occurs at the
indicated position.
This version of consensus program uses heuristic
approach of randimized
iteration-optimization method combined with
original consensus method
developed
by Gerald Z. Hertz (Hertz and Stormo, 1999) This version speeds
up the computation time such that
it finds the solution in linear time;
however, it does not guarantee to
find the optimum solution.
Randomized iteration-optimization method
are divided into two major parts.
First part of the method is to obtain an initial consensus pattern
matrix.
In this program, the
initial pattern matrix is obtained by execution of
original version
consensus (version 6c) or wconsensus (version 5c) with
linear option,
which seeds with the first sequence and proceed linearly
through the
list. The algorithm is dependent
on the order of the sequence
in the sequence set. (Stormo and Hartzell,
1989, PNAS, 86:1183-1187; Hertz
et al., 1990, CABIOS, 6:81-92). In consensus-vio, user specified number
of
execution or the default number of execution is determined by 1/4 of
total
number of input sequence provided by the user. The resulting initial matrix
will
be the highest scoring matrix among all top scoring matrices result
from
executions of consensus-v6c or wconsensus-v5c.
The matrix obtained
from linear method of consensus-v6c is used as an
initial matrix to build
an optimal matrix through the iteration of sequence
scan and matrix
enhancement. At each round of
optimization, the pattern matrix
built from the previous round is used to
scan the set of sequences to build
a new enhanced matrix. Every round, information content and
p-value of
matrix is calculated to determine the statistical significance
of matrix.
P-value is
calculated by determining the probability of observing a particular
score
or higher at a particular sequence position. P-values are estimated by
the method described in
Staden, 1989, CABIOS, p. 89--96.
Based on p-value of
each L-mer (subsequences of length L, where L
is the width of the pattern
being sought), the expected frequency is
calculated for the matrices, which
are constructed by sequentially adding additional L-mers. The optimization
step is called
iteratively until it converges to the optimum solution. The
program terminates either
when it surpasses the maximum number of cycles or
when the expected
frequency of matrix does not improve.
The lastest feature added in
consensus-vio is width-variation search.
At each
round of iteration-optimization step, it searches for
different width of
pattern matrix by extending and shrinking the width of
pattern, and returns
the highest scoring matrix of different pattern
width, along with the highest
scoring matrix of user-given width.
When the width
of pattern to be sought is not provided by the user, the linear
version of
wconsensus is used to get the initial matrix, and width-variation
search is
used to build and enhance the matrix.
The difference from the width
given search is that at each round,
the matrix of different width can be
selected as a new optimum matrix for
the next round. Therefore, the
optimum
width of pattern is also dynamically searched thru the
optimization procedure.
The significance of a matrix is initially
measured by its information
content. A higher information content indicates a rarer pattern and
a
more desirable matrix. The
program also estimates for each matrix a
p-value, which is the
probability of observing the particular information
content or higher in
an arbitrary alignment of random L-mers.
The
ultimate statistical significance of a matrix is determined by
multiplying
the p-value by the approximate number of possible alignments,
containing
the designated number of sequences and having the observed
width.
We refer to this
product as the expected frequency of the matrix alignment.
The expected
frequency allows the comparison of matrices summarizing
differing numbers
of sequences and having differing widths.
The program can print two
different lists of matrices. The
first list
contains the matrices having the highest information content
from each round,
including the list of top scoring matrices of different
width, ordered by
decreasing statistical significance (i.e., increasing
expected frequency).
In
general, this first list will contain the most interesting alignment.
The second list contains all the
matrices from each round of the program.
In the program's output,
the location of a pattern found is indicated by
"integer/integer". The first integer indicates where in
that sequence the
pattern is located and second integer indicates the
total length of each
sequence.
If the first integer is preceded by a minus sign, then the
complementary
pattern is the one included in the matrix.
Unlike the original
version of consensus, consensus_vio does not take
alphabet file, or
sequence file that does not include the name of sequence.
Alphabet and
normalization information is included on the command line, and
the
sequence file should contain the sequence name and contents in consensus
format. (see "Format of the sequence
files" section.)
COMMAND LINE
OPTIONS:
0) -h: print
these directions.
1)
General information
-f
filename(required): this file (default: read from the standard
input)
contains the names of the
sequences. The names of the
sequences must be
less than 512 characters. The corresponding sequence may follow
its
name if the sequence is enclosed between backslashes (\)
In the
sequences, whitespace, slashes (/), periods, dashes, and
comments
beginning with ';', '%', or '#' are ignored.
-l
integer: width of the pattern being sought. If the width is not provided,
width-variation search is performed.
-m matrix:
name of file that contains matrix.
If this option is chosen, then
initial
matrix is obtained from contents of matrix file, and -l
option
should be chosen.
-e width-variation: a value for width variation search. For
example,
-e 1 will search with a width of +/- 1 of user-given
width. If the width of pattern being sought is
not given,
then
the program automatically uses this option with a
default
value of 3 or user-specified value.
2)
Alphabet options
-a alphabet and normalization
information (required): the alphabet and
normalization information appears on the command line
(e.g., -a
"a:t 3 c:g 2"). The
entire command line argument must be
inside of quotation mark.
Each line
contains a letter (a symbol in the alphabet) followed by an
optional
normalization number. The
normalization is based on the
relative prior probabilities
of the letters. For nucleic
acids,
this might be be the genomic frequency of the bases. If the
normalization information is
not included, then the frequencies
observed in your own sequence
data are used. In nucleic acid
alphabets, a letter and its complement appear on the same line,
separated
by a colon. Complementary letters
may use the same
normalization number. Only the standard 26 letters are
permissible.
3) Complement Option:
handling the complement of nucleic acid sequences---
the four options in this
section are mutually exclusive
-c 0:
ignore the complement (the default option)
-c 1:
include both strands as separate sequences
-c 2: include
both strands as a single sequence (i.e., orientation unknown)
4) Initialization Option: number of cycles for
the initialization steps.
-i integer: the
number of executions of linear consensus to obatin an initial
matrix. (by default, 1/4 of total number of input sequences.)
5) Optimization Option:
number of cycles for optimization steps.
-n integer:
the number of cycles to terminate the optimization search process.
6) Output options
-t
integer: the number of top scoring matrices to print. (default:
1).
-r
: print all matrices from each cycle of iteration.
FORMAT OF THE SEQUENCE FILES:
Do not
explicitly give the complements of nucleic acid sequences. If needed, the
complementary
sequence is determined by the program.
Whitespace, periods, dashes
(unless part of an integer when the
"-i" option is used), and comments beginning
with ';', '%', or
'#' are ignored.
Sequences
surrounded by slashes (/) do not contribute to the generation of the
patterns;
thus, a portion of a sequence can be ignored without disrupting the
overall
numbering of the sequence. A double slash (//) would indicate a
discontinuity
in the sequence.