Usage:

mast <motif file> <sequence file> [options]

Description

Input

Motif File

A query file containing MEME formatted motifs. Outputs from MEME and DREME are supported, as well as Minimal MEME Format. You can convert many other motif formats to MEME format using conversion scripts available with the MEME Suite. MAST previously required a log-odds matrix in the motif format but, while the log-odds section is used preferentially, it is no longer required.

Sequence File

A file containing FASTA formatted sequences which are suspected to contain motif sites. See the -dblist option if you need to specify multiple sequence databases.

Output

MAST works by calculating match scores for each sequence in the database compared with each of the motifs in the group of motifs you provide. For each sequence, the match scores are converted into various types of p-values and these are used to determine the overall match of the sequence to the group of motifs and the probable order and spacing of occurrences of the motifs in the sequence.

MAST outputs its results as HTML, XML and plain text. The HTML file (mast.html) is designed for human viewing and the XML file (mast.xml) is designed for machine processing. The plain text format file (mast.txt) is available for backwards compatibility with earlier versions of MAST.

The MAST HTML output contains:

  1. A summary of the MOTIFS in the input, including their sequence logos, and a correlation matrix indicating how similar each pair of motifs is to help avoid including redundant motifs that would bias the query.
  2. The SEARCH RESULTS showing the sequences with significant overall matches to the motifs in the query. The sequences are sorted by their match E-values, and each sequence is accompanied by a motif block diagram showing the order and spacing of the matches to the query motifs. Further details about each of the motif matches is available by clicking near the desired sequence.
  3. The INPUTS & SETTINGS used in the query to MAST. These include a description of the sequences alphabet, a description of the sequence file, a description of the query motif file and other settings that affect how MAST runs.

In order to avoid biased scores when multiple motif scores are combined, MAST computes the pairwise correlations between each pair of motifs and displays it in the MOTIFS section of its HTML output. The correlation between two motifs is the maximum sum of Pearson's correlation coefficients for aligned columns, divided by the width of the shorter motif. The maximum is found by trying all alignments of the two motifs. Motifs with correlations below 0.60 have little effect on the accuracy of the combined scores. Pairs of motifs with higher correlations should be removed from the query. This is done by default on the MAST web server, and can be requested via an option when MAST is run from the command line.

Options

Option Parameter Description Default Behaviour
Input Options
-dblist The sequence file contains a list of file names of FASTA-formatted databases. The sequence file contains FASTA-formatted sequences.
Output Options
-hit_list Write a machine-readable (plain text) list of all non-overlapping motif matches (or just the single best hit for each motif, see -best, below) to standard output. No other output is created. See the section titled Hit List below for details of the output format. MAST outputs HTML, XML and plain text, and does not output a "hit list".
Which Motifs To Use
-remcorr Remove highly correlated motifs from query. No motifs are removed from the query.
-mn Use only motifs appearing at the nth position in the file. This option may be repeated. Use all the motifs.
-ccount Only use the first count motifs. Use all the motifs.
-mevevalue Use only motifs with E-values ≤ evalue. Use all the motifs.
-diagdiagram The nominal order an spacing of motifs is specified by diagram, which is a block diagram. MAST uses the prefered order and spacing to compute the "spacing p-value" for any observed motif spacing in a sequence. The spacing p-value is treated as though it were an additional motif in computing the sequence E-value. In the diagram, motifs should be referred to by their position in the motif file, not by their name. For example, if motifs number 1 and 2 typically occur separated gap of 5, with motif 2 site preceeding the motif 1 site, the diagram would be [2]-5-[1]. Each input motif may be specified at most once in the diagram. Any leading and trailing gaps are ignored. Sequence E-values ignore motif order and spacing.
Options for Alphabets with Complements (e.g., DNA)
-norc Do not score the reverse complement strand. This option is not compatible with the -sep or -dna options. The p-value of a motif site is the minimum of its p-values on the two strands.
-sep Score the reverse complement strand as a separate sequence. This option is not compatible with the -norc or -dna options. The p-value of a motif site is the minimum of its p-values on the two strands.
-dna (DNA sequences only) Translate the DNA sequences to protein so protein motifs may be scanned. The motifs must be protein and the sequences must be DNA. This option is not compatible with -norc or -sep. DNA sequences are not translated to protein and only DNA motifs may be used to scan them.
-comp Adjust the p-values and E-values for sequence composition. P-values are based on the overall background mode (see -bfile, above).
Which Results To Print
-evevalue Output results for sequences with E-values < evalue. Output results for sequences with E-values < 10.
Appearance of Block Diagrams
-mtmt Show motif matches with p-value < mt. Show motif matches with a p-value < 0.0001.
-w show weak matches (mt < p-value < mt * 10) in angle brackets in the hit list or when the XML is converted to plain text. Only strong matches (see -mt) are indicated in the plain text output.
-seqp Use SEQUENCE p-values for motif thresholds. use POSITION p-values for motif thresholds.
Miscellaneous
-best Include only the best motif hits in the list of motif sites generated by -hit_list. This option has no effect unless -hit_list is specified. All non-overlapping motif sites are listed in the -hit_list text output.
-mfmf In results use mf as motif file name. The actual name of the motif file is used.
-dfdf In results use df as database name. This option is ignored when -dblist is specified. The actual name of the sequence file is used.
-dldl If there is on-line annotation for the sequences in your sequence file that can be accessed via a link of the form "http://anything?anything=anythingSEQUENCEIDanything", you can have MAST link each sequence ID in its results to its annotation. The actual FASTA sequence ID of the sequence will be used to replace the token SEQUENCEID in the pattern dl that you specify. This is option is ignored when -dblist is specified. Sequence IDs in the results are not linked to anything.
-minseqsms The lower bound on the number of sequences in the database. This will reduce the amount of memory required by MAST. MAST uses more memory.
-nostatus Do not print progress updates to standard error. Progress updates are printed to standard error.
-notext Do not create plain text output. MAST creates HTML, XML and plain text output.
-nohtml Do not create HTML output. MAST creates HTML, XML and plain text output.
-version Display the version and exit. Run as normal.

Match Scores

The match score of a motif to a position in a sequence is the sum of the score from each column of the position-dependent scoring matrix corresponding to the letter at that position in the sequence. For example, if the sequence is

       TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC
          ========
      

and the motif is represented by the position-dependent scoring matrix (where each row of the matrix corresponds to a position in the motif)

PositionAC GT
11.4470.188-4.025-4.095
20.7391.339-3.945-2.325
31.764-3.562-4.197-3.895
41.574-3.784-1.594-1.994
51.602-3.935-4.054-1.370
60.797-3.647-0.8140.215
7-1.2801.873-0.607-1.993
8-3.0761.0351.414-3.913

then the match score of the fourth position in the sequence (underlined) would be found by summing the score for T in position 1, G in position 2 and so on until G in position 8. So the match score would be

         score = -4.095 + -3.945 + -3.895 + -1.994
                 + -4.054 + -0.814 + -1.933 + 1.414 
               = -19.316
      

The match scores for other positions in the sequence are calculated in the same way. Match scores are only calculated if the match completely fits within the sequence. Match scores are not calculated if the motif would overhang either end of the sequence.

p-values

MAST reports all matches of a sequence to a motif or group of motifs in terms of the p-value of the match. MAST considers the p-values of four types of events:

All p-values are based on a random sequence model that assumes each position in a random sequence is generated according to the average letter frequencies of all sequences in the appropriate (peptide or nucleotide) non-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/) on September 22, 1996. This can be overridden by specifying the -bfile or -comp options (see below). For DNA sequences, unless -norc is given, the positive and reverse complement strand frequencies are averaged together.

  1. -bfile bfile The random model uses the letter frequencies given in bfile instead of the non-redundant database frequencies. The bfile is in Markov Background Model format. You can create files in the appropriate format based on the base/residue composition of your own FASTA sequence files using the command fasta-get-markov included in the MEME distribution.
  2. -comp The random model uses the letter frequencies in the current target sequence instead of the non-redundant database frequencies. This causes p-values and E-values to be compensated individually for the actual composition of each sequence in the database. This option can increase search time substantially due to the need to compute a different score distribution for each high-scoring sequence. With this option and sequences on an alphabet with complements (e.g., DNA), the positive and reverse complement strand frequencies are not averaged together.

Position p-value

The p-value of a match of a given position within a sequence to a motif is defined as the probability of a randomly selected position in a randomly generated sequence having a match score at least as large as that of the given position. Note: If MAST is combining reverse complement strands, the position p-value is not corrected for multiple tests.

Sequence p-value

The p-value of a match of a sequence to a motif is defined as the probability of a randomly generated sequence of the same length having a match score at least as large as the largest match score of any position in the sequence.

Combined p-value

The p-value of a match of a sequence to a group of motifs is defined as the probability of a randomly generated sequence of the same length having sequence p-values whose product is at least as small as the product of the sequence p-values of the matches of the motifs to the given sequence.

E-value

The E-value of the match of a sequence in a database to a group of motifs is defined as the expected number of sequences in a random database of the same size that would match the motifs as well as the sequence does and is equal to the combined p-value of the sequence times the number of sequences in the database.

High-scoring Sequences

MAST lists the names and part of the descriptive text of all sequences whose E-value is less than E. Sequences shorter than one or more of the motifs are skipped. The sequences are sorted by increasing E-value. The value of E is set to 10 for the WEB server but is user-selectable in the downloadable version of MAST.

Motif Diagrams (Plain text output and -diag input option only)

Motif diagrams show the order and spacing of non-overlapping matches to the motifs in each high-scoring sequence. Motif occurrences are determined based on the position p-value of matches to the motif. Strong matches (p-value < mt) are shown in square brackets (`[ ]'), weak matches (mt < p-value < mt * 10) are shown in angle brackets (`< >') and the length of non-motif sequence ("spacer") is shown between hyphens (`-'). The value of mt is 0.0001 for the WEB server but is user-selectable in the downloadable version of the MEME Suite. For example,

             27-[+3]-44-<4>-99-[-1]-7
      

shows an initial spacer of length 27, followed by a strong match to motif 3, a spacer of length 44, a weak match to motif 4 on the positive strand, a spacer of length 99, a strong match to motif 1 on the negative strand and a final non-motif sequence of length 7.

Note that when scanning DNA sequences with protein motifs (-dna command line option), the frame of the match is indicated by one of the letters "a", "b" or "c" following the motif number. For example

             27-[+3a]-44-<4c>-99-[-1b]-7
      
indicates that the matches are in frames "a", "c" and "b", respectively.

Annotated Sequences

MAST annotates each high-scoring sequence by printing the sequence along with the position and strength of all the non-overlapping motif occurrences. The four lines above each motif occurrence contain, respectively,

The best possible match to a motif is the sequence of letters which would achieve the highest match score.

Hit List

If you specify the -hit_list switch to MAST, MAST outputs ONLY a list of "hits" in easily machine-readable format. Each line corresponds to one motif occurrence in one sequence. The format of the hit lines is

[sequence_name [strand]motif id alt_id start end score p-value]+

where

sequence_nameis the name of the sequence containing the hit
strandis the strand (+ or - for DNA, blank for protein),
motifis the motif number,
idis the motif ID,
alt_idis the motif alternate ID,
startis the starting position of the hit,
endis the ending position of the hit, and
scoreis the score the hit,
p-valueis the position p-value of the hit.

Two comment lines (starting with "#") are written above the list of hits, and the MAST command line is printed as a comment line after the list. An example of the output using the -hit_list switch to MAST is:

Loading Multiple Sequence Databases

Multiple sequence databases can be loaded by MAST by putting the file names into a file and specifying that file instead of the sequence database with the option -dblist.

The file list has one file name on each line with the optional name and link as follows:

      <file> [<name> <link>]
      ...
      ...
      

If it is specified then the name will be used instead of the file name in the output. If the link is specified then all sequences for that database in the html output will have a hyperlink to the URL specified with the text SEQUENCEID replaced with the FASTA sequence id.

Examples

The following examples assume that file "meme.results" is the output of a MEME run containing at least 3 motifs which was created on the trainingset "training.fasta" and file SwissProt is a copy of the Swiss-Prot database on your local disk. DNA_DB is a copy of a DNA database on your local disk.

  1. Annotate the training set:
            mast meme.results training.fasta
          
  2. Find sequences matching the motif and annotate them in the SwissProt database:
     
            mast meme.results SwissProt
          
  3. Show sequences with weaker combined matches to motifs.
            mast meme.results SwissProt -ev 200
          
  4. Include a nominal order and spacing of the first three motifs in the calculation of the sequence p-values to increase the sensitivity of the search for matching sequences. Note that the leading and trailing gaps are ignored ("9-" and "-91" in the example):
            mast meme.results SwissProt -diag "9-[2]-61-[1]-62-[3]-91"
          
  5. Use only the first and third motifs in the search:
     
            mast meme.results SwissProt -m 1 -m 3
          
  6. Use only the first two motifs in the search:
            mast meme.results SwissProt -c 2
          
  7. Search DNA sequences using protein motifs, adjusting p-values and E-values for each sequence by that sequence's composition:
            mast meme.results DNA_DB -dna -comp
          

Citing

If you use MAST in your research, please cite the following paper:
Timothy L. Bailey and Michael Gribskov, "Combining evidence using p-values: application to sequence homology searches", Bioinformatics, 14(1):48-54, 1998. [pdf]