mcast [options] <motifs> <sequence database>
In order for MCAST to compute statistical confidence estimates, at least 200 matches must be found. If the database contains too few sequences, or if certain other options are made too stringent, then too few matches may exist for significance statistics to be computed. In this case, the p-value, q-value, and E-value columns are set to "NaN", and all matches are printed. This limitation can be overcome by specifying the --synth option. When this option is set, synthetic sequences will be generated using a background model generated by choosing a random GC frequency within the range of observed GC minimum and maximum. The synthetic sequences will be used to estimate significance statistics.
When computing statistical confidence estimates, MCAST must retain the matches in memory until the final distribution of scores can be estimated. This means that the scanning of genome sized datasets has the potential to exhaust all available memory. To avoid this problem, MCAST uses reservoir sampling of the match scores, and limits the number of matches that are kept in memory. The default number of matches kept in memory is 100,000, but this value can be adjusted via the --max-stored-scores option. If the maximum number of stored matches is reached, then MCAST will drop the least significant half of the matches. This behavior may result in matches missing from the MCAST output, even though they would have satisfied the user-specified p-value or q-value threshold.
MCAST can make use of position-specific priors (PSPs) to improve its identification of true motif occurrences. To take advantage of PSPs in MCAST you use must provide two command line options. The --psp option is used to set the name of a file containing the PSP, and the --prior-dist option is used to set the name of a file containing the binned distribution of the PSP.
The PSP can be provided in MEME PSP file format, or in wiggle format. The MEME PSP file format requires that a PSP be included for every position in the sequence to be scanned. This format is usually only practical for relatively small sequence databases. The wiggle format accommodates sequence segments with missing PSP. When no PSP is available for a given position, MCAST will use the median PSP from the PSP distribution file. The wiggle format will work with large sequence databases, including full genomes.
The PSP and PSP distribution files can be generated from raw scores using the
create-priors utility available
when you download and install the MEME Suite on your own computer.
A full description of the algorithm may be found in:
A file containing DNA motifs in MEME formatted.
Outputs from MEME and DREME are supported, as well as Minimal MEME
Format. You can also input DNA motifs in TRANSFAC format if you
--transfac. You can convert many other motif formats to MEME format
using conversion scripts
available with the MEME Suite. Input motifs that are likely to appear in the
A collection of DNA sequences in FASTA format.
MCAST will create a directory named
mcast_out (the name of this directory can be overridden via the
--o or --oc options)
The directory will contain:
mcast.htmlreporting the matches in HTML format (see details here)
cisml.xmlreporting the matches in XML format using the CisML schema
mcast.xmldescribing the inputs to MCAST in XML format and referencing
mcast.txtreporting the matches in tab-delimited format (see details
mcast.gffreporting the matches in GFF3 format
The score reported in the GFF3 output is
|--alpha||alpha||The fraction of all TF binding sites that are binding sites for the TF of interest.||1.0|
|--hardmask||Nucleotides in lower case will be converted to the wildcard 'N'. This prevents these positions from being considred in motif matches. This is useful when the input sequence file has been soft-masked for tandem repeats. Without hard masking, MCAST may assign sequence segments containing tandem repeats a highly significant score.||Nucleotides in lower case are converted to upper case.|
|--max-gap||max gap||The value of max gap specifies the longest distance allowed between two hits in a match. Hits separated by more than max gap will be placed in different matches. Note: Large values of max gap combined with large values of pthresh may prevent MCAST from computing E-values.||The maximum gap is set to 50.|
|--max-stored-scores||max||Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate.||The maximum number of stored matches is 100,000.|
|--motif-pthresh||pthresh||sets the scale for calculating pscores for motif hits. The
p-score for a hit with p-value p is
S = -log2(p/pthresh),
|The motif scaling pvalue defaults to 0.0005.|
|--output-ethresh||out E-value||The E-value threshold for displaying search results. If the E-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.||The E-value threshold is 10.0.|
|--output-pthresh||out p-value||The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.||The E-value is used as the threshold. See --output-ethresh option.|
|--output-qthresh||out q-value||The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.||The E-value is used as the threshold. See --output-ethresh option.|
|--parse-genomic-coord||When this options is specified each sequence header will be
checked for UCSC style genomic coordinates. These are of the form:
>sequence name:starting position-ending positionWhere
|The first position in the sequence will be assumed to be 1.|
|--psp||file||File containing position-specific priors (PSP) in MEME PSP format or wiggle format. This file can be generated using the create-priors utility.||A uniform position-specific prior is used.|
|--prior-dist||file||File containing binned distribution of priors. This file can be generated using the create-priors utility.||A uniform position-specific prior is used.|
|--synth||Use synthetic scores for distribution. A 0th-order Markov model of nucleotide frequencies will be created by choosing a GC content at random between the observed minimum and maximum values. This model will be used to generate synthetic sequences, and the synthetic sequences will be used to estimate the distribution of p-values.||No synthetic sequences will be generated.|
|--text||Limits output to plain text sent to standard out.|
|--transfac||MCAST will assume that the motif file is in TRANSFAC matrix format.||MCAST assumes the motif file is in MEME format.|
|--version||Display the version and exit.||Run as normal.|
The HTML output contains
The plain text output contains a line for each match. Each line contains the following fields:
The lines are sorted by score in descending order.
If you use MCAST in your research please cite the following paper:
Timothy Bailey and William Stafford Noble, "Searching for statistically significant regulatory modules", Bioinformatics (Proceedings of the European Conference on Computational Biology), 19(Suppl. 2):ii16-ii25, 2003. [full text]