Synopsys

perl fastaFileCheck.pl [-simvh] dnafastafile

perl fastaFileCheck.pl -p[simvh] proteinfastafile

Checks the structure of a DNA fasta file (or protein fasta files using the -p option and prints out various information.

Description

The program checks

(1)

    if letters other than A,T,G,C and N occur in lines that do not start with > (or letters corresponding to the one letter amino acid code +X +* when the protein option (-p) is used). Less than 80% A,T,G,C and N are considered a severe problem (usually it is protein data or a return character in the description line, errors which will break most programs). 
(2)

    that there are no empty lines in the files. 
(3)

    whether duplicate identifiers are present in the file. 
(4)

    whether identifiers exceed a length of 50 characters. 
(5)

    whether there are lines that are over 100000 characters long (the total sequence length of an entry can be infinitely long - that is not checked. Line length here means a stream of characters until a carriage return is encountered. Sometimes, fasta files have lines that are 50-100 letters long, but is not required). 

Lines with more than 20% non-DNA characters are counted as severe errors, minor error with less.

Options

-p

    File contains protein sequences. 
-s

    Print summary only. 
-i

    Print identifiers only, and summary. 
-m

    Ignore minor problems. 
-v

    Print version of the program. 
-h

    Print help message 

Version

Version 2.2 of September 17, 2002

Version History

 2.0 2002/02/08 Added command line options
 2.1 2002/02/16 Added protein option
                duplicate id checking
                line length checking
                help information
 2.2 2002/09/17 Fixed a bug that reported too many empty lines                                 

Author

Lukas Mueller

 The Arabidopsis Information Resource. Copyright (c) 2001, 2002. 
 The Carnegie Institution of Washington, Department of Plant Biology. Copyright (c) 2001, 2002.
 All Rights Reserved.