Synopsys perl fastaFileCheck.pl [-simvh] dnafastafile perl fastaFileCheck.pl -p[simvh] proteinfastafile Checks the structure of a DNA fasta file (or protein fasta files using the -p option and prints out various information. Description The program checks (1) if letters other than A,T,G,C and N occur in lines that do not start with > (or letters corresponding to the one letter amino acid code +X +* when the protein option (-p) is used). Less than 80% A,T,G,C and N are considered a severe problem (usually it is protein data or a return character in the description line, errors which will break most programs). (2) that there are no empty lines in the files. (3) whether duplicate identifiers are present in the file. (4) whether identifiers exceed a length of 50 characters. (5) whether there are lines that are over 100000 characters long (the total sequence length of an entry can be infinitely long - that is not checked. Line length here means a stream of characters until a carriage return is encountered. Sometimes, fasta files have lines that are 50-100 letters long, but is not required). Lines with more than 20% non-DNA characters are counted as severe errors, minor error with less. Options -p File contains protein sequences. -s Print summary only. -i Print identifiers only, and summary. -m Ignore minor problems. -v Print version of the program. -h Print help message Version Version 2.2 of September 17, 2002 Version History 2.0 2002/02/08 Added command line options 2.1 2002/02/16 Added protein option duplicate id checking line length checking help information 2.2 2002/09/17 Fixed a bug that reported too many empty lines Author Lukas Mueller The Arabidopsis Information Resource. Copyright (c) 2001, 2002. The Carnegie Institution of Washington, Department of Plant Biology. Copyright (c) 2001, 2002. All Rights Reserved.