Statistics

Basic genomes statistics
The basic statistics are values calculated from raw DNA data, not genes. The results include:
 * Total length (base pairs)
 * Percentage AT
 * Standard deviation AT (in the case of multiple replicons/contigs)
 * Number of replicons/contigs
 * Percentage of unknown bases (not A, T, C or G)
 * Fraction of genome made up by largest contig/replicon, as a percentage of total genome length. This measure is mostly useful for evaluation if most of the genome is in one piece or if it is completely fragmented.

genomeStatistics .fna Filename TotalBases: Per.AT: StDevAT: ContigCount: Per.Unknowns: Per.LargestSeq .fna 2132142 61.3707 0.0000 1 0.0000 100.0000

Unknown bases analysis
In some DNA sequences bases other than A, T, C or G are found. This can be a function of assembly programs where the distance between two sequences are known but not the sequence itself. The analysis of these DNA signatures produces the following measures:


 * Total number of bases	2209947
 * Total number of unknown stretches	99
 * Total number of unknowns	79605
 * Percentage of unknowns	3.60212258484027
 * Average length of unknown stretch	804.090909090909
 * Max/min length of unknown stretch	1780	141

The program is called as follows: countUnknowns.pl Megamonas_hypermegale_ART12_1.fna

Amino acid and codon usage
This system has some different ways of analyzing the third position base use, amino acid and codon usage. The first is a vizual presentation which should be used to present the patterns of a few genomes. This approach is not usefull for comparing many genomes. The analysis uses the open reading frame genes from a genefinder (DNA open reading frames, FASTA format): >NZ_ADFU01000001__CDS_1275-526 ATGAAAAAATCCACTTTGCTTGCTTTCACAGCGGCAGTATTATTCGGCAGTGGCGT CACGTTAATGCGGCATCTGCTACATATGATGATCCATTGCTTTTACCAAATCCTGC GCGCCTACAACAGGTTCTGTTGTATTGGTTCCTGTGGCTAGCCCTCAGGCGGTGCA ............

The output of the analysis is a PDF file along with a raw data file, format shown bellow: Veillonella_parvula_ATCC_17745_prodigal.orf.fna TotalBases: 1900137 PerAT: 60.38 StDevAT: 0.04 codon	AAA	4.39974	27867 codon	CAA	2.79548	17706 ......... aa	C	0.9828 aa	P	3.6291

The analysis should be performed in a directory which has a file called .orf.fna, and is run as follows: basicGenomeAnalysis organismName /usr/bin/gnuplot

It is also possible to just run the calculations, without the visual presentation. This is more useful for comparing many genomes. for i in *fna do perl /usr/biotools/indirect/atStats.pl $i > $i.atStats.tab cat $i.atStats.tab > $i.CodonAaUsage perl /usr/biotools/indirect/CodonAaUsage.pl $i >> $i.CodonAaUsage rm $i.atStats.tab done

To collect all the data for all genomes construct one file per type of data (amino acid usage, codon usage and statistics): grep aa *AaUsage > aaUsage.all sed -i s/_prodigal.orf.fna.CodonAaUsage:aa//g aaUsage.all grep Total *AaUsage > statistics.all sed -i 's/_prodigal.orf.fna.CodonAaUsage:/\t/g' statistics.all cut -f2,3,4,5,6,7,8 statistics.all > tmp.all mv tmp.all statistics.all sed -i 's/_prodigal.orf.fna//g' statistics.all grep codon *AaUsage > codonUsage.all sed -i s/_prodigal.orf.fna.CodonAaUsage:codon//g codonUsage.all

Questions

 * Use head and tail to have a look at these files. What do they contain?

These files can be used to plot several different patterns using Excel, R or other plotting programs. What this data can be used for depends on the aim of the study and can not be standardized. Here is shown an approach which compares XX of all genomes and graphically represents the results as a XX plot, To see how these plots were made, go to the R page: http://biotoolscmg.wikia.com/wiki/R#Reshape_table_to_matrix_.28heatmap.29