FASTA (.fasta) File Format
FASTA file format
Description
Details on the FASTA format
Notes
Examples
References
FASTA is a plaintext format for storing protein or nucleic acid (DNA or RNA) data as character sequences. It is a popular interchange format for molecular biology software.
The commands Import and Export support this format.
The FASTA format employs the following standard IUB/IUPAC conventions for encoding protein or nucleic acid sequences as alphabetic characters.
In addition to codes specifying particular nucleic acids or amino acids, the convention supports codes for ambiguous sequences where a position may be occupied by more than one possible nucleic acid or amino acid. For example the code R matches either adenine (A) or guanine (G).
Table 1: Nucleic Acid Codes
Code
Meaning
A
Adenine
B
{C,G,T,U}
Not A
C
Cytosine
D
{A,G,T,U}
Not C
G
Guanine
H
{A,C,T,U}
Not G
T
Thymine
V
{A,C,G}
Not T or U
U
Uracil
N
{A,C,G,T,U}
Any Nucleic acid
R
{A,G}
Purine
Y
{C,T,U}
Pyramidine
K
{G,T,U}
Ketone
M
{A,C}
Amino
S
{C,G}
Strong interaction
W
{A,T,U}
Weak interaction
Table 2: Amino Acid Codes
Alanine
J
I or L
Serine
D or N
Lysine
Threonine
Cysteine
L
Leucine
Selenocysteine
Aspartic acid
Methionine
Valine
E
Glutamic acid
Asparagine
Tryptophan
F
Phenylalanine
O
Pyrrolysine
Glycine
P
Proline
Tyrosine
Histidine
Q
Glutamine
Z
E or Q
I
Isoleucine
Arginine
X
any amino acid
*
translation stop
-
gap of indeterminate length
Content-Type: chemical/seq-aa-fasta, chemical/seq-na-fasta
Import a DNA sequence from a FASTA file.
DNASequence≔Import⁡example/humanmtDNA.fasta,base=datadir:
Read the descriptor for the first sequence in the file.
DNASequence1,1
Human mitochondrial genome,HVR2,CR,HVR1
Examine positions 100 through 150 in this sequence.
DNASequence1,2100..150
GGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATC
Count the frequency of each of the nucleotide base pairs within the sequence.
frequencies≔StringTools:-CharacterFrequencies⁡DNASequence1,2,dna
frequencies≔A=5118,C=5185,G=2175,T=4092
Statistics:-ColumnGraph⁡frequencies
IUPAC code for incomplete nucleic acid specification, National Center for Biotechnology Information.
A One-Letter Notation for Amino Acid Sequences, International Union of Pure and Applied Chemistry.
See Also
Formats
Formats,FASTQ
Formats,GenBank
Download Help Document