Biopython – Sequence Operations
The Biopython module provides various built-in methods through which we can perform various basic and advanced operations on the sequences. basic operations are very similar to string methods like slicing, concatenation, find, count, strip, split, etc. Some of the advanced operations are listed below
Complement and Reverse Complement: Biopython provides the complement() and reverse_complement() functions which can be used to find the complement of the given nucleotide sequence to get a new sequence, while the complemented sequence can also be reverse complemented to get the original sequence. Below is a simple example for described functions:
Syntax: complement(self)
Return Type: <class ‘Bio.Seq.Seq’>
Python3
# Import Libraries from Bio.Seq import Seq from Bio.Alphabet import IUPAC # Creating sequence seq = Seq( 'CTGACTGAAGCT' , IUPAC.ambiguous_dna) # Creating complement of the sequence and print comp = seq.complement() comp # Creating reverse complement and print rev_comp = comp.reverse_complement() rev_comp |
Output:
Seq('GACTGACTTCGA', IUPACAmbiguousDNA()) Seq('TCGAAGTCAGTC', IUPACAmbiguousDNA())
In the above example, the complement() method creates the complement of the DNA or RNA sequence, while the reverse_complement() function creates the complement of the sequence and reverses the resultant from left to right.
Bio.Data.IUPACData module of biopython provides the ambiguous_dna_complement variable which is used to perform the complement operations.
Python3
# Import libraries from Bio.Data import IUPACData import pprint # Printing the dataset pprint.pprint(IUPACData.ambiguous_dna_complement) |
Output:
{ 'A': 'T', 'B': 'V', 'C': 'G', 'D': 'H', 'G': 'C', 'H': 'D', 'K': 'M', 'M': 'K', 'N': 'N', 'R': 'Y', 'S': 'S', 'T': 'A', 'V': 'B', 'W': 'W', 'X': 'X', 'Y': 'R'}
GC Content(guanine-cytosine content): GC Content is basically the percentage of nitrogenous bases in DNA or RNA molecule which is either Guanine or Cytosine. It can be predicted by calculating the number of GC nucleotides divided by the total number of nucleotides. Below is a basic example for calculating GC content:
Syntax: Bio.SeqUtils.GC(seq)
Return Type: <class ‘float’>
Python3
# Import Libraries from Bio.Seq import Seq from Bio.SeqUtils import GC from Bio.Alphabet import IUPAC # Creating sequence seq = Seq( "CTGACTGAAGCT" , IUPAC.unambiguous_dna) # Getting GC count print (GC(seq)) |
Output:
50.00
Transcription: It is basically a process of converting a DNA into a RNA sequence. An actual biological transcription is a process to perform a reverse complement(GACT -> AGUC) to get the mRNA having DNA as the template strand. In Biopython, the base DNA strand is directly converted to mRNA simply by changing the letter T with U. A simple example is given below :
Syntax: transcribe(self)
Return Type: <class ‘Bio.Seq.Seq’>
Python3
# Import Libraries from Bio.Seq import Seq from Bio.Seq import transcribe from Bio.Alphabet import IUPAC # Creating sequence dna_seq = Seq( "CTGACTGAAGCT" , IUPAC.unambiguous_dna) # Transcription to RNA print (transcribe(dna_seq)) # Reverse Transcription to DNA rna_seq = transcribe(dna_seq) print (rna_seq.back_transcribe()) |
Output:-
Seq('CUGACUGAAGCU', IUPACUnambiguousRNA()) Seq('CTGACTGAAGCT', IUPACUnambiguousDNA())
Translation: It is a process of translating a RNA sequence to a protein sequence. The sequence module has h built-in translate() method used for this purpose. If we have to stop translation at the first codon, it is possible by passing to_stop = True parameter to the translation() method.
Biopython uses the translation table provided by The Genetic Codes page of NCBI. The full list of translation table is given below :
Syntax: translate(self, table=’Standard’, stop_symbol=’*’, to_stop=False, cds=False, gap=’-‘)
Return Type: <class ‘Bio.Seq.Seq’>
Python3
# import libraries from Bio.Data import CodonTable # Creating table table = CodonTable.unambiguous_dna_by_name[ "Standard" ] # Print table print (table) |
Output:
Table 1 Standard, SGC0 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA Stop| A T | TTG L(s)| TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L(s)| CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I | ACT T | AAT N | AGT S | T A | ATC I | ACC T | AAC N | AGC S | C A | ATA I | ACA T | AAA K | AGA R | A A | ATG M(s)| ACG T | AAG K | AGG R | G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V | GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+--
A simple example of translation is given below :
Python3
# Import Libraries from Bio.Seq import Seq from Bio.Alphabet import IUPAC # Creating sequence rna = Seq( 'UACCGGAUUGUUUUCCCGGGCUGAUCCUGUGCCCGA' , IUPAC.unambiguous_rna) print (rna) # Translating RNA print (rna.translate()) # Stop translation to first stop codon ( asterisk '*' is stop codon) print (rna.translate(to_stop = True )) |
Output:
Seq('UACCGGAUUGUUUUCCCGGGCUGAUCCUGUGCCCGA', IUPACUnambiguousRNA()) Seq('YRIVFPG*SCAR', HasStopCodon(IUPACProtein(), '*')) Seq('YRIVFPG', IUPACProtein())
Contact Us