13.1 Essential ideas

13.1.5 Bioinformatics (HL)

Bioinformatics uses computers to analyse sequence data.


  • Scientists have easy access to information stored in databases.  
  • The amount of data is increasing exponentially due to advances in sequencing technology, modelling and imaging software, and computing power.

Database name

Data type and tools


European Bioinformatics Institute (EBI)

  • Gene expression data (from microarray analysis)
  • Genomes for 75 species (genome sequencing data)



DNA Data Bank of Japan (DDJB)

  • Genomic data
  • Express sequence tags (ESTs)


Protein Data Bank (PDB)

  • Interactive molecular structures in 3D


National Centre for Biotechnology Information (NCBI)

  • Genes, genomes, proteins, EST, homologous sequences





Sequence alignment software

  • Sequence alignment software allows comparison of sequences from different species.
  • Similarities in structure between nucleotide and protein sequences indicate similarities in function, and also evolutionary relationships. The greater the alignment, the greater the shared evolutionary history.
  • The Basic Local Alignment Search Tool (BLAST) is a computer algorithm that searches sequences for regions of similarity.
  • The two (or more) sequences compared in a BLAST alignment may come from databases, or a newly acquired sequence may be compared to known sequences from a database.
  • BLASTn allows nucleotide sequence alignment, while BLASTp allows protein alignment.


  • Phylogeny refers to a species’ evolutionary line of descent. Multiple sequence alignments are used in the study of phylogenetics and to construct phylogenetic trees.
  • A phylogenetic tree differs from a cladogram (see 5.1.4) because the length of each branch represents the amount of change over time in that species.

Figure 13.1.5a – Comparison of cladogram and phylogenetic treesFigure 13.1.5a – Comparison of cladogram and phylogenetic trees

  • Similarities in biochemical sequences that are a product of evolution are called homologies. Homologous sequences are shared in species that have a common ancestry.
  • Some sequences may be similar due to random mutation. These are not a product of evolution and are called analogous sequences.
  • Computer-based algorithms take into account the rate of mutation when multiple sequences are aligned, so that evolutionary relationships can be determined more accurately.

Figure 13.1.5b – Mulitple sequence alignments are used in phylogeneticsFigure 13.1.5b – Mulitple sequence alignments are used in phylogenetics

Model organisms

  • There are ethical considerations that prevent in vivo experimentation on humans and other higher order animals.
  • Gene function can be studied using model organisms with similar sequences, or having similar biochemical mechanisms and pathways.
  • Some of the most extensively studied model organisms are Drosophila melanogaster (fruitfly), Escherichia coli (bacteria), and Mus musculus (white lab mouse), the last having more than 80% genetic similarity with humans.

Application: Knockout technology in mice

  • Researchers can determine the function of a gene in vivo, by observing the effects of its removal. The functional sequence of a gene is 'knocked out' and replaced with a non-functioning sequence.

Figure 13.1.5ci and cii – Knockout technology Figure 13.1.5ci and cii – Knockout technology

Figure 13.1.5c – General strategy for gene targeting

  • Embryonic stem (ES) cells are transfected with the manipulated gene and cultured in vitro. The modified stem cells are then injected into an early embryo, resulting in mosaic mice that carry cells from two mouse strains (wild type and knockout)
  • Subsequent generations are bred to produce a population of knockout mice. The effect of the altered gene can be compared experimentally with the wild type mouse.

Identifying potential genes using ESTs

  • Complete genes contain regions of coding and non-coding sequences. While genes are being expressed, mRNA containing only coding sequences is present in cells.
  • This mRNA can be captured from tissues and used to produce sequences of cDNA using reverse transcriptase. The resulting cDNA sequence contains between 200 and 500 nucleotides and is called an expressed sequence tag (EST).
  • Scientists can compare newly generated ESTs to databases containing ESTs of known gene function. Similarly, they can use a BLASTn search to align unknown ESTs to sequences in another speciesgenome.

Application: Discovery of genes by EST data mining

Data mining is the practice of using existing data to generate new information.  Between 1990 and 1995, hundreds of thousands of ESTs were generated using automated sequencing technologies. Today, researchers can use the existing EST databases to match a novel EST sequence with an identified function in another species.

Figure 13.1.5d – DrosophilaFigure 13.1.5d – Drosophila
Gene function can be studied using model organisms with similar sequences

Nature of Science

Cooperation and collaboration between scientists: databases on the internet allow scientists free access to information

Figure 13.1.5e – Molecular clockFigure 13.1.5e – Molecular clock

Food for thought

  • What is meant by the term ‘molecular clock’? How do scientists determine when species diverged from each other on evolutionary timelines?

Figure 13.1.5f – Dopamine transporterFigure 13.1.5f – Dopamine transporter
Knockout mice that lack dopamine transporters behave like drug addicts.

Concept help

  • Knockout technology is targeted gene replacement, meaning the exact locus is targeted. Using genetic engineering via plasmid vectors, and genetic transformation by physical methods, inserted genes may end up at different loci on the host genome.

Figure 13.15g – Knockout miceFigure 13.15g – Knockout mice

Course links

Figure 13.1.5h – Fat miceFigure 13.1.5h – Fat mice