Making Sense of Raw Genetic Data: Comparative Genomics of Mosquitoes

September 08 2025

Making Sense of Raw Genetic Data: Comparative Genomics of Mosquitoes

There is a lot of raw genetic data out there. The cost of sequencing genomes has drastically gone down thanks to the advent of next-generation sequencing technologies. The result is a plethora of genomes available on the National Center for Biotechnology Information - boasting 3.12 million total assembled genomes at the time of writing, with 2.52 million of them being annotated. That isn’t even counting non-assembled genomes, which poses a problem in itself!

Today’s current problem is making sense of all of that data. All of life is coded in sequences of four letters – A, C, G, and T – and from just those four letters, we can learn almost anything about a species. Over time, the biological problem of determining species function became a computational one.

But where do we even begin? How do we use computational data to get meaningful biological insights?

My final group project for my undergraduate Genomics class decided to tackle comparative genomics: comparing mosquito genomes. We wanted to see if there were significant genetic differences that determined their abilities as human disease carriers.

Comparative Genomics of Mosquito Species banner

OrthoFinder

For my part of the project, I decided to do gene set comparison between three species using a command line tool written in Python called Orthofinder. OrthoFinder determines orthologous groups (orthogroups), which are homologous (similar) genes among species that separated from a speciation event. (Contrast these with paralogs, which are homologous genes from a gene duplication event.) Essentially, for my purposes, comparing homologous genes was the key to highlighting the genetic differences between the mosquitos.

OrthoFinder analyzes given FASTA files of genomes and produces numerous output files describing all analyzed orthogroups, phyologenetic groupings, statistics, gene duplication events, and gene and species trees. There is a lot of data, but how do we make sense of it?

List of files produced from OrthoFinder

That’s a lot of information!

Mapping Genes: The Biology Side of Bioinformatics

This is where the Central Dogma comes in to play. For those of you unfamiliar, it goes like this:

DNA → RNA → Protein

This means that DNA nucleotides (A, C, G, and T) map to RNA (similar to DNA, except replace the U with a T), and three sets of RNA code to one amino acid, which create a protein.

Proteins are the key here: they are the foundation of all cell functions, and they determine function in a species. It’s why something like AlphaFold was so revolutionary – being able to predict the way protein was structured, which was a notoriously difficult task, unlocked the key to what it actually did.

I needed a data pipeline to map the genes to proteins. There are numerous (free!) databases for any budding bioinformatician to do their own research. The beauty of biologists’ meticulous need to gather data is that they tend to make it available for others to use.

In this case, I used the UniProt Database. UniProt provides several APIs to automate this process programmatically, which is highly useful when you’re querying tens of thousands of different genes. OrthoFinder helpfully provided protein IDs, which map onto a protein in the UniProt database.

With the help of a hastily-drawn note describing my data pipeline, my group member was able to write up a Python script to called the ID mapping API.

Sticky note describing my OrthoFinder to Uniprot Mapping pipeline

Your best ideas come when you’re at work and don’t have access to something better to draw with.

The pipeline went something like this:

From Orthogroups.tsv, an output file from OrthoFinder, remove any orthogroups that aren’t shared between at least two species.
Each orthogroup was a single job that queried the database. For each job, pull out the GO IDs and GO terms.
Group the orthogroups shared for all species or between two species (for a total of four different files)

The result was a table of gene functions for each identified orthogroup:

Sample table of orthogroup data with GO terms

GO Terms

What are GO terms? GO stands for Gene Ontology , the most comprehensive, unifying source of gene function. This is the best way to connect genes and proteins to actual function. Now we’re getting somewhere!

Each orthogroup was identified under

All three mosquito species
Aedes aegypti and Anopheles moucheti
Anopheles moucheti and Culex pipiens
Aedes aegypti and Culex pipiens

From there, we could determine which groups shared which functions.

Using R to Process & Visualize Data: The Computational Side of Bioinformatics

The best way to understand these data was to visualize it. R is a fantastic language to use for data visualization, although this could have easily been done using Python. I like to call R the “spreadsheet language” because it’s highly amenable to working with spreadsheet data (known as “vectors” in R).

I used a collection of packages for R called Tidyverse, which include data science packages like dplyr and ggplot to analyze the data output from our data pipeline.

(This section will get pretty code-heavy!)

Frequency Table Data Pipeline

To create a frequency table of GO terms, I created the following pipeline:

frequency_table <- . %>% strsplit("; ") %>% unlist() %>% table %>%
sort(decreasing=T) %>% as.data.frame(stringsAsFactors = F) %>%
set_names(c("GO.term", "Frequency"))

Breaking it down:

frequency_table is the name of the function
<- is the assignment operator in R
%>% is a pipeline character from the magrittr package. It basically means use the data on the left as input for the next function, and forms the basis of data piping in Tidyverse
The strsplit() function splits a string; in this case, it was used to separate between the GO terms, which was separated with ;
unlist() %>% table creates a table of all of the GO terms
sort(decreasing=T) sorts the table in descending order
as.dataframe(stringsAsFactors = F) %>% set_names(c("Go.term", "Frequency")) creates the frequency table for each GO term

This function was called for each of the previously-described four groups of orthogroups.

Often in bioinformatics and data science, you will be cleaning up and analyzing data. The data pipeline is an excellent way of doing this without altering the original data and visualizing exactly how the data is transformed at each step. The script can be run one step at a time to trace how the data is being transformed and catch any errors.

Mapping GO Terms by Orthogroup

This code snippet looks just as complicated, if not more so, but if we trace how the data is transformed, it becomes easier to understand.

mapped_orthogroups_m_terms <- mapped_orthogroups %>%
    select(Orthogroup, go_m, -job_id, -go_b, -go_id) %>%
    separate_rows(go_m, sep = "', |\\\", ") %>%
    mutate_at("go_m", str_replace, "\\[\\]", "none") %>%
    mutate_at("go_m", str_remove, "^\\['|^'|^\\\"") %>%
    mutate_at("go_m", str_remove, "'\\]$") %>%
    group_by(go_m) %>%
    mutate(orthogroup_list = map_chr(Orthogroup,
    ~toString(union(Orthogroup, .x)))) %>%
    select(go_m, orthogroup_list, -Orthogroup) %>%
    unique() %>%
    mutate(length = str_count(orthogroup_list, ",")+1) %>%
    arrange(desc(length))

mapped_orthogroups is the raw data that was originally loaded in
select() selects the columns to be analyzed
mutate_at() is a special function that allows functions to be applied for certain columns. The str_replace and str_remove functions remove any blank IDs and extraneous characters
group_by(go_m) %> mutate(orthogroup_list = map_chr(Orthogroup, ~toString(union(Orthogroup, .x)))) groups the molecular GO terms into frequencies
- group_by() allows the data to be transformed by group
- mutate() creates a new column
- map_chr() applies a function to each element of a vector or list – in this function,
unique() removes any duplicate rows
mutate(length = str_count(orthogroup_list, ",")+1) created the frequency of each GO term
arrange(desc(length)) orders it by descending order

Creating Graphs with `ggplot2`

Lastly, to neatly visualize the data, we can create bar graphs. ggplot2 provides an excellent way to create beautiful graphs.

GO Term Frequency Bar Graph

mapped_orthogroups_m_frequency_bargraph <- head(mapped_orthogroups_m_terms, 30) %>%
    ggplot(aes(x=fct_reorder(go_m, length), y=length, fill=go_m)) +
    geom_bar(stat="identity", position = position_dodge(), show.legend=F) +
    labs(x = "GO Term", y="Number of Orthogroups") +
    scale_fill_viridis(discrete=T) +
    theme_minimal() +
    coord_flip()

The head() function only took the top 30 molecular GO terms.
The ggplot function takes the aes function, which allows you to describe which data corresponds to
- The x-axis (x): the number of orthogroups
  - fct_reorder() obtained getting the frequency of GO terms
- The y-axis (y): Number of orthogroups
- The colour (fill): Each GO term had a different colour
Each function related to ggplot is separated with a plus sign + to describe the data and look of the actual plot
geom_bar() draws a bar graph, and takes arguments for its look
labs() describes the label for the x- and y- axis
scale_fill_viridis(discrete=T) used the viridis colour scheme package
theme_minimal() applies the minimal ggplot theme
coord_flip() swapped the x- and y-axis for easier reading

The result was this graph: Graph of orthogroups to GO terms

Top 30 molecular GO terms by number of orthogroups that contain them.

GO Terms by Species Bar Graph

To make a GO terms by species bar graph, I first made a frequency table:

species_molecular_frequency <- merge(anopheles_aedes_go_molecular_frequency,
    aedes_culex_go_molecular_frequency, by="GO.term", suffix =c(".Anopheles.Aedes", ".Aedes.Culex"), all.x=T) %>%
    replace(is.na(.), 0) %>%
    gather(key="Species",value="Frequency", 2:3)

These were grouped by species: Aedes and Culex, and Anopheles & Aedes. Previous data analysis found no (annotated) ortholog groups between Culex and Anopheles. These data were merged with the merge() function by GO term.
Then the replace() function removed any missing data
Finally, gather() was used to create the frequency table of GO terms per species

Similar to the graph above, I used geom_bar() to create a bar graph with ggplot:

species_molecular_frequency_bargraph <- species_molecular_frequency %>%
    ggplot(aes(x=GO.term, y=Frequency, fill=Species)) +
    geom_bar(stat="identity", position = position_dodge()) +
    theme_minimal() +
    coord_flip() +
    labs(x = "GO Term", y = "Frequency") +
    scale_fill_manual(labels=c("Aedes & Culex", "Anopheles & Aedes"),
    values=c('#742017', '#7F7F7F'))

The result was this graph:

Graph of GO term frequency grouped by species

Molecular GO terms between species orthogroups: Aedes and Culex are in red, and Anopheles and Aedes GO terms are in grey.

Back to Biology - Interpreting the Data

Now that there is an easier way of visualizing data, the data itself needs to be interpreted to have biological meaning.

An important thing to note is the lack of database information does not constitute a lack of relationship. There are many, many protein sequences that have yet to be analyzed and annotated in the UniProt database. While we may know what we know, we don’t know what we don’t know, and a lack of data should be treated with caution when making assumptions.

In this dataset, there were 13,000 orthogroups, and about 80% of them were mapped in UniProt. From those proteins, 23% were considered “uncharacterized”. There is a lot of room for research into particular proteins.

Most of the shared GO terms were related to ATP binding, DNA binding, and RNA binding, which is going to be common among most species. There were also shared iron transport orthologs and olfactory orthologs, which relate to mosquito disease transport. For my complete analysis, see my report on my portfolio [PDF link].

Although I’ve shown a singular data pipeline, this is by no means the only pipeline that can be used. I was interested in shared GO terms; however, one could upscale this to more than three mosquito genomes, and look at other data such as phylogeny construction and gene duplications. There are numerous ways of comparing genomes aside from gene sets - the entire genome structure itself can be compared, too. This was only one method of many in a larger project to compare mosquito genomes. Being able to wrangle and make sense of the large amounts of data is a Herculean task, but we can take advantage of databases and their APIs to make it easier.

Conclusions & Further Steps

The world of comparative genomics can have as wide or as narrow of a field as you wish. My only limitation was the time to dedicate to a school project and the storage of the Unix server I was given access to, but these data pipelines can get very complicated, very quickly.

The Python and R scripts were written specifically for the data we already know and for the specific use case of OrthoFinder, but the broader concepts of transforming data and mapping using keys can be applied for multiple use cases. Bridging the gap between analysis and API into databases is a powerful way to analyze data.

A more steamlined approach might be to create another, more general Python script that can be run in the command-line after Orthofinder to map the data given the number of columns column names, creating another standard output. Python also has its own share of data libraries and can be used to visualize data. There are other data OrthoFinder files produces that could be visualized with a general R or Python script.

A data pipeline is a bit like the central dogma of biology. Like the flow of gene information goes from DNA, to RNA, and lastly to protein, the raw DNA data too can be analyzed and transformed into graphs and meaningful analysis. And like the central dogma, the data pipeline can take a bunch of code and create something amazing.

Filed Under: genetics programming

Next Post Previous Post