I’m starting a series where I will post links and short descriptions to articles I’ve read, my progress in the books I’m reading, and other interesting media that I would like to share but otherwise won’t dedicate a whole blog post about. This is partly to keep myself accountable to retain what I’ve read, rather than passively consume things. I’ll see how the weekly format goes for me and adjust accordingly.
There is a lot of raw genetic data out there. The cost of sequencing genomes has drastically gone down thanks to the advent of next-generation sequencing technologies. The result is a plethora of genomes available on the National Center for Biotechnology Information - boasting 3.12 million total assembled genomes at the time of writing, with 2.52 million of them being annotated. That isn’t even counting non-assembled genomes, which poses a problem in itself!
Today’s current problem is making sense of all of that data. All of life is coded in sequences of four letters – A, C, G, and T – and from just those four letters, we can learn almost anything about a species. Over time, the biological problem of determining species function became a computational one.
But where do we even begin? How do we use computational data to get meaningful biological insights?