Forensic Bioinformatics Hacks

by Kevin R. Coombes (kevin.r.coombes@gmail.com)

As a result of the Human Genome Project, scientists have now assembled a complete "parts list" of the genes encoded in the human DNA sequence.  But DNA is only part of the story.  Every one of your cells contains exactly the same DNA.  What makes your skin cells different from your brain cells (at least for people who read this magazine) depends on which genes they "express" by transcribing the DNA into RNA molecules.  Cancer cells differ from their normal counterparts because their DNA is mutated.  In the same way that skin cells and brain cells express different genes, the DNA differences between normal and cancer cells are reflected in changes in the expression levels of RNA molecules.

In the mid-1990s, scientists invented a tool known as "gene expression microarrays" that allowed them to simultaneously measure the expression levels of thousands of different RNA molecules from the same sample of cells.  With this development, biology started to become a computational science.  The data collected from a typical microarray experiment can be viewed as a single (spreadsheet) table containing the expression values.  The columns (numbering in the tens up to maybe a few hundred) represent the patient samples used for the experiment.  The rows (numbering in the tens of thousands) represent the probes that were placed on the microarray.  Each probe is carefully designed, using the sequencing data from the Human Genome Project, to target a specific gene of interest.  Managing and analyzing these kinds of datasets is the purview of a new discipline known as "bioinformatics."

Not surprisingly, computers are needed in order to analyze microarray datasets.  So, bioinformaticians spend a lot of their time writing computer programs or computer scripts to perform these analyses.  What is surprising is how rarely these scripts are shared with others.  Now, there are collections of open-source scripts that provide reusable tools that can be included as part of an analysis; BioPerl, BioPython, CRAN, and Bioconductor are some of the largest and best known.  But the specific scripts that tie these or other tools together to analyze a specific dataset almost never see the light of day.

The scientific journals that publish the biological and clinical findings that arise from analyzing microarray datasets generally require the authors to make the datasets publicly available.  The largest collection of microarray datasets, the Gene Expression Omnibus (GEO), is run by the National Center for Bioinformatics (NCBI), which is one component of the U.S. National Institutes of Health (NIH).  A smaller repository, ArrayExpress, is run by the European Bioinformatics Institute (EBI).

However, those same journals do not require the authors to provide the computer scripts that they used to perform the analysis.  If you are a bioinformatician or statistician who would like to reproduce the results from a publication, you find yourself in an interesting situation.  You can usually track down the data, but you have no access to the computer scripts.  Moreover, the actual algorithm is rarely described in any formal or technical way; at best, you get a few sentences (devoid of formulas) in the methods section of the journal article.  You find yourself forced to reverse-engineer the missing computer code from the data, the hints in the paper, and the claimed results.  The subdiscipline devoted to this task has come to be called "forensic bioinformatics."

The skills required to be a good forensic bioinformatician are the same skills that make a good hacker.  You have to be curious about how things work; you have to be willing to take things apart to see what makes them tick.  And, if you really want to know how the data was analyzed, you have to be willing to persevere for a long time before you actually get to the core issues.

The rest of this article is a brief tale of one of my own adventures in forensic bioinformatics.  It all started in November 2006, when researchers at Duke University published an article that claimed that they had a method to (accurately) predict which cancer patients would respond to which drug treatments.  If they were correct, their results would have revolutionized the treatment of cancer.  As usual, all the data for their analysis was available online, but their complete computer code was not.  Keith Baggerly, my colleague at the M.D. Anderson Cancer Center, and I collected the data and tried to reproduce their results, without success.

We looked carefully at the microarray data (from cell lines) that they had used to develop "gene expression signatures" to predict sensitivity or resistance to a particular drug.  Each signature was a list of a few genes (about 50 to 100) that should be expressed at high levels in sensitive cell lines and low levels in resistant cell lines (or vice versa).  Surprisingly, when we plotted a "heatmap" of the signature genes, they showed no difference.  So, we did our own analysis to select genes that we thought were different.  In these datasets, each gene is identified by its "probe ID" which typically consists of a numeric prefix and an alphabetic suffix; for example, 5316_at.  When we compared their list of 50 genes to our list of 50 genes, we realized that the numeric part often appeared to be off by one.  For example, where our list contained 5316_at, their list contained 5315_s_at.

In the best hacker spirit, we weren't content to stop at the conjecture that they had somehow made an off-by-one error.  We wanted to understand how they could possibly have done that.  It turned out that the software tool they were (mis) using was written in MATLAB by a different researcher at Duke, and we could get a copy of the tool.  An important fact about MATLAB is that (probably because it arose out of FORTRAN and was developed for engineers) it is hard to mix character strings and numbers in the same data structure.  So, their MATLAB function required two inputs:

1.)  A numeric matrix containing the gene expression values along with a header line with 0 for sensitive cell lines, 1 for resistant cell lines, and 2 for patient samples where the results were to be predicted.

2.)  A vector of character strings containing the gene-probe IDs, which should not have a header line.

Now, you can easily imagine someone adding the numeric classification header to a spreadsheet and later separating the numeric values from the first column of probe IDs and forgetting to remove the header.  Result: an off-by-one error.

Even after correcting for the off-by-one error, however, there were still genes in their reported signatures that we couldn't explain.  By using the same MATLAB tool that they used, we could prove that the mysterious genes did not come out of the software.  This finding suggested that there might be something more than simple "operator error" at work.

Many of the tools of forensic bioinformatics are fairly simple; they largely consist of finding different ways to look at the data.  For example, one of the datasets that they used to try to validate their predictions was supposed to contain microarray data from 122 different patient samples.  We computed a simple correlation matrix that looked at how similar the data was from one microarray to another.  We plotted an image of the correlation matrix, highlighting values that were larger than 0.9999; correlations that large can only happen if the data is identical.  We could see that there were actually only about 90 distinct samples.  Moreover, the samples that were included more than once showed that there were inconsistencies in the labels that said which patients were sensitive and which were resistant.  For example, one sample was included four times; three times it was called sensitive and one time it was called resistant to the same drug.  In another dataset, we could show that all 59 samples were wrong in some way.

To make a long story short, it appears that the data was being manipulated to make the results look significantly better than they actually were.  As a result of the forensic bioinformatics hacking that we did to understand what was going on, ten error-filled scientific publications have been retracted.  Four clinical trials where patients were being treated based on those invalid scientific claims were halted.  (And Keith and I got to appear on 60 Minutes.)

If you'd like to get more details on the story, here are some URLs to get started:

Reproducibility of Chemopredictors from Cell Lines

Run Batch Effects and Ovarian Cancer

Irreproducibility of NCI60 Predictors of Chemotherapy

Deception at Duke

groups.google.com/forum/?fromgroups#!forum/reproducible-research

retractionwatch.wordpress.com

And here are some URLs that point you to sources of data and software tools that might allow you to start doing some bioinformatics hacking of your own:

Comprehensive R Archive Network: cran.r-project.org

BioConductor: www.bioconductor.org

BioPerl: www.bioperl.org

BioPython: www.biopython.org

Gene Expression Omnibus: www.ncbi.nlm.nih.gov/geo

ArrayExpress: www.ebi.ac.uk/arrayexpress

Return to $2600 Index