Data Management Of Study Data

Every day, new medical studies are published, displaying everything from the merits of potential wonder drugs of the future, to the impact of specific changes to health and diet. While the true differentiators are the details in how the studies are actually performed, the conclusions are reached through a numbers game borne out by the collected data, painstakingly collected over time, often tracking medical details of hundreds of people or lab animals. The scientific basis behind these studies is a giant pile of data, laboriously, perhaps lovingly analyzed, until the study’s results fall out into the researchers’ hands, helping to paint a clear picture of the outcome to be demonstrated. But how does one put all of this data onto an Excel spreadsheet on one’s computer for such a study? There’s just too much of it, and too much risk of accidentally editing or corruption of files. The answer is simple: use a data warehouse instead of a database for tracking the records. Let me tell you why this is the preferred solution.

First of all, data warehouses are designed to store data. Let me say that again in a slightly different way to emphasize my point. Data warehouses are designed to allow data to be added, but prohibitively difficult at tasks such as allowing edit access to data once it has been added. This reduces any temptation to alter data points towards the end of the study to ensure the desired results. Unpromising results remain in archive in the repository, untouchable to the casual user of the data who might be tempted to fudge the data to paint a prettier picture.

Next, data warehouses are less constrained by size. Have you ever tried to edit a giant spreadsheet and waited forever for it to open, or hit a point of accidental corruption? I have, more than once, and it is not fun at all! Despite my tech background, I get nervous when documents or spreadsheets start getting big like that. I don’t like having to track down “that version where it used to work” as the point I need to go back to, wondering what I have lost in the process. By comparison, the process of adding new data to data warehouses is profoundly easy. The data seamlessly gets extracted from the source, effortlessly (on your part) transformed to the form it needs to be in (this may include unit conversion, and stripping and reporting of problem records), and then everything loads into the final repository: the data warehouse. All without the painstaking and error-prone human process of keying in data by hand, and hoping that you hit Tab or Enter to change fields as you go.

Conversely, the size problem may concern storage – suppose your study involves comparing genetic mapping of different life forms. Within each type, you want to determine which genetic markers are constant, and which ones vary. Once you know the constant ones, you want to compare this creature to another that’s undergone the same study, to determine how much of the genetic code is identical. Imagine how quickly you blow through the 2GB limit of your personal computer! Or instead of a medical study, imagine you are a researcher at the Large Hadron Collider, where physicists sift through petabytes of data at the CERN Data Centre generated from collision events.

Lastly, data warehouses are designed for business intelligence. If there is a way that you know that the data will need to be analyzed (such as tracking weight over time, blood pressure over time, death rates, reports of complications, etc.), the data warehouse testing through transformation and business intelligence, so it can be organized to concentrate speed on the specific manner of analysis you will need for generating your reports. And if you need to add a new factor for your calculations, you can make that change then rapidly generate a newer report (“Add .4 to everyone who has an iron deficiency!” “Filter out this set of people from the data pool who now are known to have a disease that skews the results!” “Take the dosage change into account when calculating!”)

So in case you ever wondered how researchers can possibly manage all of that data they must have, now you know.