The SETI project generates 36 Gigabytes of data per 15.5 hours of observation whilst scanning the Universe for Extraterrestrial Intelligence.
Making it possible
So what makes this kind of data analysis endeavour possible?
- A multi-million dollar budget? That's nice but in no way does spending massive amounts of money guarantee success.
- A massive array of computers? It comes in pretty handy being able to outsource your analysis to 100,000 computers, but this isn't the key
- Brilliant Minds? Obviously, being able to consult with world leaders in the field is nice, but this isn't the key either
The fundamental thing that these projects have in common, and is something that we should take away for our own projects, is that they know what they are looking for!
This may seem counter intuitive at first but it's a thing that our brain is doing constantly in our subconscious. Our brains see the patterns in pictures that would be too hopelessly scrambled for a computer to see. This is possible because the brain is testing against a set of hard-coded hypotheses that allow us to live our lives.
The LHC will be testing a number of hypotheses that will allow them focus in on the really interesting physics. In these collisions a vast amount of energy will be released creating myriad particles along the way. Data for these well understood processes can immediately be disregarded.
The SETI project isn't scanning the heavens to see if we suddenly receive alien MTV. The data is tested for certain characteristics which will allow us to distinguish signals over the immense noise spectrum from space. The criteria are determined based on well understood physics which allow certain assumptions to be made. Without assumptions it would be impossible to analyse this type of data.
So how does this apply to me?
OK, so you probably aren't going to analyse quite as much data as either of the above projects, but certain parallels can be drawn. In the two examples above, the analysis is done to find a concrete result. Now you may think,
"I'm doing research because I want to find an answer to something I don't know, not proof of something that I already suspect!"
This is fine. But don't kid yourself; there are certain things that you know and already expect. There are certain answers that would (or at least should) surprise you. In 2001, the UK census results revealed that nearly 1% of the adult population of England and Wales were in fact Jedi Knights. This data tells us something about this 1% ,and it doesn't have anything to do with their proficiency with a Light Saber. Establishing some control tests allows a first pass evalution about how good the collected data is. (Although lying is against the Jedi code, perhaps these people are slightly confused)
Before you immerse yourself analysing a big piece of data make sure that you always have a certain set of minimum criteria. ALWAYS do a sanity test. Use a simple tool to do some high level analysis (shameless plug). What is the average income for my demographic? What do you mean it's minus $200/month? This could indicate a problem with the data that you need to address before you start. What is the maximum age of my demographic? What do you mean it's 410 years? An innocent typing error could skew your results. It's much better to find these errors before you spend hours analysing the results. I know from experience:-
I learnt some painful lessons about data analysis during my PhD. Experimental runs could go on for days and I learnt the importantance of doing sanity-checks on your data whilst the experiment is running. Lot's of things could go wrong. When you've spent a week, including weekends and late nights, collecting data, then 4 long days cranking through the datasets, only to realise that your sample fell off before you even got it into the cryostat A WEEK EARLIER, you feel like a fool. By the time I got to this stage a few years later, the hard lessons had been learnt.
Conclusions
- Have an idea of want you want to learn before you start.
- Have a sanity test mechanism to know if you've found something useful.
- Do waste your time measuring the magnetic penetration depth of a blob of grease; it'll eventually make you a better scientist.