Sunday, June 21, 2009

Petabytes, Jedi-liars and blobs of grease

An estimated 15 Petabytes per year will be generated by the Large Hadron Collider at CERN when fully operational. The LHC will have a grid of 100,000 computers (LHC@home) analysing this vast amount of data. The aim is to discover interesting new physics and possibly observe the Higgs Boson (God Particle). Fears of the collider creating Earth swallowing blackholes have been raised, although a quick back-of-the-envelope calculation says the LHC is no more than 18 times more lethal than death.

The SETI project generates 36 Gigabytes of data per 15.5 hours of observation whilst scanning the Universe for Extraterrestrial Intelligence.

Making it possible

So what makes this kind of data analysis endeavour possible?
  • A multi-million dollar budget? That's nice but in no way does spending massive amounts of money guarantee success.
  • A massive array of computers? It comes in pretty handy being able to outsource your analysis to 100,000 computers, but this isn't the key
  • Brilliant Minds? Obviously, being able to consult with world leaders in the field is nice, but this isn't the key either

The fundamental thing that these projects have in common, and is something that we should take away for our own projects, is that they know what they are looking for!

This may seem counter intuitive at first but it's a thing that our brain is doing constantly in our subconscious. Our brains see the patterns in pictures that would be too hopelessly scrambled for a computer to see. This is possible because the brain is testing against a set of hard-coded hypotheses that allow us to live our lives.

The LHC will be testing a number of hypotheses that will allow them focus in on the really interesting physics. In these collisions a vast amount of energy will be released creating myriad particles along the way. Data for these well understood processes can immediately be disregarded.

The SETI project isn't scanning the heavens to see if we suddenly receive alien MTV. The data is tested for certain characteristics which will allow us to distinguish signals over the immense noise spectrum from space. The criteria are determined based on well understood physics which allow certain assumptions to be made. Without assumptions it would be impossible to analyse this type of data.

So how does this apply to me?

OK, so you probably aren't going to analyse quite as much data as either of the above projects, but certain parallels can be drawn. In the two examples above, the analysis is done to find a concrete result. Now you may think,

"I'm doing research because I want to find an answer to something I don't know, not proof of something that I already suspect!"

This is fine. But don't kid yourself; there are certain things that you know and already expect. There are certain answers that would (or at least should) surprise you. In 2001, the UK census results revealed that nearly 1% of the adult population of England and Wales were in fact Jedi Knights. This data tells us something about this 1% ,and it doesn't have anything to do with their proficiency with a Light Saber. Establishing some control tests allows a first pass evalution about how good the collected data is. (Although lying is against the Jedi code, perhaps these people are slightly confused)

Before you immerse yourself analysing a big piece of data make sure that you always have a certain set of minimum criteria. ALWAYS do a sanity test. Use a simple tool to do some high level analysis (shameless plug). What is the average income for my demographic? What do you mean it's minus $200/month? This could indicate a problem with the data that you need to address before you start. What is the maximum age of my demographic? What do you mean it's 410 years? An innocent typing error could skew your results. It's much better to find these errors before you spend hours analysing the results. I know from experience:-

I learnt some painful lessons about data analysis during my PhD. Experimental runs could go on for days and I learnt the importantance of doing sanity-checks on your data whilst the experiment is running. Lot's of things could go wrong. When you've spent a week, including weekends and late nights, collecting data, then 4 long days cranking through the datasets, only to realise that your sample fell off before you even got it into the cryostat A WEEK EARLIER, you feel like a fool. By the time I got to this stage a few years later, the hard lessons had been learnt.

Conclusions
  • Have an idea of want you want to learn before you start.
  • Have a sanity test mechanism to know if you've found something useful.
  • Do waste your time measuring the magnetic penetration depth of a blob of grease; it'll eventually make you a better scientist.

Thursday, June 11, 2009

Welcome to the DatalightProject blog!

My name is Fran Manzano and I'm one of the data geeks who run www.datalightproject.com. Like any company blog, we'd like to spread the word about our products and services, but we'll also talk about how we get things done and what motivates us. You can also follow us on Twitter and Facebook.

Let's start with a mini FAQ, which will likely become part of the real FAQ once we get around to it.

What is Datalight?

Datalight is a web based application to analyse statistical data. Our aim is to provide statistical analysis that can be used by anyone that has data that they want to know more about. We chose the Silverlight platform to provide flexibility for our users. We do not insist on a specific operating system or even that the Datalight subscription to a specific computer. We firmly believe that software should adapt to the needs of the user and not the other way around.

What isn't Datalight?

Datalight is not meant to be a replacement for the existing data analysis applications on the market. At the same time, we feel that people are often turned away from doing useful analysis of their data by the high entry fee in both cost and difficulty of use. Our pricing model is such that we expect customers to only pay for Datalight when they need it.

It’s online, am I sending my data over the internet?

Not at all! All of the computation is performed locally on your machine (which is why it is fast).

OK, how much?

Datalight currently costs $15 per month. There is no sign up fee or tie in. We don't try to get you to pay for longer than you need and you won't be penalised if you want to extend your subscription at a later date. Our 7 day cooling off period is so that you can try out Datalight for 7 days with your own data and then decide if it's useful to you. If you chose to cancel then that is fine; it’s your data, you know best.

Mini Manifesto

We strongly believe in not being evil and we'd like to think that we'd pass the Starbuck's test. We want Datalight to be something remarkable and that doesn't happen passively or by accident.

Have it for free!

If you work for a not-for-profit organisation or charity and you think you could use Datalight, then let us know and you can get it for free. Educational establishments also qualify for a discount so please get in touch!