The Datalight Project: data

Showing posts with label data. Show all posts

Wednesday, July 22, 2009

Pandemic Perspective

The Swine Flu outbreak/pandemic/scare (pick depending on hysteria level) is all the rage at the moment. The current state of infections and deaths due to this strain of flu is available here. I thought that it would be enlightening to have a quick look at the data and see what is going on.

The Data
Let's have a look at the data for 6 countries:- Australia, Canada, Mexico, Spain, UK and US. The data data shows the situation as of the 17th of July, 2009.

We can clearly see that the US has a very large number of confirmed cases. This is perhaps not the best metric since the US has a large, well connected (roads, air-travel and rail) population. More rural countries with less inter-city connectivity (eg. Spain) would probably see the number of cases be smaller.

A normalised infection rate will give us a better handle on the real extent of the spread of the virus.

Of course, the infection rate will change over time. Smaller populations are likely to achieve their maximum infection rate sooner than larger ones. Perhaps this is a factor in explaining the differences between, say, Australia (21,855,000) and the UK (61,612,300).

As unpleasant as it is to suffer from flu symptoms, what we're really afraid of is dying. So let's have a look at the deaths per million population in these countries.

Unfortunately, what we see above still has some time dependant factor to it. As the number of cases increases then so to will the number of deaths. The above method doesn't help us to compare the situation across countries.

What we should be measuring is the number of infections that lead to death. Although this will still be somewhat dependant on the sample size it will at least give us a first order summary of what the infection-to-death cases look like as well as some comparison between the countries. We are assuming that each country has the same strain of the virus so that a given equivalent person in any of the countries will have the same mortality rate due to the virus. Extrinsic factors such as accessibility to health care level and general health of the population will also affect the results this is as good a high level overview as any other.

So should I panic!?

As ever, this isn't going to be much of a conclusion. Panic is a very personal reaction. In the US seasonal flu accounts for over 100 deaths per million of population. As you can see above (3rd diagram down), swine flu is currently at 0.84 deaths per million of population. I couldn't find any data on the infection-to-death rate for seasonal flu due to the fact that every web search serves up swine flu data. I'll just add it to the other causes of death that I'm not worried about.

Some more perspective:- Here in Spain, suicide accounts for 160 deaths-per-million population... and I'm not planning on catching that either!

Shameless plug:- All images diagrams were created using the DatalightProject's online statistics tool.

Tuesday, July 14, 2009

Sugar Daddy

My wife is in the process of building our first manzanoling. I was tempted to start with "My wife and I are in the process..." but my input to the creation process was over quite a while ago...

Before my wife was diagnosed with Gestational Diabetes I didn't even know that it existed (my bad!). Depending on the severity it can be more or less of a problem. The hospital provided us with small machine to measure my wife's blood sugar before and after meals. My wife*, being the iron willed obsessive that she is, took it upon herself to record the values and try different things to see if she could keep the sugar under control.

98 meals in and I decided to load up the data in the DataLightProject app. I had a pretty good qualitative handle on how things were going but I hadn't really been following the numbers that closely.

The diagram represents data before all meals (breakfast, lunch, dinner). We already knew that before eating her blood sugar values were generally within the normal range. The thing to keep an eye on is after eating.

The average value doesn't look too bad, but the Max value of 202 mg/100ml points to the fact that all is not as it should be.

Some quick application of filtering by food type and we get some idea of what foods are to blame. Some things were no surprise:- Orange juice (average 202 mg/100ml) and All-Bran (average 163 mg/100ml) were leaders in causing higher blood sugar. But toast (average 152 mg/100ml) for breakfast was a surprisingly bad thing too!

The most surprising effect on post meal blood sugar came when considering exercise. Now given that my wife is pregnant, the kind of exercise that she can do is pretty light. Even so this has a dramatic effect:

A 20 minute walk after breakfast makes a real difference. The blood sugar measurements after breakfast are consistently higher, so being able to get that under control was a real bonus. I'm not sure why exercise has a greater effect following dinner, but it's definitely real.

With any luck the diabetes will stop once the baby is born. This run-in with diabetes has really made us think about what it must be like to live with this condition throughout your whole life. Of course there are treatments but diabetes has a lot of day to day overhead. Most of us don't have to think about what food types we can eat or how long we can or can't wait between meals. Diabetes isn't as solved a problem as most of us like to think.

Shameless plug:-

Simplicity is what the DataLight Project is all about. We want to empower people with the ability to look at their real data and gain some real value. It took me a total of 12 mouse clicks in DataLight to prepare all of the diagrams for this post (yes, I counted). This gives the user the ability to think about the underlying data and it's meaning, rather than the number crunching. So why not give it a go!

*I'm referring to my wife as my "Dream Girl" - not as someone I met from a dying world; we don't talk about that.

Sunday, June 21, 2009

Petabytes, Jedi-liars and blobs of grease

An estimated 15 Petabytes per year will be generated by the Large Hadron Collider at CERN when fully operational. The LHC will have a grid of 100,000 computers (LHC@home) analysing this vast amount of data. The aim is to discover interesting new physics and possibly observe the Higgs Boson (God Particle). Fears of the collider creating Earth swallowing blackholes have been raised, although a quick back-of-the-envelope calculation says the LHC is no more than 18 times more lethal than death.

The SETI project generates 36 Gigabytes of data per 15.5 hours of observation whilst scanning the Universe for Extraterrestrial Intelligence.

Making it possible

So what makes this kind of data analysis endeavour possible?

A multi-million dollar budget? That's nice but in no way does spending massive amounts of money guarantee success.
A massive array of computers? It comes in pretty handy being able to outsource your analysis to 100,000 computers, but this isn't the key
Brilliant Minds? Obviously, being able to consult with world leaders in the field is nice, but this isn't the key either

The fundamental thing that these projects have in common, and is something that we should take away for our own projects, is that they know what they are looking for!

This may seem counter intuitive at first but it's a thing that our brain is doing constantly in our subconscious. Our brains see the patterns in pictures that would be too hopelessly scrambled for a computer to see. This is possible because the brain is testing against a set of hard-coded hypotheses that allow us to live our lives.

The LHC will be testing a number of hypotheses that will allow them focus in on the really interesting physics. In these collisions a vast amount of energy will be released creating myriad particles along the way. Data for these well understood processes can immediately be disregarded.

The SETI project isn't scanning the heavens to see if we suddenly receive alien MTV. The data is tested for certain characteristics which will allow us to distinguish signals over the immense noise spectrum from space. The criteria are determined based on well understood physics which allow certain assumptions to be made. Without assumptions it would be impossible to analyse this type of data.

So how does this apply to me?

OK, so you probably aren't going to analyse quite as much data as either of the above projects, but certain parallels can be drawn. In the two examples above, the analysis is done to find a concrete result. Now you may think,

"I'm doing research because I want to find an answer to something I don't know, not proof of something that I already suspect!"

This is fine. But don't kid yourself; there are certain things that you know and already expect. There are certain answers that would (or at least should) surprise you. In 2001, the UK census results revealed that nearly 1% of the adult population of England and Wales were in fact Jedi Knights. This data tells us something about this 1% ,and it doesn't have anything to do with their proficiency with a Light Saber. Establishing some control tests allows a first pass evalution about how good the collected data is. (Although lying is against the Jedi code, perhaps these people are slightly confused)

Before you immerse yourself analysing a big piece of data make sure that you always have a certain set of minimum criteria. ALWAYS do a sanity test. Use a simple tool to do some high level analysis (shameless plug). What is the average income for my demographic? What do you mean it's minus $200/month? This could indicate a problem with the data that you need to address before you start. What is the maximum age of my demographic? What do you mean it's 410 years? An innocent typing error could skew your results. It's much better to find these errors before you spend hours analysing the results. I know from experience:-

I learnt some painful lessons about data analysis during my PhD. Experimental runs could go on for days and I learnt the importantance of doing sanity-checks on your data whilst the experiment is running. Lot's of things could go wrong. When you've spent a week, including weekends and late nights, collecting data, then 4 long days cranking through the datasets, only to realise that your sample fell off before you even got it into the cryostat A WEEK EARLIER, you feel like a fool. By the time I got to this stage a few years later, the hard lessons had been learnt.

Conclusions

Have an idea of want you want to learn before you start.
Have a sanity test mechanism to know if you've found something useful.
Do waste your time measuring the magnetic penetration depth of a blob of grease; it'll eventually make you a better scientist.

The Datalight Project

Wednesday, July 22, 2009

Pandemic Perspective

Tuesday, July 14, 2009

Sugar Daddy

Sunday, June 21, 2009

Petabytes, Jedi-liars and blobs of grease

Followers

Blog Archive

The Datalight Project

Wednesday, July 22, 2009

Pandemic Perspective

Tuesday, July 14, 2009

Sugar Daddy

Sunday, June 21, 2009

Petabytes, Jedi-liars and blobs of grease

Subscribe To

Followers

Blog Archive