Wednesday, July 22, 2009

Pandemic Perspective

The Swine Flu outbreak/pandemic/scare (pick depending on hysteria level) is all the rage at the moment. The current state of infections and deaths due to this strain of flu is available here. I thought that it would be enlightening to have a quick look at the data and see what is going on.

The Data
Let's have a look at the data for 6 countries:- Australia, Canada, Mexico, Spain, UK and US. The data data shows the situation as of the 17th of July, 2009.




We can clearly see that the US has a very large number of confirmed cases. This is perhaps not the best metric since the US has a large, well connected (roads, air-travel and rail) population. More rural countries with less inter-city connectivity (eg. Spain) would probably see the number of cases be smaller.

A normalised infection rate will give us a better handle on the real extent of the spread of the virus.


Of course, the infection rate will change over time. Smaller populations are likely to achieve their maximum infection rate sooner than larger ones. Perhaps this is a factor in explaining the differences between, say, Australia (21,855,000) and the UK (61,612,300).

As unpleasant as it is to suffer from flu symptoms, what we're really afraid of is dying. So let's have a look at the deaths per million population in these countries.

Unfortunately, what we see above still has some time dependant factor to it. As the number of cases increases then so to will the number of deaths. The above method doesn't help us to compare the situation across countries.

What we should be measuring is the number of infections that lead to death. Although this will still be somewhat dependant on the sample size it will at least give us a first order summary of what the infection-to-death cases look like as well as some comparison between the countries. We are assuming that each country has the same strain of the virus so that a given equivalent person in any of the countries will have the same mortality rate due to the virus. Extrinsic factors such as accessibility to health care level and general health of the population will also affect the results this is as good a high level overview as any other.

So should I panic!?

As ever, this isn't going to be much of a conclusion. Panic is a very personal reaction. In the US seasonal flu accounts for over 100 deaths per million of population. As you can see above (3rd diagram down), swine flu is currently at 0.84 deaths per million of population. I couldn't find any data on the infection-to-death rate for seasonal flu due to the fact that every web search serves up swine flu data. I'll just add it to the other causes of death that I'm not worried about.

Some more perspective:- Here in Spain, suicide accounts for 160 deaths-per-million population... and I'm not planning on catching that either!

Shameless plug:- All images diagrams were created using the DatalightProject's online statistics tool.

Tuesday, July 14, 2009

Sugar Daddy

My wife is in the process of building our first manzanoling. I was tempted to start with "My wife and I are in the process..." but my input to the creation process was over quite a while ago...

Before my wife was diagnosed with Gestational Diabetes I didn't even know that it existed (my bad!). Depending on the severity it can be more or less of a problem. The hospital provided us with small machine to measure my wife's blood sugar before and after meals. My wife*, being the iron willed obsessive that she is, took it upon herself to record the values and try different things to see if she could keep the sugar under control.

98 meals in and I decided to load up the data in the DataLightProject app. I had a pretty good qualitative handle on how things were going but I hadn't really been following the numbers that closely.

The diagram represents data before all meals (breakfast, lunch, dinner). We already knew that before eating her blood sugar values were generally within the normal range. The thing to keep an eye on is after eating.

The average value doesn't look too bad, but the Max value of 202 mg/100ml points to the fact that all is not as it should be.

Some quick application of filtering by food type and we get some idea of what foods are to blame. Some things were no surprise:- Orange juice (average 202 mg/100ml) and All-Bran (average 163 mg/100ml) were leaders in causing higher blood sugar. But toast (average 152 mg/100ml) for breakfast was a surprisingly bad thing too!

The most surprising effect on post meal blood sugar came when considering exercise. Now given that my wife is pregnant, the kind of exercise that she can do is pretty light. Even so this has a dramatic effect:

A 20 minute walk after breakfast makes a real difference. The blood sugar measurements after breakfast are consistently higher, so being able to get that under control was a real bonus. I'm not sure why exercise has a greater effect following dinner, but it's definitely real.

With any luck the diabetes will stop once the baby is born. This run-in with diabetes has really made us think about what it must be like to live with this condition throughout your whole life. Of course there are treatments but diabetes has a lot of day to day overhead. Most of us don't have to think about what food types we can eat or how long we can or can't wait between meals. Diabetes isn't as solved a problem as most of us like to think.

Shameless plug:-

Simplicity is what the DataLight Project is all about. We want to empower people with the ability to look at their real data and gain some real value. It took me a total of 12 mouse clicks in DataLight to prepare all of the diagrams for this post (yes, I counted). This gives the user the ability to think about the underlying data and it's meaning, rather than the number crunching. So why not give it a go!

*I'm referring to my wife as my "Dream Girl" - not as someone I met from a dying world; we don't talk about that.

Thursday, July 2, 2009

The Rich, the Internet and the Application

Whilst researching this post I made a frightening discovery:- I didn't know how to define a Rich Internet Application (RIA). Now I'm pretty used to feeling stupid and confused. Confusion is pretty much the default state for scientists, programmers and anyone who's job it is to do, or work with, new things. Politicians often seem to be confused too, although I suspect that "I thought I could claim for that" type of confusion is a little different to "Why doesn't this work?". Just a hunch...

Now I'm guessing that a bunch of you are thinking, "Fran, you idiot! It's obvious what an RIA is! You are selling one for pity's sake!" And you'd be correct. Sometimes you just know. But let's examine some other cases and have a think.

The Rich

Let's look into the Rich experience. A natural place to start is with Microsoft's flagship technology for creating applications: WPF. Believe it or not, this isn't taken from CSI: Miami, but is a showcase of what can be done with WPF (although you'll be bombarded by cool animations and music just like in CSI: Miami). Although these are all desktop applications, you'll notice that several of them source their content from the internet. That's fine, Outlook gets most of its content from the Internet but nobody would call it a RIA.

Twitter clients, which are obviously sourced from the Internet, present some great Rich experiences. TweetDeck is one of the more popular clients and runs using Adobe-Air. Adobe Air and Silverlight often get lumped together as providing the similar, competing features. Whereas Silverlight is considered 100% internet, Adobe Air appears to be taken seriously on the Desktop. I haven't heard reference to TweetDeck as an RIA at all. Another interesting Twitter client is Blu. Blu has impressive visuals and is built using WPF. Blu is installed via ClickOnce. Although Blu obviously isn't an RIA, the install mechanism makes for a painless experience which we have come to associate with Internet apps ie., no install.

Defining a Rich experience is easy. Defining a Rich Internet one is slightly blurred.

The Internet

The Internet bit of the discussion should be easy. Surely if it's in a browser then it's Internet. Maybe. YouTube is definitely and Internet application. But my phone has a YouTube client on it since it doesn't run Flash. Now everybody knows that YouTube is an Internet application and just because some people chose to run it in a client doesn't change that. So maybe the definition goes a little beyond where the client is installed.

Balsamiq mockups is an excellent UI mockup application. I think that everyone's gut reaction would be to call this a bonafide, as sure as Bing Is Not Google, 100%, Rich Internet Application. It's built using Adobe Air and can be deployed to a Web Server, run from the website and bought as a desktop app. So are we back to square one?

Silverlight 3 will offer developers the opportunity to run their applications outside of the browser. So does this make them desktop applications? I guess so. Rich Internet Application - Internet = Rich Application. Right?

The Application

I'm pretty sure we can all agree on what an application is. Word is an Application. Google Docs is an application. They are both completely different and yet have an incredibly large overlap. The delivery mechanism is completely different. The licensing couldn't be more disparate. But they both create documents. Easy. Google Docs is an RIA. You can save output to the desktop but it's definitely an RIA. No arguments.

Conclusions

So what have we learnt? I'm not exactly sure. I think that what we have seen is that there is a significant blurring between the differences between an Internet application and a desktop application. They can share delivery mechanisms (Internet installs from ClickOnce), technology (Adobe Air) or even frameworks (Silverlight is a subset of WPF).

The most important factor in deciding on an application is its usefulness to you. If it suits the customers requirements and budget then it will be used.

Lets stop worrying about where the application is running. In 2009 it's just an implementation detail.

So to conclude: I still don't know what RIA is. I told you I was an idiot.

Sunday, June 21, 2009

Petabytes, Jedi-liars and blobs of grease

An estimated 15 Petabytes per year will be generated by the Large Hadron Collider at CERN when fully operational. The LHC will have a grid of 100,000 computers (LHC@home) analysing this vast amount of data. The aim is to discover interesting new physics and possibly observe the Higgs Boson (God Particle). Fears of the collider creating Earth swallowing blackholes have been raised, although a quick back-of-the-envelope calculation says the LHC is no more than 18 times more lethal than death.

The SETI project generates 36 Gigabytes of data per 15.5 hours of observation whilst scanning the Universe for Extraterrestrial Intelligence.

Making it possible

So what makes this kind of data analysis endeavour possible?
  • A multi-million dollar budget? That's nice but in no way does spending massive amounts of money guarantee success.
  • A massive array of computers? It comes in pretty handy being able to outsource your analysis to 100,000 computers, but this isn't the key
  • Brilliant Minds? Obviously, being able to consult with world leaders in the field is nice, but this isn't the key either

The fundamental thing that these projects have in common, and is something that we should take away for our own projects, is that they know what they are looking for!

This may seem counter intuitive at first but it's a thing that our brain is doing constantly in our subconscious. Our brains see the patterns in pictures that would be too hopelessly scrambled for a computer to see. This is possible because the brain is testing against a set of hard-coded hypotheses that allow us to live our lives.

The LHC will be testing a number of hypotheses that will allow them focus in on the really interesting physics. In these collisions a vast amount of energy will be released creating myriad particles along the way. Data for these well understood processes can immediately be disregarded.

The SETI project isn't scanning the heavens to see if we suddenly receive alien MTV. The data is tested for certain characteristics which will allow us to distinguish signals over the immense noise spectrum from space. The criteria are determined based on well understood physics which allow certain assumptions to be made. Without assumptions it would be impossible to analyse this type of data.

So how does this apply to me?

OK, so you probably aren't going to analyse quite as much data as either of the above projects, but certain parallels can be drawn. In the two examples above, the analysis is done to find a concrete result. Now you may think,

"I'm doing research because I want to find an answer to something I don't know, not proof of something that I already suspect!"

This is fine. But don't kid yourself; there are certain things that you know and already expect. There are certain answers that would (or at least should) surprise you. In 2001, the UK census results revealed that nearly 1% of the adult population of England and Wales were in fact Jedi Knights. This data tells us something about this 1% ,and it doesn't have anything to do with their proficiency with a Light Saber. Establishing some control tests allows a first pass evalution about how good the collected data is. (Although lying is against the Jedi code, perhaps these people are slightly confused)

Before you immerse yourself analysing a big piece of data make sure that you always have a certain set of minimum criteria. ALWAYS do a sanity test. Use a simple tool to do some high level analysis (shameless plug). What is the average income for my demographic? What do you mean it's minus $200/month? This could indicate a problem with the data that you need to address before you start. What is the maximum age of my demographic? What do you mean it's 410 years? An innocent typing error could skew your results. It's much better to find these errors before you spend hours analysing the results. I know from experience:-

I learnt some painful lessons about data analysis during my PhD. Experimental runs could go on for days and I learnt the importantance of doing sanity-checks on your data whilst the experiment is running. Lot's of things could go wrong. When you've spent a week, including weekends and late nights, collecting data, then 4 long days cranking through the datasets, only to realise that your sample fell off before you even got it into the cryostat A WEEK EARLIER, you feel like a fool. By the time I got to this stage a few years later, the hard lessons had been learnt.

Conclusions
  • Have an idea of want you want to learn before you start.
  • Have a sanity test mechanism to know if you've found something useful.
  • Do waste your time measuring the magnetic penetration depth of a blob of grease; it'll eventually make you a better scientist.

Thursday, June 11, 2009

Welcome to the DatalightProject blog!

My name is Fran Manzano and I'm one of the data geeks who run www.datalightproject.com. Like any company blog, we'd like to spread the word about our products and services, but we'll also talk about how we get things done and what motivates us. You can also follow us on Twitter and Facebook.

Let's start with a mini FAQ, which will likely become part of the real FAQ once we get around to it.

What is Datalight?

Datalight is a web based application to analyse statistical data. Our aim is to provide statistical analysis that can be used by anyone that has data that they want to know more about. We chose the Silverlight platform to provide flexibility for our users. We do not insist on a specific operating system or even that the Datalight subscription to a specific computer. We firmly believe that software should adapt to the needs of the user and not the other way around.

What isn't Datalight?

Datalight is not meant to be a replacement for the existing data analysis applications on the market. At the same time, we feel that people are often turned away from doing useful analysis of their data by the high entry fee in both cost and difficulty of use. Our pricing model is such that we expect customers to only pay for Datalight when they need it.

It’s online, am I sending my data over the internet?

Not at all! All of the computation is performed locally on your machine (which is why it is fast).

OK, how much?

Datalight currently costs $15 per month. There is no sign up fee or tie in. We don't try to get you to pay for longer than you need and you won't be penalised if you want to extend your subscription at a later date. Our 7 day cooling off period is so that you can try out Datalight for 7 days with your own data and then decide if it's useful to you. If you chose to cancel then that is fine; it’s your data, you know best.

Mini Manifesto

We strongly believe in not being evil and we'd like to think that we'd pass the Starbuck's test. We want Datalight to be something remarkable and that doesn't happen passively or by accident.

Have it for free!

If you work for a not-for-profit organisation or charity and you think you could use Datalight, then let us know and you can get it for free. Educational establishments also qualify for a discount so please get in touch!