Collective Intelligence and the Guardian Data-Store

I’ve been interested in collective intelligence and machine learning for a while now. These too related fields centre round using statistical tools on large sets of data to make measurements and predictions. So when the UK’s Guardian newspaper announced their “Data-store”, a collection of data set open to the public I felt it was time to apply some of what I’ve learned to the data they were offer.

I choose to apply hierarchal clustering to the data on world health. The idea of hierarchal clustering is to measure how similar data sets are then pair off the similar data sets to build a binary tree that will relieve groups of similar data. I used the pearson correlation to compare the data sets and the resulting data is drawn in a dendrogram, a way of showing the distances between the various clusters that emerge from our clustering algorithm.

The code I’ve used is available on github.com, it’s packaged in an F# project called gdata.fsproj. For a direct link to the project click here. (There’s also a demonstration on hierarchal clustering with word counts from blogs from TechDays Paris 2009 talk).

Anyway, I’m not going to dig too deeply into the code, at least for this post, so let’s have a look at the results. First I clustered by county using the following statics to form my vectors:

Hospital beds per 1000
Nursing and Midwifery Personnel per 1000
One-year-olds Immunised with diphtheriatetanustoxoidandpertussisdtp
One-year-olds Immunised with hepatitis b
One-year-olds Immunised with hibhib3vaccine
Adolescent fertility rate (%)
Births attended by skilled health personnel (%)
Infant mortality rate (per 1 000 live births) both sexes
Maternal mortality ratio (per 100 000 live births)
Neonatal mortality rate (per 1 000 live births)
Life expectancy at birth (years) both sexes
Life expectancy at birth (years) female
Life expectancy at birth (years) male
Deaths among children under five years of age due to HIV/AIDS (%)
Per capita recorded alcohol consumption (litres of pure alcohol) among adults
Population with sustainable access to improved drinking water sources (%) total
Population with sustainable access to improved sanitation (%) total.

The statistics were chosen mainly because they were the most complete; it is only possible to compare countries using this technique if all statistics are available. The resulting dendrogram can be seen below:

 

There’s no great surprises from the stats, there appears to be two distinct clusters, one of poor countries towards the bottom of the diagram and one of richer countries towards the top, with the 1st world countries being located towards the top of this cluster (absolute position doesn’t matter much is the diagram it’s more who your close to). There are perhaps a few surpises, maybe we wouldn’t have expected to find Cananda quite so close to the Ukraine or perhaps not the Czech Republic so closed to Germany. It may be worth going back to the underlying statistics to find why this is.

Perhaps a more interesting analysis is to reverse the matrix so we are no comparing which conditions are related to each other:

Again, the diagram does show some obvious relations. Male and female life expectancies were always going to statically similar to overall life expectancy, but it does appear that this is closely related to infant mortality rates. In turn is closely correlated to births attend by medical professions and access to clean water and sanitations. While this is fairly logical I think it’s good that we can show, statically speaking at least, that access to clean water and sanitation will improve infant mortality rates and life expectancy.

While these first steps in analysing the Guardian Data didn’t perhaps turn up anything we didn’t already know, I feel it’s shown that if you spend a bit of time working with public available data you can start to find interesting patterns. I shall definitely be looking at how I can further these experiments.

Feature Speaking Engagement – F# Tutorial at the Progressive .NET Tutorials, May 11-13th, London

I will be giving a half day F# tutorial at the “Progressive .NET Tutorials” organised by Skills Matter. This will be an excellent 3 daylong event with 2 tracks featuring half day and full day Tutorials by Gojko Adzic, David Laribee, Hammet, Ian Cooper, Mike Hadlow, Scott Belware and Sebastien Lambla.

My will be giving a half day tutorial on Wednesday May 13th (the last day of the event). I will be presenting 'F# Tutorial ', which will aim to give delegates the building blocks for using F# productively and to start having fun with it.

For the full programme and description of my tutorial, and all other Progressive .NET tutorials, check out: http://progressive-dotnet.com

Special Community Discount: Book on or before March 31st and pay just £500!

Skills Matter has given me a promotion code that will entitle you to a substantial discount off the Tutorial Fees. Simply book on or before March 31st, quote SM1368-622459-33L (in the Promo Code field) and pay just £500 (normal price £1000). Offer is valid until March 31st only, and tickets are going fast, if you would like to secure a place and claim your discount – you’d better get a wriggle on.

The code to use is: *SM1368-622459-33L* and must be entered in the box provided when booking online at https://skillsmatter.com/register-online/conf/280

Full details of the event can be found at http://progressive-dotnet.com

ALTi

(Sorry I’ve been a bit quiet recently; this is the first of several posts I’ll be making this morning)

I decided a little while ago that I’d like to change direction in my career and go back to consulting, after interviewing around a bit I decided to join ALTi. It was my first day on Monday, and so far I’m enjoying my first week, although obviously I’m just getting settled in. The thing I like most about the company so far is that they seem quite open to suggestions and seem will to let you develop your career in the direction you want. I’ll be predominately working on .NET projects in there .NET practice, so I’m interested in finding any projects with an F# slant out there (although F# won’t be my exclusive focus). I’m also hoping to develop the training and speaking side of my career. So if you have some F# work or are interested in having an F# presentation or tutorial, do not hesitate to drop me a line: Printf.sprintf "%s@%s.%s" "robert" "strangelights" "com"

Links

 Subscribe in a reader
Twitter Follow me on Twitter
FaceBook View my Facebook
LinkedIn View my LinkedIn Profile Viadeo Viadeo Profile (Français)

Badges


Progressive .NET Tutorials 2009

Disclaimer

The views expressed on this weblog are mine and do not necessarily reflect the views of my employer.

All postings are provided "AS IS" with no warranties, and confer no rights.

www.flickr.com
This is a Flickr badge showing public photos and videos from Robert Pickering. Make your own badge here.