Analysis of the track hit count

Andrew Booker 2007-10-03 17:45:51

Last night I made another geeky addition to the track hits list. If you haven't noticed the hits list recently, go and notice it now, because it's fu....ascinating. OK, so you might need to read the rest of this to appreciate the near bottomless depths of wonder to be enjoyed from these numbers. Hopefully that last sentence tells you that I am not taking this very seriously.

Originally the hits list page just had a simple hit count and a latest hit timestamp for each track. It only counted clicks on the track download page. So if someone copied the file URL and mailed it round for downloading, we wouldn't be able to count the hits from people clicking on the URL in their email. For the podcast, I got the hit count working better and could split it into hits via the site and hits via external links, eg the podcast. Also I added a column called per day, showing the total hits divided by the number of days the track had been available. I immediately noticed the per-day average was not very clever or interesting. I might get rid of it and save the space.

Here's what's wrong with it. By default, the lists are sorted in descending date order, so you get the latest stuff at the top. In that order, you see that the per-day average tends to decrease down the list, from several per day around the latest tracks at the top, to a small fraction per day towards the bottom. This is to be expected. As more tracks become available, people inevitably pay less attention to the older stuff in favour of the new, so the per-day average doesn't tell you much, because hits are naturally decreasing over time for all tracks.

I reckoned it would be more interesting to ask, does the track lose popularity slower or faster than the general downward trend? This has been a right bugger to get working, but I think I have something vaguely realistic now. Suppose a track has been on the site for 25 days and I want to see how it's doing. What I do is look at all the tracks that have been on the site for more than 25 days, and count all the hits they clocked in their first 25 days. Suppose I find a total of 851 hits for 52 tracks, or 16.37 hits per track. When I see my 25-day old track has had 18 hits, that gives me a rating of 18 * 52 / 851 = 1.1 times the 25-day average. Now that I've got my head around it, that part of calculation is not hard.

What took a lot of effort was making sure the page loaded in less time than it takes to boil the kettle. Because all the tracks are different ages, I need to repeat the hits per track calculation for all possible track ages between one day and the oldest, currently 408. Caching the results in the database speeds it all up, but the cache has to take into account that as each day passes, the whole average hitscape changes. Tomorrow's averages will be different from today's, for some tracks. So each time the page is loaded, it checks that the cache is up to date. If it finds any results more than one day old, it will recalculate them.

So, what I get out of all this is a popularity rating. For example, a rating of 2.0 says that the track has been downloaded twice as many times as the average for a track of its age. A rating of 0.5 would mean the track has had half as many hits as the average. And so on. And this is where it doesn't quite look right. Click on the rating heading so that it sorts highest first, and look where the 1.0 figure is. Way down the page. Surely it should be somewhere in the middle? I will have to look into this a bit more, but given that over 25% of the tacks on the list were uploaded within the first six weeks, ie by mid October 2006, and that the tracks with the highest hit count are typically not the most recent uploads (as at October 2007), I'm sure there's a gripping explanation in there somewhere. And I have not worked out why the two oldest tracks are missing a rating. Some part of my algorithm is underachieving.

Is there any other unconvincing analysis I could wedge into the track hits page? Ooh, probably all sorts. One that comes to mind is something similar to the duration analysis commonly used on financial instruments, where I could do a time-weighted average of the number of hits on each track to give a kind of effective lifetime. If a track gets loads of hits initially but almost none after a few weeks, it will have a short duration. If it gets fewer hits up front but attracts more and more later on, it will have a longer duration. In both cases the track could have the same popularity rating, but if I were looking to trim down the list and save disk space, I'd keep the one that had the longer duration.

Why am bothering cramming this junk into an already busy page when I should be mixing more uploads and going out to find a Central London venue?

I have exactly one reason for mucking about with this stuff. It's fun. Unlike the blog entries, and mixing the tracks, and soldering leads and electronics, and submitting listings to websites, and trying to get people to turn up to the gigs, and trying even harder to get people to play at the gigs, this harmless collection of data sits on the downloads page and gathers itself for my amusement with no intervention from me at all. Low maintenance interest. That's what we're all about here at Improvizone.

I suppose it had its moments << | >> 9th gig: Tuesday 30 October 2007 at The Plough, E17