Monday, October 18, 2010

Data wrangling the Current

Arguably, one of the better things about Minnesota is Minnesota Public Radio. MPR is a giant on the national public radio scene (producing shows like "The Splendid Table", "A Prairie Home Companion", and "Marketplace", although "Marketplace" is recorded out in CA).

One of their interesting recent endeavors has been a pop music format station, "The Current". It's at 89.3 in the metro Twin Cities region, and it is (for better or worse)(and let's not argue the point) pretty much the only radio station I listen to.

I've noticed over the last couple of years (I've been listening since January of 2006; they came online in January of 2005) that the diversity of music seems to be declining somewhat. Not in the sense that they play more mainstream stuff, but in the sense that they are playing fewer songs more often.

So, I decided I'd do a little data wrangling. See, one of the outstanding features of the Current is their religious devotion to citing the songs they play. The artist and title are ALWAYS given on the air, AND they have a complete web accessible playlist of (pretty much) every song they've played since 5:00 AM, Dec 22, 2005. I say "pretty much" because they occasionally have a program or "guest DJ" which is listed on the playlist rather than the songs played during that period.

I wrote a Python script which fetched every day's playlist data since it was first available, does some formatting (stripping leading and trailing whitespace, replacing a few meta-characters with their intended, removing entries with empty song/artist entries), and concatenates it into one long file.

This post is all about the first pass I've made over the data- it's really quite basic stuff. I'll try and post some more interesting things later; many friends on Twitter seem to be interested in this little project and have suggested some excellent metrics to investigate. So, without further ado, here's what I can tell you so far:

Total number of unique artists: 13438
Total number of unique songs: 50369
Top 10 songs by number of plays:
1. [323] My Girls- Animal Collective (2009-01-10 22:12:00)
2. [317] Dominos- The Big Pink (2009-08-03 18:36:00)
3. [309] Two Weeks- Grizzly Bear (2009-05-09 16:13:00)
4. [307] Kids- MGMT (2008-04-01 21:05:00)
5. [304] 2080- Yeasayer (2008-01-08 03:50:00)
6. [299] French Navy- Camera Obscura (2009-04-11 16:36:00)
7. [299] Ambling Alp- Yeasayer (2009-11-06 19:26:00)
8. [298] Home- Edward Sharpe and the Magnetic Zeros (2009-09-21 20:07:00)
9. [298] Time To Pretend- MGMT (2007-12-04 11:07:00)
10. [296] How You Like Me Now?- The Heavy (2009-10-26 23:35:00)

Top 10 artists by number of plays:
1. Wilco (played 2413 times)
2. Spoon (played 2370 times)
3. The Hold Steady (played 2153 times)
4. Beck (played 2074 times)
5. Atmosphere (played 2006 times)
6. Radiohead (played 1943 times)
7. The Flaming Lips (played 1915 times)
8. Prince (played 1866 times)
9. Cloud Cult (played 1791 times)
10. Arcade Fire (played 1790 times)

Commentary- first, this data is PRETTY good but not VERY good. Ferinstance, the 22nd most played band is "R.E.M.". But, if their database entry for, say, "Stand", has the band as "REM" or "R. E. M." or any of a hundred other possible variations, my program is not (as yet) sophisticated enough to notice. So R.E.M. might well deserve to be higher if one of their songs is reasonably popular.

Second, I think it's interesting (and it meshes well with why I started this project) that all of the top 10 songs in terms of popularity were first played since the end of 2007 (the oldest in the top 10 has a "birthday" of 4 Dec, 2007). In fact, 22 of the top 25 were released since the beginning of 2009. I think I need a better way of characterizing this metric, though- perhaps by number of times played in the first month after the song premieres.

Third, and but perhaps most importantly, I don't want to sound like I'm down on these guys. I still love the station (for the music they play and for the commercials they DON'T play). I hope this doesn't tweak anyone's nose or get anyone mad at me- more than anything, I like parsing data to assuage my own curiosity, and this whole thing is also an excuse for me to play around with using Python to collect and parse a lot of data.

I'll add a few other items to this list over the coming days- half-life of particular songs, time-of-day correlation, playlist diversity for one day/week/month versus a given other day/week/month, and maybe a few more. If you have ideas, let me know!

3 comments:

  1. So are you parsing w/ Python, or have you gone to SQL to get the metrics?

    ReplyDelete
  2. RE: Idris_Arslanian- I did the parsing with Python, since a large part of this is an exercise to learn Python and how Python deals with large data sets. I will probably try to do some SQL for more advanced queries; it would be a nice skill to add to my resume, but far less important for me as an EE than it would be for other disciplines (I think).

    ReplyDelete
  3. It's eerie how similar this is to a project I undertook a few years back: 89.3 The Current and the Mysterious Non-Expanding Playlist. Same motivation, same love of the Current, even the same language to do the screen scraping... convergence in action.

    I've also been scraping playlists from other TC stations (Radio Sucks). If you'd like those SQL datasets to play around with, I could send them to you.

    ReplyDelete