{"id":1565,"date":"2015-03-08T22:02:05","date_gmt":"2015-03-09T03:02:05","guid":{"rendered":"http:\/\/unitstep.net\/?p=1565"},"modified":"2015-03-09T08:57:17","modified_gmt":"2015-03-09T13:57:17","slug":"python-script-to-search-cbc-radio-2-broadcastplay-logs","status":"publish","type":"post","link":"https:\/\/unitstep.net\/blog\/2015\/03\/08\/python-script-to-search-cbc-radio-2-broadcastplay-logs\/","title":{"rendered":"Python script to search CBC Radio 2 broadcast\/play log history"},"content":{"rendered":"

I’m a fan of CBC Radio 2<\/a>. Okay, that’s not exactly true, but I do have my $10 radio alarm clock tuned to 94.1 FM to wake me on weekdays. I often find myself in a stupor or only semi-awake when the tunes start blasting away before dawn, and as such, I often have trouble remembering what was exactly on the radio that morning. However, once during the day I remembered that a certain Ben Folds Five<\/em> song had received airtime on CBC Radio 2 during my morning wake-up, but could not recall the exact day. It bothered me.<\/p>\n

Thankfully, they did have broadcast\/play logs<\/a> of all tracks they had aired, along with the date\/times, providing for a succinct history. Unfortunately, it didn’t seem possible to search them, and I didn’t feel like searching through each day’s play log for the particular title. What to do?<\/p>\n

Scripting to the rescue!<\/p>\n

<\/p>\n

TL;DR<\/h2>\n

If you just want the script, it’s available here as a gist<\/a>.<\/p>\n

The broadcast log web page<\/h2>\n

The first thing to note on the broadcast log page is that the top-level page appears to use a SPA<\/a> design, with the hash storing the state. However, it’s actually simpler than that. No Ajax\/XHR is actually used to move between different dates on the broadcast log; instead, there’s just an iframe<\/code> whose src<\/code> URL<\/acronym> is updated when a new date is selected.<\/p>\n

This is a bit of a weird design, since the net effect is that the entire page is refreshed, since the iframe<\/code> occupies the entire page. The result is a SPA-like URL<\/acronym> with a hash, but without an SPA-like experience. (I’m sure there’s an obscure and reasonable explanation for this.)<\/p>\n

This means that the URL<\/acronym> you see in your browser when you visit the broadcast log page<\/a> is not the URL<\/acronym> where the broadcast logs are loaded from. Instead, looking at the iframe<\/code> source, you’ll see they come from URLs with the following format: http:\/\/music.cbc.ca\/broadcastlogs\/broadcastlogs.aspx?broadcastdate=YYYY-MM-DD<\/code><\/p>\n

Scraping<\/h2>\n

Now that we have the URL<\/acronym> template to use when requesting a specific day of broadcast\/play logs, the next step is to learn how to scrape and extract the relevant data. Even though scraping is far less preferable to a structured API, thankfully the HTML<\/acronym> is relatively well-structured:<\/p>\n

<div class=\"logShowEntry\">\r\n  <div class=\"logEntryTime fB s11\">\r\n      5:07 AM<\/div>\r\n  <div class=\"logTrack\">\r\n    <h3 class=\"fCm s21\">\r\n      DRAW A CROWD<\/h3>\r\n    <dl class=\"s12\">\r\n      <dt>artist<\/dt><dd>Ben Folds Five<\/dd><dt>composer<\/dt><dd>Folds- Ben<\/dd>\r\n      <dt>album<\/dt><dd class=\"fB\">Draw A Crowd (Clean Edit)(Single)<\/dd>\r\n      <dt>label<\/dt><dd>Legacy<\/dd>\r\n      <dt>duration<\/dt><dd>03:56<\/dd>\r\n    <\/dl>\r\n  <\/div>\r\n<\/div><\/code><\/pre>\n

You can see that all the information is available with in each div.logShowEntry<\/code>:<\/p>\n