Python script to search CBC Radio 2 broadcast/play log history

I’m a fan of CBC Radio 2. Okay, that’s not exactly true, but I do have my $10 radio alarm clock tuned to 94.1 FM to wake me on weekdays. I often find myself in a stupor or only semi-awake when the tunes start blasting away before dawn, and as such, I often have trouble remembering what was exactly on the radio that morning. However, once during the day I remembered that a certain Ben Folds Five song had received airtime on CBC Radio 2 during my morning wake-up, but could not recall the exact day. It bothered me.

Thankfully, they did have broadcast/play logs of all tracks they had aired, along with the date/times, providing for a succinct history. Unfortunately, it didn’t seem possible to search them, and I didn’t feel like searching through each day’s play log for the particular title. What to do?

Scripting to the rescue!

TL;DR

If you just want the script, it’s available here as a gist.

The broadcast log web page

The first thing to note on the broadcast log page is that the top-level page appears to use a SPA design, with the hash storing the state. However, it’s actually simpler than that. No Ajax/XHR is actually used to move between different dates on the broadcast log; instead, there’s just an iframe whose src URL is updated when a new date is selected.

This is a bit of a weird design, since the net effect is that the entire page is refreshed, since the iframe occupies the entire page. The result is a SPA-like URL with a hash, but without an SPA-like experience. (I’m sure there’s an obscure and reasonable explanation for this.)

This means that the URL you see in your browser when you visit the broadcast log page is not the URL where the broadcast logs are loaded from. Instead, looking at the iframe source, you’ll see they come from URLs with the following format: http://music.cbc.ca/broadcastlogs/broadcastlogs.aspx?broadcastdate=YYYY-MM-DD

Scraping

Now that we have the URL template to use when requesting a specific day of broadcast/play logs, the next step is to learn how to scrape and extract the relevant data. Even though scraping is far less preferable to a structured API, thankfully the HTML is relatively well-structured:

<div class="logShowEntry">
  <div class="logEntryTime fB s11">
      5:07 AM</div>
  <div class="logTrack">
    <h3 class="fCm s21">
      DRAW A CROWD</h3>
    <dl class="s12">
      <dt>artist</dt><dd>Ben Folds Five</dd><dt>composer</dt><dd>Folds- Ben</dd>
      <dt>album</dt><dd class="fB">Draw A Crowd (Clean Edit)(Single)</dd>
      <dt>label</dt><dd>Legacy</dd>
      <dt>duration</dt><dd>03:56</dd>
    </dl>
  </div>
</div>

You can see that all the information is available with in each div.logShowEntry:

  • The time the song was played at is in a div.logEntryTime
  • The track/song name is in a .logTrack h3
  • The remainder of the attributes are in a dl or a description/definition list

The description/definition list makes it easy grab this data as a map (key-value pairs) and the other ones can be added in. Sometimes, especially for classical music aired during the “Choral Concert” segment, there are many other attributes in the dl list. This script ignores them and limits its output to only the following fields: date,time,label,artist,composer,album,title,duration.

Here’s a truncated example of the output, which is valid CSV generated by the Python csv library.

$ ./cbc_radio_broadcast_logs.py --start=2015-01-01
# Results from 2015-01-01 to 2015-03-09.
date,time,label,artist,composer,album,title,duration
2015-01-01,12:00 AM,Soft Revolution,Stars,Campbell- Torquil,From The Night (Radio Edit),FROM THE NIGHT,04:06
2015-01-01,12:04 AM,Rubyworks,Hozier,Hozier,Hozier,FROM EDEN,03:42
2015-01-01,12:07 AM,Mungo Park,Bobby Bazini,,Better In Time,MELLOW MOOD,02:51
2015-01-01,12:10 AM,Last Gang,The New Pornographers,Newman- A C,The New Pornographers: Brill Bruisers,CHAMPIONS OF RED WINE,03:40
2015-01-01,12:14 AM,Bloodshot,Ryan Adams,Adams- Ryan,Ryan Adams,FEELS LIKE FIRE,04:25
2015-01-01,12:18 AM,Dine Alone,Ivan & Alyosha,"Wilson- Tim,Wilson- Pete,Kim- Tim,Carbary- Ryan",All The Times We Had,BE YOUR MAN,03:56
2015-01-01,12:22 AM,True North,Lynn Miles,,Love Sweet Love,NEVER COMING BACK,02:57
2015-01-01,12:25 AM,Universal,Sarah Mclachlan,"Marchand- Pierre,Mclachlan- Sarah",Monsters (Remix)(Single),MONSTERS,03:11
2015-01-01,12:28 AM,Sub Pop,Blitzen Trapper,Earley- Eric,Furr,FURR,04:07
2015-01-01,12:32 AM,Dangerbird,Butch Walker,Walker- Butch,Bed On Fire (Single),BED ON FIRE,03:57
2015-01-01,12:36 AM,Six Shooter,Amelia Curran,Curran- Amelia,They Promised You Mercy,NEVER SAY GOODBYE,03:54
2015-01-01,12:40 AM,Universal,U2,U2,U2: War,"""40""",02:37
2015-01-01,12:42 AM,Glassnote,Mumford & Sons,"Lovett- Ben,Marshall- Winston,Dwane- Ted,Mumford- Marcus",Babel,BABEL,03:22
...

(Note: --search-artist allows you to limit to only a specific artist, but I recommend leaving it out/empty to return all entries in the broadcast log. You can then save an offline copy to search later, which I recommend rather than continually hammering the site, in order to be a good citizen.)

You can grab the script from a gist here. One final usage note: Broadcast log data appears to be spotty before 2012.

14 Comments »

  1. How lucky that I stumbled across this great info! Well done, Peter! I have similar needs in that I want to record what is playing while I hear it on CBC R2. I was using a Google API (PageSpeed) to get a screen capture of the mobile version of the playlog but this doesn’t work well since the timezone that Google analyzes the webpage from is not EST so I get the wrong section of the playlog.

    With the information you’ve provided here, I hope to generate a PHP script to home in on the current playlog item.

    I’m a noobie programmer so hopefully you’ll be able to help me by answering some questions I’ll have along the way.

    Great post!

  2. update: after some analysis of your python script, it’s become clear that a much better approach for a noob like me is to have my php simply execute your python script and then continue analyzing the results from php ๐Ÿ™‚

  3. Whoa, serious parsing problems with the wonky play logs for the “The Radio 2 Top 20” aired ~7pm to ~8pm today. What do you make of that?

    There’s also overlap of play items from ~8pm to ~8:15pm where items from two different shows are shown in the play log. That will wreak havoc with my algorithm to narrow in on the exact item playing at an exact time.

  4. The broadcast log for today: http://music.cbc.ca/#!/broadcastlogs/broadcastlogs.aspx?broadcastdate=2015-06-15 has a parsing anomaly for the 6:06 AM entry for label “A&M”. Perhaps the Python script is not escaping the ‘&’ correctly?

  5. Hi Bastian,

    I’m unable to duplicate your issue. Here’s what the script output for June 15 at 6:06 AM:

    A&M:
    2015-06-15,6:06 AM,A&M,Pulp,"Cocker- Jarvis,Senior- Russell,Mackey- Steve,Banks- Nick,Doyle- Candida","A&M/Island/Motown Radio Sampler, Vol. 1, January 29, 1996",COMMON PEOPLE,04:08

    Also, I’m unable to see any issues with the parsing on June 12th at the times you noted.

    Can you describe, in more detail, the issues you are noticing? I.E. What output are you expecting, what output is the script generating, what are the differences/errors?

    Thanks.

  6. Thanks for your help, Peter. FYI, there’s no preview function when composing a post here so I can’t guarantee the accuracy of my formatting. I also cannot edit posts to delete or change information/formatting.

    Here are some more details: I’m calling your Python script from PHP like so:$pythonScriptResult = shell_exec($pythonScript);When I look at the raw string that is return for 2015-06-15 using$HTMLmessage .= ''.str_replace( PHP_EOL, '', $pythonScriptResult );here is what I see:# Results from 2015-06-15 to 2015-06-15.
    date,time,label,artist,composer,album,title,duration
    2015-06-15,6:06 AM,A&M;,Pulp,"Cocker- Jarvis,Senior- Russell,Mackey- Steve,Banks- Nick,Doyle- Candida","A&M;/Island/Motown Radio Sampler, Vol. 1, January 29, 1996",COMMON PEOPLE,04:08
    2015-06-15,6:11 AM,Old Farm Pony,Fortunate Ones,"Allan- Catherine,O'brien- Andrew James",The Bliss,THE BLISS,03:41
    I’ll post more about the June 12 anomaly once I debug it again.

  7. In the previous post, null string ” actually has HTML BR tags that are stripped by your blog form post code.

    For the broadcast log of June 12, look between 7pm and 8pm at the Top 20 show entries. There’s no ‘artist’ field in the CBC R2 log so this is just blank. Maybe you could detect this case and use the first ‘composer’ field instead which seems to the artist.

  8. …although there’s a good argument for keeping your Python script pure – I can code the exception into my PHP for the TOP 20 show anomaly.

  9. I tried catching the exception of a null artist field and using the subsequent field but it’s not the correct composer field. For example, the June 12 7:11 PM entry: CBC log has:
    7:11 PM
    OH DOLORES

    composer
    LUKE DOUCET
    composer
    MELISSA MCCLELLAND
    composer
    GUS VAN GO
    producer
    GUS VAN GO
    producer
    WERNER F
    pop group
    WHITEHORSE
    album
    LEAVE NO BRIDGE UNBURNED/WHITEHORSE
    label
    SIX SHOOTER
    duration
    03:47

    but your script returns “2015-06-12,7:11 PM,SIX SHOOTER,,GUS VAN GO,LEAVE NO BRIDGE UNBURNED/WHITEHORSE,OH DOLORES,03:47”
    which only has the 3rd composer field whereas we really need the 1st composer field “LUKE DOUCET” who was the performing artist.

  10. Would your script work on: CBC’s webradio log for icimusique?
    http://www.icimusique.ca/#!radio/musiques-diffusees

  11. Hi Bastian,

    Unfortunately, my script cannot handle multiple composers – it’s rather simple and assumes a certain format for each of the entries, which may not hold for all entries on the website. I basically coded it to handle >99% of the cases and to output in a simple CSV format. (Multiple values for a single field don’t work well with CSV)

    Such is the nature of scraping a website. Same goes for the case where there is no “artist” entry; the entry in the CSV will simply be blank – the script is not “smart” enough to know to use the “composer” field in lieu of “artist” not being available.

    For the same reason, the script will not work for the icimusique log – it’s basically hard-coded to work against the HTML that’s used on the CBC Radio 2 website. (If Radio 2 changed their play log format, this script would no longer work as well)

    If you need custom behaviour, I encourage you to modify the script to suit your needs – the code is not too complicated.

  12. No support? No further development? OK I’ll plough ahead on my own ๐Ÿ™

  13. Hi Bastian,

    I’ll see what I can do about handling multiple composers. Since the output is CSV, I’d have to create multiple fields like “composer_1”, “composer_2”, etc. This obviously doesn’t scale and is the reason why I didn’t bother with this – CSV isn’t well-suited to handling multiple values for a single entry.

    An alternative would be to change the output format to something like JSON, which can easily handle scenarios like this. However, this will involve changes in your code as well. Note that I explicitly did not pick something like JSON because I intended for this to be a command-line tool, something that could be fed into grep/sed/awk/etc. and something row-oriented like CSV makes sense in this case. It seems like you require something substantially different – which is why I suggested that you modify the code to suit your needs – because the aim of my script was to be a command-line tool, not a library or API. (See: Unix Philosophy)

    As for the case where there is no “artist” field and only a “composer” field – I suggest that your calling code handle this case, as it would be inaccurate to put the composer details into the artist column merely because there was no artist entry from the source. (i.e., the CBC playlogs) The output should remain true to what the source reflected.

  14. Thanks, Peter. I’m actually coping quite well with your original script. I’m calling it from PHP and parsing the results successfully.

    The missing Artist field rarely happens – only in the Top 20 show right now and I can live without it rather than you or I coding a bunch of exception cases.

    Thanks for your help.

Comments are now closed for this entry.