History

Back in 2002 I wanted to time-shift radio programs for later listening and also to listen in the car. Since a computer can automatically record at the appropriate times, this was the obvious choice of method. All I had to do was to connect up the sound card to the tuner, set up Linux cron jobs and I would have WAV files which I could then convert to MP3. After a few months it occurred to me that I should also save details of what was broadcast so that I could later find out details of the program, the songs and the artists. So after the recording I would save a copy of the web page of the program details. So began a practice that has continued to the present day. I even recorded programs during vacations as my Linux workhorse is on 24/7 so I only needed to keep the tuner on.

Technology

Some comments about the technology I used that has evolved over the years. In the early days I played some of the files at more convenient times. I also copied others onto cassette tape to play in the car. When car cassette players dwindled I switched to copying the programs onto a personal MP3 player connected to the AUX input of the car CD player. At the time car MP3 CD players were rare and audio CD R/Ws too slow to burn. Eventually I got a car MP3 CD player and I started making MP3 data CDs to listen in the car. But that is also going the way of the dodo and I'm down to my last CD R/W discs. I'll probably switch to one of: 1. a flash drive containing the files (the player has a USB slot), 2. a personal player connected to the AUX input, 3. a smartphone beaming the audio via Bluetooth to the player. The first method would seem to be a direct replacment, but for some reason with flash drives it takes much longer for the player to find the restart point after starting the car.

The sound source also changed over the years. In the beginning I used a hifi tuner. Then FM radio cards became available and I installed one in my PC. The signal path was still analog though, a jumper cable between the audio out of the tuner to the audio in of the sound card. Finally when the programs were moved off the AM and FM bands to DAB+ I bought DAB+ radios to feed the sound card.

A few years ago I changed the compression method from MP3 to AAC for better quality, but I still generate MP3s on the side for consumption in the car.

So my history is a reflection of the progress in technology over the last 2 decades.

Methodology

When I saved the mp3 files the naming scheme I used was: program-YYYYMMDD.mp3. Later on when more than one program could be recorded in a day I appended HH. So it's possible to tell when a program was recorded without looking at the timestamps, which can be altered by copying. But it appears I have been careful to preserve the ctime attribute through generations of workhorses copying from HD to HD.

The HTML files are saved with the htm extension, originally as named on the website, but later I switched to saving them with the date and hour encoded in the filename. Fortunately there is metadata in the HTML that records the date. But not in all cases.

Algorithm

The desired output is an index.html file containing a table with one row for each program, with links to the playlist HTML file and the audio MP3 file(s), the date and a description of the program. Note that we are not doing a high volume of data processing since we only parse the HTML files and only handle the MP3 files by name, so run time isn't an issue. Here is a example of desired output:

Data structures

We need two main data structures. First, a dictionary of MP3 files, keyed by date. Note that a date could have more than one MP3 file, as some programs had two segments broadcast at different times on the day. so the mp3files attribute is actually a list. Second as each HTML file is parsed, create an associative array of the attributes (title, HTML file link, description). The keys of this associative array are fixed, i.e. "title", "htmfile", "description".

The algorithm is then: populate the MP3 dictionary with the mp3 filenames, read in the HTML playlists one by one and augment the appropriate MP3 dictionary entry from the attributes in the playlist associative array, and finally render the output as HTML.

Program

I decided to use Python for this project. There is no need for efficiency but there is a need for a quick edit and test cycle so compiled languages are out. Python is not my favourite language (this can be a separate running argument), but I knew that Python had extensive libraries and modules to do the heavy lifting so I wouldn't have to write many lines of code, only to understand how to leverage existing resources. It's in widespread use and cross-platform so my opus would be usable by a wide audience. So I'll explain the program in sections, starting with the main routine.

Main

#!/usr/bin/env python3

"""Generate HTML index file from playlist HTMLs and MP3 filenames"""

import sys
import glob
import string
import re
from datetime import date, timedelta
from html.parser import HTMLParser
import pystache


CHARSET = "ISO8859-1"           # Windows-1252 for older playlists

[...]

HTMFILES = sorted(glob.glob("*.htm"))
MP3FILES = sorted(glob.glob("*.mp3"))
PREFIXES = re.compile("am-|los-")
DATEPREF = re.compile("am-")
WANTED = re.compile("title|dc.title|date|abc-datereversed|description", re.I)
DIGITS = re.compile("[0-9]{8}")
PLDICT = processmp3list(MP3FILES)
for htmfile in HTMFILES:
    INFO = {}
    # print(htmfile, file=sys.stderr)
    parseplaylist(htmfile)
    pldate = getbestdate(htmfile, INFO)
    if pldate in PLDICT:
        title = getbesttitle(INFO)
        description = INFO["description"] if "description" in INFO else ""
        makedictentry(pldate, htmfile, title, description)
    else:
        print(f"{htmfile} has no corresponding audio file", file=sys.stderr)
print(f"{len(PLDICT)} entries", file=sys.stderr)
for pldate in sorted(PLDICT):
    if isinstance(PLDICT[pldate], list):
        print(f"{pldate} has no playlist information, adding dummy info",
              file=sys.stderr)
        PLDICT[pldate] = {"htmfile": "",
                          "mp3files": PLDICT[pldate],
                          "ISODate": isodate(pldate),
                          "title": "Unknown",
                          "description": ""}
render(PLDICT)

First we get lists of the HTML and MP3 files in the current directory using the glob module and sort them.

Then we generate some preparsed regular expressions that will be used in the program.

We process the MP3 files in a function, which will be explained below.

We then iterate through the HTML files, parsing each one to get the date, title and description attributes. We attach these attributes to the dated entry in the PLDICT dictionary. If there is no audio file corresponding to the HTML file, we display an error to sys.stderr. This happened in only a handful of cases probably because I deleted the audio file because it wasn't suitable for one reason or another (ruined by recording accident, didn't like the program, or some other reason) and I didn't delete the corresponding HTML playlist.

The converse could be true, an audio file exists but there is no HTML playlist for it. This is detected by the dictionary entry being only a list of MP3 files instead of an associative array. In a handful of cases this happened because the playlist was never published or more rarely, I failed to fetch the HTML playlist after the recording. In a cluster of playlists it was because the date metadata fields were not populated so the audio file was effectively "orphaned". In those cases I was able to repopulate that attribute from the date in the title, using another quick and dirty Python script. In a few remaining cases I really don't have any information on the audio so I generate a dummy set of attributes rather than discard the links to the audio.

Finally we render the dictionary as a HTML table.

Now let's drill down to the functions called.

processmp3list

def processmp3list(mp3files):
    """Create a dictionary of MP3 files keyed by 8-digit date"""
    mp3dict = {}
    for mp3file in mp3files:
        ymd = getcorrdate(mp3file)
        if ymd in mp3dict:
            mp3dict[ymd].append(mp3file)
        else:
            mp3dict[ymd] = [mp3file]
    return mp3dict


def getcorrdate(mp3file):
    """Get corresponding date for MP3 file. If hour = 0, use previous day"""
    ymd = PREFIXES.sub("", mp3file)[0:10]
    if ymd[8:10] == "00":
        ymd = (date(int(ymd[0:4]), int(ymd[4:6]), int(ymd[6:8]))
               - timedelta(days=1)).strftime("%Y%m%d")
    return ymd[0:8]

This is straightforward, just extract the YYYYMMDD date from the MP3 filename by removing the program prefix and either create a 1-element list or append to the current list. However there is one twist. One radio program had two segments, one at 1500 in the afternoon and the other at 12 midnight. So if HH is 00, then we need to associate this MP3 file with the previous day. Fortunately Python has the datetime module to handle this. Rather than try to figure out the previous day in the calendar and also correctly handle programs that cross months, or even in one case cross years. I just subtract one day from the date and return that as a string. Hence that long one-liner. Laziness is also correctness.

parseplaylist

HTML is not XML so I couldn't use an XML parser for this. However Python has the htmlparser module. It has a simple API. You derive a class from it and then implement callbacks for various events as the text is read. Here it is:

def parseplaylist(hfile):
    """Parse one playlist file"""
    parser = MyHTMLParser()
    try:
        file = open(hfile, encoding=CHARSET)
    except OSError:
        print(f"Cannot open {hfile}")
        return
    parser.feed(file.read())
    file.close()


class MyHTMLParser(HTMLParser):
    #pylint: disable=W0223
    """Parser that will process a playlist HTML file"""

    def handle_starttag(self, tag, attrs):
        if tag != "meta":
            return
        name = findattr(attrs, "name")
        if not name:
            return
        name = name.lower()
        content = findattr(attrs, "content")
        if WANTED.match(name) and content:
            INFO[name] = content if name.find("date") >= 0 else bytes(
                content, "raw-unicode-escape").decode(CHARSET, "replace").encode(CHARSET)


def findattr(attrs, name):
    """Return value of attribute with name or None"""
    for attr in attrs:
        if attr[0] == name:
            return attr[1]
    return None

The main function is straightforward, open the file, signal an error if that fails, then read the entire contents and feed it to the parser.

I thought I would have to handle tags like <title> but it turned out that the files had <meta> tags, of the form <meta name="Date" content="20120214"> So I only had to handle the starttag event. The attributes are returned as a list and as we need to look for name and content, I wrote a utility function to extract those. WANTED is a regular expression that matches all the tags we want. It turns out there were more than one tag specifying the title, similarly for date. Also the names could be in any case so the match is case-insensitive, see the re.I flag passed to re.compile. For date attributes we know the characters are in the ASCII subset, but the title and description attributes are in whatever charset the author typed in. Hence the convoluted encodng and decoding to get the information in UTF-8, the natural representation of Python. This line is a bit of black magic I had to suffer though invalid character exceptions from Python's reading routines to work out.

Note that INFO is a global associative array. Maybe there's a way to pass the information back through the closure but I didn't look too hard.

makedictentry

def makedictentry(ymd, hfile, ttl, descr):
    """Make a (unique) entry for playlist dictionary"""
    if isinstance(PLDICT[ymd], list):
        PLDICT[ymd] = {"htmfile": hfile,
                       "mp3files": PLDICT[ymd],
                       "ISODate": isodate(ymd),
                       "title": ttl,
                       "description": descr}
    else:  # append some fields
        replace = None
        for letter in list(string.ascii_lowercase):
            if not ymd + letter in PLDICT:
                replace = letter
                break
        if not replace:
            return
        ymdl = ymd + replace
        PLDICT[ymdl] = {"htmfile": hfile,
                        "mp3files": PLDICT[ymd]['mp3files'],
                        "ISODate": isodate(ymd),
                        "title": ttl,
                        "description": descr}

Now that we have the attributes let's fill in the playlist dictionary entry. If the mp3files attribute is a list, then we can just create an entry and replace the list with an associative array, copying over the list. Otherwise it's a bit more complicated. Some other playlist has already claimed the mp3files because it was a different program on the same day. So we create a new dictionary entry under date + unique letter. We cycle through the alphabet until we find a free one. In practice we won't have more than 2 clashes so we'll never get to z.

isodate is a trivial routine to format the date as YYYY-MM-DD.

def isodate(ymd):
    """Return ISO formatted date from 8-digit date"""
    return ymd[0:4] + "-" + ymd[4:6] + "-" + ymd[6:8]

Utility functions

A couple of utility functions used in the main loop.

def getbestdate(hfile, info):
    """Use the best metadata for date"""
    if DATEPREF.match(hfile):
        best = DATEPREF.sub("", hfile)[0:8]
    elif "date" in INFO:
        best = info["date"]
    if not best or not DIGITS.match(best):  # use the other date
        best = info["abc-datereversed"]
    return best


def getbesttitle(info):
    """Use the best metadata for title"""
    if "dc.title" in info:
        return info["dc.title"]
    elif "title" in info:
        return info["title"]
    return "No title"

The best value from the date is obtained from the filename, if it's of the form program-YYYYMMDD[HH].htm otherwise from the date attribute in a meta. However it turned out that the authors were sometimes in the habit of writing that attribute as DD/MM/YYYY. So if it isn't suitable we use the alternative abc-datereversed which is always in the YYYYMMDD form. In one case both date attributes were blank, causing an exception. I edited that file by hand to insert the right date.

Similarly the title is obtained from either the dc.title or the title attribute in a meta, in that order of preference.

render

Finally we come to render the dictionary as a HTML file. I knew from working on web projects that a template was the way to go. Most templates are however tied to a framework. I didn't want to drag in a whole framework just to generate HTML. Fortunately I discovered the very lightweight Mustache template language which has ports to various languages. In Python this is available as pystache, and all I had to do was import it.

def render(proglist):
    """Render the list as a web page"""
    renderer = pystache.Renderer()

    print("""<!DOCTYPE html>
<html>
<head>
<title>Playlist</title>
<style>
table {
  border-collapse: collapse;
}
table, th, td {
  border: 1px solid black;
}
</style>
</head>
<body>
<table>
<tr>
<th align="left">Playlist</th>
<th align="left">Audio</th>
<th>Description</th>
</tr>""")

    parsed = pystache.parse("""<tr>
<td><a href="{{htmfile}}">{{{title}}}</a></td>
<td>{{#mp3files}}{{#.}}<a href="{{.}}">MP3</a> {{/.}}{{/mp3files}}</td>
<td><strong>{{ISODate}}</strong> {{{description}}}</td>
</tr>""")
    for ymd in sorted(proglist):
        program = proglist[ymd]
        print(renderer.render(parsed, program))

    print("""</table>
</body>
</html>""")

 We simply generate the preamble and the start of the table, loop through the playlist, now called a program list, rendering according to the template. A couple of things to note: There is a small inner loop in the line:

<td>{{#mp3files}}{{#.}}<a href="{{.}}">MP3</a> {{/.}}{{/mp3files}}</td>

 This iterates through the list of MP3 files generating a hyperlink for each.

Secondly note that title and description are surrounded by triple mustaches, rather than double. This bypasses the HTML escaping for double mustaches, as those attributes can contain HTML tags and character entities.

Finally we generate the end of the table and the end of the HTML file.

Running the indexer on a directory of one year's recording only takes a few seconds. As mentioned, only text files are processed.

Cleaning up

Well I don't want to show my dirty programs to the world so I checked the program for poor practices with pylint.

$ pylint amtoc.py
No config file found, using default configuration

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

and autopep8 for proper formatting

$ autopep8 amtoc.py | diff -u - amtoc.py
$

 Clean bill of health. No corvids were harmed.

Closing remarks

Well that's it. You can view the complete program, just under 200 lines, in the files section. The main takeaway is that if you get to know the resources available in the programming language you can save a tremendous amount of work by leveraging on the work of others before you.

Here's that screenshot  again, the index page for 2012 for one program.