Think Twice, Cut Once

Hello y’all! I don’t want to foolishly head down a road of poor-thoughtoutness. I’ve done that. It can teach you a lot, and failure is important, but I’d much rather know were I’m headed and why I’m doing what I’m doing this time. This is what it looked like last time.

With that in mind, I’m sharing what I’m working on and requesting feedback before I keep on keeping on.

Project: Make a database/interface for my comic book collection akin to

Each comic will be encoded in RDF/XML using the Comic Book Ontology (CBO) and supplemented with other metadata schemes when that won’t satisfy. Check out more on CBO, via the creator’s thesis.

So here’s my big question (and I openly submit that as a total linked data newbie, it’s an ignorant one…)

I can put data into RDF, I can then transform it into HTML for web viewing, but then it’s not in linked triples anymore. My fundamental question is, how do you do both? Does it fulfill linked data requirements if I just have a human readable interface of HTML/CSS with RDFa attributes inside those HTML elements?

I’m genuinely asking here — This is a project intended to better help me understand how to create linked data, and then use it on the actual web. Does what I’ve laid out sound like a correct way of doing that?





Emflix – Index

Here’s a (growing) index of all my Emflix posts for both you and me!

Here’s a link to the project preserved in amber forever

Background and Beginnings

XML Structure

Genres and subgenres and subsubgenres


Emflix – Part 14 – Subgenres – Anime and Animation

Link to Part 13 – Rule of 5

Continuing with the subgenres now, I’ll move to “Anime and Animation”

Compare again Netflix’s listing of genres to mine.


Here are my subgenres for said category with asterisks for ones to be discussed (i.e. the ones which are absent or different from Netflix’s)
Animation for Grown-ups
Family Animation*


  • Anime
    • Due to the Rule  of 5, all of the Anime ____ subgenres were consolidated into one. There just wasn’t enough Anime in the Emerson collection to justify subdividing them.
    • Also note that Netflix has “Anime Series”as a subgenre under ‘Anime & Animation’…but there’s a separate genre for ‘TV Shows’ in which it doesn’t appear! What’s going on there? (there are other seemingly TV genres which only appear under a film genre and not the TV one. Netflix, you a mess.
  • Family Animation
    • Added for hybrid consistency

I went back and forth several times on trying to decide on distinguishing between “family friendly” animation and…less family friendly. I ended up keeping that distinction, because every animation thing I read seemed to differentiate them, though they disagreed on terminology.


Link to Part 15 – Subgenres – Children and Family

Emflix – Part 13 – Rule of 5

Link to Part 12 – SubGenres – Action and Adventure

Sure, you know the Rule of 3. Some of the coolest of you even know the Rule of 4.

But who knows the Rule of 5? No one, because I made it up. My rule is one higher than the most rule LC has. So does that makes me one higher? I don’t know, I’m just a simple cataloger doing his best. You decide.

As I’ve said a million times, (and will say it more…) Emflix was a lot smaller than Netflix. That is, whereas the Emerson Library had some 3000+ DVDs, Netflix has some 93,000+ (according to them). One of the things that means is that while it can populate its numerous genres and categories, Emflix struggled to catch up. I started noticing subsubgenres which had a single item and I thought it would be disappointing to users to click through  to a subsubgenre  only to discover that their options were nearly non-existent.

So I instigated the Rule of 5. Unless a subsubgenre had 5 or more films in that category, it wouldn’t be used. This first meant running a quick little count on number of films per subsubgenre.

I commented out all the subsubgenres which had fewer than 5 films so they wouldn’t be taken into account during the transformation into the interface.

Trying to think ahead for once, I realized that as movies were added to the Emerson collection, some of these commented out subsubgenres may make it up to the required 5.

Thus, I continued to record the subsubgenres which i’d commented out when I encountered them in new films while also implementing a count on the commented out subsubgenres. When they hit 5, ta-da! Un-commented and returned to production.

(The Rule of 5 also applies to non english languages, but we’ll get to those later…)


Emflix – Part 12 – Subgenres – Action and Adventure

Link to Part 11

As discussed last time, there were 19 primary genres. Each of those had many possible values for subgenres.

Compare again Netflix’s listing of genres to mine.

Before discussing specific changes I made, a comment:

I noticed that Netflix had many “hybrid subgenres” e.g. Action Sci-Fi and Fantasy, Indie Romance, Sci-Fi Thrillers, etc. The odd thing to me is that only some of these hybrid subgenres would appear under both of their ‘parent’ genres.

That is, in Netflix’s listings ‘Action Thrillers’ is listed under both ‘Action & Adventure’ and ‘Thrillers’, whereas ‘Action Comedies’ is listed only under ‘Action & Adventure’ and not under Comedies. Weird, right?

That kind of inconsistency is this cataloger’s bane. So my first rule was, if a subgenre is ‘hybrid’ (as judged by its name being made up of two different genres) it belongs under both genres.

got it? okay, onto Action and Adventure!

Here are my subgenres for said category with asterisks for ones to be discussed (i.e. the ones which are absent or different from Netflix’s)
Action Classics
Action Comedies
Action Sci-Fi and Fantasy*
Action Thrillers
Adventure Sci-Fi and Fantasy*
African-American Action
Comic Books and Superheroes
Crime Action
Deadly Disasters
Espionage Action
Family Adventures*
Foreign Action and Adventure
Heist Films
Indie Action*
Martial Arts
Military and War Action
Super Swashbucklers


  1. Action Sci-Fi and Fantasy
    1. Added for hybrid consistency
  2. Adventure Sci-Fi and Fantasy
    1. While this one was still added for hybrid consistency, notice that I inverted it and added ‘ and Fantasy’. Netflix has this as ‘Sci-fi Adventures’. I think I did this because I wanted the parallelism between ‘Action Sci-Fi and Fantasy’. I don’t rightly remember though. #GoodJobGanin
  3. Blockbusters
    1. I chose not to use ‘Blockbusters’ because I felt it didn’t really capture a ‘genre’ of film but was rather a comment on the film’s reception or box-office.
  4. Family Adventures
    1. Added for hybrid consistency
  5. Indie Action
    1. Added for hybrid consistency


Tune in for Part 13 where I remember to tell you about the Rule of 5!



Emflix – Part 11 – Genres

Link to Part 10


Genres, subgenres, and sub-subgenres, oh my!

This was one of the entire reasons I built the darn thing, and it involved many many many revisions decisions, re-decisions, and changes. So bare with me as I try to remember the process.

First things first, the XML structure of genres was like this:

   <genre>Some Genre</genre>
   <subGenre>great subgenre</subGenre>
   <subGenre>another subgenre</subGenre>
   <subSubGenre>wow! another level!</subSubgenre>

Each <media> element could have as many <genreWrap> elements as necessary (in any order), but each <genreWrap> element must have had one (and only one) <genre> element.

Each <genreWrap> element could have as many <subGenre> elements as necessary (in any order), but each <subSubGenre> element had to follow directly after the <subGenre> element to which it belonged.


I used the enumeration declaration of xsd to control for every possible value of a <genre>, <subGenre>, and <subSubGenre> element.

Why three? How did I decide on three levels of hierarchy for genre-analysis? Easy! Remember, I didn’t want to have to analyze 3000+ films myself, and because I was going to be using Netflix for the genre analysis, I chose three levels because that’s their maximum.

Check out the full list of Netflix’s genres.

Here is the declaration of my genreType.

   <xs:simpleType name="genreType">
         <xs:restriction base="xs:string">
             <!--=============================== FILM GENRE ENUMERATION -->
             <xs:enumeration value="Action and Adventure"/>
             <xs:enumeration value="Anime and Animation"/>
             <xs:enumeration value="Children and Family"/>
             <xs:enumeration value="Classics"/>
             <xs:enumeration value="Comedy"/>
             <xs:enumeration value="Documentary"/>
             <xs:enumeration value="Drama"/>
             <xs:enumeration value="Faith and Spirituality"/>
             <xs:enumeration value="Foreign"/>
             <xs:enumeration value="Horror"/>
             <xs:enumeration value="Independent"/>
             <xs:enumeration value="LGBTQ"/>
             <xs:enumeration value="Music and Musicals"/>
             <xs:enumeration value="Romance"/>
             <xs:enumeration value="Sci-Fi and Fantasy"/>
             <xs:enumeration value="Sports"/>
             <xs:enumeration value="Special Interest"/>
             <xs:enumeration value="Thrillers"/>
             <!--=============================== TV GENRE ENUMERATION -->
             <xs:enumeration value="Television"/>

Careful comparers will notice some differences in the main genres between Netflix and Emflix.

  1. No ampersands
  2. ‘Gay & Lesbian’ changed to ‘LGBTQ’
  3. ‘Sports & Fitness’ changed to ‘Sports’


  1. XML doesn’t like ampersands and they have to be represented as an entity ‘&amp;’ — I chose to just change them all to ‘and’. (I’m sure a better computer coder person could’ve kept them all as entities and still handled it well on the other end with the html transformation and php, but I just ignored the issue)
  2. ‘Gay & Lesbian’ is not a particularly inclusive category term. Frankly, Netflix is actually treating it as more inclusive than the term implies — take a look at its subgenres, or NTs if you will. Bisexual is listed there. While I’m glad that they’re including such films and giving them metadata, it’s not great to subsume bisexuality as a subset of gay and lesbian sexualities. That’s the kind of shit bi folks have to put up with all day everyday everywhere else, I didn’t want Emflix to continue the same practice.
    1. There’s also the matter of films about trans folks or otherwise gender non conforming people. Our collection had a bunch of films which would fall into this type of category, and I didn’t feel comfortable including them under ‘Gay & Lesbian’, as again, that term doesn’t cover gender identity. I wanted to be explicit.
    2. I actually met with Emerson’s director of diversity education and human relations about terminology. While Emerson doesn’t have a house-style-guide, I thought it’d be good to run my choices by someone who worked professionally in the field of knowing things about inclusion.
  3. The dropping of ‘Fitness’ was a lot less fraught. While Netflix’s collection has LOTS of workout videos and exercise videos…our collection didn’t. Zero of them. So…that’s it.


Okay, so that explains the main genres. A note here about potential for errors.

I had a file (that’ll I’ll probably be referring back to from time to time) called “ErrorChecker.xsl”. Good name, I know. As I would encounter an error that I couldn’t control through the .xsd file (either through ignorance of how to do it, or because schema just can’t do all that I want it to do) I would add testers to my error checker, and then run it periodically.

Two that are relevant to main genres are the following:

<xsl:apply-templates select="mediaList/media[count(genreWrap[genre = 'Action and Adventure']) > 1]"  mode="repeated"/>

(Yes, I had 18 additional versions of this one for every main genre) checks to see if a single <media> element had more than one <genreWrap> element with the same term as its primary. I was trying to prevent something like this:

<title>Cool Action Movie</title>
   <genre>Action and Adventure</genre>
   <subGenre>Action Sci-Fi and Fantasy</subGenre>
   <genre>Action and Adventure</genre>
   <subGenre>Action Thrillers</subGenre>

If I remember correctly, it didn’t actually screw anything up in the sorting, but it did screw up my statistical data. I couldn’t find a way to control that in the .xsd, so I just had to do it the way I did. Though now that I’m revisiting of course, I think rather than 19 versions of that apply-templates command, I could’ve done a for-each-group on each media element to see if it had a repeated genre element!

The other:

<xsl:apply-templates select="mediaList/media/genreWrap[count(genre) > 1]/../title[1]" mode="multGenre"/>

This second tester made sure that a single <genreWrap> element had only one <genre> element. The errors caused by this DID screw up the sorting as the XSLT which created the browsing interface was only looking for one <genre> element per <genreWrap>.
Check back for part 12 when we talk subgenres (or some anyway, probably not ALL of them at one go)!

Emflix – Part 10 – XML – TV Shows

Link to Part Nine

So, tv shows — for the most part I handled them similarly to the movies, but there were a few differences that I want to talk about.

  1. Creator
  2. Title attributes
  3. Year ranges

Unlike the movie elements, in which I recorded directors and writers — I recorded a creator element in tv shows. It seemed to me, that because tv shows often have so many different directors/writers/showrunners over the course of many seasons, the creator would be the most stable person-access-point to record.

It was programmatically generated just like those other person elements (see Part Six for LOTS more details about how that was done) and the trigger phrases were, ” ‘Created by ‘, ‘created by ‘, ‘creator ‘ ” (though now of course I acknowledge that I didn’t need to add variations for capitalization, rather I should’ve just converted the comparison text to a single case in the checking phase)

I used the title attribute, ‘differentiator’ consistently in tv shows. While I used it in movie elements as well if there was a need to separate two different versions of the same movie, because every season release of the same show would have the same title, I recorded “Season $X” for each season of a show. If we had the complete series (or multiple seasons in a single box set) I recorded that as well in the differentiator.

Finally, I used yearRange elements rather than a year element if the release spanned more than a single calendar year. Here’s a complete tv show element depicting many of these pieces.

  <media id="1667754" dateCreated="2014-11-06">
       <title differentiator="Season 2">Gilmore Girls</title>
       <creator sort="5">Amy Sherman-Palladino</creator>
       <actor sort="8">Lauren Graham</actor>
       <actor sort="8">Alexis Bledel</actor>
       <actor sort="9">Melissa McCarthy</actor>
       <actor sort="7">Keiko Agena</actor>
       <actor sort="7">Yanic Truesdale</actor>
       <actor sort="7">Scott Patterson</actor>
       <actor sort="6">Liza Weil</actor>
       <actor sort="7">Jared Padalecki</actor>
       <actor sort="6">Milo Ventimiglia</actor>
       <actor sort="7">Kelly Bishop</actor>
          <subGenre>TV Comedies</subGenre>
          <subGenre>TV Dramas</subGenre>
          <!--TV Family Dramas-->
          <subGenre>TV Dramas</subGenre>
          <subSubGenre>TV Dramedy</subSubGenre>
       <summary>Those acclaimed Gilmore Girls are back for a second season of warmth, charm, zingy
          repartee, and heart-stopping moments of drama. Includes 22 episodes from the second
       <LCSpecialTopics>Individual programs</LCSpecialTopics>
       <callNumber href="">[DVD] PN1992.77
          .G52 G52 2008 v.2</callNumber>
       <coverArt href="Pics/GilmoreGirls2.jpg"/>



Check back for part 11 when I begin to dig into genres! (this may take a while…)

Emflix – Part Nine – XML – Foreign titles

Link to Part Eight


I want to start off this post by acknowledging that I haven’t really mentioned genres. Wasn’t part of the whole point of this project to display and index the films by more than a single genre? WASN’T IT?


And I will talk about them! This order of posts isn’t actually the order of how things ‘happened’. I was working on everything simultaneously, and making genre decisions (and revising those decisions) all the time. But the genre stuff is so self contained and yet so involved, that I want to talk about it on its own, without getting bogged down by the other pieces I pulled in.

Okay, so today — foreign titles!

While every movie that gets released in a foreign country (and here I mean foreign to the country of origin) may get another title, I didn’t want to necessarily spend the time tracking down everything that ‘Alien’ was called. My rule of thumb was

  • If it was produced and released simultaneously in more than one country (and under different titles), give each title
  • If it was a foreign (to the US) film but was primarily known in the US by an English title, give the original and its English title.

Here are some examples:

   <media id="1737472" dateCreated="2015-04-10" lastModified="2015-04-10">
       <title xml:lang="cmn" type="foreign">Chuntian de kuangxiang</title>
       <title>Rhapsody of Spring</title>
   <media id="1616575" dateCreated="2014-11-09">      
       <title>Pelle the Conqueror</title>
       <title type="foreign" xml:lang="da">Pelle Erobreren</title>
       <title xml:lang="sv" type="foreign">Pelle erovraren</title>

Notice that I included an attribute ‘type=foreign’ on all non English titles. This was my way of differentiating them from the primary display title. It’s also fairly other-ing and I regret having done it this way. For of course, these are not ‘foreign titles’. They are the native title, and it’s the English title which is foreign.

I also included a piece of ACTUAL STANDARD data! I know, what a concept! ‘xml:lang’ is used by the XML spec to indicate the language contained in that element. The codes are pulled from the Internet Assigned Numbers Authority Language Subtag Registry. I never ended up doing anything that used that data…but there it is.

Learn From My Mistakes

All the titles were searchable but only the first ‘foreign’ title was displayed. This means that you’d get hits, without seeing why. This is very bad information retrieval.

Example here, a search of ‘khamas’ gives you a hit, because the Arabic title ‘Khamas Kamirat Muhattamah’ is contained in the record, but not displayed to the user. Not great.


Check back for part 10 when I talk tv shows! (Mostly similar but some of its own weirdnesses)

Emflix – Part Eight – XML – Years, Boxes, Metametadata

Link to Part Seven

Today I’ll be discussing years, box-sets and some meta-metadata, the date created/updated

I chose to use a single year for films: the earliest year I could find that it was released. I handled tv shows differently, but that’ll come up later.

Heading13 contained every 500 field mushed together as anyone who’s cataloged DVDs knows we usually type some phrase like “DVD of the original motion picture released in 1999”, transcribed from the back of the box. Bearing that in mind, I applied this template

   <xsl:template match="Heading13">
         <xsl:variable name="Heading13Data" select="normalize-space(.)"/>
         <xsl:variable name="yearFind" as="xs:string*"
             select="('motion picture in ','release of the ','released in ')"/>
             <xsl:variable name="yearFound">
                 <xsl:for-each select="1 to count($yearFind)">
                     <xsl:variable name="x" select="."/>
                         select="substring(substring-after($Heading13Data, $yearFind[$x]),1,4)"/>
                 <xsl:when test="matches($yearFound, '(19|20)\d{2}')">
                     <xsl:value-of select="$yearFound"/>
                     <xsl:value-of select="normalize-space(../../Heading8)"/>

First I assign all the data (normalizing the space) in Heading 13 to a variable. Next there’s a variable created of the trigger phrases which often indicate a year. As with the writer/director templates, I added these as a I encountered them. Then it counting from 1 to the number of trigger phrases (3 in this case) it will take the four characters found after the trigger phrase and assign them to a variable called ‘yearFound’.

The choose statement then tests if those four characters satisfy a regex match of 19 or 20 followed by two digits. If it does match then that is the year it outputs, if it doesn’t then it outputs whatever was in Heading8 (which is the second of the two date fields in the fixed field.

Box-sets  were a real bugaboo for me. They were much more work and they ended up (or at least, the I way I handled them) breaking a data rule. I had decided to break up box-sets and give equal access and treatment to every movie contained therein. Remember way back in part three when I said I couldn’t use unique IDs after a while? This is why. I had been using the bib ID from the catalog as the unique ID, but of course a box set that had been cataloged as such would have a single ID. That meant that every movie in the set would have the same ID. Bad move, Ganin.

It also meant that I had to spend loads more time manually inputting data because as anyone who’s cataloged a box-set knows…you end up cramming a lot into a single field. Most of my clever little templates would either give me a single entry for the first movie in the set, or not even that. I had to hop over to wikipedia or another source and find the directors/writers/years/titles for all the other movies in the set. Sigh. This is one of the few things that I’m not super sure what I would’ve done differently if I were starting over. I think ultimately, it was beneficial (particularly because this was a visual display) to have the individual movies separated, which I think would just have to mean more work on my part.

As for that tasty bit of meta-metadata (not that I ever actually used it…)  — each media element had two attributes, “dateCreated” and “lastModified” seen here:

<media id="{Heading16}" dateCreated="{$date}" lastModified="{$date}">

They were initially set to the same value and then I would update the lastModified date when I changed something manually. There’s a generally scoped variable giving their data, seen here:

    <xsl:variable name=“date” select=“substring(string(current-date()), 1, 10)”/>

The weirdness is just because I only wanted DD-MM-YYY, but the current-date() function gives you LOTS more than that, so I first converted into a string and stripped anything after the first 10 characters.

Next time I’ll talk about foreign titles.

Emflix – Part Seven – XML – LC Special Topics, Language

Link to Part Six


I’m grouping today’s topics, the LC Special Topics and the language, together, because they both rely on the same thing — a set list of known possibilities.

                 <xsl:apply-templates select="Heading1"/>
                 <xsl:apply-templates select="Heading11"/>

The instructions are much simpler than the ones we’ve seen previously, just applying the data right from the XML, no special filtering or processing. Heading 1 was the “Display Call No.” field (doesn’t map to MARC, not actually sure where it was drawn from, maybe the 852 from MARC-Holdings?) and Heading11 is from the fixed field.

So here’re the matching templates (excerpted…because they’re long)

<xsl:template match="Heading1">
             <xsl:when test="contains(., '.A26 ')">Acting. Auditions</xsl:when>
             <xsl:when test="contains(., '.A3 ')">Adventure films</xsl:when>
             <xsl:when test="contains(., '.A43 ')">Africa</xsl:when>
             <xsl:when test="contains(., '.A45 ')">Alcoholism</xsl:when>
             <xsl:when test="contains(., '.A5 ')">Animals</xsl:when>
             <xsl:when test="contains(., '.A54 ')">Animation</xsl:when>
             <xsl:when test="contains(., '.A72 ')">Armed Forces</xsl:when>
             <xsl:when test="contains(., '.A73 ')">Art and the arts</xsl:when>
             <xsl:when test="contains(., '.A77 ')">Asian Americans</xsl:when>    
            <xsl:when test="contains(., '.W3 ')">War</xsl:when>
             <xsl:when test="contains(., '.W4 ')">Western films</xsl:when>
             <xsl:when test="contains(., '.W6 ')">Women</xsl:when>
             <xsl:when test="contains(., '.Y6 ')">Youth</xsl:when>
     <xsl:template match="Heading11" mode="language">
             <xsl:when test=". = 'ara'">Arabic</xsl:when>
             <xsl:when test=". = 'arc'">Aramaic</xsl:when>
             <xsl:when test=". = 'arm'">Armenian</xsl:when>
             <xsl:when test=". = 'art'">Artificial (Other)</xsl:when>
             <xsl:when test=". = 'und'">Undetermined</xsl:when>
             <xsl:when test=". = 'urd'">Urdu</xsl:when>
             <xsl:when test=". = 'wol'">Wolof</xsl:when>
             <xsl:when test=". = 'zul'">Zulu</xsl:when>
             <xsl:when test=". = 'zxx'">No linguistic content</xsl:when>
                 <xsl:value-of select="."/>

Get the idea? Just a simple ‘choose’ statement with a LOT of options. Drawn from Class Web (for the cutters) and from the MARC code list of languages (for the langs). Now I know what you’re thinking: but Netanel, aren’t you just Jurassic Park-ing?

Here I stop to tell a tale that for whatever reason…has stuck with me over the years. In the novel Jurassic Park, but not the movie, there’s an important scene were Ian Malcolm demonstrates that the animals must be breeding. The InGen scientists are sure that dinos are not breeding because their cameas are equipped with fancy counting technology and on a regular basis count the animals in the park. It always matches their pre-determined correct number of animals, ergo: no breeding.

See, but Malcolm realizes that the camera-counting thing is only counting the number that it assumes will be there. When he has them try to count for higher and higher numbers of animals, and then eventually any number, they get wildly different results. The take away lesson (for me anyway) has always been: if you assume you know what the results will be, you’ll find those results. You must remember to account for the ones you couldn’t have known would be in the mix.

So here’s how I ensured that I wasn’t Jurassic Park-ing. The first thing I did was  spit out a list of every cutter currently in use by the circulating DVD collection, then I looked each one up in Class Web to get the value and added them all to that Choose/When statement. But I also added them to this little number in my “error checker”.

<xsl:variable name="LCSpecialTopics"
       <xsl:for-each-group select="/root/row"
          group-by="substring-before(substring-after(Heading1,'1995.9.'),' ')">
      <xsl:if test="not(index-of($LCSpecialTopics,current-grouping-key()))">
             <xsl:value-of select="concat(current-grouping-key(),' isnt currently in my Circulation Stylesheet!')"/>

I wrote a similar one for the language codes. By running it each time I was given new movies by my boss, I could see if any of them had a cutter/lang which wasn’t accounted for in my pre-existing set!

As an aside, doing this did help me identify (and then correct) many invalid cutters, so that’s a bonus.

 Well that’ll do it for those two. Up next we’ll be talking:

  • Years
  • Box-Sets
  • Dates (of the records, i.e. meta-metadata)