First off, I want to assure folks that I have not given up or ceased to work on the other aspects of this blog. These technical things interest me as a cataloger as well as the social (sociological?) aspects of cataloging.
Alright, so I’ve begun — you can find the preliminary .xsd file here. I’ve included what was already on LC’s page, but made alterations. The assertions (that provide the validation) that I’ve added (so far) control for all the indicators for 1XX fields. At first I thought, oh sure, 100, 110, 111, 130. Easy. But then of course, I realized that that’s Bibliographic MARC. There’s also Authority, Classification, Holdings and Community.
(Full disclosure, I have no real idea what the heck MARC-Community is, but it’s on the spec!)
So I then added assertions to control the indicators for the 148, 150, 151, 155, 162, 180, 181, 182, 185 (Authority), the 153, 154 (Classification). Fortunately, Holdings has no 1XX fields, and Community’s are identical (in indicator values) to those of Bibliographic.
In terms of the non-filing characters, I considered adding assertions for ‘A’, ‘An’, and ‘The’. While the full list of possible articles on which to not file is extensive, the problem is that there are ALSO languages in which you would file on ‘A’. If the movie ‘A la mala’ were to get a uniform title of ‘A la mala (motion picture)’ for example, it would have to validate without the non-filing characters! As for ‘An’ and ‘The’ — my knowledge of all languages and their articles is simply too imperfect to be certain that I wouldn’t screw up something else like that as well. So for now, I simply required that those indicators be a digit.
So now to test all this stuff, right? Well I figured I would download the examples given on the LC MARCXML page for Classification/Names/Subjects (I grabbed a bunch of Bibliographic and Holdings from my work catalog, still no clue where to get Community…).
After conforming all the records to the same namespace (LC seems to use slim: some, marc:, some, marcxml: and some none at all, because they hate me.) Wouldn’t you know it, suddenly I’m popping errors all over the place. Oh glob, what did I do wrong?
Nothing! The examples given in the MARCXML documentation aren’t even all valid xml! According to the specifications of the XML standard,
The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section.
But (just as an example, there are MANY of these…) in the 642 field of one of the Authority examples (and indeed, it’s the same in the actual authority record in the NAF, there’s this:
<marcxml:datafield tag=”642“ ind1=” “ ind2=” “>
<marcxml:subfield code=”a“>nuova ser., 4</marcxml:subfield><
<marcxml:subfield code=”d“>1983-< ></marcxml:subfield>
(emphasis mine) — INVALID XML, LC!! YEESH. So, yeah. First I had to fix all those basic formatting errors.
Then, after I fix those errors, most of the leaders actually weren’t valid! the leader is 23 digits long, some of which is system added, some of which is generated by the control fields. But so many didn’t conform, because they had too few spaces around the 8/9th digit. Super annoying and confusing. Grr.
So, once I fixed that, voila! All my assertions worked as expected, as in I found additional errors, which of course is the point of validating — a few 130s had second indicators. At first I was confused for how ever could LC make such a mistake? But then, oy, I checked the spec again, and wouldn’t you know that while 130 in BibliographicMARC has
- First indicator: 0-9
- Second indicator: undefined