Here are five things:
Note: See also previous posts (1 and 2) on using the number building tool with standard subdivisions. See also posts 1 and 2 on using the number building tool in music, and posts on using the number building tool in literature and in natural resources. The general approach to building numbers described in those posts can be applied in any discipline. See also the WebDewey training modules for the WebDewey number building tool.
Are you having problems using the WebDewey number building tool to add standard subdivisions to three-digit numbers ending in zero? If so, let’s start with the simplest case: a division number where no extra zeros are needed with standard subdivisions. (We’ll look at more complicated situations in subsequent blog posts.) Division numbers have two significant digits plus one placeholder zero, e.g., 550 Earth sciences. Classifiers are taught to drop the placeholder zero when adding a standard subdivision to a division number. We plan to teach the number building tool to drop that placeholder zero, but that has not happened yet. In the meanwhile, here are instructions on how to overcome the problem.
Let’s consider an encyclopedia of earth sciences, e.g.: Macmillan Encyclopedia of Earth Sciences. Its first LCSH is "Earth sciences – Encyclopedias."
Browsing the Relative Index for "earth sciences" yields:
Earth sciences 550
If we click to see the full record, we get this Hierarchy box:
The focus is 550 Earth sciences, with downward hierarchy showing some numbers built with standard subdivisions, all with one zero. There is no indication that extra zeros are needed. One step up in the hierarchy is an entry with the placeholder zero greyed out (550) and the heading edited for browsing purposes as found in the DDC Summaries:
550 Earth sciences & geology
If we click to see the full record, we get this Hierarchy box:
In addition to the Hierarchy box, the full record has a Create Built Number box with the base number that we want: 55 without the placeholder zero:
If we click Start in that box, we get:
Now the Hierarchy box shows standard subdivisions:
In the Hierarchy box we click T1—03 Dictionaries, encyclopedias, concordances and get this Hierarchy box:
If we now click Add in the Create Built Number box, we get:
We have now built the number with the correct number of zeros. If we click Save, the newly built number appears in the Hierarchy box:
We now have an opportunity to change the user term and add other user terms—but enough for now! We have successfully built the number.
The key to success is to find the record with the placeholder zero greyed out, and click Start in that record.
Note: The WebDewey number building tool may be confused by bracketed standard subdivisions. For example, if there are bracketed entries indicating that standard subdivisions have been relocated from a double zero to a single zero (e.g., at 380 or 730), you will need to use Edit local to get the single zero.
Fuzz testing, or fuzzing, is a way of stress testing services by sending them potentially unexpected input data. I remember being very impressed by one of the early descriptions of testing software this way (Miller, Barton P., Louis Fredriksen, and Bryan So. 1990. "An empirical study of the reliability of UNIX utilities". Communications of the ACM. 33 (12): 32-44), but had never tried the technique.
Recently, however, Jenny Toves spent some time extending VIAF's date parsing software to handle dates associated with people in WorldCat. As you might imagine, passing through a hundred million new date strings found some holes in the software. While we can't guarantee that the parsing always gives the right answer, we would like to be as sure as we can that it won't blow up and cause an exception.
So, I looked into fuzzing. Rather than sending random strings to the software, the normal techniques now used tend to generate them based on a specification or by fuzzing existing test cases. Although we do have something close to a specification based on the regular expressions the code uses, I decided to try making changes to the date strings we have that are derived from VIAF dates.
Most frameworks for fuzzing are quite loosely coupled, typically they pass the fuzzed strings to a separate process that is being tested. Rather than do that, I read in each of the strings, did some simple transformations on it and called the date parsing routine to see if it would cause an exception. Here's what I did for each test string, typically for as many times as the string was long. At each step the parsing is called
For our 384K test strings this resulted in 1.9M fuzzed strings. This took about an hour to run on my desktop machine.
While the testing didn't find all the bugs we knew about in the code, it did manage to tickle a couple of holes in it, so I think the rather minimal time taken (less than a day) was worth it, given the confidence it gives us that the code won't blow up on strange input.
The date parsing code in GitHub will be updated soon. Jenny is adding support for Thai dates (different calendar) and generally improving things.
Possibly the reason I thought of trying fuzzing was an amazing post on lcamtuf's blog Pulling JPEGs out of thin air. That post is really amazing. By instrumenting some JPEG software so that his fuzzing software could follow code paths at the assembly level, he was able to create byte strings representing valid JPEG images by sending in fuzzed strings, a truly remarkable achievement. My feeling on reading it was very similar to my reaction reading the original UNIX testing article cited earlier.
A new version of MarcEdit has been made available. The update includes the following changes:
Linked Data Tool Improvements:
A couple specific notes of interest around the linked data tool. First, over the past few weeks, I’ve been collecting instances where id.loc.gov and viaf have been providing back results that were not optimal. On the VIAF side, some of that was related to the indexes being queried, some of it relates to how queries are made and executed. I’ve done a fair bit of work added some additional data checks to ensure that links occur correctly. At the same time, there is one known issue that I wasn’t able to correct while working with id.loc.gov, and that is around deprecated headings. id.loc.gov currently provides no information within any metadata provided through the service that relates a deprecated item to the current preferred heading. This is something I’m waiting for LC to correct.
To improve the Linked Data Tool, I’ve added the ability to query by specific index. By default, the tool will default to LC (NACO), but users can select from a wide range of vocabularies (including, querying all the vocabularies at once). The new screen for the Linked Data tool looks like the following:
In addition to the changes to the Linked Data Tool – I’ve also integrated the Linked Data Tool with the MarcEditor:
And within the Task Manager:
The idea behind these improvements is to allow users the ability to integrate data linking into normal cataloging workflows – or at least start testing how these changes might impact local workflows.
You can download the current version buy utilizing MarcEdit’s automatic update within the Help menu, or by going to: http://marcedit.reeset.net/downloads.html and downloading the current version.
Posting to Autocat
On 1/22/2015 12:37 AM, J. McRee Elrod wrote:
> James said:
>> >It seems to me that the catalog*as a whole* handles this rather well >> >right now …
> Yes there are resources outside the items themselves which identify > hoaxes. The questions are, do the records for the items themselves > need such indication, and if so, how does that impact on cataloguer > neutrality, and who decides what is untrue?
That is not the idea I wished to convey. We need to see the catalog as a whole–as our users do–and not focus only on individual records. In the case of the “Protocols” the catalog is supposed to bring all the materials together, both the different versions of the Protocols and the items about the Protocols. At one time, it did that job rather well.
We are lucky that we can see how this was supposed to work by looking in Princeton’s scanned card catalog. (DISCLOSURE: I am not pining for “the old days” here. I am demonstrating a power that has been lost)
If we go to the first card of the Protocols: http://bit.ly/15dua9k, and browse the cards, we see the different versions of the text in different languages. As we continue along, we come to items that have the Protocols as a subject: http://bit.ly/1JjZEYU (We know this because the subjects were typed in ALL CAPS and they may have been in red ink, too) As we browse those cards, we immediately become aware that there is some kind of controversy.
This was one example of how the catalog was supposed to work but it, along with many other capabilities, were lost when keyword was introduced. It is true that keyword brought in many capabilities that were impossible before, but it should be recognized that it lost many as well.
In our catalogs today, there is no “browse” function that brought subjects and titles together in this very powerful and provocative way. I am absolutely not saying that the solution is to bring back such a browse, because that method was for physical catalogs and is 100% obsolete today. But our time and efforts would be better spent figuring out how to recreate that power for a new environment, instead of the tedious recoding of zillions of records of what we deem to be “true” or “false” today, and that we know will change over time. As a profession, we don’t want to go there. Simply bringing similar things together can both be powerful and highly provocative.
Mac, I know you understand this, but you are one of the few. I believe these sorts of basic powers of the catalog have been forgotten for a long time now.
Posting to Autocat
On 1/21/2015 7:40 PM, J. McRee Elrod wrote:
> The word “Mythomane” has the meaning we are looking for, but would be > unknown to most patrons I suspect.
> “Hoaxes” is perhaps the best suggestion so far.
> Deciding what is a hoax and what is simply inaccurate would be > difficult. Rarely do people witnessing the same event agree about > what they saw. Also, would such a judgement breach our neutrality > policy?
It seems to me that the catalog as a whole handles this rather well right now. If we search for “The Protocols of the Wise Men of Zion” (the uniform title) in Worldcat as a subject (which is what we are talking about) we find: http://www.worldcat.org/search?q=su%3A%22Protocols+of+the+wise+men+of+Zion%22
The first individual records are (out of many):
1) The history of a lie, “The protocols of the wise men of Zion” : a study. by Herman Bernstein; John Retcliffe, Sir
2) The plot : the secret story of the Protocols of the Elders of Zion by Will Eisner; Umberto Eco
3) Warrant for genocide; the myth of the Jewish world-conspiracy and the Protocols of the elders of Zion by Norman Cohn
4) A lie and a libel : the history of the Protocols of the Elders of Zion by B W Segel; Richard S …
Already there is something similar that exists in Google. When I search for this boy’s book “The boy who came back from heaven” (at least I see):
1) The boy who didn’t come back from heaven: inside a bestseller’s ‘deception’
2) The Boy Who Came Back From Heaven, Or Not?
3) What If Heaven Is Not For Real?
4) and then there is the Wikipedia page that says in the second sentence, “The book, published by Tyndale House Publishers in 2010, lists Alex’s father Kevin Malarkey as an author along with Alex, though in November 2012 Alex described the book as “1 of the most deceptive books ever.””
It seems to me that this is more than adequate.
The catalog can be a very powerful tool so long as it is used correctly. After all, that was how it was designed to work and our predecessors were pretty clever people. And as other tools come along–as they are now–we can use them so that the catalog can become even more powerful. The task ahead of us is to make this power of these tools more obvious to the user. Already, the information and the technology exists to do it all.
In this case, the boy’s surname is a tip-off as well. (Yes, that is a joke!)
I hesitate to change the fundamental role of catalogers and their records. I think that somebody, somewhere has to play the role of “unbiased arbiter”. Otherwise, if we are to be the arbiters of truth and falsehood, I fear we may be going down the road to create our own, modernized form of the “Index of Forbidden Books” (http://en.wikipedia.org/wiki/Index_Librorum_Prohibitorum)
Here are five more exciting things:
- Agenda announced for DAMNY. Super early bird discount ends tomorrow!
- Really cool video about the photos in the Hulton Archive. See a real card catalog and learn about wacky research requests.
- And the Hulton Archive landing page on Getty Images.
- Trying to understand Linked Data like I am? These tutorials are helpful.
- What is an ontologyversus a controlled…
I’ve been back from Chicago for just over a week now, but still reflecting on a very successful Jane-athon pre-conference the Friday before Midwinter. And the good news is that our participant survey responses agree with the “successful” part, plus contain a lot of food for thought going forward. More about that later …
There was a lot of buzz in the Jane-athon room that day, primarily from the enthusiastic participants, working together at tables, definitely having the fun we promised. Afterwards, the buzz came from those who wished they’d been there (many on Twitter #Janeathon) and others that wanted us to promise to do it again. Rest assured–we’re planning on another one in San Francisco at ALA Annual, but it will probably be somewhat different because by then we’ll have a better support infrastructure and will be able to be more concrete about the question of ‘what do you do with the data once you have it?’ If you’re particularly interested in that question, keep an eye on the rballs.info site, where new resources and improvements will be announced.
Rballs? What the heck are those? Originally they were meant to be ‘RIMMF-balls’, but then we started talking about ‘resource-balls’, and other such wanderings. The ‘ball’ part was suggested by ‘tar-balls’ and ‘mudballs’ (mudball was a term of derision in the old MARBI days, but Jon and I started using it more generally when we were working on aggregated records in NSDL).
So, how did we come up with such a crazy idea as a Jane-athon anyway? The idea came from Deborah Fritz, who’d been teaching about RDA for some time, plus working with her husband Richard on the RIMMF (RDA In Many Metadata Formats) tool, which is designed to allow creation of RDA data and export to RDF. The tool was upgraded to version 3 for the Jane-athon, and Deborah added some tutorials so that Jane-athon participants could get some practice with RIMMF beforehand (she also did online sessions for team leaders and coaches).
Deborah and I had discussed many times the frustration we shared with the ‘sage on the stage’ model of training, which left attendees to such events unhappy with the limitations of that model. They wanted something concrete–they usually said–something they could get their teeth into. Something that would help them visualize RDA out of the context of MARC. The Jane-athon idea promised to do just that.
I had done a prototype session of the Jane-athon with some librarians from the University of Hawaii (Nancy Sack did a great job organizing everything, even though a dodgy plane made me a day late to the party!) We got some very useful evaluations from that group, and those contributed to the success of the official Chicago debut.
So a crazy idea, bolstered by a lot of work and a whole lot of organizational effort, actually happened, and was even better than we’d dared to hope. There was a certain chaos on the day, which most people accepted with equanimity, and an awful lot of learning of the best kind. The event couldn’t have happened without Deborah and Richard Fritz, Gordon Dunsire, and Jon Phipps, each of whom had a part to play. Jamie Hennelly from ALA Publishing was instrumental in making the event happen, despite his reservations about herding the organizer cats.
And, as the cherry on top: After the five organizers finished their celebratory dinner later in the evening after the Jane-athon, we were all out on the sidewalk looking for cabs. A long black limousine pulled up, and asked us if we wanted a ride. Needless to say, we did, and soon pulled up in style in front of the Hyatt Regency on Wacker. Sadly, there was no one we knew at the front of the hotel, but many looked askance at the somewhat scruffy mob who piled out of the limo, no doubt wondering who the heck we were.
What’s up next? We think we’re on the path of a new data sharing paradigm, and we’ll run with that for the next few months, and maybe riff on that in San Francisco. Stay tuned! And do download a copy of RIMMF and play–there are rballs to look at and use for your purposes.
P.S. A report of the evaluation survey will be on RDA-L sometime next week.
I’m happy to announce that I’ve accepted a position at The UCI Libraries as Head, E-Research & Digital Scholarship Services. Today is my last day at Caltech. I will resume my professional blogging at infodiva.com, where I’ve maintained a presence since 1998. My long-time host recently decided to close shop, so I’m in the process of migrating nearly 20 years of server cruft. Everything should be ready to launch by March, when I begin at UCI.
Thanks for the connection here during my 7 years at Caltech, I look forward to continuing our conversations.