Planet Cataloging

May 17, 2012

025.431: The Dewey blog

Updates: Selected Topics in 297.1-.8 Islam

297.122    Koran
297.14      Religious and ceremonial laws and decisions
297.352    Mecca
297.81      Sunnites
297.82      Shiites

Updates for selected topics in 297.1-.8 Islam have been published in 200 Religion Class and in WebDewey.  The new provisions can be found by clicking "Updates" on the login page or other pages in WebDewey; the file is named 297_20120501_DDC23.  Most of the provisions were made available for review in an earlier version announced in a previous blog. The exception is an expansion to provide for two groups of Shiites:

297.8251 'Alawīs (Alawites)
297.8252 Alevis

Most of the new numbers will be used primarily by libraries with large collections on Islam.  That is probably true of the largest area of expansion and revision, for 297.122 Koran.  That is probably also true of the expansion at 297.14 Religious and ceremonial laws and decisions for schools of Islamic law.  The expansion, which can now be used for fiqh al-'ibādāt (rituals law), will also provide notation that can be used to facilitate work that is underway to make 340 Law more hospitable for Islamic law. The main change to 297.81 Sunnites is relocation of four groups that the Arabic translation team recommended be treated as schools of law rather than as Islamic sects; Hanafites, Shafiites, Malikites, and Hanbalites are now all classed in subdivisions of 297.14018 Schools of law.

The update includes a small expansion that may be more widely used, at 297.352 Mecca:

297.3524 Hajj
297.3525 Umrah (Lesser pilgrimage to Mecca)

Hajj Stories is an example of a work that can be classed in 297.3524 Hajj (or, since it is a juvenile work, it could be classed in the unchanged abridged number: 297.3 Islamic worship.)  The Ultimate Guide to Umrah: Based on the Famous Book Getting the Best Out of Hajj: With Special Chapters on Umrah in Ramadaan and Visiting Madinah is an example of a work that can be classed in the new number 297.3525 Umrah (Lesser pilgrimage to Mecca)

by Juli at May 17, 2012 02:37 PM

May 16, 2012

First thus

Comment to: Improving the presentation of library data using FRBR and Linked data

Comment to:  Improving the presentation of library data using FRBR and Linked data by Anne-Lena Westrum, Asgeir Rekkavik and Kim Tallerås. Code4Lib Journal, Issue 16, 2012-02-03 This is a very interesting article for anyone interested in how the public works with library catalogs. One additional aspect that would be especially interesting would be a comparison among patrons for the result sets

by noreply@blogger.com (James Weinheimer) at May 16, 2012 03:03 PM

Re: [RDA-L] Part 2: Efficiency of DBMS operations Re: [RDA-L] [BIBFRAME] RDA, DBMS and RDF

Posting to RDA-L On 15/05/2012 17:53, Jonathan Rochkind wrote: <snip> Frankly, I no longer have much confidence that the library cataloging community is capable of any necessary changes in any kind of timeline fast enough to save us. Those that believe no significant changes to library cataloging or metadata practices are neccesary will have a chance to see if they

by noreply@blogger.com (James Weinheimer) at May 16, 2012 07:50 AM

Re: [RDA-L] Part 2: Efficiency of DBMS operations Re: [RDA-L] [BIBFRAME] RDA, DBMS and RDF

Posting to RDA-L On 15/05/2012 16:50, Jonathan Rochkind wrote: <snip> I certainly agree that the way our data is currently recorded and maintained in MARC is not suitable for contemporary desired uses, as I've suggested many times before on this list and others and tried to explain why; it's got little to do with rdbms though. </snip> Although MARC needs to

by noreply@blogger.com (James Weinheimer) at May 16, 2012 07:48 AM

May 15, 2012

First thus

Re: [RDA-L] Part 2: Efficiency of DBMS operations Re: [RDA-L] [BIBFRAME] RDA, DBMS and RDF

Posting to RDA-L On 15/05/2012 02:52, Karen Coyle wrote: <snip> let's say you have a record with 3 subject headings: Working class -- France Working class -- Dwellings -- France Housing -- France In a card catalog, these would result in 3 separate cards and therefore should you look all through the subject card catalog you would see the book in question 3 times. In a keyword

by noreply@blogger.com (James Weinheimer) at May 15, 2012 09:44 AM

May 14, 2012

TSLL TechScans

Think Like a Startup

Mathews, Brian. Think like a startup: a white paper to inspire library entrepreneurialism. (3 April 2012). At: http://vtechworks.lib.vt.edu/handle/10919/18649

This is a working paper published in the VTechWorks digital repository. Brian Mathews is the Associate Dean for Learning & Outreach at Virginia Tech. Mathews intended to inspire transformative thinking using insight into startup culture and innovation methodologies. He stated that "We don’t just need change, we need breakthrough, paradigm-shifting, transformative, disruptive ideas." He listed his points in the summary section at the end of the article, including: "Launching a good idea is always better than not launching an awesome one," "The library is a platform, not a place, website, or person," "Libraries need less assessment and more R&D," "Good ideas are usable, feasible, and valuable," and "Build a strategic culture, not a strategic plan".

by noreply@blogger.com (Yumin Jiang) at May 14, 2012 09:57 PM

Evaluating Web-Scale DIscovery Services

Hoeppner, Athena. "The Ins and Outs of Evaluating Web-Scale Discovery Services." Computers in Libraries 32(3)(April 2012). At: http://www.infotoday.com/cilmag/apr12/Hoeppner-Web-Scale-Discovery-Services.shtml

This introductory article explains web-scaled discovery concepts and terminology, and provides a short checklist for examining the products. The products discussed include: EBSCO Discovery Service (EBSCO), Primo Central Index (Ex Libris), Summon (Serials Solutions), and WorldCat Local (OCLC).

Vaughan, Jason. "Investigations into Library Web-Scale Discovery Services." Information Technology and Libraries 31(1)(March 2012). At: http://ejournals.bc.edu/ojs/index.php/ital/article/view/1916

This article describes in depth how the University of Nevada Las Vegas Library investigated the discovery service tools. The Discovery Task Force conducted several internal staff surveys, prepared comprehensive question lists for vendors, organized onsite vendor visits, tracked product enhancements, and made final recommendation. Many of the documents used in the process are available in the Appendix section.

by noreply@blogger.com (Yumin Jiang) at May 14, 2012 09:29 PM

Catalogue & Index Blog

CIG member is Brain of Britain

The winner of this year’s Brain of Britain is Ray Ward, a retired librarian and CIG member. Ray defeated three other finalists to win the long-running Radio 4 quiz in March. He is a veteran quizzer and has competed in Brain of Britain three times previously, as well as Mastermind (twice), University Challenge and he has twice won Brain of Mensa. Well done, Ray, and belated congratulations!

by Esther Antoinette Arens at May 14, 2012 08:51 PM

Free Moth :: Flutterings

Sophistication of Library Resource Description Structures

This is a provacitive statement by Ronald J. Murray writing with Barbara Tillett in their paper, “Cataloging Theory in Search of Graph Theory and Other Ivory Towers“:

“… library resource description structures — when teased out of their book and card and digital catalog implementations and treated as graphs — are arguably more sophisticated than those being explored in the World Wide Web Consortium’s (W3C) Library Linked Data initiative.”

Murray explores this idea in a series of presentation slides he’s posted on SlideShare.


by freemoth at May 14, 2012 02:49 PM

mod librarian

Metadata Monday: Top Tweets About Metadata from #DAMNY

The Henry Stewart Events Digital Asset Management conference in New York happened on May 10 and 11th. As always, the topics looked saliant with a lot of focus on my favorite topics - metadata and creative workflow as related to DAM. Here is a sysnopsis of the best Tweets from the two day event #DAMNY:

  • Seth Earley: content curation is key (along with metadata and taxonomies) to remove all the noise from the data...
  • Jake Athey: "You shouldn't plant it if you don't want to eat it" -- good gardening analogy for metadata...John Horodyski: DAM needs to be mobile, current and linked
  • Theresa Regli: talking web 3.0 "data is the new oil" always enjoy his use of media and imagery (Mark Davey)
  • David Riecks: "Find the dancers and get the dancers on the dance floor" to infect adoption when implementing DAM
  • Seth Earley: Metadata must have a purpose
  • John Horodyski: DAM vendors...standards, standards, standards
  • John Horodyski: Metadata is cumulative

Sorry I missed it!

 

Permalink | Leave a comment  »

May 14, 2012 12:30 PM

First thus

Re: [RDA-L] RDA, DBMS and RDF

Posting to RDA-L On 13/05/2012 19:49, Karen Coyle wrote: <snip> All, After struggling for a long time with my frustration with the difficulties of dealing with MARC, FRBR and RDA concepts in the context of data management, I have done a blog post that explains some of my thinking on the topic: http://kcoyle.blogspot.com/2012/05/rda-dbms-rdf.html

by noreply@blogger.com (James Weinheimer) at May 14, 2012 10:45 AM

Bibliographic Wilderness

RDFa :: HTML5 microdata

RDFa and HTML5 microdata, are, I think, basically interchangeable.

RDF and microdata both use the same fundamental triple data model. Please note that schema.org is just a specific set of vocabularies that can be used with HTML5 microdata,  HTML5 microdata goes beyond this. schema.org is a pretty good microdata tutorial though, if you remember you don’t have to use it’s vocabularies.  Here’s the actual microdata spec. Here’s a good microdata tutorial that pre-dates schema.org and is not schema.org-specific.

You can take pretty much anything that’s RDF, from any vocabularies, and use an RDFa style approach to express (basically) the same semantics is in HTML5 microdata  instead.

This is a good thing for RDF, because there’s no good way to do RDFa in HTML (or anything but xHTML which is basically an abandoned approach — RDFa needs XML namespaces).  You can go from (any) html5 microdata to RDF too — although there are a couple gaps I’ll discuss at the end.

First, let’s show how you’d do RDFa-style RDF semantics expressed in HTML5 microdata. Let’s take the complete example from the RDFa wikipedia article, as it’s small but makes us actually use a pretty complete complement of microdata features. There are in fact a couple weird details I’m not sure about.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    version="XHTML+RDFa 1.0" xml:lang="en">
  <head>
    <title>John's Home Page</title>
    <base href="http://example.org/john-d/" />
    <meta property="dc:creator" content="Jonathan Doe" />
    <link rel="foaf:primaryTopic" href="http://example.org/john-d/#me" />
  </head>
  <body about="http://example.org/john-d/#me">
    <h1>John's Home Page</h1>
    <p>My name is <span property="foaf:nick">John D</span> and I like
      <a href="http://www.neubauten.org/" rel="foaf:interest"
        xml:lang="de">Einstürzende Neubauten</a>.
    </p>
    <p>
      My <span rel="foaf:interest" resource="urn:ISBN:0752820907">favorite
      book is the inspiring <span about="urn:ISBN:0752820907"><cite
      property="dc:title">Weaving the Web</cite> by
      <span property="dc:creator">Tim Berners-Lee</span></span>
     </span>
    </p>
  </body>
</html>

Here’s the same thing, using the same vocabularies, with HTML5 microdata. (yes, contrary to some belief, you can mix and match more than one vocabulary in microdata too, although you’ve got to spell out the complete URI for all but one in any given scope.

<html lang="en">
  <head>
    <title>John's Home Page</title>
    <base href="http://example.org/john-d/" />
    <link rel="http://xmlns.com/foaf/0.1/primaryTopic" href="http://example.org/john-d/#me" />
  </head>
  <body itemscope itemtype="http://purl.org/dc/elements/1.1/"  itemid="http://example.org/john-d/#me">
    <h1>John's Home Page</h1>
    <p>My name is <span itemprop="http://xmlns.com/foaf/0.1/nick">John D</span> and I like
      <a href="http://www.neubauten.org/" itemprop="http://xmlns.com/foaf/0.1/interest"
        lang="de">Einstürzende Neubauten</a>.
    </p>
    <p>
      <span itemscope itemtype="http://purl.org/dc/elements/1.1/" itemprop="http://xmlns.com/foaf/0.1/interest" itemid="urn:ISBN:0752820907 ">
      My favorite
      book is the inspiring <cite
      itemprop="title">Weaving the Web</cite> by
      <span itemprop="creator">Tim Berners-Lee</span>
     </span>
    </p>
  </body>
</html>

Mismatches and missing semantics

While the fundamental approach is compatible, there are a few mismatches and semantics lost or less clear in html5 microdata. Here are some I feel like noting, there may be others.

  • I’m not sure I did the right thing with the <link> in the <head> section — html5 has kind of an odd fork in it between maintaining more or less backwards compatibility with old-style <link> and <meta> (in <head>, using ‘rel’), and microdata style (in <body>, using ‘itemprop’).  There are weird things with ‘rel’ only being allowed in certain places and ‘itemprop’ in others; you also are never supposed to have both ‘rel’ and ‘itemprop’.  So anyway, I’m not sure what a proper way of expressing a relationship with the document as the subject is, in microdata, may have done something not right here.
  • RDFa takes XML’s namespaces to express vocabularies. RDFa’s namespace+name is analagous to microdata’s itemtype+itemprop.  But.
    • In microdata, you can do something dear to RDF’s heart, and express the predicate URL as a literal absolute URL  – which is what you have to do to mix namespaces/vocabularies, and that’s really just fine. You can also do the equivalent of a namespace (in an itemtype) and a non-URI bare name belonging to that namespace (in an itemprop), but you only get one namespace at a time like this.
    • But also, RDFa, via XML, is quite clear that you concatenate a namespace and a bare name to get the complete URI. We used this same convention when putting our RDFa into microdata, which works because itemtype’s are always URIs too.  – but it’s just a convention, microdata isn’t clear about that, and microdata examples often use itemtype URI examples that clearly weren’t intended like this. Like schema.org: itemtype=”http://schema.org/Book” + itemprop=”bookFormat” concatenated == “http://schema.org/BookbookFormat”. Um, that’s not quite sensible, not what anyone’s looking for… although it is a legal URI….
  • microdata makes  a lot ‘easier’ to use what the RDFistas call ‘blank nodes’ — nodes whose ‘subject’ lacks a specified URI. Idiomatic microdata actually generally has a bunch of those, including the top-level one(s).   The microdata spec tries to tell you that you can only use an `itemid` for certain vocabularies that establish it’s use — ideally, I think this would be opened up, and even encouraged. The semantics should be made more clearly compatible with RDF — the itemid it is an identifier for the ‘itemscope’d thing, that is the ‘subject’ URI of any itemprop’s in that itemscope, that should be made clear.
    • I personally think allowing idiomatic blank nodes is a good thing for microdata, making it more usable, letting people get started with the minimal semantics for their use cases, not making them spend time on metadata design/control they don’t need yet.  Even if RDFistas disagree, I suggest they focus on making it easier to avoid blank nodes — more idiomatic, more encouraged by docs, more generally legal — and give up on making it hard or impossible to have blank nodes in html5 microdata.

Whither RDF/RDFa

(That’s “whither”, not “wither”. Hopefully).

There are probably other rough spots than the ones I’ve identified. And the one’s I mentioned include some tough ones (the itemtype+itemprop==URI issue).

But by and large, HTML5 microdata’s fundamental model is RDF compatible.  Hopefully the RDFistas are focused on figuring out how to lessen the impedence mismatches, if neccesary by lobbying the html5 working group to make minimized interventions.  Hopefully they’re not still stuck on an xhtml/rdfa/why-didn’t-they-do-things-our-way train, because that train isn’t leaving the station.  Instead though, they can contribute to sanding off a few rough spots in microdata to make it quite capable of doing what they want (and, if they’re right, everyone else will eventually realize they want too). Work on tools to turn microdata to RDF, too, hopefully.

microdata could actually be the a great thing for RDF.  If handled correctly, it should be possible to express full RDF semantics in microdata — microdata can be the RDF-in-HTML-markup standard that RDFa wanted to be. (microdata’s designers clearly knew about RDF/RDFa and were influenced by it). It’s also possible to leave a lot of semantics out when writing microdata — but often in ways you could do with RDF/RDFa too, lots of blank nodes, etc, RDF/RDFa just tries to make it inconvenient and non-idiomatic.

While the RDFistas may be rueing that microdata makes it so easy to not have completely specified triples with no blank nodes everywhere — I think the flip side of this is actually what will allow it to possibly get more uptake, and be an easy start on the road to RDF, if RDF plays it’s cards right.   That because you have to think through the complete vocabularies and semantics less, you can get started with just the semantics you need, and not be forced to do more up front metadata design than you need for your identified use cases, or more than you can afford or have the skills to do. That, and some the immediate use cases in ‘Google will use it!’ of course. But if Google had tried to say they used RDFa (didn’t they once, maybe, sort of?), I don’t think it would have gone anywhere — RDFa is just too overwhelming.


Filed under: General

by jrochkind at May 14, 2012 05:38 AM

May 13, 2012

Coyle's InFormation

RDA, DBMS, RDF

I have written before about some issues relating to RDA and RDF. Today I want to actually consider some things we should consider that should cause us to question the concept of "RDA in RDF."

For many decades we have been using relational databases to store our bibliographic data, bibliographic data that we create and exchange using the MARC format. Doing so was not by any means natural or intuitive because there is nothing about the structure or content of the MARC record that lends itself to being stored and managed in a relational database. The results were often awkward, inefficient, and unsatisfying.

Part of the reason for this is the unitary and flat nature of MARC. In spite of the long history of creating separate authority files, each MARC record is a complete and closed document with no actual connections to data outside of itself. While some database implementations for MARC do create relational tables for headings, the degree to which a MARC record can be separated out into tables is minimal and gains us very little in terms of the functionality of an RDBMS.

The underlying problem, however, is not in the structure of the MARC record but in the content of our catalog records. Moving from the card to a database for our data requires more than adding mark-up coding around the catalog data; to do so successfully requires re-thinking the data in terms of relational database principles. There are two basic principles to relational database design: repetition and combination.

To design for relational databases you look at your data to see what elements will be repeated in many different records. Rather than carrying those data elements in multiple records, you create a separate database table for each repeating element, and you store that element once. For example, if you are creating a database of mailing addresses, you see quickly that elements like state and zip code will appear in multiple records. You therefore create a table of state names and one of zip codes, and perhaps even one that links zip codes to city names. In this way, your database carries the string "Mississippi" only once, and that string is replaced in the records with a database pointer that uses much less internal storage. Ditto the zip code. And if the zip code is associated in a table with a city name, you also only store city names once, and each address record needs only a pointer to the zip code, not a city name. In fact, with a zip code you can get the city and state, and your design might look like:



In this way you have saved a huge amount of storage space. You have also made selection of your records on zip code, city and state much more efficient than if they were stored in every address record, because a search on a zip code, for example, retrieves a single entry in the zip code table, and that entry has database-managed links to the relevant records.

In a database of customer orders that has your inventory information along with customer addresses, you use the tables in your database to search for things like "all customers in Mississippi who have ordered WidgetX in the last six months." Information about your inventory and information about purchases are all in appropriate sets of tables in your database and you can combine the data elements to develop different views of the data.

Where the goal in relational database design is to identify and isolate data elements that are the same, the goal in library cataloging data is exactly the opposite: headings are developed primarily to differentiate at the data creation point rather than allow combination within the database management system. The goal is to have each data point be as unique as possible and to be assigned to as few records as possible. Thus, library cataloging creates headings whose purpose is to distinguish between entries:

Shakespeare, William, 1564-1616. As you like it
Shakespeare, William, 1564-1616. As you like it. 1905
Shakespeare, William, 1564-1616. As you like it. 1911.
Shakespeare, William, 1564-1616. As you like it. 1919.
Shakespeare, William, 1564-1616. As you like it. Czech
Shakespeare, William, 1564-1616. As you like it. French

These headings are counter to the functioning of a database management system. If moved to a database table to facilitate retrieval, they will each point to only one or a very small number of records. This negates both the space-saving aspect of database management and it also does not facilitate combination of data elements for retrieval. In the case of headings, the combination of elements is pre-coordinated in the data, rather than post-coordinated in the database retrieval function.

A database approach might break this data into four tables:




In this way one could search for this data by title, by title + author, date + language, or by any other combination of these four data elements. To search the library headings as anything but a single keyworded string, that is to use these headings to perform searches on title or date or language, would be incredibly inefficient. The upshot is that library headings are not "relational" and do not contribute to the functionality that database management systems can provide. Instead, database management systems make use of the separate coded elements, such as date and language, for combinatorial retrieval. Names and titles, because they are text strings and do not have an identified presence in the stored records, must be searched separately rather than being available for relational combination. The results of this type of searching are less than optimal in speed and accuracy.

All of this may seem obvious to some of you, so you may be asking yourselves why I bring this up. I bring it up because even though RDA claims to have as its goal the creation of records in a relational design (see scenario one in this JSC document), it continues to instruct catalogers to create pre-coordinated headings like the ones above. Not only will these not be efficient or fruitful in a relational database, this brings into question whether RDA is truly modeled on the principles it claims to embrace. If it is not we have cause to worry: we cannot move forward with data that does not conform to a modern model.

Note that in this post I have been emphasizing the use of relational database design for the data. The current plans for a new bibliographic framework appear to plan to create a data model for RDA that is based on semantic web principles. Those principles are yet another significant evolution following on the database model, which is now considered waning technology. Other communities, ones that have been designing for database management requirements for their data for decades, are now looking at ways to transform that data to RDF. It is possible that we can skip the relational database phase of our data development and move directly into a semantic web model. However,  to think that data created following RDA instructions, which is not even suitable for a relational database, could be made usable on the semantic web without major modifications is simply wrong. If we create a bibliographic framework that takes RDA as it has been described and ports that, unchanged, to RDF we will create a data model that does not serve us, does not serve our users, and that cannot reasonably interact with other linked data on the web.

What we need is an analysis of our data, not a transformation of it "as is" to a new technology. If we aren't ready to admit that some traditional practices, like headings, are no longer useful or usable in today's technological environment, we cannot have any hope that our data will be relevant in the future.

(p.s. I anticipate that someone will state that headings are needed for alphabetical displays, to "collocate" the records. To that I reply: 1) you can do the same collocation using the data elements, and in fact you could devise multiple collocations by combining the elements in different ways and 2) a linear, alphabetic display is so anachronistic with today's technology, and so seldom used when available, that it is hard to justify the use of human catalogers to create these fields. If you still believe that library records must contain hand-crafted headings, all I can say is: you can believe what you want, but some of us will be exploring other solutions.)

by Karen Coyle (noreply@blogger.com) at May 13, 2012 11:42 AM

May 12, 2012

Metadata Matters (Diane Hillmann)

Using the sub-property ladder

I discussed the utility of the sub-property relationship in Getting to higher MARC branches, Netting more MARC fruit, and Adding MARC fruit to the cornucopia. Coincidentally, Bob DuCharme posted Simple federated queries with RDF which outlines the same technique and provides additional information on its use for resource discovery. Those posts are somewhat technical, and I tried to lighten up in my presentation Turtle dreaming at the recent Dublin Core Metadata Initiative (DCMI) seminar Five years on. This post is another attempt to demonstrate in a non-technical way (I hope) how useful and powerful the sub-property relationship can be.

A metadata attribute, like ‘title’, that is to be used for linked data in the Semantic Web is usually represented in Resource Description Framework (RDF) as a property. A property can be used as the predicate part of a triple: “Subject – predicate – object”, where ‘Subject’ is what the triple is about (e.g. a resource), ‘predicate’ is the aspect of the subject, and ‘object’ is the value of that aspect for the specified subject. For example:

“This resource – (has) title – ‘Using the sub-property ladder’”

is a single metadata statement in triple format. We can think of this as conforming to the triple template:

“Specified resource – (has) attribute – value”.

Note that prefixing the predicate with ‘has’ turns it into a verbal phrase and renders the statement in (near) natural language.

We can also make meta-metadata statements in triple format. These are ‘data about metadata’ rather than ‘data about data’, and are often referred to as ontological triples to distinguish them from data triples such as the example above. The triple template for one type of meta-metadata statement is:

“Specified RDF element – (has) relationship – Other specified RDF element”

Note that a relationship between metadata elements is also represented in RDF as a property. In particular, ‘sub-property’ is a pre-defined relationship between two RDF property elements, giving the ontological triple:

“Property 1 – (is) sub-property of – Property 2″

Furthermore, such relationships can embed semantic rules that can be processed automatically by software known as ‘semantic reasoners’ or just plain ‘reasoners’. The rule embedded in the sub-property relationship is: If “P1 – (is) sub-property of – P2″, then any data triple using P1 as its predicate can generate another data triple using P2 as its predicate, with the same subject and object. Let’s call this kind of ontological triple a mapping triple, because it effectively maps one property to another.

Suppose we have two attributes ‘title’ and ‘varying form of title’. I can create the mapping triple:

“‘varying form of title’ – (is) sub-property of – ‘title’”.

If we have a data triple:

“This resource – (has) varying form of title – ‘Pat presents cataloguing for beginners’”

then a reasoner will automatically generate the data triple:

“This resource – (has) title – ‘Pat presents cataloguing for beginners’”

In a similar way, we can create the mapping triple:

“‘title statement’ – (is) sub-property of – ‘title’”

and from the data triple:

“This resource – (has) title statement – ‘Cataloguing for beginners’”

generate:

“This resource – (has) title – ‘Cataloguing for beginners’”

So what? Further suppose that the ‘title’ attribute is from the DCMI metadata terms, and the ‘varying form of title’ and ‘title statement’ attributes are from the MARC21 tags 245 and 246. So a MARC21 record for the resource might contain the set of data triples:

“This resource – (has) 245 [title statement] – ‘Cataloguing for beginners’”
“This resource – (has) 246 [varying form of title] – ‘Pat presents cataloguing for beginners’”

A reasoner can generate the set of data triples:

“This resource – (has) [DC] title – ‘Cataloguing for beginners’”
“This resource – (has) [DC] title – ‘Pat presents cataloguing for beginners’”

In other words, we have generated a DC record from a MARC21 record. Or we have generated a title index for the MARC21 record. Or both.

Let’s add an RDA attribute and an ISBD attribute mapping to the mix:

“[RDA] ‘title proper’ – (is) sub-property of – [DC] ‘title’”
“[ISBD] ‘has title proper’ – (is) sub-property of – [DC] ‘title’”

The data triples:

“That resource – (has) [RDA] title proper – ‘Cataloguing for geeks’”
“Another resource – [ISBD] has title proper – ‘Does cataloguing have a future?’”

can generate the corresponding DC triples, and we end up with:

“This resource – (has) [DC] title – ‘Cataloguing for beginners’”
“That resource – (has) [DC] title – ‘Cataloguing for geeks’”
“Another resource – (has) [DC] title – ‘Does cataloguing have a future?”
“This resource – (has) [DC] title – ‘Pat presents cataloguing for beginners’”

So now we have a title index to metadata from multiple heterogeneous sources. And the beginnings of a set of records in Dublin Core format.

Note that the attribute which is the sub-property must be entirely narrower in its semantics than the related super-property. If we create the mapping triple:

“‘title’ – (is) sub-property of – ‘varying form of title’”

then we generate the data triple:

“This resource - (has) 246 [varying form of title] – ‘Cataloguing for beginners’”

which is incorrect.

As a result, a data triple generated by a sub-property mapping triple is usually ‘dumber’ than the original data triple; detail is lost because the generated triple uses an attribute which is broader in meaning than the original. This ‘dumbing-up’ is necessary to produce interoperable metadata from different schemas – but data is not permanently lost because the original triple is still available for use in other applications. Needless to say, data triples created with broad attributes cannot be “smartened-down”, at least on their own.

The sub-property relationship can be chained. We can create a new attribute property, MARC21 ‘title’, which could be used in an application for making a title index to MARC21 records, as already mentioned. This new attribute is a super-property of all the MARC21 title-type attributes, and is also a sub-property of the DC ‘title’ attribute:

“[MARC21] ‘title statement’ – (is) sub-property of – [MARC21] ‘title’”
“[MARC21] ‘varying form of title’ – (is) sub-property of – [MARC21] ‘title’”
“[MARC21] ‘title’ – (is) sub-property of – [DC] ‘title’”

Doing this does not affect the previous mapping triples relating each MARC21 title-type attribute directly to the DC ‘title’ attribute, although it  makes them redundant because this new set of mapping triples generates exactly the same data triples at the DC level from the MARC21 originals.

Different application can therefore re-use and, if necessary, augment the sub-property chains for each of the high-level core attributes found in most bibliographic metadata schemas, such as title, author/creator/agent, subject, target audience, etc. The chains form a net(work) of mappings, or map, which can automatically dumb-up triples from any level of semantic granularity to any higher level.

We should only have to publish such maps or part-maps once, openly so that anyone can use them and add to them. If the professional communities develop the maps first, much effort will be saved and much authority imparted. This requires collaboration and action real soon now – the ISBD Review Group and the Joint Steering Committee for Development of RDA have started with the development of a mapping between the ISBD and RDA element sets.

These maps should remain valid forever, so the effort is worth expending. The original data triples use the original properties based on the schema attributes at the time and they will be valid “for their time”, in the same way that many catalogues are likely to contain records created under the Anglo-American Cataloguing Rules, with its ‘general material designation’ attribute long after the successor standard RDA: resource description and access has been adopted with its ‘content type’ and ‘carrier type’ attributes.

And mappings from the MARC21 element sets will show, we hope, that it may not be necessary to convert the entire contents of every MARC21 record as a result of the Bibliographic Framework Transition Initiative!

But the professional communities lack a framework to help them collaborate as a super-community. A network of mappings is more (socially) efficient than an aggregation of one-to-one mappings between pairs of schemas. We need (name)spaces to add intermediary attribute properties and publish the mappings; we need protocols for managing semantic change as schemas evolve; we need lightweight protocols for authorizing mappings; we need systems for ensuring the long-term preservation of RDF element sets and mapping triples.

by Gordon Dunsire at May 12, 2012 06:17 PM

May 11, 2012

Managing Metadata

Return to work

Milagros Valdes, my beautiful wife of 11 years, lost her battle with a rare and vicious form of breast cancer on April 15, 2012.  I want to thank all of my colleagues at Caltech for their understanding and support during this most difficult time.  I really do have the best job and coworkers imaginable.  I also want to thank my professional friends who sent so many kind messages and care packages.  Your support helped me and my family enormously. I surely could not have coped without my community.   Many hugs are coming your way next time we see one another at meetings/conferences.  I am deeply humble and grateful.

by laura at May 11, 2012 05:23 PM

May 10, 2012

Bibliographic Wilderness

The Semi-Isolated Rails Engine

(All of this is accurate, so far as I know, in Rails 3.2.3. If you are reading this later in future Rails versions, mileage may vary).

Rails 3 introduced plugins-as-gems, and the special case of Engines. An Engine is basically  a library of code that can define it’s own views, controllers, models, assets, etc, in it’s own codebase, that will be available for the Rails app. (An Engine doesn’t actually need to be defined in it’s own gem, it can be defined anywhere that ends up in the load path. but it’s own gem is typical).  You can have Rails generate a skeleton for an Engine plugin as gem, with `rails plugin new enginename --full`.  (Without the –full, it’d be a less powerful plugin without full Engine features — actually it ends up being pretty much just an ordinary gem).

A “plain” engine (as opposed to ‘isolated’ engine we’ll discuss later) basically “inserts” controllers, views, and models into the host app — they’re added to the load paths to part of the host app same as any locally defined controller, view or model.

Additionally, routes defined in your $engine/config/routes.rb will be automatically included in the host app. I’m not sure if they’ll be included before or after host app routes; route definition order matters in Rails3 routing.

Name collisions?

If there’s a name collision, the thing with the same name in the host app will usually ‘win’, and the one in the gem will be in accessible to most code (in gem or in host app).  If there’s name collision between two gems, it probably depends on load order (what order they’re referenced in the Gemfile, usually).

This is pretty much what you’d expect to happen, so long as the host app version really wins, I think it’s “right”.  (With helpers specifically, things can sometimes get confusing and not behave how you expect. I now can’t find the message I think I sent to the rails-user listserv on this at some point, and maybe it’s been changed/fixed in recent versions of rails.)

You can put your models, views, and controllers in module namespaces just exactly the same as you can if you were adding em to any Rails app, in order to try and prevent namespace collision. They’ll work just exactly the same way — the point of an Engine is the stuff in an engine is in the host apps load paths just the same as if it was really in the host app source locations.

Avoiding routing name collisions can be handled the same way, in a ‘plain’ engine, using the Rails3 router :namespace function, or any of the other related router functions (:as, :module, :path, etc.)

Some Engines handle routing by not including routes in $engine/config/routes.rb, where they’ll be automatically loaded by Rails, but instead loading routes into the host app using their own logic, so it can be done just so. This is especially useful for routes that should be changed by host app configuration. For instance, Devise and it’s `devise_for` method that the host app calls manually in it’s own routes.rb.

Isolated Engines: Rails 3.1

Rails 3.1 introduced the “isolate_namespace” directive, which you can add to your engine module.

The one main effect this has is actually on routing. $engine/config/routes.rb are not added to the host app’s routing.  Instead, Rails creates a little Rack mini-app out of your engine (or maybe any Engine already is this?), with your engine’s routing in it, so that host app can mount the Engine into the host app’s own routing, using the standard Rails routing ‘mount’ directive for Rack apps. See the Engines guide (or the edgeguide version, with slightly expanded information).

It also makes the engine’s $config/routes.rb behave a bit differnet as far as default routing params, assuming all routes are :namespace’d, making sure  the routing helper methods are available to your Engine’s controllers and helpers (and at the right method names), etc.

On top of this, it changes how rails generators work inside your engine. You can use rails generators inside an engine to add controllers and models. In a ‘plain’ engine, if you call `rails generate controller foo`, it’ll add an $engine/controllers/foo_controller.rb, just like any rails app.  It’ll add an `$engine/views/foo` directory and an `$engine/helpers/foo_helper.rb`. Just like an app.

In an Engine with `isolate_namespace`, if you call `rails generate controller foo`, it’ll namespace everything it generates for you:  `$engine/controllers/$enginename/foo_controller.rb` will contain a controller whose class is EngineName::Foo.  Similarly, view folder in `$engine/controllers/$enginename/foo`, etc.

Isolated engines are convenient for many cases.  You can have Rails generate a new skeleton for an isolated engine with `rails plugin new enginename --full --mountable`

There’s one aspect of them, though, that you may or may not want — and is fortunately pretty easy to change, giving you what I’ll call a Semi-Isolated Engine.

More Isolation Than you Might Want: Controller inheritance

There’s one aspect of isolated engines that ends up being a bit confusing — It’s actually not caused by the `isolate_namespace` directive in the Engine, but purely by the Rails generators — in fact, purely by the `--mountable` arg to `rails plugin new engine_name --full --mountable`.

Let’s look at how controller inheritance works.

If you use the `rails generator controller` to generate in your engine, if you look at it you’ll see that it’s defined as < ApplicationController — inheriting from the class called ApplicationController — just like a controller in a normal app. But your engine gem doesn’t have an ApplicationController (at least it ought not to, at least not a top-level-namespace ::ApplicationController) — what’s it inheriting?  Well, it’s inheriting from the ApplicationController in whatever host app it happens to be running in.

This means common logic in the host apps ApplicationController is available to engine controllers. (Say, a current_user? method; the engine would obviously need to document it’s conventions).  It also means all the helper methods loaded into the host app in a way that they apply to all controllers, will be available to engine controllers/views.  It also means that, by default, the default rails template layout for controllers in the engine is the host app’s `application` layout — or any other default layout specified in host app ApplicationController.

Sometimes that’s all actually nice, but sometimes you want more isolation. If you generate an engine with `rails plugin x --full --mountable`, you get it.  But how you get it is a bit confusing at first.

mountable/isolated generation of Engine::ApplicationController

If you generate a `mountable` (ie, isolated) engine, and then you use `rails g controller` to generate a controller, you’ll see it’s still defined as `< ApplicationController`. And yet it doesn’t actually inherit the behavior of the host app ApplicationController — it’s got no logic from host app ApplicationController, no helpers, won’t find it’s layout, etc.

What’s going on? It’s a different ApplicationController.  When you generate an engine with rails –full –mountable, it generates an EngineName::ApplicationController to $engine/controllers/$engine_name/application_controller.rb.

Because of the way Rails constant lookup works, it’s finding this ApplicationController.

And it generated a layout in your engine too at $engine/views/layouts/$engine_name/application.html.erb.

That’s the layout used by all your engine controllers, by default too.

multiple ApplicationController’s, really?

While this level of isolation is perhaps useful for many (most?) Engines, I question the decision to ‘override’ the ApplicationController class name and count on ruby constant-lookup in namespaces to get to the right one. ruby namespaced constant lookup is notoriously confusing, and changes from ruby version to version not always in documented ways.  I think it’s just asking for developer confusion and bugs.

Fortunately, it’s only a feature of the Rails generators (both the ‘rails plugin new‘ and `rails generate controller` within an isolated_namespace engine). Got nothing to do with actual rails runtime logic.

If you want to do it differently, no problem.  Go change $engine/controllers/application_controller.rb to, say, engine_name_controller.rb instead, and the layout to engine_name.html.erb.  All of your engine controllers should now “< EngineNameController” instead of “< ApplicationController“.

You’ve got the exact same behavior, just with less confusing and error prone names.

Sadly, `rails g controller` in an isolated_namespace engine will still generate< ApplicationController“, you’ll have to manually change it each time you use the generator.

Now, for the Semi-Isolated Engine

Okay, now we can get to the actual point. While isolating controllers like this can be useful sometimes, sometimes it’s not. You might still want the routing isolation that “isolate_namespace” gives you, and the convenient change in behavior of the rails generators under that condition.

But you do want your engine controllers to inherit from the host app ApplicationController. No problem!  Just change that engine ‘main’ controller to “< ApplicationController”. You could do that even without the name change we discussed above, by properly scoping to top-level namespace, but that would lead to the confusing (but correct!) EngineName::ApplicationController < ::ApplicationController.

Less confusing if we changed the name as recommended above, say if your engine is the Widgetizer, Widgetizer::WidgetizerController < ApplicationController.

Now,

  • any logic in the host app ApplicationController is available in engine controllers.
  • Your engine controllers are by default using your engine’s ‘main’ layout instead of the host app’s — just delete the engine layout and they’re by default using the host app’s, that’s it!  (Delete $engine/app/views/layouts/widgetizer/application.html.erb, or $engine/app/views/layouts/widgetizer/widitizer.html.erb if you changed the names as recommended).
  • If you have logic which you do want available to all engine controllers but shouldn’t be in teh host app, just add it to your intermediary engine main controller, right? Because SomeEngineController < EngineController < ApplicationController. (With Rails 3.2+ hierarhical view lookup, all views can be looked up through this chain, not just layouts).
  • Because of isolate_namespace, the host app is still not automatically given the helpers in the engine — great! (If you want to manually expose engine helpers in the host app, see advice in the Engine Guide).
  • Helpers in the engine are a bit more confusing. Since engine controllers subclass the host app ApplicationController, helpers from the host app areavailable in engine controllers. In some cases this is useful, in most others it probably won’t cause a problem.
    • If there is name collision between helper methods in host app and engine, when called from within an engine controller, the engine helper method ‘wins’. Which is great. (The engine helper can even call ‘super’ to get access to the host app version, although there are few cases where an engine helper could rely on ‘super’ existing.) However, this is reliant on details of how and in what order Rails include’s helper modules into controllers, something that’s changed in past rails versions, I’d be a bit cautious of relying on this continuing to work, sadly.

So there you have it, the “Semi-Isolated Rails Engine”, a design that works well for me for certain kinds of engines. It’s a testament to Rails 3.x nice, clean, flexible, consistent, well-designed architecture that we don’t need to fight with Rails actual runtime logic at all to do this, we don’t even need to change it, we just need to make different choices than the Rails engine generators make. If someone wanted to, they could even make their own generators that behaved this way for a ‘semi-isolated rails engine’.


Filed under: General

by jrochkind at May 10, 2012 08:27 PM

First thus

Re: [NGC4LIB] Google Penguin and SEO

Posting to NGC4LIB On 09/05/2012 21:18, john g marr wrote: <snip> On Wed, 9 May 2012, Laval Hunsucker wrote: James Weinheimer wrote : A search for "ebooks" has www.ebooks.com come up second to Project Gutenberg. Now that rather depends, doesn't it, on ( i.a. ) the who and when and from where of the search? True, but it is also indicative of the fact that search results

by noreply@blogger.com (James Weinheimer) at May 10, 2012 01:01 PM

mod librarian

5 Things Thursday: iPad Photo Management, Cute Animal Photos, Metadata

Here are five more things and then some:

  1. Read about Photos Pro for the iPad.
  2. Ever wonder where the cute animal photos come from and the copyright implications?
  3. If I lived in NY, I would go to Using Metadata for Audiovisual Collection Management.
  4. How to set up your home office for remote work.
  5. Check out CONTENTdm in action.

BONUS: check out The Signal, the LOC digital preservation publication.

Permalink | Leave a comment  »

May 10, 2012 12:30 PM

Catalogablog

Open Annotation Core Data Model

The W3C has published the Open Annotation Core Data Model.
Annotating, the act of creating associations between distinct pieces of information, is a pervasive activity online in many guises but lacks a structured approach. Web citizens make comments about online resources using either tools built in to the hosting web site, external web services, or the functionality of an annotation client. Comments about photos on Flickr, videos on YouTube, people's posts on Facebook, or mentions of resources on Twitter could all be considered as annotations associated with the resource being discussed. In addition, there a plethora of closed and proprietary web-based "sticky note" systems, and stand-alone multimedia annotation systems. The primary complaint about all of these systems is that the user created annotations cannot be shared or reused, due to a deliberate "lock-in" strategy within the environments where they were created, or at the very least the lack of a common approach to expressing the annotations.

The Open Annotation data model provides an extensible, interoperable framework for expressing annotations such that they can easily be shared between platforms, with sufficient richness of expression to satisfy complex requirements while remaining simple enough to also allow for the most common use cases, such as attaching a piece of text to a single web resource.
Seen on Digital Koans.

by David (noreply@blogger.com) at May 10, 2012 10:51 AM

May 09, 2012

Catalogue & Index Blog

CIG visit to Defra, London - 21st June 2012

The visit will start at 2.30 pm and finish by 4 pm.

What will the visit cover?
There will be a small tour of the physical library collection. The main part of the visit will be a presentation on cataloguing a special collection, with particular reference to electronic publications. The presentation will also cover terminology control through the library thesaurus.

Aim:
To explore the collection and approaches to cataloguing at Defra Library

Objectives of the visit:

  • To learn about specialist library collections
  • To explore cataloguing practices at a government library
  • To develop professional networks by meeting other library and information services workers

The visit is free of charge.

Places are limited to 10 and bookable on a first come first served basis.

For more information or to book a place please contact Birgit Fraser, email: birgit.fraser@anglia.ac.uk

 

by Birgit Juliane Fraser at May 09, 2012 09:15 PM

Bibliographic Wilderness

Rails Engine Guide

The Rails Guides are actually really good overview documentation. The days of saying Rails documentation is terrible are over, with the good guides, and good api-level docs too.

I knew there was a Plugin Guide, but I only just noticed there’s an Engine Guide too.

For reasons I don’t know, the Engines guide is not listed on the Guides home table of contents page, even though it’s available at guides.rubyonrails.org.   It also doesn’t google very well.

So here’s my part to publisize it. Both the Engines and Plugin guides are pretty good. They’re also overlapping in coverage. As the Engines guide says:

Engines are also closely related to plugins where the two share a common lib directory structure and are both generated using the rails plugin new generator. The difference being that an engine is considered a “full plugin” by Rails as indicated by the --full option that’s passed to the generator command, but this guide will refer to them simply as “engines” throughout. An engine can be a plugin, and a plugin can be an engine.

Maybe ideally they’d be merged, but they’re both good guides; you’ll probably want to review both if you’re working on Rails plugins OR engines.

Actually, you won’t find those exact words above in the current stable release of the Engines Guide, you’ll find it on edgeguides.rubyonrails.org instead.  I believe the actual guides are versioned with Rails releases — after one is released to guides.rubyonrails.org with the most recent rails version, it’s never changed (but perhaps for very serious errors), rather any changes will be released with the next Rails release.But you can preview em on edgeguides (I’m not sure if new ‘stable’ guides at guides.rubyonrails are pushed with new patch-releases of Rails, or just new minor-releases).

So you might find edgeguides contains some content that doesn’t apply to the current Rails release; but it may also contain improved explication or better examples or more coverage, as a result of contributors improving things. It’s worth checking edgeguides for a complicated topic. In this case, the Engines guide is rather improved in the ‘edge’ version as of this writing, and I don’t believe it includes anything that won’t work in Rails 3.2, it’s worth checking out.

A while ago I learned about `rails plugin new plugin_name`  command to give you a gem plugin  skeleton. Including a very useful dummy app for testing.  Before that I had been doing it by hand!   But this generates a very lightweight plugin, I was going in by hand and adding the files and methods turning into a heavyweight engine. I only just now, from the Engines Guide, noticed you can do `rails plugin new plugin_name -full` to get a fully engine-ized plugin.

Note on plugins ‘vs’. gems

It’s not a “vs.”.  Since Rails 3.0, Rails plugins could be distributed as gems. Since Rails 3.2, distributing a plugin as anything but a gem is deprecated — vendor/plugins is probably going to go away in future releases.

Although the great architecture of Rails 3 is such it wouldn’t be that hard to put it back into a particular app yourself, if you really needed to for legacy purposes. But in general, you don’t want, plugins-as-gems are great, making dependency management a lot more sane.

The current guides.rubyonrails.org plugin guide still gives you the option of a /vendor/plugin or a plugin-as-gem — it ought not to, since the current Rails version deprecates /vendor/plugin.  So don’t do that with a new plugin.

The edgeguide is more clear.  (I made the commit myself! Did you know anyone can commit to docrails github repo?  The commits are reviewed by editors before being merged into actual rails, and in the case of guides, eventually deployed to guides.rubyonrails).


Filed under: General

by jrochkind at May 09, 2012 06:49 PM

Catalogue & Index Blog

Cataloguing in joint library/record office facilities - can you advise?

Posted on behalf of Sue Peach, Local Studies Librarian, Derbyshire.

 

"Our County Local Studies Library is merging with Derbyshire Record Office. We are keeping our library staff and open access to bookstock. However, I'm looking, with the archivists, at the best ways to make our holdings known to users. Currently the more recent library stock is on the SirsiDynix Symphony library management system for Derbyshire libraries, but put on via Cataloguing staff, not Local Studies librarians. We also have a magnificent card catalogue/subject index of all Local's holdings, which we are keeping until a digitised system as good becomes available. Record Office holdings are on CALM.

Does anyone have experience of best practice re cataloguing in joint library/Record Office facilities?"

Please reply directly to Sue: Sue.Peach@derbyshire.gov.uk if you can help with any advice.

by Katrina Marie Louise Clifford at May 09, 2012 02:05 PM

First thus

Google Penguin and SEO

<!-- @page { margin: 2cm } P { margin-bottom: 0.21cm } A:link { so-language: zxx } --> Posting to NGC4LIB   The blogosphere has been discussing the latest updates to the Google Search algorithm, these called "Google Panda 3.5" and "Google Penguin” announced April 24 of this year http://googlewebmastercentral.blogspot.co.uk/

by noreply@blogger.com (James Weinheimer) at May 09, 2012 02:01 PM

Catalogue & Index Blog

CILIP rep to Joint Steering Committee on RDA - last days for applications

There is still time to apply for the role of CILIP rep to the Joint Steering Committee for RDA.  The closing date is this Friday, 11th May.

CIG is seeking applications for a CILIP representative to the Joint Steering Committee on RDA.

 

Background

 

The JSC on RDA is international committee for the development of RDA.  The candidate will represent the CILIP membership at an international level on the JSC.

 

The Committee conducts much of its business by e-mail.  The workload is considerable and would require agreement from potential candidates' employers.  Work is carried out by email and by regular committee meetings.

 

Person specification

 

Candidates are expected to have at a minimum:

 

Essential

·         In-depth knowledge of RDA

·         In-depth knowledge of FRBR and FRAD

·         Interest in the development of cataloguing rules

·         Commitment from employer to support time for attendance at meetings and response to consultations

 

Desirable

·         Knowledge and experience of other metadata standards

·         Knowledge and experience of standardization

·         Knowledge and experience of developing online tools

 

Application

 

Applicants should submit a c.v. with supporting statement to Stuart Hunt, Chair, CIG (stuart.hunt@warwick.ac.uk).  The deadline for applications is 11th May 2012.  Candidates should expect to be notified approximately two weeks after the closing date.

by Stuart William Hunt at May 09, 2012 07:09 AM

May 08, 2012

Bibliographic Wilderness

Reddit’s actual? (or a variation?) story ranking algorithm explained (significant typos in previously published version (or not))

So it turns out there’s a significant typo, that keeps the algorithm from working right, in the several previously blogged descriptions of reddit’s story-ranking algorithm.


update 6:28PM ET 8 May On reddit, someone with a flag suggesting they ought to know tells me I got it wrong and the original algorithm was correct and is used in production. All I can say is I can’t figure out how that could be, I could not get it to work in a non-reddit codebase, I could get it to work with my ammendment.

If I haven’t corrected a typo, then I guess I’ve derived my own variation which works a lot better for me (that is, works at all for me, but also seems to mimic reddit), in my own codebase. Good enough for me. If you are trying to reimplement this algorithm in a non-reddit codebase, I suspect you’ll find my investigation useful.

Now back to the blog post as originally written.


More oddly, this same significant typo is in the public version of reddit’s code released on github.

I’ve found myself finding joy in code-for-code’s-sake like I haven’t since past days of being an undergrad staying up all the night in the CS lab working on the most fun homework ever. And so I found myself staying up into the wee hours last night investigating reddit’s story ranking algorithm (the one used for stories/posts in the default ‘hot’ ranking,  that is time-of-posting sensitive. A different algorithm is used for comments).

The wrong algorithm

The (typo’d) algorithm is most nicely described by Amir Salihefendic. He even provides a python implementation. I figured I’d translate it to my preferred comfortable language ruby, and play around with it changing different parameters to get a feel for it, and get a feel of how it might be modified/tuned to behave somewhat differently if one wanted.

My assumption was that this algorithm outputs a number which can be stored in the database, and stories can be ordered purely by this number, to produce the on-page ranking.   This seems indeed to be true, although I was doubting it a bit in the middle. (Another nice thing about this particular algorithm, that everyone did catch on to even in the typo’d version, is that a ranking order calculation only needs to be done when a ‘vote’ happens, then it can be stored in the db unchanging forever (until another vote happens)).

So I translated Amir’s python to ruby and starting playing, and the results made no sense.  They didn’t match how things work on reddit, and they didn’t result in any kind of useful ranking algorithm.

Users of reddit know that the story listings are mostly chronological. The vote count will change a story’s position somewhat, but not put it dozens of weeks or pages ahead or behind of it’s strictly chronological order.  But that’s what this algorithm did.  It also gave any story with a net-negative vote a negative score. Which would put all the net-positive vote stories before any of the net-negative voted stories. Which is not how reddit works.

Looking at the math again now, I’m kind of embaressed I didn’t immediately see the problem, it’s not complicated math. But I didn’t, it just made my head swim. I’ll give you the relevant line from Amir’s python version here, maybe you can do better than I did, now that I’ve primed you:

return round(order + sign * seconds / 45000, 7)

Before I give you the answer, I’m going to tell you all the things I did first:

  • Went back and from scratch re-ported Amir’s python to ruby, doing as literal a translation as possible. Same nonsensical result.
  • Took the more mathematical description on Amir’s page (the one in a giant png? That he took from semoz?), and implemented that in ruby. Same nonsensical results (modulo some new bugs I introduced).
  • Googled around for anyone elses description of the algorithm to see if they had a different version, or explained it better, or explained how it fit into the context of the software as a whole (maybe I was wrong that it was supposed to produce the actual bare ordering number?). No dice.
  • Found the relevant source in the reddit github open repository (Amir’s link was broken as reddit moved their public source repo. Hooray github for being so easy to navigate on the web). Translated this to ruby. Same results.
  • Okay, noted that the reddit source mentions an “equivalent function in
    postgres”  Okay, found the SQL implementation, translated THAT to ruby, same results.

(Yep, the implementation in github public reddit has the typo, and is wrong!).

At this point, I gave up on understanding the reddit algorithm, I figured there was something I was missing (wrong, only thing I was missing was the typo). But, okay, I dove back into the math, trying to understand it and convert it to something that would work for me.

Take a moment to note lesson learned

Like many programmers, I rather like working from fixed assumptions and constraints, and building on top. This is kind of the nature of abstraction, don’t question the lower levels, take em as assumptions, don’t question em,  build upon them.

This is the second time recently that’s led me astray into butting my head against a dead end wall repeatedly, assuming the problem was in my own implementation or understanding, instead of in the framework code I was using, or the published algorithm or explanation I was working from.

Sometimes you’ve got to start questioning the validity of the algorithm you’re working from or the correctness of the library/framework code you’re using sooner rather than later, to save yourself some time. However, do it privately, if you start questioning your dependencies in public without evidence, everyone’s probably (rightly) going to tell you “occam’s razor, the bug is probably in your code, not the dependency.”

The Fix

WRONG:  return round(order + sign * seconds / 45000, 7)
RIGHT:  return round(  (order * sign) + (seconds / 45000),   7)

Is it obvious now that you see it, that the first one makes no sense, but the second one does?  Maybe if you see it in context, here’s my ruby implementation of the corrected algorithm. 

I feel kind of stupid  for not noticing this right away; on the other hand, as far as I can find on google, nobody’s pointed out the typo bug before, and several have commented on the (wrong, typo buggy) algorithm.

The Explanation of the Algorithm

With the typo corrected, it’s much easier to explain the algorithm. The crucial line, from my ruby version, with variables named how I think is clearer:

return (displacement * sign.to_f) + ( epoch_seconds(date) / 45000 )

It plots each story on a fixed timeline by post, and then displaces a story on the timeline by it’s votes.  It uses only the vote difference between up and down for displacement, the total number of votes is irrelevant. First:

... + ( epoch_seconds(date) / 45000 )

This just plots each story on a fixed timeline, with distance between two stories always exactly proportional to difference between absolute posting time.

The `/45000` fixes the units of the timeline as “12 hour periods” (45000 seconds in 12 hours), rather than seconds. This reduces the order of magnitude of the units by 4.5ish, making them conveniently less likely to overflow wherever you’re keeping them. But more importantly, choosing the units matters for how much displacement the actual votes will cause, making sure they match appropriately. Then:

(displacement * sign.to_f) + ....

Here’s our displacement. `displacement` is the based on the vote difference (up – down), but on a logarithmic scale.  The way the logarithmic scale is calculated, it loses the sign, so it just has to be added back in to net-down-votes will displace the story to be older on the timeline, and net-up-votes will displace the story to be newer on the timeline.

Why is a logarithmic scale used? Other explainers have said something like “to weight the first votes higher than the rest.”  While it might have this effect because of reddit voter’s behavior, this is a misleading explanation.  The algorithm pays no attention to which votes were made first, either in absolute chronological time or in sequence. It’s just vote-difference.  ”10 up, 1 down” has exactly the same effect as “100 up, 91 down” or “1000 up, 991 down”.  And it doesn’t matter what order the ups and downs were placed.

The logarithmic scale is in fact used to prevent the displacement-from-voting from displacing the display order too much.  Reddit doesn’t want a very high or low voted story to be months ahead or behind, the reddit ‘hot’ order is mostly chronological, with just some displacement from votes.

I dont’ do this kind of mathematical analysis much, and don’t know how to get, say, R, to make you a pretty plot (it ought to be an actual function plot not a bar graph, for explanatory power). So I’ll just give you some samples of how much displacement a given vote-diff can get. Again vote-diff is just ups minus downs, doens’t matter total number of votes. I’ve converted from the “12-hour units” the displacement is actually expressed in to more comprehensible ‘in hours’ units.

vote-diff displacement in hours
0 0
1 0
2 3.7
3 6.0
4 7.5
5 8.7
6 9.7
7 10.6
8 11.3
9 11.9
10 12.5
100 25.0
1000 37.5
10000 50.0

As you see, even something with an absurdly high 10000 vote-diff only gets put 50 hours ahead of it’s usual place in the timeline. Likewise, if it had a -10000 vote-diff (10k more downvotes than upvotes), it would be only 50 hours behind it’s usual place in the timeline.

Keeps votes from changing the position of a story too much, keeping it at the top forever, or moving it so many pages in that nobody ever sees it.  That’s what the log scale does.

And that scale pretty well matches what we reddit users actually observe on reddit, I went and checked it against some popular reddits; reddit only displays approximate posting time of a story as far as I can tell (“1 day ago” could mean 28 hours or 32 hours or whatever), so can’t check completely, but the actual ordering could be explained by the corrected algorithm.

Wrong in the public source on github?

update 6:55pm ET 8 May 2012.  reddit assures me that this code is what reddit runs live, and I have made some really stupid mistake. Fair enough. Struck out this section.

Unless I’m making some really stupid mistake, this typo-bug is present in the reddit source publicly shared on github as of time of this writing. [1], [2]

This means that there’s pretty much no way actual reddit.com is using the code they’ve posted publicly on github. At the very least, they’ve fixed this bug in the implementation they’re actually running.

It probably means nobody else is using the reddit github source either, cause it wouldn’t work right with ‘hot’ ranking. (Or someone else is using it, and fixed the bug in their source but didn’t send it back).

How did this bug end up in the publicly shared reddit source? Not fixed yet? I’m kind of curious, and curious as to what relationship this publicly shared source has to what reddit actually runs.

Considering tweaking the algorithm

Now that we understand the basic “timeline + displacement” algorithm, we can consider tweaks/modifications/tuning of the algorithm to behave differently in different environments, which curiosity was my original motivation for looking into this in the first place.

You might want vote displacement to have more of an effect, or have the effect trail off faster or slower . You’d still want to use a log-scale (or a mathematical function with similar properties) to keep very high vote-diffs from displacing a story too much, you still want a trail-off effect.

You could change the log from base-10 to base-something else to effect the velocity of the trail-off effect.  You could also introduce a factor into the operand of the log, take `log( factor * vote-diff)` instead of just `log(vote-diff)`.   You potentially could change the units from 12-hour units to something else (the 45000 number), but that could get confusing quick, you might need to add another factor on the left hand of the sum to compensate. So actually, instead, you can just add a factor in the left-hand side, `factor * log(votediff)` instead of just `log (votediff)`

I’m not enough of a math guy to predict exactly what all those things would do, I’d want to actually plot the function in R (or something else) and see what it looks like when you change the various factors, and I don’t know enough R (or anything else) to do it. I think plotting vote-diff vs hours-of-displacement is the right thing to plot though to give you the right feedback.

You could also try to introduce something to the equation to take account of total number of votes, so “10 up, 1 down” and “100 up, 91 down” don’t have exactly the same effect. You’d want to base this on the Wilson score confidence interval used by reddit for default comment ranking somehow, that’s the right way to take account of total number of votes, but it’s not immediately clear to me where you’d introduce that into the equation how (Did I mention I’m not a math guy?).  That would make it a bit harder to see what it does by plotting it, since it’ll be a multi-variate function now, doh.

And you might not want to trust that the algorithm found in reddit’s public github source for Wilson score confidence interval is actually bug free. Last year someone said they found a bug in at least one published implementation; I think I saw someone say it had been fixed on reddit.com, but I don’t know if it’s been fixed in the github public source.

You might also want to make up votes worth more or less than downvotes, instead of equivalent. Not quite sure how you’d do that. You could make net-negative votes worth more or less than the same absolute value net-positive, just by using a factor in `factor * log(diff)` that depends on diff being positive or negative.


Filed under: General

by jrochkind at May 08, 2012 11:37 PM

TSLL TechScans

National Library of Medicine on Name Authority Records

NLM has decided to follow the British Library’s lead and try to avoid creating any further undifferentiated NARs for NACO, nor to add any further identities to existing NARs. If using RDA qualifiers such as period of activity or profession will allow the name to be differentiated, then these elements will be added to the heading and headings will be coded RDA. Catalogers who are not yet trained in RDA will work with or pass the work onto NLM catalogers who participated in the RDA test.

While NLM is not as optimistic as the BL that undifferentiated records can be avoided completely, NLM believes that minimizing the number of undifferentiated headings in the national authority file will be a benefit to the cataloging community.

Diane Boehr
Head of Cataloging
National Library of Medicine

from PCC list

by noreply@blogger.com (Christina Tarr) at May 08, 2012 09:17 PM

First thus

Re: Setting an Example for Academic Research

Comment to Daniel Stuhlman's blog posting Setting an Example for Academic Research, Monday, May 7, 2012. http://kol-safran.blogspot.it/2012/05/setting-example-for-academic-research.html   This shows the changes that are occurring now in the field of publishing and bibliography. I personally do not like some of these trends, but nevertheless, they are happening. Wikipedia is becoming an

by noreply@blogger.com (James Weinheimer) at May 08, 2012 03:00 PM

Outgoing

Just the Links, Please

PearRing150We haven't done much with JSON in VIAF yet, but Ralph came up with a new feature for VIAF for which JSON seemed a natural fit: requesting a view of a VIAF cluster that just shows the links.  Here's the JSON for this new view of the Mark Twain cluster, http://viaf.org/viaf/50566653/justlinks.json

 

{ "viafID":"50566653",
    "BAV":["ADV10188047"],
    "BIBSYS":["x90056487"],
    "BNE":["XX945992"],
    "BNF":["http://catalogue.bnf.fr/ark:/12148/cb11927291n"],
    "DNB":["118624822"],
    "EGAXA":["vtls000823270"],
    "JPG":["500020427"],
    "LAC":["000002392","0105C0556"],
    "LC":["340758"],
    "NDL":["00459304"],
    "NKC":["jn19981002263"],
    "NLA":["000035028957"],
    "NLIara":["000160730"],
    "NLIcyr":["000156897"],
    "NLIheb":["000175478"],
    "NLIlat":["000133341"],
    "NUKAT":["vtls000634910"],
    "PTBNP":["46269"],
    "RERO":["vtls000164542"],
    "RSL":["nafpn-000083946"],
    "SELIBR":["98057"],
    "SUDOC":["027171876"],
    "VLACC":["000002392"],
    "WKP":"Mark_Twain"}

We return an array for each source because some may have multiple IDs (e.g. LAC in the example).  The array labels are VIAF's abbreviations (e.g. http://viaf.org/authorityScheme/LAC) for each of the source files, with WKP standing for Wikipedia.

For those of you that prefer content-negotiation to mangling URIs, the mime-type 'application/json+links' 'application/vnd.oclc.links+json' should also work.

Already built into VIAF is the ability to go from the source ID to the VIAF cluster: http://viaf.org/viaf/sourceID/SUDOC|027171876 will take you to the VIAF cluster http://viaf.org/viaf/50566653, and http://viaf.org/viaf/sourceID/SUDOC|027171876/justlinks.json will give you just the links for that page.

If you want ALL the links out of VIAF then visit the VIAF Dataset page at http://viaf.org/viaf/data.  It has a pointer to a file that lists each of the 27 million VIAF ID to source ID links.

We're having fun here at VIAF Central!, especially Ralph LeVan who thought of and implemented this.

--Th

 Update: 2012.05.09  changed mime type (old one should still work)

by Thom at May 08, 2012 02:00 PM

Catalogue & Index Blog

Papers on repositories wanted for Catalogue & Index

The next issue of Catalogue & Index will be on repositories. Are you currently working in a repository-related role? How has your metadata knowledge been of use? Have you been doing any projects in repository metadata that you want to share with others?

Anyone who wants to contribute an article should contact Cathy Broad (library@ethicalsoc.org.uk).

by Katrina Marie Louise Clifford at May 08, 2012 01:53 PM

May 07, 2012

Celeripedean » cataloging

Jen

In my last post I left off with the teaser that I would talk about initiative. In second thought, I wanted to discuss the positive aspects of having volunteers or interns. There might never be a good moment to have a volunteer or intern. It requires a lot of time and energy that staff might not have or be willing to give. Despite this, I would contend that the time and effort are totally worth accepting a volunteer or intern.

A volunteer or intern provides an outside perspective to your everyday work routines. It doesn’t matter if it comes from a volunteer or intern. Because the person is external to the politics and inner working of the department, this person can see things that either you take for granted or chose to ignore. Sometimes, the volunteer and/or intern can bring a fresh perspective and perhaps their own work experience. From this exchange, routines can be updated, refreshed, and/or revised.

Volunteers and interns also bring a lot of positive energy. This is especially in the case of library school students or recent grads. Most of them are just so excited to be entering this profession. I think it helps those of us who have become jaded by politics or social dynamics in the workplace.

Last but not least, having a volunteer or intern is a great opportunity to give someone a chance to see what library work is. With the right mentor, a volunteer and/or intern an become a vital member of the department.

That leads me to being a mentor. It’s not easy. It’s not leaving the volunteer or intern with some project typically left for work studies. It’s also not leaving the volunteer to their own devices because you’re just too busy to help them out. With the right framework and a clear understanding of what is expected and required of both volunteers/interns and their mentees, the experience can be success. And more importantly, any problems can be avoided or dealt with in a professional, clear, and timely manner.

 


Filed under: cataloging Tagged: volunteering

by Jen at May 07, 2012 11:20 PM

TSLL TechScans

Serials and RDA: An Ongoing Relationship

The 26th conference of the National Serials Interest Group held in St. Louis, Missouri on June 2-5, 2011, featured a full-day pre-conference entitled Serials and RDA: An Ongoing Relationship. Judith Kuhagen, Senior Cataloging Policy Specialist at the Library of Congress, presented highlights of RDA as it applies to serials. This helpful presentation linked AACR2 to RDA vocabulary and emphasized the areas of greatest concern. Concrete examples of RDA in MARC were most helpful.

Full conference proceedings are available to NASIG members via the NASIG website, or in a special issue of the Serials Librarian.

by noreply@blogger.com (Jackie Magagnosc) at May 07, 2012 08:33 PM

Bibliographic Wilderness

Check your Apache httpds, make sure they’re KeepAlive On

So in the course of testing ruby http clients, I realized with full force that persistent HTTP connections really do matter, especially if you are doing SSL/https connections.

I knew in theory it improved performance, especially for SSL, but I guess I hadn’t really realized how much it really mattered, a lot.  For your web apps/servers talking to ordinary browsers, not just for software automated http clients.

If you have an app that’s accessed over https, a browser connecting to that app might need to fetch half a dozen, a dozen, or more different assets over https. Sure, if things are working right they’ll be cached (until they change) after the first request. Sure, we try all sorts of things to minimize the number of seperate http connections. But there’s still probably 6-10, including images and such, at least.

Over https/SSL 6-10 separate HTTP connections vs 6-10 HTTP requests over a single persisted HTTP connection really does make a difference, as much as several hundred miliseconds, enough for your users to notice, possibly enough to effect CPU load on your machine.

It turns out, I have some web apps that enforce https only, which were not supporting persistent connections!

It looks like the default RHEL/CentOS 5 httpd install, for some reason has “KeepAlive Off” in it’s httpd.conf.  Even though apache, for quite some time, has defaulted to KeepAlive On, for some reason RHEL installs override this default.

It’s worth checking. Check your httpd conf files, grep for “KeepAlive”, make sure there aren’t any “KeepAlive Off”s in there.  Or check by actually making a connection to your server — with curl, with a browser using Chrome’s developer tools ‘network’ tab, or LiveHTTPHeaders in Firefox.  When you make a request from your browser, in the response headers you do not want to see “Connection: close”. That means the server is insisting on closing the connection right away and not letting the client use a persistent connection.

Some people reading this are going to be thinking “Duh, how can you have been deployed with such a silly configuration, do you know what you’re doing?” But others of my peers at similar institutions know how it is, and probably have never checked this either. It’s really worth a check.

(I am curious why the heck RHEL5 installs httpd with KeepAlive Off. All I’ve found is some other people asking the same question. )

update: some notes on performance tradeoffs of using KeepAlive.  I still believe that for a webapp that’s served over SSL, you pretty much have no choice but to enable keepalive for proper performance, and make sure your server can handle it.

It doesn’t look like there’s a way to tell apache “Allow X concurrent connections with KeepAlive, and if there are X connections and more connections come in, then for THOSE connections above X, insist on effectively ‘keepalive off’. If there were such a way, it could be the best of both worlds, making sure you serve KeepAlive within the constraints your server can handle.  Can anyone figure out a way to do that?


Filed under: General

by jrochkind at May 07, 2012 02:41 PM

mod librarian

Metadata Monday: Self-Publishing Metadata

Here is a simple guide to maximizing metadata when self publishing online. Joel Friedlander, veteran writer/book designer/publishing acumen presents a concise overview of e-book metadata and then delves into how to use keywords with particular metadata fields to create a "secret sauce" that becomes the search-related verbal key to your book's discoverability online.

Taking it one step further, an author might perform some similar searches on similar items to amass relevant keyword ideas and then edit and customized based on their particular work. Another powerful point that Joel makes is that you can always go back to refine the keywords over time to better sell your content.

 

 

 

Permalink | Leave a comment  »

May 07, 2012 12:30 PM

Catalogablog

MADS 2.0 User Guidelines

News from LC.
The MADS 2.0 User Guidelines http://www.loc.gov/standards/mads/userguide/index.html are now available on the Library of Congress' MADS Web site: http://www.loc.gov/mads, along with the XML schema itself, an Outline of Elements and Attributes, and a mapping and XSLT from the MARC 21 Authority Format to MADS 2.0.

by David (noreply@blogger.com) at May 07, 2012 10:52 AM

Crowdsourcing Cataloging at the Bodleian Library

Crowdsourcing cataloging at the Bodleian Library.
What's the Score at the Bodleian? is a project which aims to enlist the wider community's help in describing a selection of digitised scores from the Bodleian Library's extensive music collections, thereby facilitating access to valuable and interesting material which has not been catalogued and is therefore difficult to find. The approach is two-fold in that it combines a process of rapid digitization of the scores and the creation of descriptive metadata through crowd-sourcing, and it is hoped that the outcomes of the project can be used to inform an efficient yet cost-effective approach to creating access to other music-related material in the Bodleian in the future. It is hoped that there will also be scope in the final delivery of images and crowd-sourced data for additional enhancements such as the hosting of audio performances relating to the music scores and provision of external links to video performances.
My feeling is for some material this makes sense. For items that may take years or decades to fully catalog this may be a good interim solution. Or for items of low importance that may never get described some metadata is better than none. I'm reminded of the 4 levels of access and description once proposed. Most stuff, little importance, indexed by search engines. More important stuff, some metadata like PDF and Word description fields. Materials of still more importance, get Qualified Dublin Core so something on that level. Most important get full treatment by a trained professional. FGDC, MARC/RDA/ISBD, MODS, whatever standard fits. Crowdsourcing could move materials at the search index level up a level or two. It would improve access without using lots of resources.

by David (noreply@blogger.com) at May 07, 2012 10:34 AM

First thus

Re: [ACAT] Who actually needs to understand FRBR principles?

Posting to Autocat On 01/05/2012 15:40, Kathleen Lamantia wrote: <snip> I have heard several experienced presenters opine that a wide percentage of library staff need to be made familiar with FRBR principles, at a minimum the WEMI concepts. I am extremely hesitant to do this. I understand WEMI; I do not find it helpful in any way, and I don't think my staff or patrons will either. I work in

by noreply@blogger.com (James Weinheimer) at May 07, 2012 09:15 AM

May 06, 2012

habitually probing generalist

Two-Thirds Book Challenge Update 7

This is update 7 in the Two-Thirds Book Challenge.

It seems that Helen is the only one who got any books read and/or posted about this month … so, we’ll start with her.

Helen

The Big Cat Nap by Rita Mae Brown

I love this series. Through 20 years I feel like I’ve grown up with these characters. They’re effortless and real in a way that feels genuine, even in such a contrived environment as the murder mystery can be. … I hope she never stops this series!

Read her review to find out the topics covered in this book.

Why Be Happy When You Could Be Normal by Jeanette Winterson

This was a 5 star book for Helen.

This is a slice of her life across the singular topic of being adopted. That sounds so simple, but no one is better equipped to express the exquisite agony and beauty of this topic from childhood, with her severe, evangelical adopted mother, to the present, meeting her biological mother and family. Nothing about it is simple, nothing is expected.

She refuses to make a simple syrup of her experiences and so takes us all to a place where there is no separation between emotions and thought, where feeling and thinking happen simultaneously and equivalently and the mess that is. It sounds complicated, maybe overly so, and it is. That’s life.

Ragnarok: The End of the Gods by A.S. Byatt

Helen gave some good reasons for not liking this one very much:

There were a numbers of barriers to enjoyment for me reading this book. I was just glad it was so short, otherwise I would have quit.

First, this is the 15th in the Canongate Myths series (http://www.themyths.co.uk/) and it was only three stories ago that they covered a Norse myth. I love the Myths series, but not spacing these two stories out more was a big oversight, especially since the other story was so much better. I mean light years, so having them close like this made the superiority of the other story just that much more obvious.

Too much description, a bad transition, and a disjointed essay at the end are the other reasons. Read her review to get the details.

On the Canongate Myth series as a whole she writes:

Prior to this I have only disliked one other book in the Myths series, so I still think they’re batting average is pretty high! But, if I were just getting into the series, I wouldn’t start here. I might even skip it altogether.

Sara and I have both read the opening book in this series, and Sara has read a few more of them. I believe she has generally liked them.

Radioactive: Marie & Pierre Curie, a Tale of Love and Fallout by Lauren Redniss

A.Maz.Ing. This book is not only stunningly gorgeous to look at but beautifully written. Every page, even the filler pages, were a treat to explore. …

Just go read her review. And then, perhaps, read the book. I know I will be doing so.

Canning for a New Generation: Bold, Fresh Flavors for the Modern Pantry by Liana Krissoff

So even though a “wee bit too hipster homesteader for me in style,” the author’s “genuine and it makes me feel like I might actually be able to make these things. … I don’t think I’ve ever wanted to try to make so many recipes in a cookbook, and that’s all there is to say.”

Interesting review and if you want an introduction to canning, or are looking for good canning recipes, then this might be a book for you.

Everyone else

I apologize if I missed something by the rest of you but I poked the feed reader, your blogs and my diigo tag and didn’t find anything. Perhaps next month.

by Mark at May 06, 2012 09:26 PM

May 05, 2012

Terry's Worklog

Maybe all libraries do need coffee shops

Having gone 34 years without being a coffee drinker, I personally never got why people wanted coffee shops in libraries.  But over the last year, my wife and Greenhill Farms, a Kona Coffee Grower in Hawaii, convinced me that not all coffee is bad.  I’m so convinced, that having a morning cup of coffee (black, no sugar – yuck) has become a bit of a habit. 

Well, this morning, I was hanging out in Vancouver, BC killing time before heading to the airport.  Since I didn’t have anything to do, I grabbed my copy of Norman Mailers “The Castle in the Forest” and headed a couple blocks down the road to Tim Horton’s.  There, I grabbed a medium cup of black coffee, and found myself a quite table to just sit and read.  And I have to admit, kicking back, nursing my cup of coffee and enjoying a good book was really appealing.  Without realizing it, I’d spent about an hour and half in my little corner of the coffee shop.  I think I now definitely understand the draw.

Maybe now that I’ve had this break through, I’ll be able to unwrap other mysteries – like why people enjoy watching talk shows and reality TV, who actually liked the show FireFly and why (because as a scify fan, I don’t get it) the infatuation with Dr. Pepper, and why my cats always look like they are plotting to kill me in my sleep. 

–tr

by Administrator at May 05, 2012 08:09 PM

May 04, 2012

Free Moth :: Flutterings

Translating FRBR into RDF

Baker, Thomas. 2012. ‘Libraries, Languages of Description, and Linked Data: a Dublin Core Perspective’. Library Hi Tech 30 (1) (January 1): 116–133.

I finished reading this paper yesterday which I’d recommend as a great histroical overview of Dublin Core and its place in the emerging linked data environment.  One section that I found particularly interesting was Baker‘s account of the translation of FRBR into RDF.

He talks about how the FRBR group 1 entities (work, expression, manifestation, item) have been set up in an ontology using OWL, the W3C web ontology language.  He notes that this approach is in contrast to the method used to set up ISBD in RDF which “follows the Singapore framework by defining an RDF vocabulary of ISBD properties and using those properties in specifically constrained ways in a description set profile.” (p. 125)

Baker says the result is that the FRBR group 1 entities

“… are defined as disjoint classes and the relationships between entities are defined as disjoint properties. Declaring the entities to be disjoint means that in the FRBR [RDF] universe, a resource belongs clearly to one of the four classes. If one statement declares a resource to be a work, and another declares that same resource to be an expression, then by definition one of the statements must be wrong.” (p. 125)

This strikes me as an odd way to look at FRBR.  These entities are not mutually exclusive.  There is a certain amount of overlap between all of them, certainly between a work and an expression and perhaps to a lesser extent between a manifestation and an item.  There’s also relationships to be found between a work, expression and manifestation.

I’ve attended a couple of RDA training webinars put on through OCLC and presented by Mark Ehlert.  He looks at the FRBR group one entities in terms of layers.  The work is the first layer followed by the expression, etc.  I like this approach.  It makes sense to me.  A resource will display properties and attributes from all of the FRBR group 1 entities.  They should not be considered “disjoint classes.”

Baker points out that this interpretation of FRBR in OWL has been strongly criticized because it presents “an overly rigid interpretation of FRBR – one that imposes sharp ontological distinctions on users … [and that the] … rigidity of this conceptual universe becomes a particular problem when trying to merge FRBR-based data with non-FRBR-based data.” (p. 126)

He provides the following example to illustrate:

Should the non-FRBR-based description of a book, for example, be considered comparable to the description of a work, an expression, a manifestation, or an item? It cannot be considered comparable to more than one without violating the laws of the conceptual universe delineated in the FRBR ontology.” (p. 126)

Baker talks about some “workarounds” that have been devised and points to two articles.  The first is this paper about the RDA vocabularies:

Hillmann, Diane, Karen Coyle, Jon Phipps, and Gordon Dunsire. 2010. ‘RDA Vocabularies: Process, Outcome, Use’. D-Lib Magazine 16 (1/2) (February). doi:10.1045/january2010-hillmann. http://www.dlib.org/dlib/january10/hillmann/01hillmann.html.

I’ll need to reread that one with this workaround idea in mind.  The second is a relatively new paper by Ronald Murray and Barbara Tillett that Baker says “suggest[s] an alternative interpretation of FRBR: one in which the WEMI entities are seen as ‘groups of statements that occupy different levels of abstraction’.” (p. 126)

Murray, Robert J., and Barbara B. Tillett. 2012. ‘Cataloging Theory in Search of Graph Theory and Other Ivory Towers’. Information Technology and Libraries 30 (4) (January 19). doi:10.6017/ital.v30i4.1868. http://ejournals.bc.edu/ojs/index.php/ital/article/view/1868.

The title of this paper alone has me intrigued and I look forward to reading it.  It sounds like it will be closer to Ehlert‘s discription of WEMI as layers that I mentioned earlier.


by freemoth at May 04, 2012 01:52 PM

Outgoing

VIAF Dataset

VoID The VIAF dataset is now available for public consumption!  http://viaf.org/viaf/data describes and links to the files involved and describes how we expect the ODC-By license to be applied.  We are not sure just how popular the files will be, so if the site appears slow, please stop downloading and come back later.  From my machine here at OCLC my browser is estimating 20-30 minutes to download the larger files, from my home it was double that.

For more about this, see a previous post: http://outgoing.typepad.com/outgoing/2012/04/viaf-developments.html.

One question that has come up was whether it would be possible to incorporate VIAF identifiers and information into a dataset that is released under a CC-0 license.  The short answer is yes.  Here's a longer answer:

  • We would like to see acknowledgement of VIAF as a source somewhere on your site
  • We encourage the use of VIAF URI's where appropriate and they can be considered acknowledgement
  • Incorporation of those VIAF URI's and associated information from VIAF should not prevent you from releasing your dataset under CC-0, since the URI's are considered sufficient acknowledgment

--Th

 

 

by Thom at May 04, 2012 01:05 PM

Catalogablog

VIAF Now Available

Thom Hickey has details about the Virtual International Authority File being publicly available on his Outgoing weblog..
The VIAF dataset is now available for public consumption! http://viaf.org/viaf/data describes and links to the files involved and describes how we expect the ODC-By license to be applied. We are not sure just how popular the files will be, so if the site appears slow, please stop downloading and come back later. From my machine here at OCLC my browser is estimating 20-30 minutes to download the larger files, from my home it was double that.

by David (noreply@blogger.com) at May 04, 2012 12:56 PM

The Serials Cataloger

"RDA: End of the World Postponed?" by Kevin M. Randall

Kevin Randall, an RDA proponent and the Principal Serials Cataloger at Northwestern University, provides an overview of some of the major issues surrounding Resource Description and Access in his article, "RDA: End of the World Postponed?" In the article, Randall tackles questions such as:
  • Why should we switch to RDA if the records aren't really much different from AACR2 records? Can't we just fix AACR2?
  • What's all this about FRBR, and why are we rushing into it blindly?
  • Are we putting the rules cart before the format horse?
  • We finally got continuing resources, now where did they go?
  • Have we abandoned ISBD?
  • Isn't the U.S. RDA test really just for show? Isn't implementation a foregone conclusion?
  • That was then, this is now . . . what's the future?
If you've grown weary of following all the RDA discussions on various discussion lists, this article will help catch you up on some of the current debates. As catalogers well know, cataloging rules are all too often blamed for the shortcomings of OPACs, so I want to note one particular quote that I wish all reference librarians and library administrators would read:
"... the cataloger’s work could be aided to an extreme degree by truly suitable and modern cataloging
interfaces. (For many years complaints about the difficulty and expense of cataloging have been largely misplaced. The problems have far less to do with the cataloging rules and the MARC format than they have to do with an electronic cataloging interface that after four decades still holds onto its original basic concept: read a book of cataloging rules, and apply those rules in filling out a MARC tag workform)"--P. 339.

Randall, Kevin M. "RDA: End of the World Postponed?" Serials Librarian 61, no. 3-4 (Oct. 2011): 334-345. doi:10.1080/0361526X.2011.617297


[Note: My apologies to readers and to Mr. Randall for the lateness of this posting. I had been relying on an RSS feed for information on new articles, but never received notice of this one. It wasn't until I saw a citation to the article in the NASIG Newsletter that I discovered the omission.]

by Lori (noreply@blogger.com) at May 04, 2012 09:20 AM

"The U.S. RDA Test Process" by Diane Boehr, Regina Romano Reynolds, and Tina Shrader


This article reports the content of a session given at the 2011 NASIG Conference on the process of testing Resource Description and Access (RDA) in the United States in late 2010 and the subsequent analysis of the data in 2011. The presenters represented three national libraries who were an integral part of the testing process: the Library of Congress, the National Agricultural Library, and the National Library of Medicine. While this article does not present the recommendations that came out of the test (those can be found on the Library of Congress's website), it does describe the process of how the test was conducted, how the data were analyzed, and the general categories of recommendations that were arrived at. Given that this presentation was part of the NASIG conference, the topic of continuing resources in the test was highlighted. The presenters touched on issues such as successive entry, what format changes will constitute the need for a new bibliographic record, how translations and language additions will be handled, discrepancies between RDA and CONSER Standard Record practices, and the future of provider neutral records under RDA.

"The U.S. RDA Test Process" can be found in the Serials Librarian, volume 62, issues 1-4 (2012), pages 125-139. DOI: 10.1080/0361526X.2012.652485.

by Lori (noreply@blogger.com) at May 04, 2012 09:17 AM

Free Moth :: Flutterings

Book Meme

Remembered this blog of mine the other day and finally sought it out to see if it’s still around. Surprisingly yes.

Reading through some of the old posts and came across this book meme and thought I’d try it out again.

  1. Grab the nearest book.
  2. Open it to page 56.
  3. Find the fifth sentence.
  4. Post the text of the sentence in your journal along with these instructions.
  5. Don’t dig for your favorite book, the cool book, or the intellectual one: pick the CLOSEST.

Proper selection among the many books available was crucial because ‘if a scholar does not have the books required for his subject, he does not enjoy the privleges of a scholar.’”

Blair, Ann M. Too Much to Know


by freemoth at May 04, 2012 12:43 AM