Planet Cataloging

July 27, 2014

Celeripedean

Jen

I’ve talked about how the term “metadata” has become associated with so many jobs, tasks, or titles that I wonder if the term has lost much of its meaning to be helpful. I was reminded of my question while reading a recent thread on what to call catalogers. The term metadata came up. Another reminder was my supervisor who stated that everyone in her department does metadata. I’ve heard over the years on how people create metadata, do metadata, transform metadata, perform metadata tasks or are metadata librarians. I’ve read articles on the merits of metadata or the sad stories of what’s lost with no metadata. With each one of these cases, metadata is slightly different. My supervisor was referring to descriptive information entered and/or manipulated in Connexion, our ILS, or perhaps MarcEdit, a cataloging tool, before being loaded into our ILS. This description information is not really converted from another format. It tends to all be in MARC21 that needs varying degrees of local manipulation to conform to local practices. Some of my colleagues who have said that they create metadata sometimes refer to this process I just described. Others create metadata by entering information in web forms for digital repositories. Those who speak about transforming metadata typically work with xml editors, xml and xslt to convert an xml file of one structure into a different xml file of a different structure. In general, structure here refers to a particular metadata schema or an xml file that is a data dictionary. If you write an xml file in English, you use the English dictionary. Similarly, if you write an xml file in MODS, you use the MODS standard. Some of my colleagues who say they transform metadata tend to know some scripting or programming languages such as Python. Some of my colleagues who have said they perform metadata tasks talk about developing best practices for encoding content and content standards. In a sense, all of these examples illustrate that my colleagues create and edit information that contextualizes resources of all types in different software applications and for different audiences. You could say that despite all of these differences, the glue is that metadata is just data. However, a fellow colleague didn’t see metadata this way. For her, metadata was content entered into forms; of course, this could be just a matter of different perspectives.

It is these perspectives that highlight one important thing about this profession that deals with metadata. Namely, we are still a growing in terms of the diversity of tasks out there for us to take on and learn. This is what makes this profession so much fun. There is just so much to learn. One can take their job into so many different directions (hopefully seen as a positive and encouraged by their institution). In this respect, it is a good thing that metadata is elusive. Hopefully we won’t be able to pin it done to one task, one job, or one type of librarian.


Filed under: cataloging Tagged: metadata

by Jen at July 27, 2014 07:45 PM

July 25, 2014

025.431: The Dewey blog

Informatics

Sometimes it seems that every field wants its own informatics subfield.  For example, in the health and medicine arena, you might come across health informatics, consumer health informatics, community health informatics, public health informatics, disease informatics, medical informatics, clinical Informatics, nursing informatics, dental informatics, translational bioinformatics, medical imaging informatics, and the list goes on.  But it isn't only health and medicine that have gotten on the informatics bandwagon.  You might also come across museum informatics, business informatics, cheminformatics, education informatics, environmental informatics, legal informatics, music informatics, and on and on.

Since it's so popular, it would be good to find out just exactly what informatics is.  (Good luck with that endeavor!)  But, of course, informatics isn't something that occurs in nature, whose nature can be discovered.  Informatics is a label put on a set of perspectives, activities, phenomena etc., found useful in some context and varying across contexts.  The range of activities/phenomena belonging under the informatics umbrella is quite broad, including, inter alia, clinical decision support systems, interoperable electronic health records, identifying a consensus classification of biological organisms, the simulation of neural systems of specific animals, virtual screening of large libraries of chemical compounds, magnetic resonance imaging, in-car navigation systems, case-based legal prediction,  optical music recognition, etc.  What all versions of informatics share is a focus on the interaction of information and computation.

In the DDC, informatics is mentioned in the schedules in two places:  004 Computer science (indexed by the Relative Index heading "Informatics—computer science") and 020 Library and information sciences (indexed by the Relative Index heading "Informatics—information science").  No interdisciplinary number for informatics has been identified, since works on informatics almost invariably treat the informatics of specific fields.  The point of listing informatics in the class-here notes at 004 and 020 is to motivate the giving, at both locations, of the following scatter class-elsewhere note:  "Class informatics of a specific subject with the subject, plus notation T1—0285, e.g., bioinformatics 570.285."

Accordingly, we find the following titles on the informatics of specific subjects classed in the Dewey numbers indicated:

Title:   Medical device data and modeling for clinical decision making
Topic:  Medical informatics
LCSH:  Medical informatics
DDC number:  610.285 (610 Medicine and health +  T1—0285 Computer applications)

Title:   Nursing informatics : where technology and caring meet
Topic:  Nursing informatics
LCSH:  Nursing informatics; Information storage and retrieval systems -- Nursing
DDC number:  610.730285 (610.73 Nursing and services of allied health personnel +  T1—0285 Computer applications)

Title:   Information technology in pharmacy : an integrated approach
Topic:  Pharmacy informatics
LCSH:  Pharmacy informatics; Pharmaceutical services -- Information technology
DDC number:  615.10285 (615.1 Drugs [Materia medica] + T1—0285 Computer applications)

Title:   Informatics for materials science and engineering : data-driven discovery for accelerated experimentation and application
Topic:  Materials informatics
LCSH:  Materials science -- Data processing
DDC number:  620.110285 (620.11 Engineering materials + T1—0285 Computer applications)

Title:   Semantic modeling and interoperability in product and process  engineering : a technology for engineering informatics
Topic:  Manufacturing informatics
LCSH:  Production engineering -- Data processing
DDC number:  670.285 (670 Manufacturing +  T1—0285 Computer applications)

by Rebecca at July 25, 2014 04:48 PM

First Thus

ACAT Concerning D’Souza’s “America” movie and how Search is changing

Posting to Autocat

On 7/25/2014 3:16 PM, Wojciech Siemaszkiewicz wrote:

It certainly is happening for some time now. Facebook has been very aggressive with their advertisements based on your “political” activities and opinions posted on FB. The same happens to pages you might like that FB presents for you. This has happened when their redesigned displays in FB. The key element of FB that is your friends activities are now displayed on the right side small window, while your preferences and likes are centrally located with FB adds and pages you might like right in front of your eyes. It is all based on your past history of postings and activities there.

Google does the same reading your past searching history and suggesting results for your searches that more or less match your past preferences based on your past activities including your “political” searches. It works well if this is what you look for, however, it misses the target or your search if you are looking for nonpolitical subjects that you have not searched yet in the past.

Yes. But I want to emphasize that I am not so much judging the positive or negative points of the Googles, but to emphasize what was a kind of “revelation” to me about 10 years ago: that Google (and Yahoo and Bing and Facebook…) are all advertising agencies, and I had been incorrect to think of them as some kind of information agency, or super-indexes or super-catalogs or something similar. Certainly, the Googles provide access to some resources–just as any advertising helps you to find a beauty parlor, grocery store, book, or anything else. Yet the major point of all this advertising is to make people *happy* (at least for long enough) and that includes the companies who make the products you are advertising. If you don’t, you are dead.

Google … [et al.] all know very well that for someone to change their search engine/advertising agency all it would take is a click of a button! That must be terrifying for them!

I discussed this some time back http://blog.jweinheimer.net/2013/03/re-ngc4lib-google-ngram-results_9153.html (there I go, quoting myself again!).

But all of this is in the nature of an advertising agency and we shouldn’t blame them. It’s their job: we should not expect the truth from them, nor unbiased information. Why should we expect anything different from the Googles? Look at how Google is “cleansing” its databases of negative information here in Europe. (It gets messier. See: http://uk.reuters.com/article/2014/07/24/us-google-eu-privacy-idUKKBN0FT1AZ20140724) Once this is understood and accepted, the differences from the Googles, vs. what a library is supposed to provide their communities becomes crystal clear. Google wants to make everybody happy and they are finding it increasingly difficult to do.

Libraries provide something very different but the majority of people do not understand–and unfortunately that includes many administrators–and compare the Google “User Experience” (that makes them happy!) with the library “User Experience” and find libraries wanting.

As a fitting aside here, this just in: the “Let me Google that for you” Act
http://searchengineland.com/let-google-bill-aims-replace-government-agency-google-search-197907?utm_source=feedburner&utm_medium=feed&utm_campaign=feed-main

People need to understand the difference. A lot of us understand. These senators do not and many others, very intelligent, obviously do not either.

A dangerous moment.

by James Weinheimer at July 25, 2014 04:17 PM

Concerning D’Souza’s “America” movie and how Search is changing

Lately, there have been several articles discussing how, when someone types “America movie” into Google, the top result is Dinesh D’Souza’s new film “America: Imagine the World Without Her” but it turns out that is not enough. The Google results apparently lack film locations and times, and D’Souza’s lawyers are claiming that Google needs to “fix” this “problem”. http://www.foxnews.com/entertainment/2014/07/16/google-denies-playing-politics-with-its-algorithms-for-dinesh-dsouza-america/

A Google rep acknowledged the “problem”, stating that their systems “have unfortunately confused the title of the movie ‘America’, because it’s a common term and appears in many movie titles.”

The representative went on to say: “We’ve updated the Knowledge Graph, our database that stores this type of information, but it will take some time to display show times and other details for this movie. We’re always working on improving our systems, and we appreciate the feedback,” the rep explained, adding that the show times are already there; but that the system just hasn’t fully updated yet to display show times when you search for directly for “America 2014 movie.”

Politics aside, this is interesting since it is a discussion about how people search, and what the expectations are. For instance, notice how the Google representative changed it from “America movie” to “America 2014 movie”–quite a different search. The article doesn’t mention this “little” difference, but as a cataloger, it struck me in the face!

“America movie” could be Captain America, Coming to America, Air America, or any of the titles shown on the IMDB http://www.imdb.com/search/title?title=america&title_type=feature, plus bunches more, I am sure. But yet, Google’s algorithms are supposed to “know” that somebody who searches “America movie” wants this latest one. That is obviously why the Google representative said “America 2014 movie” because there is much less ambiguity, although nobody asks how many people would actually search with the date.

In any case, this shows how “Search” has changed and continues to change before our eyes; no less important, it shows how users are expecting more and more. Not only does the public expect information about the movie, but also show times and directions–a staggering amount of information coming from all kinds of places–and if they don’t get those show times, the directors and producers might sue. Quite a change from 20 years ago!

If I were going to make a suggestion to the people who made this movie, I would think they should consult their SEO (search engine optimization) experts, and they probably would have told them to come up with a more distinctive name. Here is an excerpt from some of the latest best practices:
(http://www.weidert.com/whole_brain_marketing_blog/bid/123473/On-Page-SEO-Best-Practices-to-Follow-For-2014-Part-1)

“1. Page Titles
When it comes to optimizing your site pages for optimal SERP (search engine results page) performance, Page Titles are a crucial component that users evaluate when making the decision to click on a search result. This element is also heavily weighed by search crawlers when determining the relevancy of the page in relation to the user’s search query. Guidelines:
– Use the primary keyword at least once, preferably at the start of the title
– Limit to 70 characters or less (longer titles than this will get cut off in the SERP)
– Use pipe symbols ( | ) to divide specific keyword phrases but don’t go overboard. Remember these are “titles” so phrase them that way!”

Incidentally, the movie’s official website does the last part. The title has:

<title>America: Imagine the World Without Her | Official Movie Site</title> (http://www.americathemovie.com/)

I just wonder how long it will be–that is, if it isn’t happening already–that the algorithms that analyze us all day long by reading our emails, and watching what we surf–will determine our political leanings and decide that you really doesn’t want to see a movie like “America: Imagine the World Without Her” and as a result, you will never even know about it.

Or could it turn out that left-wing liberals who search “america movie” and who have entirely different expectations about their SERPs (search engine results page), will then get angry if they see a conservative film by D’Souza as the #1 result? Will they consider this a covert political act, as D’Souza’s people seem to be suggesting by the lack of show times and locations? Only time can tell.

This shows how the Googles are fundamentally different from libraries. It took some time for Google to understand what kind of a company it is: search engine? information? databases? and it finally figured out that it is an advertising agency, and probably the most successful one of all time. These sorts of stories make it obvious.

Can library methods be of help? I don’t know, but when I looked at the <title> above, I also noticed the keywords they had chosen:

<meta name=”keywords” content=”america, movie, dinesh, d’souza, film” />

Those keywords look pretty lame to me. I think any cataloger could do a much better job without thinking at all!

by James Weinheimer at July 25, 2014 12:57 PM

Resource Description & Access (RDA)

Name/Title NAR (Name Authority Record) in RDA : Questions & Answers : RDA Examples


QUESTION: I have to create a N/T NAR for selections of stories of an author, but the NAR for the author is established in AACR2. How to create N/T NAR in this case, can we establish N/T NAR straightaway in RDA?
ANSWER: In order to create a Name/Title Name Authority Record (NAR) for a MARC21 240 tag in Resource Description & Access (RDA), the base  Name Authority Record (NAR) for the person has to be also in RDA first. If the base NAR for the person is in AACR2, then it should be converted to RDA. For example, please see lccn: 2012350684
  • Bibliographic Record

1000_ |a Jayabhikkhu,  |d 1908-1969,  |e author.
24010 |a Children's stories.  |k Selections
24510 |a Jayabhikkhunī śreshṭha bāḷavārtāo =  |b Jaybhikhkhuni shreshtha balvartao / |c śreṇī-sampādako, Yaśavanta Mahetā, Śraddhā Trivedī.
24631 |a Jaybhikhkhuni shreshtha balvartao
250__ |a Pahelī āvr̥tti.
264_1 |a Amadāvāda :  |b Gūrjara Grantharatna Kāryālaya,  |c 2011.
300__ |a 7, 152 pages :  |b illustrations ;  |c 21 cm.
336__ |a text  |2 rdacontent
337__ |a unmediated  |2 rdamedia
338__ |a volume  |2 rdacarrier
4900_ |a Gurjara bāḷavārtāvaibhava
520__ |a Representative collection of children's short stories.
546__ |a In Gujarati.
650_0 |a Children's stories, Gujarati.
[Source: Library of Congress Online Catalog: ]

  • NAR for Author Converted to RDA from AACR2

008920113n| azannaabn |a aaa
010__ |a n 89260083
035__ |a (OCoLC)oca03089021
040__ |a DLC |b eng |c DLC |d DLC |d DLC-ON |d OCoLC |e rda
046__ |f 1908 |g 1969
053_0 |a PK1859.D368
1000_ |a Jayabhikkhu, |d 1908-1969
375__ |a male
377__ |a guj
4000_ |a Jayabhikhkhu, |d 1908-1969
4001_ |a Desāī, Bālābhāī, |d 1908-1969
4000_ |a Bālābhāī Vīracanda Desāī, |d 1980-1969
670__ |a Ṭhakkara, Naṭubhāī. Jayabhikkhu, vyaktitva ane vāṇmaya, 1991: |b t.p. (Jayabhikkhu) t.p. verso (Jayabhikhkhu)
670__ |a New Delhi MARC file, 10/15/91 |b (MLC hdg.: Jayabhikkhu, 1908-1969)
670__ |a New Delhi non-MARC file |b (hdg.: Desāī, Bālābhāī Vīracanda, 1908-1969; usage: Jayabhikkhu)

  • Name/Title NAR Created in RDA
LC control no.:n 2012217901
LCCN permalink:http://lccn.loc.gov/n2012217901
HEADING:Jayabhikkhu, 1908-1969. Children’s stories. Selections
00000560cz a2200157n 450
0019296325
00520140212001715.0
008130625n| azannaabn |a aaa
010__ |a n 2012217901
040__ |a DLC |b eng |c DLC |e rda
046__ |k 2012
1000_ |a Jayabhikkhu, |d 1908-1969. |t Children’s stories. |k Selections
370__ |g Ahmadābād, India
4000_ |a Jayabhikkhu, |d 1908-1969. |t Jayabhikkhunī śreshṭha bāḷavārtāo
4000_ |a Jayabhikkhu, |d 1908-1969. |t Jaybhikhkhunishreshtha balvartao
670__ |a Jayabhikkhunī śreshṭha bāḷavārtāo, 2011.



RDA Blog ➨ 150000 Pageviews
Dear all. I am pleased to inform you that RDA Blog has crossed 150000 pageviews. Thanks all for your support and encouragement. The layout of the blog is changed for more convenient browsing. Please provide your feedback/comments if you find RDA Blog helpful, through RDA Blog Guest Book. Please Like/Share on Facebook and do a g+1 on Google+ through widgets on top right side of RDA Blog (Facebook and Google+ widgets shown by arrows in the picture given below).




by Salman Haider (noreply@blogger.com) at July 25, 2014 12:20 PM

July 24, 2014

First Thus

ACAT An amazing record

Posting to Autocat

On 23/07/2014 16.33, Mary Beth Weber wrote:

I shared the original posting with a colleague who’s not on Autocat, and would like to share her response. She’s been charged with implementing Rutgers’ Open Access policy:

This is something we have been asked about in terms of the OA initiative as it was making its way through the Senate. Apparently it is quite typical for physics scholarship (much of which is in arxiv.org) to list literally hundreds of authors, all of them “primary”! I never thought about the fact that some would make their way to OCLC! Yikes.

As catalogs strive to become more “inclusive” in various ways, this will be the reality everyone will have to live with, and this will go for whether the records are physically added to the catalog or if a federated search is implemented. If, as was suggested, these records have been created automatically, these kinds of records can be pumped out by computers at a rate that make our collective efforts look puny. These generated records may follow other rules, or no rules at all, and often have all kinds of purposes, such as these records that add each and every person that has ever been attached to a project (that is my assumption) and the purpose is not so much for finding purposes (especially hard to demonstrate with names such as “Alison, J.” or “Becker, S.”) but I suppose it attaches a certain prestige to the records, since it shows how big and important the project is. Obviously, all of these people did not write all of these articles and the real authors are far fewer, but since the names appear to be in alphabetical order, there is no way to know which ones they are.

I don’t think we can change this trend but it has the potential to change searching in some fundamental ways. If this goes on, almost any name, or group of names, someone searches will bring up these records. At the same time, there is this push/need/compulsion to be able to search “everything” in one search (whatever the word “everything” happens to mean). They could be handled like Google handles their data: by pushing “unwanted” results down in the list, but I don’t know if that is any solution.

Is there anything we can do about it? Or do we just keep doing what we have always done?

by James Weinheimer at July 24, 2014 10:50 AM

July 23, 2014

NSDL Metadata Registry Blog

Server issues

Normally the Registry just hums along quietly and doesn’t demand too much attention. But the last system update we performed seems to have altered our memory usage pretty dramatically and we’re quite suddenly having out-of-memory issues and some heavy swapping. We’ve expanded the server capacity twice already as a stopgap while we investigate, but before we move to an even larger server we’re testing some alternative configurations.

In the meantime there may continue to be periodic slowdowns, but you should see some improvement in a few days.

The last thing we want is for you to think we’re not seeing, or ignoring, the problem.

by Jon at July 23, 2014 05:52 PM

First Thus

An amazing record

Posting to Autocat

I thought I would share one of the most amazing records I have ever seen on Worldcat. I found it by mistake searching for copy for the book “Vesta and the Vestal virgins” by A D Tani http://www.worldcat.org/search?qt=worldcat_org_all&q=vesta+tani
and noticed that the records just below that one, “Hunt for new phenomena using large jet multiplicities and missing transverse momentum with ATLAS in 4.7fb(-1) of root s=7 TeV proton-proton collisions” have a truly stupifying number of authors attached. In fact, there are so many that when I click on it, the record won’t even load, and as a result, I can’t give anybody a direct link to the record.

There are other similar records too, all based on this “ATLAS” project.

Intrigued, I found the actual article in the arxiv.org site http://arxiv.org/pdf/1206.1760.pdf. I discovered that ATLAS is actually some kind of an experiment and is apparently based at CERN. The article is 48 p. (pdf), and the list of authors starts on p. 28 (pdf) and goes on for 12 pages! So, I guess it’s correct and all have been added.

I seriously doubt if the names in the Worldcat record are authority controlled but are just transcribed from the article (at least I suppose so, I confess I haven’t checked each one!).

My book came up no. 1, so I experienced no problem when searching for “vesta tani”. Still, I wonder what would/will happen if someone searched by other names that are in this record (these records) and how these developments could change retrieval for the users, especially as more and more records such as these are added to our databases.

by James Weinheimer at July 23, 2014 11:15 AM

Terry's Worklog

MarcEdit 6: Reintroduction of MARCCompare/RobertCompare

What is MARCCompare/RobertCompare?

Very rarely do I create programs for individuals to meet very specific user needs.  I’ve always taken the approach with MarcEdit that tools should be generalizable, and not tied to a specific individual or project.  RobertCompare was different.  The tool was created to support Mr.Robert (Bob) Ellett’s (ALA Tribute, Link to Dissertation Record in WorldCat) research for his Ph.D. dissertation, and only after completion, was the tool generalized for wider use. 

When I moved MarcEdit from the 4.x to the 5.x codebase, I dropped this utility because it had seemed to have run its course.  This was something Bob would periodically give me a hard time about — I think that he liked the idea of RobertCompare kicking around.  Of course, the program was terribly complicated, and without folks asking for it, converting the code from assembly to C# just wasn’t a high priority. 

Well, that changed last year when Bob suddenly passed away.  I liked Bob a lot — he was immeasurably kind and easy to get along with.  After his passing, I decided I wanted to bring RobertCompare back…I wanted to do something to remember my friend.  It’s taken a lot more time than I’d hoped, in part due to a move, a job change, and the complexity of the code.  However, after an extended absence, RobertCompare is being reintroduced back into MarcEdiit with MarcEdit 6.0. 

MARCCompare/RobertCompare 2.0

The original version of RobertCompare was designed to answer a very specific set of questions.  The program didn’t just look for differences between records, but rather, utilizing a probability engine, made determinations regarding the types of changes that had been made in the records.  Bob’s research centered around the use of PCC records at non-PCC libraries, and he was particularly interested in the types of changes these libraries were making to the records when downloading them for use.  The original version of RobertCompare was very good at analyzing record sets and generating a change history based on the current state of the records.  But the program was incredibly complicated and slow…really, really slow. 

In order to make this tool more multi-use, I’ve removed much of the code centered around probability matrix, and instead created a tool that utilizes a differential equation to generate an output file that graphically represents the changes between MARC files.  The output of the file is in HTML and at this point, pretty simple – but has been created in a way that I should be able to add additional functionality if this tool proves to have utility within the community.

So what does it look like?  The program is pretty straightforward.  There is a home menu where in identify the two files that you want to compare, and then a place to designate a save file path. 

image
Figure 1: MARCCompare/RobertCompare main window

The program can take MARC files and mnemonic files and compare them to determine what changes have been made between each record.  At this point, the files to be compared need to be in the same order.  This has been done primarily for performance reasons, as it allows the program to very quickly chew through very large files which was what I was looking for as part of this re-release. 

As noted above, the output of the files has changed.  Rather than breaking down changes into categories in an attempt to determine if changes were updated fields, new fields or deleted field data – the program now just notes additions/changes and deletions to the record and outputs this as an HTML record.  Figure 2 shows a sample of what the report might look like (format is slightly fluid and still changing). 

image 
Figure 2: MARCCompare/RobertCompare output

Final thoughts

While I’m not sure that RobertCompare was ever widely used by the MarcEdit community, I do know that it had its champions.  Over the past year, I’ve heard from a handful of users asking about the tool, and letting me know that they still have MarcEdit 4.0 on their systems specifically to utilize this program.  Hopefully by adding this tool back into MarcEdit, they will finally be able to send MarcEdit 4.x into retirement and jump to the current version of the application.  For me personally, working on this tool again was a chance to remember a very good man, and finish something that I know probably would have given him a good laugh.  

–TR

by reeset at July 23, 2014 01:00 AM

July 22, 2014

TSLL TechScans

NASIG assumes management of Serialist

NASIG has has announced that they have assumed management of the longstanding listserv SERIALIST.

The full text of the announcement is available at via the NASIG blog.

by noreply@blogger.com (Jackie Magagnosc) at July 22, 2014 09:02 PM

July 21, 2014

Terry's Worklog

Code4Lib Article: Opening the Door: A First Look at the OCLC WorldCat Metadata API

For those interested in some code and feedback on experiences using the OCLC Metadata API, you can find my notes here: http://journal.code4lib.org/articles/9863

 

–TR

by reeset at July 21, 2014 06:24 PM

July 19, 2014

Cataloging Futures

RDA Training Booklet

One of my ATLA colleagues just pointed out this practical guide to RDA: RDA Training Booklet by Marielle Veve. It can be used as a quick reference for cataloging changes needed for the switch from AACR2 to RDA.

by Christine Schwartz at July 19, 2014 11:03 AM

July 17, 2014

Bibliographic Wilderness

ActiveRecord Concurrency in Rails4: Avoid leaked connections!

My past long posts about multi-threaded concurrency in Rails ActiveRecord are some of the most visited posts on this blog, so I guess I’ll add another one here; if you’re a “tl;dr” type, you should probably bail now, but past long posts have proven useful to people over the long-term, so here it is.

I’m in the middle of updating my app that uses multi-threaded concurrency in unusual ways to Rails4.   The good news is that the significant bugs I ran into in Rails 3.1 etc, reported in the earlier post have been fixed.

However, the ActiveRecord concurrency model has always made it too easy to accidentally leak orphaned connections, and in Rails4 there’s no good way to recover these leaked connections. Later in this post, I’ll give you a monkey patch to ActiveRecord that will make it much harder to accidentally leak connections.

Background: The ActiveRecord Concurrency Model

Is pretty much described in the header docs for ConnectionPool, and the fundamental architecture and contract hasn’t changed since Rails 2.2.

Rails keeps a ConnectionPool of individual connections (usually network connections) to the database. Each connection can only be used by one thread at a time, and needs to be checked out and then checked back in when done.

You can check out a connection explicitly using `checkout` and `checkin` methods. Or, better yet use the `with_connection` method to wrap database use.  So far so good.

But ActiveRecord also supports an automatic/implicit checkout. If a thread performs an ActiveRecord operation, and that thread doesn’t already have a connection checked out to it (ActiveRecord keeps track of whether a thread has a checked out connection in Thread.current), then a connection will be silently, automatically, implicitly checked out to it. It still needs to be checked back in.

And you can call `ActiveRecord::Base.clear_active_connections!`, and all connections checked out to the calling thread will be checked back in. (Why might there be more than one connection checked out to the calling thread? Mostly only if you have more than one database in use, with some models in one database and others in others.)

And that’s what ordinary Rails use does, which is why you haven’t had to worry about connection checkouts before.  A Rails action method begins with no connections checked out to it; if and only if the action actually tries to do some ActiveRecord stuff, does a connection get lazily checked out to the thread.

And after the request had been processed and the response delivered, Rails itself will call `ActiveRecord::Base.clear_active_connections!` inside the thread that handled the request, checking back connections, if any, that were checked out.

The danger of leaked connections

So, if you are doing “normal” Rails things, you don’t need to worry about connection checkout/checkin. (modulo any bugs in AR).

But if you create your own threads to use ActiveRecord (inside or outside a Rails app, doesn’t matter), you absolutely do.  If you proceed blithly to use AR like you are used to in Rails, but have created Threads yourself — then connections will be automatically checked out to you when needed…. and never checked back in.

The best thing to do in your own threads is to wrap all AR use in a `with_connection`. But if some code somewhere accidentally does an AR operation outside of a `with_connection`, a connection will get checked out and never checked back in.

And if the thread then dies, the connection will become orphaned or leaked, and in fact there is no way in Rails4 to recover it.  If you leak one connection like this, that’s one less connection available in the ConnectionPool.  If you leak all the connections in the ConnectionPool, then there’s no more connections available, and next time anyone tries to use ActiveRecord, it’ll wait as long as the checkout_timeout (default 5 seconds; you can set it in your database.yml to something else) trying to get a connection, and then it’ll give up and throw a ConnectionTimeout. No more database access for you.

In Rails 3.x, there was a method `clear_stale_cached_connections!`, that would  go through the list of all checked out connections, cross-reference it against the list of all active threads, and if there were any checked out connections that were associated with a Thread that didn’t exist anymore, they’d be reclaimed.   You could call this method from time to time yourself to try and clean up after yourself.

And in fact, if you tried to check out a connection, and no connections were available — Rails 3.2 would call clear_stale_cached_connections! itself to see if there were any leaked connections that could be reclaimed, before raising a ConnectionTimeout. So if you were leaking connections all over the place, you still might not notice, the ConnectionPool would clean em up for you.

But this was a pretty expensive operation, and in Rails4, not only does the ConnectionPool not do this for you, but the method isn’t even available to you to call manually.  As far as I can tell, there is no way using public ActiveRecord API to clean up a leaked connection; once it’s leaked it’s gone.

So this makes it pretty important to avoid leaking connections.

(Note: There is still a method `clear_stale_cached_connections` in Rails4, but it’s been redefined in a way that doesn’t do the same thing at all, and does not do anything useful for leaked connection cleanup.  That it uses the same method name, I think, is based on misunderstanding by Rails devs of what it’s doing. See Fear the Reaper below. )

Monkey-patch AR to avoid leaked connections

I understand where Rails is coming from with the ‘implicit checkout’ thing.  For standard Rails use, they want to avoid checking out a connection for a request action if the action isn’t going to use AR at all. But they don’t want the developer to have to explicitly check out a connection, they want it to happen automatically. (In no previous version of Rails, back from when AR didn’t do concurrency right at all in Rails 1.0 and Rails 2.0-2.1, has the developer had to manually check out a connection in a standard Rails action method).

So, okay, it lazily checks out a connection only when code tries to do an ActiveRecord operation, and then Rails checks it back in for you when the request processing is done.

The problem is, for any more general-purpose usage where you are managing your own threads, this is just a mess waiting to happen. It’s way too easy for code to ‘accidentally’ check out a connection, that never gets checked back in, gets leaked, with no API available anymore to even recover the leaked connections. It’s way too error prone.

That API contract of “implicitly checkout a connection when needed without you realizing it, but you’re still responsible for checking it back in” is actually kind of insane. If we’re doing our own `Thread.new` and using ActiveRecord in it, we really want to disable that entirely, and so code is forced to do an explicit `with_connection` (or `checkout`, but `with_connection` is a really good idea).

So, here, in a gist, is a couple dozen line monkey patch to ActiveRecord that let’s you, on a thread-by-thread basis, disable the “implicit checkout”.  Apply this monkey patch (just throw it in a config/initializer, that works), and if you’re ever manually creating a thread that might (even accidentally) use ActiveRecord, the first thing you should do is:

Thread.new do 
   ActiveRecord::Base.forbid_implicit_checkout_for_thread!

   # stuff
end

Once you’ve called `forbid_implicit_checkout_for_thread!` in a thread, that thread will be forbidden from doing an ‘implicit’ checkout.

If any code in that thread tries to do an ActiveRecord operation outside a `with_connection` without a checked out connection, instead of implicitly checking out a connection, you’ll get an ActiveRecord::ImplicitConnectionForbiddenError raised — immediately, fail fast, at the point the code wrongly ended up trying an implicit checkout.

This way you can enforce your code to only use `with_connection` like it should.

Note: This code is not battle-tested yet, but it seems to be working for me with `with_connection`. I have not tried it with explicitly checking out a connection with ‘checkout’, because I don’t entirely understand how that works.

DO fear the Reaper

In Rails4, the ConnectionPool has an under-documented thing called the “Reaper”, which might appear to be related to reclaiming leaked connections.  In fact, what public documentation there is says: “the Reaper, which attempts to find and close dead connections, which can occur if a programmer forgets to close a connection at the end of a thread or a thread dies unexpectedly. (Default nil, which means don’t run the Reaper).”

The problem is, as far as I can tell by reading the code, it simply does not do this.

What does the reaper do?  As far as I can tell trying to follow the code, it mostly looks for connections which have actually dropped their network connection to the database.

A leaked connection hasn’t necessarily dropped it’s network connection. That really depends on the database and it’s settings — most databases will drop unused connections after a certain idle timeout, by default often hours long.  A leaked connection probably hasn’t yet had it’s network connection closed, and a properly checked out not-leaked connection can have it’s network connection closed (say, there’s been a network interruption or error; or a very short idle timeout on the database).

The Reaper actually, if I’m reading the code right, has nothing to do with leaked connections at all. It’s targeting a completely different problem (dropped network, not checked out but never checked in leaked connections). Dropped network is a legit problem you want to be handled gracefullly; I have no idea how well the Reaper handles it (the Reaper is off by default, I don’t know how much use it’s gotten, I have not put it through it’s paces myself). But it’s got nothing to do with leaked connections.

Someone thought it did, they wrote documentation suggesting that, and they redefined `clear_stale_cached_connections!` to use it. But I think they were mistaken. (Did not succeed at convincing @tenderlove of this when I tried a couple years ago when the code was just in unreleased master; but I also didn’t have a PR to offer, and I’m not sure what the PR should be; if anyone else wants to try, feel free!)

So, yeah, Rails4 has redefined the existing `clear_stale_active_connections!` method to do something entirely different than it did in Rails3, it’s triggered in entirely different circumstance. Yeah, kind of confusing.

Oh, maybe fear ruby 1.9.3 too

When I was working on upgrading the app, I’m working on, I was occasionally getting a mysterious deadlock exception:

ThreadError: deadlock; recursive locking:

In retrospect, I think I had some bugs in my code and wouldn’t have run into that if my code had been behaving well. However, that my errors resulted in that exception rather than a more meaningful one, maybe possibly have been a bug in ruby 1.9.3 that’s fixed in ruby 2.0. 

If you’re doing concurrency stuff, it seems wise to use ruby 2.0 or 2.1.

Can you use an already loaded AR model without a connection?

Let’s say you’ve already fetched an AR model in. Can a thread then use it, read-only, without ever trying to `save`, without needing a connection checkout?

Well, sort of. You might think, oh yeah, what if I follow a not yet loaded association, that’ll require a trip to the db, and thus a checked out connection, right? Yep, right.

Okay, what if you pre-load all the associations, then are you good? In Rails 3.2, I did this, and it seemed to be good.

But in Rails4, it seems that even though an association has been pre-loaded, the first time you access it, some under-the-hood things need an ActiveRecord Connection object. I don’t think it’ll end up taking a trip to the db (it has been pre-loaded after all), but it needs the connection object. Only the first time you access it. Which means it’ll check one out implicitly if you’re not careful. (Debugging this is actually what led me to the forbid_implicit_checkout stuff again).

Didn’t bother trying to report that as a bug, because AR doesn’t really make any guarantees that you can do anything at all with an AR model without a checked out connection, it doesn’t really consider that one way or another.

Safest thing to do is simply don’t touch an ActiveRecord model without a checked out connection. You never know what AR is going to do under the hood, and it may change from version to version.

Concurrency Patterns to Avoid in ActiveRecord?

Rails has officially supported multi-threaded request handling for years, but in Rails4 that support is turned on by default — although there still won’t actually be multi-threaded request handling going on unless you have an app server that does that (Puma, Passenger Enterprise, maybe something else).

So I’m not sure how many people are using multi-threaded request dispatch to find edge case bugs; still, it’s fairly high profile these days, and I think it’s probably fairly reliable.

If you are actually creating your own ActiveRecord-using threads manually though (whether in a Rails app or not; say in a background task system), from prior conversations @tenderlove’s preferred use case seemed to be creating a fixed number of threads in a thread pool, making sure the ConnectionPool has enough connections for all the threads, and letting each thread permanently check out and keep a connection.

I think you’re probably fairly safe doing that too, and is the way background task pools are often set up.

That’s not what my app does.  I wouldn’t necessarily design my app the same way today if I was starting from scratch (the app was originally written for Rails 1.0, gives you a sense of how old some of it’s design choices are; although the concurrency related stuff really only dates from relatively recent rails 2.1 (!)).

My app creates a variable number of threads, each of which is doing something different (using a plugin system). The things it’s doing generally involve HTTP interactions with remote API’s, is why I wanted to do them in concurrent threads (huge wall time speedup even with the GIL, yep). The threads do need to occasionally do ActiveRecord operations to look at input or store their output (I tried to avoid concurrency headaches by making all inter-thread communications through the database; this is not a low-latency-requirement situation; I’m not sure how much headache I’ve avoided though!)

So I’ve got an indeterminate number of threads coming into and going out of existence, each of which needs only occasional ActiveRecord access. Theoretically, AR’s concurrency contract can handle this fine, just wrap all the AR access in a `with_connection`.  But this is definitely not the sort of concurrency use case AR is designed for and happy about. I’ve definitely spent a lot of time dealing with AR bugs (hopefully no longer!), and just parts of AR’s concurrency design that are less than optimal for my (theoretically supported) use case.

I’ve made it work. And it probably works better in Rails4 than any time previously (although I haven’t load tested my app yet under real conditions, upgrade still in progress). But, at this point,  I’d recommend avoiding using ActiveRecord concurrency this way.

What to do?

What would I do if I had it to do over again? Well, I don’t think I’d change my basic concurrency setup — lots of short-lived threads still makes a lot of sense to me for a workload like I’ve got, of highly diverse jobs that all do a lot of HTTP I/O.

At first, I was thinking “I wouldn’t use ActiveRecord, I’d use something else with a better concurrency story for me.”  DataMapper and Sequel have entirely different concurrency architectures; while they use similar connection pools, they try to spare you from having to know about it (at the cost of lots of expensive under-the-hood synchronization).

Except if I had actually acted on that when I thought about it a couple years ago, when DataMapper was the new hotness, I probably would have switched to or used DataMapper, and now I’d be stuck with a large unmaintained dependency. And be really regretting it. (And yeah, at one point I was this close to switching to Mongo instead of an rdbms, also happy I never got around to doing it).

I don’t think there is or is likely to be a ruby ORM as powerful, maintained, and likely to continue to be maintained throughout the life of your project, as ActiveRecord. (although I do hear good things about Sequel).  I think ActiveRecord is the safe bet — at least if your app is actually a Rails app.

So what would I do different? I’d try to have my worker threads not actually use AR at all. Instead of passing in an AR model as input, I’d fetch the AR model in some other safer main thread, convert it to a pure business object without any AR, and pass that in my worker threads.  Instead of having my worker threads write their output out directly using AR, I’d have a dedicated thread pool of ‘writers’ (each of which held onto an AR connection for it’s entire lifetime), and have the indeterminate number of worker threads pass their output through a threadsafe queue to the dedicated threadpool of writers.

That would have seemed like huge over-engineering to me at some point in the past, but at the moment it’s sounding like just the right amount of engineering if it lets me avoid using ActiveRecord in the concurrency patterns I am, that while it officially supports, it isn’t very happy about.


Filed under: General

by jrochkind at July 17, 2014 08:08 PM

Mod Librarian

5 Things Thursday: Salaries, Rare Books, Metadata

5 Things Thursday: Salaries, Rare Books, Metadata

Here are five more things for librarians and fans of information organization:

  1. Some statistics on employment for librarians and salaries from Library Journal.
  2. Your rare book questions answered here in a comprehensive FAQ page from RBMS.
  3. Five tips for librarians using web metrics.
  4. Metadata, Schema.org and getting your digital collections noticed.
  5. More summaries of RBMS conference – one on copyright

View On WordPress

July 17, 2014 03:00 PM

First Thus

Books out, 3D printers in for reinvented US libraries – 16 July 2014 – New Scientist

Books out, 3D printers in for reinvented US libraries – 16 July 2014 – New Scientist.

“The image of a library as a building filled with books, quiet readers and shushing librarians is fading fast as we get ever more of our information through the internet. Websites like Wikipedia and vast online databases have largely replaced physical copies of reference books and back issues of journals. Other books can be offered in digital form, or physical copies stored out of sight and called up via an automated retrieval system.”

True. There is this idea that traditional libraries are no longer needed so it is understandable that librarians turn to “maker spaces” and so on.

There still needs to be a discussion on the quality of the information that we get through the internet. Is it really best that people use Wikipedia as an authoritative resource, and a search that retrieves 1,000,000+ results can provide the  most “relevant” in the top 10 results and the rest should be ignored because they are “not so relevant”?

It may be true, but I suspect not. People are settling for the ease of search retrieval and consider that “good enough” because if you seriously question it, you may be opening a Pandora’s box.

Still, it is happening and at least this way, libraries can keep their doors open!

by James Weinheimer at July 17, 2014 12:07 PM

July 16, 2014

Outgoing

Exploring Golang

Here is yet another blog post giving first impressions of a new language and comparing it with one the writer is familiar with.  In this case, comparing Google's Go language with Python.

A few of us at OCLC have been using Python fairly extensively for the last decade.  In fact, I have the feeling that I used to know it better than I do now, as there has been a steady influx of features into it, not to mention the move to Python 3.0

Go is relatively new, but caught my eye because of some groups moving their code from Python to Go.  Looking at what they were trying to do in Python, one wonders why they thought Python was a good fit, but maybe you could say the same thing about how we use Python.  All of the data processing for VIAF, is done in Python, as well as much of the batch processing that FRBRizes WorldCat.  We routinely push 1.5 billion marcish records through it, processing hundreds or even thousands of gigabytes of data.

We use Python because of ease and speed of writing and testing.  But, it's always nicer if things run faster and Go has a reputation as a reasonably efficient language.  The first thing I tried was to just write a simple filter that doesn't do anything at all, just reads from standard input and writes to standard output one line at a time.  This is basic to most of our map-reduce processing and I've had more than one language fail this test (Clojure comes to mind).  In Python it's simple and efficient:

import sys
for line in sys.stdin:
    sys.stdout.write(line)

The Python script reads a million authority records (averaging 5,300 bytes each) in just under 10 seconds, or about 100,000 records/second.

Go takes a few more lines, but gets the job done fairly efficiently:

package main

import (
    "bufio"
    "io"
    "os"
)

func main() {
    ifile := bufio.NewReader(os.Stdin)
    for {
        line, err := ifile.ReadBytes('\n')
        if err == io.EOF {
            break
        }
        if err != nil {
            panic(err)
        }
        os.Stdout.Write(line)
    }
}

The Go filter takes at least 16 seconds to read the same file, about 62,000 records/second.  Not super impressive and maybe there is a faster way to do it, but fast enough so that it won't slow down most jobs appreciably.

My sample application was to read in a file of MARC-21 and UNIMARC XML records, parse it into an internal datastructure, then write it out as JSON.  

I had already done this in Python, but it took some effort to do it in Go.  The standard way of parsing XML in Go is very elegant (adding notations to structures to show how the XML parser should interpret them), but turned out to both burn memory and run very slowly.  A more basic approach of processing the elements as a stream that the XML package is glad to send you, was both more similar to how we were doing it in Python and much more efficient in Go.  Although there have been numerous JSON interpretations of MARC (and I've done my own), I came up with one more: a simple list (array) of uniform structures (dictionaries in Python) that works well in both Python and Go.

Overall, the Go code was slightly more verbose, mostly because of its insistence on checking return codes (rather than Python's tendency to rely on exceptions), but very readable.  Single-threaded code in Go turns about a thousand MARC XML records/second into JSON.

Which sounds pretty good until you do the same thing in Python and it transforms them at about 1,700 records/second.  I profiled the Go code (nice profiler) and found that at least 2/3 of the time was in Go's XML parser, so no easy speedups there.  Rather than give up, I decided to try out goroutines, Go's rather elegant way of launching routines on a new thread.

Go's concurrency options seem well thought out.  I set up a pool of go routines to do the XML parsing and managed to get better than a 5x speedup (on a 12-core machine). That would be worthwhile, but we do most of our processing in Hadoop's map-reduce frame work, so I tested it in that.  The task was to read 47 million MARC-21/UNIMARC XML authority records stored in 400 files and write the resultant JSON to 120 files. 

Across our Hadoop cluster we typically run a maximum of 195 mappers and 195 reducers (running out of memory on Linux is something to avoid!).  The concurrent Go program was able to do the transform of the records in about 7 minutes ( at least a couple minutes of that is pushing the output through a simple reducer), and the machines were very busy, substantially busier than when running the equivalent single threaded Python code.  Somewhat to my surprise the Python code did the task in 6.5 minutes.  Possibly a single threaded Go program could be speedup a bit, but my conclusion is that Go offers minimal speed advantages over Python for the work we are doing.  The fairly easy concurrency is nice, but map-reduce is already providing that for us.

I enjoyed working with it, though.  The type safety sometimes got in the way (especially missed Python dictionaries/maps that are indifferent to the types of the keys and values), but at other times the type checking caught errors at compile time.  The stand alone executables are convenient to move around, the profiler was easy to use, and I really liked that it has a standard source file formatter.  I didn't try the built in documentation generator, but it looked simple and useful, as does the testing facility. The libraries aren't as mature as Python's, but they are very good.  We never drop down into C for our Python work, but we do depend on libraries written in C, such as the XML parser CElementTree.  It would be nice to have a system where everything could be done at the same level (Julia or PyPy?), but right now we're still happy with straight Python and feel that its speed seldom gets in the way.

If nothing else, I learned a bit about Go and came up with a simple JSON MARC format that works quite a bit faster in Python (and Go) than my old one did.

--Th

The machine I used for stand alone timings is a dual 6-core 3.1 GHz AMD Opteron box and runs Linux 2.6.18 (which precluded loading 1.3 Go, so I used 1.1).  I got similar (but slower) timings with Go 1.3 on my 64-bit quad-core 2.66GHz Intel PC running Windows 7, so I don't think that using 1.3 Go would have made much of a difference.  Both the Go and Python programs were executed as streaming map-reduce jobs across 39 dual quad-core 2.6 GHz AMD machines running Cloudera 4.7.

by Thom at July 16, 2014 08:06 PM

025.431: The Dewey blog

Sea Turtles: Zoology, Conservation Technology, Resource Economics

Works about sea turtles commonly treat them from the perspectives of zoology, conservation technology, or resource economics. If you browse the Relative Index for "Sea turtles," you find entries for the topic in three different disciplines:

Sea turtles                                         597.928
 
Sea turtles--conservation technology      639.977928
 
Sea turtles--resource economics             333.957928

The number 597.928 appears directly opposite the term "Sea turtles" with no subheading—an indication that it is the interdisciplinary number.

Class works on the zoology of sea turtles and interdisciplinary works on sea turtles in 597.928 Chelonioidea (Sea turtles).  In the upward hierarchy for 597.928 is 590 Animals (Zoology). An example of a work classed in 597.928 is Sea Turtles: An Extraordinary Natural History of Some Uncommon Turtles.

If one is familiar with the 590s, in which many numbers can be built following the add instructions under 592-599 Specific taxonomic groups of animals, one might expect to find sea turtles at a built number for turtles in the marine environment.  In other words, one might expect to start with 597.92 Chelonia, follow the add footnote "Add as instructed under 592-599" and add 1 from 592-599: 1 General topics of natural history of animals, then follow the instruction to add "the numbers following 591 in 591.3-591.7" and add 7 from 591.7 Animal ecology, animals characteristic of specific environments, then follow the instruction at 591.71-591.78 Specific topics in animal ecology; specific environments to add "the numbers following 577 in 577.1-577.8" and add 7 from 577.7 Marine ecology, to build 597.92177 Marine turtles. However, the number 597[.92177] Marine turtles is bracketed (to show that it is invalid), with a do-not-use note: "Do not use; class in 597.928."

Class works that focus on the leatherback turtle in a subdivision of the broad number for sea turtles: 597.9289 Dermochelyidae (Leatherback turtle).  An example of a work classed in 597.9289 is Leatherback Turtle: The World's Heaviest Reptile.

Class works that focus on the technology of saving sea turtles from extinction in 639.977928 Sea turtles—conservation technology.  That number has 639.9 Conservation of biological resources and 600 Technology in its upward hierarchy.  It appears already built in WebDewey: built with base number 639.97 Specific kinds of animals plus 7928 from 597.928 Chelonioidea (Sea turtles), following the add note at 639.971-639.978 Specific kinds of animals other than mammals. An example of a work classed in 639.977928 is Saving Sea Turtles: Extraordinary Stories from the Battle against Extinction.

Class broad works about sea turtles as a biological resource that include economic aspects in 333.957928 Sea turtles--resource economics. That number has 333.95 Biological resources and 330 Economics in the upward hierarchy. The number 333.957928 appears already built in WebDewey: built with base number 333.957 Reptiles and amphibians plus 928 from 597.928 Chelonioidea (Sea turtles), following the add note at 333.9578-333.9579 Amphibians, specific reptiles. More specific numbers can be built using the same add note, e.g., for Assessment of Leatherback Turtle (Dermochelys coriacea) Fishery and Non-Fishery Interactions in Atlantic Canadian Waters: 333.95792890916344 Leatherback turtle--resource economics--Gulf of Saint Lawrence, coastal waters of Newfoundland, coastal waters of eastern Nova Scotia.  (The area number T2—16344 Gulf of Saint Lawrence, coastal waters of Newfoundland, coastal waters of eastern Nova Scotia best reflects the Canadian portion of the summer range of the leatherback turtle: see the map in the Canadian Geographic article "Leatherback Sea Turtle: Endangered Species.") Here is a summary of the number-building process in WebDewey:

Leatherback4

 

by Juli at July 16, 2014 07:58 PM

TSLL TechScans

Library of Congress Recommended Format Specifications

In June the Library of Congress released Recommended Formats Specifications to assist in its acquisitions process. These recommendations were developed with the idea of long-term preservation in mind and take on the analog as well as the digital. They cover six categories of creative works: Textual Works & Musical Compositions, Still Image Works, Audio Works, Moving Image Works, Software & Electronic Gaming & Learning, and Datasets/Databases.

While the primary purpose of these recommendations is to provide internal guidance at the Library of Congress, it will also serve as a best-practices guide to the library and related communities to help ensure long-term preservation of creative works. And do bear in mind that while these are recommended formats, it doesn't necessarily mean that other formats should be excluded. It merely identifies those that have the best potential for long-term access.

Check out the Press Release here.

by noreply@blogger.com (Lauren Seney) at July 16, 2014 03:31 PM

First Thus

ACAT ALIENS as a Subject Heading, why is it considered to be offensive?

Posting to Autocat

On 16/07/2014 3.19, Elena Stuart wrote:

The term ALIENS is a Library of Congress Subject Heading which covers “persons who are not citizens of the country in which they reside.” It seems like a neutral term. I am doing a paper for an MLIS class on Subject Analysis and we are discussing how the LOC changes Subject Headings. In reviewing this Subject Heading it seems to cover both positive and negative aspects of the term Alien. I am a recent immigrant to the US from Russia and am therefore an alien and do not find this term offensive, but I understand that many do. Some would like the LOC to change this Subject Heading to something else.

In addition to the other comments, I also want to emphasize how cataloging no longer exists in a vacuum, and that it is highly important today to look at other tools that the public uses all of the time, and primarily Wikipedia, which in many ways, works much better than our catalogs. In Wikipedia, there is the wonderfully-named Disambiguation Page for Alien http://en.wikipedia.org/wiki/Alien. When our catalogs work correctly, you can get similar results, but you must search them in 19th-century fashion like a card catalog, with a left-anchored text search, which nobody does, e.g. http://1.usa.gov/1mgOKqG (subject)
http://1.usa.gov/UceOwB (name)
http://1.usa.gov/1jxdpwg (title)

Another very interesting point I just discovered is when I search “aliens” as a keyword in Wikipedia, it takes me directly to “alien” plus at the very bottom, there are: “All pages beginning with Alien; All pages with titles containing Alien”.

The disambiguation pages, while very nice, often don’t give you RT, NT, BT, but you do get them on the page for “Alien (Law)” http://en.wikipedia.org/wiki/Alien_%28law%29 where you can click on the Categories, in this case, “Immigration law, Legal categories of people, Expatriates.” That could stand improvement.

Another consideration is with full-text. If someone is interested in researching the history of African-Americans or gays, then searching today’s terms “African-Americans” or “Gays” won’t get you very far. That is because those are not the words used in documents before 20 years ago or so. The terms used in the 19th century may offend us today, but no matter–if people want to research those topics, they have absolutely no choice except to use those terms. It’s a different world.

The age-old problem of choosing words that may occur to people (or not), or that may offend, may gradually disappear as linked data appears and everyone begins to worry more about adding the “correct” URI instead of the “correct” string of characters. I, for one, would be very happy about that. In my experience, I have discovered that almost no matter what I might say, even something like, “I love my cats” or “I like hot dogs” I have no doubt that I said will make someone, somewhere, angry. Some people get offended by the very sight–or existence–of something else.

If the over-riding factor is never ever to offend anyone, that is simply an impossibility and anyone trying to live like that would make his or her life completely unbearable. Here in Rome, I see and hear things all the time that offend me but people are accepting (mostly) so that relative peace and harmony can prevail. At least, as much as it can in Rome!

by James Weinheimer at July 16, 2014 12:12 PM

July 15, 2014

Resource Description & Access (RDA)

International Standard Serial Number (ISSN) - MARC to RDA Mapping

MARC 21
FIELD
TAG
MARC 21
SUBFIELD CODE
MARC 21 FIELD /
SUBFIELD NAME
RDA
INSTRUCTION
NUMBER
RDA
ELEMENT
NAME
022International Standard Serial Number
022aInternational Standard Serial Number2.15Identifier for the Manifestation
022lISSN-LN/A
022mCancelled ISSN-LN/A
022yIncorrect ISSN2.15Identifier for the Manifestation
022zCancelled ISSN2.15Identifier for the Manifestation

[Source: RDA Toolkit]

by Salman Haider (noreply@blogger.com) at July 15, 2014 08:23 AM