I recently attended the New England Technical Services Librarians 2014 Annual Spring Conference, or NETSL 2014. I’ve attended NETSL for some time now. My first time was in 2006 as a GSLIS student at Simmons. Later, I volunteered to help out with registration. Then I joined the board as Treasurer. This current board year, I’m the Past President, finishing up not only a three year term as I started as Vice President, but also after a year as treasurer and as a volunteer. Each year, NETSL hands out the NETSL Award for Excellence in Technical Services. This year, the NETSL Award went to Amira Aaron of Northeastern University and Diane Baden from Boston College. What I found particularly moving this year was Diane’s acceptance speech about networking and mentoring. In short, she explained that it was a great honor to receive this award. For her, her involvement in the profession was all about learning, networking, reaching out and giving back to the profession. This struck home. Many of us are on committees of one sort or another. Many librarians are also tenured or seek tenure or perhaps a promotion which involves showing evidence of professional activity and scholarly research. Being professionally active because it will count towards moving up the career ladder is one reason. Granted, many might this as myopic. But I see it as a good start to becoming involved if it leads to a greater understanding of what it means to be part of the profession and what it means to be professional. I would like to emphasize the “good start”. Being professionally involved cannot be solely about working up the career ladder. Well I guess it can but that would mean a very ambitious sort of climb that might be helpful only for the climber. I would prefer to see professional involvement as giving back to the profession and where librarians really excel at sharing and communicating amongst themselves their expertise and experiences. This passion to help our profession comes not from being ambitious towards greater professional peaks but the willingness to see our profession grow and evolve. Is this idealistic? Certainly, though not entirely! It is also practical. By being on committees, presenting, networking, being mentored or mentoring, we learn through engagement. With small steps, we breach the many silos that we work in and around each day to do a better job as librarians as a whole. To be engaged is to work collaboratively with our peers for better or worse.
This sentiment was echoed in the remembrance for Birdie MacLennan, a prominent New England librarian who passed away recently. The remembrance was delivered by two of Birdie’s mentees and one of Birdie’s mentors. They recalled how Birdie would selfless help those navigate a new career as a librarian. This was done not only through networking but also through engagement with others to further the field of technical services in librarianship.
Out of all the presentations that day, the most moving were Diane’s acceptance speech and this remembrance. Each reminded me that engagement is so much more than just an accomplishment. It is giving back to the profession by learning and collaborating with your peers.
Developers from the New York Times have released some open source software meant for displaying and managing large digital content collections, and doing so client-side, in the browser with JS.
Developed for journalism, this has some obvious potential relevance to the business of libraries too, right? Large collections (increasingly digital), that’s what we’re all about, ain’t it?
Today we’re open-sourcing two internal projects from The Times:
- PourOver.js, a library for fast filtering, sorting, updating and viewing large (100k+ item) categorical datasets in the browser, and
- Tamper, a companion protocol for compressing categorical data on the server and decompressing in your browser. We’ve achieved a 3–5x compression advantage over gzipped JSON in several real-world applications.
…Collections are important to developers, especially news developers. We are handed hundreds of user submitted snapshots, thousands of archive items, or millions of medical records. Filtering, faceting, paging, and sorting through these sets are the shortest paths to interactivity, direct routes to experiences which would have been time-consuming, dull or impossible with paper, shelves, indices, and appendices….
…The genesis of PourOver is found in the 2012 London Olympics. Editors wanted a fast, online way to manage the half a million photos we would be collecting from staff photographers, freelancers, and wire services. Editing just hundreds of photos can be difficult with the mostly-unimproved, offline solutions standard in most newsrooms. Editing hundreds of thousands of photos in real-time is almost impossible.
Yep, those sorts of tasks sound like things libraries are involved in, or would like to be involved in, right?
The actual JS does some neat things with figuring out how to incrementally and just-in-time send delta’s of data, etc., and some good UI tools. Look at the page for more.
I am increasingly interested in what ‘digital journalism’ is up to these days. They are an enterprise with some similarities to libraries, in that they are an information-focused business which is having to deal with a lot of internet-era ‘disruption’. Journalistic enterprises are generally for-profit (unlike most of the libraries we work in), but still with a certain public service ethos. And some of the technical problems they deal with overlap heavily with our area of focus.
It may be that the grass is always greener, but I think the journalism industry is rising to the challenges somewhat better than ours is, or at any rate is putting more resources into technical innovation. When was the last time something that probably took as many developer-hours as this stuff, and is of potential interest outside the specific industry, came out of libraries?
I have seen several different approaches to division of labor in developing, deploying, and maintaining web apps.
The one that seems to work best to me is when the same team responsible for developing an app is the team responsible for deploying it and keeping it up, as well as for maintaining it. The same team — and ideally the same individual people (at least at first; job roles and employment changes over time, of course).
If the people responsible for writing the app in the first place are also responsible for deploying it with good uptime stats, then they have incentive to create software that can be easily deployed and can stay up reliably. If it isn’t at first, then the people who receive the pain of this are the same people best placed to improve the software to deploy better, because they are most familiar with it’s structure and how it might be altered.
Software is always a living organism, it’s never simply “done”, it’s going to need modifications in response to what you learn from how it’s users use it, as well as changing contexts and environments. Software is always under development, the first time it becomes public is just one marker in it’s development lifecycle, and not a clear boundary between “development” and “deployment”.
Compare this to other divisions of labor, where maybe one team does “R&D” on a nice prototype, then hands their code over to another team to turn it into a production service, or to figure out how to get it deployed and keep it deployed reliably and respond to trouble tickets. Sometimes these teams may be in entirely different parts of the organization. If it doesn’t deploy as easily or reliably as the ‘operations’ people would like, do they need to convince the ‘development’ people that this is legit and something should be done? And when it needs additional enhancements or functional changes, maybe it’s the crack team of R&Ders who do it, even though they’re on to newer and shinier things; or maybe it’s the operations people expected to it, even though they’re not familiar with the code since they didn’t write it; or maybe there’s nobody to do it at all, because the organization is operating on the mistaken assumption that developing software is like constructing a building, when it’s done it’s done.
I just don’t find that it works well to create robust, reliable software which can evolve to meet changing requirements.
Recently I ran into a quote from an interview with Werner Vogels, Chief Technology Officer at Amazon, expressing these benefits of “You build it, you run it.”:
There is another lesson here: Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.
I was originally directed to that quote by this blog post on the need for shared dev and ops responsibility, which I reccommend too.
In this world of silos, development threw releases at the ops or release team to run in production.
The ops team makes sure everything works, everything’s monitored, everything’s continuing to run smoothly.
When something breaks at night, the ops engineer can hope that enough documentation is in place for them to figure out the dial and knobs in the application to isolate and fix the problem. If it isn’t, tough luck.
Putting developers in charge of not just building an app, but also running it in production, benefits everyone in the company, and it benefits the developer too.
It fosters thinking about the environment your code runs in and how you can make sure that when something breaks, the right dials and knobs, metrics and logs, are in place so that you yourself can investigate an issue late at night.
As Werner Vogels put it on how Amazon works: “You build it, you run it.”
The responsibility to maintaining your own code in production should encourage any developer to make sure that it breaks as little as possible, and that when it breaks you know what to do and where to look.
That’s a good thing.
None of this means you can’t have people who focus on ops other people who focus on dev; but I think it means they should be situated organizationally close to each other, on the same teams, and that the dev people have to have share some ops responsibilities, so they feel some pain from products that are hard to deploy, or hard to keep running reliably, or hard to maintain or change.
 Note some people think even constructing a building shouldn’t be “when it’s done it’s done”, but that buildings too should be constructed in such a way that allows continual modification by those who inhabit them, in response to changing needs or understandings of needs.
Here are five wonderful things:
- Happy National Library Week!
- Check out this fantastic postcard collection online at the Tacoma Public Library.
- What is DAM?
- America’s libraries photographed. And yes, The Seattle Public Library makes it again.
- I want this book on Description: Innovative Practices for Libraries and Special Collections.
I’ve always suspected the protective power of books, the sense of sanctuary in a library or bookshop, and the calm that descends when reading. In his novel, Steven Hall describes the intriguing premise that all human minds are linked by vast ‘streams’ of language and thought, and, swimming through these streams, are thought-fish. Not all fish are good, and from the most predatory of all, the Ludovician, we need protection or camouflage:
‘Books of Fact/Books of Fiction: Books of fact provide solid channels of information in many directions. Library books are best because they also link the book itself to every previous reader and any applications of the text. Fiction books also generate illusionary flows of people and events and things that have never been or maybe have only half-been from a certain point of view. The result is a labyrinth of glass and mirrors which can trap an unwary fish for a great deal of time. I have an old note written by me before I got so vague which says that some of the great and most complicated stories like the Thousand and One Nights are very old protection puzzles, or even idea nets by which ancient peoples would fish for and catch the smaller conceptual fish. I don’t know if this is true or not. Build the books into a small wall around yourself. My notes say three or five books high is best.’
There have been times there is no denying, when the only thing to do, would be to come home to my favourite books, stack them up around me, and find peace enough to relax knowing that I was safe in my book tower. Now I know, they only have to be three or five books high…
“thoughts from scarlettlibrarian”
I catalog a lot of fiction. A popular topic involves the historic Christian military order the Knights Templar – an organization that existed for nearly two centuries during the Middle Ages.
When cataloging works about this historic order I notice that a lot of libraries incorrectly use the subject heading:
Knights Templar (Masonic order) n 80001259
which is an international philanthropic group affiliated with Freemasonry
For works about the historic Christian military order of the Knights Templar the correct subject heading is:
Templars n 80113860
These groups are easily confused. I hope this posting sheds some light on this topic…
Summary: Tim, LibraryThing’s founder, is going to be giving a one-day, almost-free introduction to PHP programming on Friday, June 27, alongside the preconference day of ALA 2014 in Las Vegas, NV.
“Enough PHP to Be Dangerous” will cover the basics of PHP, the most common web programming language. It’s designed for people with little programming experience.(1)
Instruction will be project-based–a series of brief explanations followed by hands-on problem solving. You won’t emerge a PHP master, but you’ll know enough to be dangerous!(2)
We’ll presume some familiarity with the web, including basic HTML. You must bring your own laptop. We’ll ask you to set up a simple development environment before you come–we’ll send instructions. You should be connected to libraryland somehow. Prepare for a mental workout–there’s no point going slow when we only have a day.
Where? The session will be held Friday June 27, 9am-5pm at Embassy Suites Convention Center, three blocks from the Convention Center.
How do I sign up? Email firstname.lastname@example.org. Say who you are and put “Enough PHP to Be Dangerous” in the subject line.
We’ll close applications on Monday, April 14 at 4:00 PM EST. If more than 30 people sign up, we’ll pick the winners randomly. If fewer, we’ll allow people to sign up after the deadline on a first-come-first-served basis.
What Does it Cost? On the day of we’ll pass the hat, asking $55 to cover the $45 cost of hotel-provided muffins, coffee and sandwiches, and some of the cost of the room, equipment and wifi. If $55 is a hardship for you, no problem–we’ll waive the fee, and you’ll still get a sandwich.
Why do I need this? Libraryland needs more programmers, and people who know what programming is. Libary software vendors exert outsized power and too often produce lousy software because the community has limited alternatives. The more library programmers, the better.
Why are you doing this? Conferences are hugely expensive to exhibit at. They’re worth it, but it’s a shame not to do more. If we’re going to be out there anyway, adding a day, a room and a projector doesn’t add much to the cost, and could help the community. Also, I’m a frustrated former Latin teacher, so it’ll be fun for me!(3)
Is this officially connected to ALA, LITA, Library Code Year, etc.? Nope. We’re doing this on our own. It’s easier that way. Of course, we love all these groups, especially our friends at LITA.(4)
Will the class be broadcast? No. That sounds fiddly. Maybe another time.
Want to help out? If you’re a programmer and want to help make this happen, email me. It would be great to have another programmer or two helping people figure out why their script won’t run. It’ll be fun, and you can put it on your resume.
Here are five things and a question:
- Should you take a cataloging course in grad school? Hack Lib School says yes and I say heck yeah!
- Want to learn more about rare books and go to California? California Rare Book School has some fabulous offerings. I hope to attend someday.
- Really nice job on CONTENTdm collections from Ball State University here.
- Does the futurefor public libraries include a…
Several of us here at OCLC have spent considerable time over the last decade trying to pull bibliographic records into work clusters. Lately we've been making considerable progress along these lines and thought it would be worth sharing some of the results.
Probably our biggest accomplishment is that work we have done to refine the worksets is now visible in WorldCat.org (as well as in an experimental view of the works as RDF). This is a big step for us, involving a number of people in research, development and production. In addition to making the new work clusters visible in WorldCat, this gives us in Research the opportunity to use the same work IDs in other services such as Classify. We also expect to move the production work IDs into services such as WorldCat Identities.
One of the numbers we keep track of is the ratio of records to works. When we first started, the record to work ratio was something like 1.2:1, that is, every work cluster averaged 1.2 records. The ratio is now close to 1.6:1, and for the first time the majority of records in WorldCat are now in work clusters with other records, primarily because of better matching.
Of records that have at least one match, we find the average workset size is 3.9 records. In terms of holdings we have 10.6 holdings/workset and over 43 holdings/non-singleton workset (worksets with more than one record). Another way to look at this is that 84% of WorldCat's holdings are in non-singleton worksets and over 1.5 billion of WorldCats 2.1 billion holdings are in worksets of 3 or more records, so collecting them together has a big impact on many displays.
As the worksets become larger and more reliable we are finding many uses for them, not the least in improving the work-level clustering itself. We find the clustering helps find variations in names, which in turn helps find title variations. We are also learning how to connect our manifestation and expression level clustering with our work-level algorithms, improving both. The Multilingual WorldCat work reported here is also an exciting development growing out of this.
There is still more to do of course. One of our latest approaches is to build on the Multilingual WorldCat work by creating new authority records in the background that can be used to guide the automated creation of authority records from WorldCat, that in turn help generate better clusters. We are applying this technique at first on problem works such as Twain's Adventures of Huckleberry Finn and his Adventures of Tom Sawyer which are published together so often and cataloged in so many ways that it is difficult to separate the two. These generated title authority records are starting to show up in VIAF as 'xR' records.
So, we've been working on this off and on for a decade, but WorldCat and our computational capabilities have changed dramatically and it still seems like a fresh problem to us as we pull in VIAF to help and use matching techniques that just would not have been feasible a decade ago.
While many of us, both in and out of OCLC Research, have worked on this over the years, no one has done more than Jenny Toves who both designs and implements the matching code.
Did you know that today is “Day of DH” or Digital Humanities? Our Digital Humanities Librarian scheduled an entire day of digital humanities fun. It started off with with a vote on sessions and workshops for the morning and afternoon. This was mixed in with hackfests and tool/software presentations. We had anywhere from 20-30 people all throughout the day! The best thing is that metadata featured prominently in many of the sessions.
During lunch, a colleague asked how to present the topic of linked data to an audience unfamiliar with it. The added challenge, as my colleague explained, is that the audience would not necessarily have the technology knowledge to understand linked data. We tried to look for examples. One example that my college showed me was the Virtual International Authority File. Essentially the issue that we skirted was name disambiguation. This is a huge problem especially with journals. There are solutions such as PubMed and Scopus. Recently, ORCID, or Open Registry of Research and Contributor ID’s, has launched out of beta. ORCID assigns a permanent unique identifier to authors that are associated with profiles. Profiles can be created by authors or third parties, such as libraries. What ORCID does is basically bring together the variant forms of an author name. Though this is not necessarily linked data, it is related (no pun intended). You can use the unique ID in other places to link out to an author’s profile. In a sense, we’re talking about making relationships or linking one idea to another. In this particular case, it is linked variant names to one authorized one (an activity well known to the cataloging world).
After lunch, we played with NVIVO. This is a software tool to analyze unstructured data. This is a fantastic tool if you need to do any text/data mining. It works with a number of different software tools such as NCapture, RefWorks, Twitter, Survey Monkey, etc. It also works with external data sets in various formats (xslx, txt, csv). For the session, we wanted to analyze a Twitter search based on a hashtag. We captured the Twitter search based on the hashtag poetry using NCapture. Thanks to the functionality of Chrome, we were able to import the data set from the Twitter search into NVIVO. From there, we saw the metadata associated with the export. We were able to visualize the data using a Map feature in NVIVO. We also got details about the number of tweets and retweets and other great information. One of the issues that the session leader emphasized was that such text mining is being driven by metadata. If you pull from social networking sites, there is always some sort of associated metadata. This is important as it will help indicate the data corresponding to the metadata labels. With Twitter, the metadata labels were easy to understand. However, in my next example they were not.
Lastly, we looked at Europeana. Our goal was to use the API service to extract data. First you need to get an API key. This just means registering with Europeana‘s API portal. With the key, you can consult a query. Europeana has plenty of examples of how to construct queries. If this is a little beyond you, then use the API Console, which constructs the query for you. Once you do this, then you put in your query (for example, search everything with Mozart). This will return a set of results in JSON. Now you might ask: What do I do with that? Cut and paste the JSON code into a simple text editor (text wrangler, textedit, notepad) and save with whatever name you want for the results with the extension .json. Now with the open source software called, OpenRefine, you can upload this file and create a new project. OpenRefine is great for data cleanup and visualization. Once you get the json data set into OpenRefine, then you can start playing with the metadata and data. Of course, we did all of that! What great fun. The first thing we saw was that the metadata labels used by Europeana were not always helpful. At times we had to guess what the data were. Not always a good thing. That just made me think again of the necessity to include README files for any data set – but that’s a different topic.
The great lesson from today was metadata is really everywhere. It’s extremely important even if users don’t know that they rely on metadata to do the work they do. This is why it is important to consider data consistency and accuracy and especially name disambiguation. And this was all from the perspective of primarily English majors! Go Digital Humanists!
One of the new functions in MarcEdit is the inclusion of a Field Edit Data function. For the most part, batch edits on field data has been handled primarily via regular expressions using the traditional replace function. For example, if I had the following field:
And I wanted to swap the subfield order, I’d use a regular expression in the Replace function and construct the following:
This works well – when needing to do advanced edits. The problem was that any field edits that didn’t fit into the Edit Subfield tool needed to be done as a regular expression. In an effort to simplify this process, I’ve introduced an Edit Field Data tool.
This tool exposes the data, following the indicators, for edit. So, in the following field:
=999 \\$aTest Data
The Edit Field Data tool could interact with the data: “$aTest Data”. Ideally, this will potentially simplify the process of doing most field edits. However, it also opens up the opportunity to do recursive group and replacements.
When harvesting data, often times subjects may be concatenated with a delimiter, so for example, a single 650 field may represent multiple subjects, separated by a delimiter of a semicolon. The new function will allow users to capture data, and recursively create new fields from the groups. So for example, if I had the following data:
=650 \7$adata – data; data – data; data – data;
And I wanted the output to look like:
=650 \7$adata – data
=650 \7$adata – data
=650 \7$adata – data
I could now use this function to achieve that result. Using a simple regular expression, I can create a recursively matching group, and then generate new fields using the “/r” parameter. So, to do this, I would use the following arguments:
Find: (data – data[; ]?)+
Check Use Regular Expressions.
The important part of the above expression is in the Replacement syntax. To tell MarcEdit that the recursion should result in a new line, the mnemonic /r is used at the end of the string to tell the tool that the recursion should result in a new line.
This new function will be available for use as of 4/7/2014.