The International Dewey Users Meeting will be held in conjunction with the World Library and Information Conference (IFLA) Cape Town, South Africa, on Tuesday 18 August 8:00-9:30 am at the OCLC Hospitality Suite, in the Conference Center Room 1.41/1.42. Learn what’s new with Dewey! Hear from Elise Conradi, Dewey Project Manager at the National Library of Norway, who will discuss the latest developments in Terminology Mapping and Peter Werling, CEO of Pansoft, who will provide an update about Dewey software. Also share ideas and notes with translation partners.
Register for this and other OCLC IFLA Events.
Interoperability in heterogeneous library data landscapes
Libraries have to deal with a highly opaque landscape of heterogeneous data sources, data types, data formats, data flows, data transformations and data redundancies, which I have earlier characterized as a “data maze”. The level and magnitude of this opacity and heterogeneity varies with the amount of content types and the number of services that the library is responsible for. Academic and national libraries are possibly dealing with more extensive mazes than small public or company libraries.
In general, libraries curate collections of things and also provide discovery and delivery services for these collections to the public. In order to successfully carry out these tasks they manage a lot of data. Data can be regarded as the signals between collections and services.
These collections and services are administered using dedicated systems with dedicated datastores. The data formats in these dedicated datastores are tailored to perform the dedicated services that these dedicated systems are designed for. In order to use the data for delivering services they were not designed for, it is common practice to deploy dedicated transformation procedures, either manual ones or as automated utilities. These transformation procedures function as translators of the signals in the form of data.
Here lies the origin of the data maze: an inextricably entangled mishmash of systems with explicit and
implicit data redundancies using a number of different data formats, some of which systems are talking to each other in some way. This is not only confusing for end users but also for library system staff. End users lack clarity about user interfaces to use, and are missing relevant results from other sources and possible related information. Libraries need licenses and expertise for ongoing administration, conversion and migration of multiple systems, and suffer unforeseen consequences of adjustments elsewhere.
To take the linguistic analogy further, systems make use of a specific language (data format) to code their signals in. This is all fine as long as they are only talking to themselves. But as soon as they want to talk to other systems that use a different language, translations are needed, as mentioned. Sometimes two systems use the same language (like MARC, DC, EAD), but this does not necessarily mean they can understand each other. There may be dialects (DANMARC, UNIMARC), local colloquialisms, differences in vocabularies and even alphabets (local fields, local codes, etc.). Some languages are only used by one system (like PNX for Primo). All languages describe things in their own vocabulary. In the systems and data universe there are not many loanwords or other mechanisms to make it clear that systems are talking about the same thing (no relations or linked data). And then there is syntax and grammar (such as subfields and cataloguing rules) that allow for lots of variations in formulations and formats.
The transformation utilities functioning as translators of the data signals suffer from a number of limitations. They translate between two specific languages or dialects only. And usually they are employed by only one system (proprietary utilities). So even if two systems speak the same language, they probably both need their own translator from a common source language. In many cases even two separate translators are needed if source and target system do not speak each other’s language or dialect. The source signals are translated to some common language which in turn is translated into the target language. This export-import scenario, which entails data redundancy across systems, is referred to as ETL (Extract Transform Load). Moreover, most translators only know a subset of the source and target language dependent on the data signals needed by the provided services. In some cases “data mappings” are used as conversion guides. This term does not really cover what is actually needed, as I have tried to demonstrate. It is not enough to show the paths between source and target signals. It is essential to add the selections and transformations needed as well. In order to make sense of the data maze you need a map, a dictionary and a guidebook.
To make things even more complicated, sometimes reading data signals is only possible with a passport or visa (authentication for access to closed data). Or even worse, when systems’ borders are completely closed and no access whatsoever is possible, not even with a passport. Usually, this last situation is referred to with the term “data silos”, but that is not the complete picture. If systems are fully open, but their data signals are coded by means of untranslatable languages or syntaxes, we are also dealing with silos.
Anyway, a lot of attention and maintenance is required to keep this Tower of Babel functioning. This practice is extremely resource-intensive, costly and vulnerable. Are there any solutions available to diminish maintenance, costs and vulnerability? Yes there are.
First of all, it is absolutely crucial to get acquainted with the maze. You need a map (or even an atlas) to be able to see which roads are there, which ones are inaccessible, what traffic is allowed, what shortcuts are possible, which systems can be pulled down and where new roads can be built. This role can be fulfilled by a Dataflow Repository, which presents an up-to-date overview of locations and flows of all content types and data elements in the landscape.
Secondly it is vital to be able to understand the signals. You need a dictionary to be able to interpret all signals, languages, syntaxes, vocabularies, etc. A Data Dictionary describing data elements, datastores, dataflows and data formats is the designated tool for this.
And finally it is essential to know which transformations are taking place en route. A guidebook should be incorporated in the repository, describing selections and transformations for every data flow.
You could leave it there and be satisfied with these guiding tools to help you getting around the existing data maze more efficiently, with all its ETL utilities and data redundancies. But there are other solutions, that focus on actually tackling or even eliminating the translation problem. Basically we are looking at some type of Service Oriented Architecture (SOA) implementation. SOA is a rather broad concept, but it refers to an environment where individual components (“systems”) communicate with each other in a technology and vendor agnostic way using interoperable building blocks (“services”). In this definition “services” refer to reusable dataflows between systems, rather than to useful results for end users. I would prefer a definition of SOA to mean “a data and utilities architecture focused on delivering optimal end user services no matter what”.
Broadly speaking there are four main routes to establish a SOA-like condition, all of which can theoretically be implemented on a global, intermediate or local level.
Overlooking the possible routes out of the data maze, it seems that the first step should be employing the map, dictionary and guidebook concept of the dataflow repository, data dictionary and transformation descriptions. After that the only feasible road on the short term is the intermediate integrated Shared Store/Shared Format solution.
MarcEdit Mac users, a new preview update has been made available. This is getting pretty close to the first “official” version of the Mac version. And for those that may have forgotten, the preview designation will be removed on Sept. 1, 2015.
So what’s been done since the last update? Well, I’ve pretty much completed the last of the work that was scheduled for the first official release. At this point, I’ve completed all the planned work on the MARC Tools and the MarcEditor functions. For this release, I’ve completed the following:
** 1.0.9 ChangeLog
Over the next month, I’ll be working on trying to complete four other components prior to the first “official” release Sept. 1. This means that I’m anticipating at least 1, maybe 2 more large preview releases before Sept. 1, 2015. The four items I’ll be targeting for completion will be:
How do you get the preview? If you have the current preview installed, just open the program and as long as you have the notifications turned on – the program will notify that an update is available. Download the update, and install the new version. If you don’t have the preview installed, just go to: http://marcedit.reeset.net/downloads and select the Mac app download.
If you have any questions, let me know.
On 7/1/2015 12:43 AM, J. McRee Elrod wrote:
> On either Autocat or RDA-L someone asked about repeating 520s in order > to have paragraphs. I answered that SLC only uses repeating 520s for > multiple descriptions, e.g., a set or kit.
> I suggested hyphens as in 505. In an offlist message John Marr > suggests using 520$b to create a break; $b can only be used once per > 520, and is for a fuller description than in $a.
For those who have access to the html/css coding of their catalogs, there is a value “white-space:pre-wrap” that preserves the line breaks. with “white-space:pre-wrap” the text wraps both on line breaks and when the browser window requires it.
What this means is, you could just put in line breaks where you want them in the 520, e.g. Line 1.[return]Line 2.[return]Line 3.
and with the correct style, it can display as:
I made a “fiddle” at https://jsfiddle.net/ouv4h56z/1/ where you can see it in action and even work with it yourselves. The top-left division (labelled HTML) has the code and you can make it bigger to see everything. You can put any text in you want to see how it works. In fact, all the divisions can be resized and it’s kind of fun to see how it works when you make the right hand divisions more narrow.
Here are the other values for “white-space” http://www.w3schools.com/cssref/pr_text_white-space.asp
This is yet another reason why open source is so important for libraries today! A simple 10-second fix to a problem.
James Weinheimer email@example.com
First Thus http://blog.jweinheimer.net
First Thus Facebook Page https://www.facebook.com/FirstThus
Personal Facebook Page https://www.facebook.com/james.weinheimer.35 Google+ https://plus.google.com/u/0/+JamesWeinheimer
Cooperative Cataloging Rules http://sites.google.com/site/opencatalogingrules/
Cataloging Matters Podcasts http://blog.jweinheimer.net/cataloging-matters-podcasts The Library Herald http://libnews.jweinheimer.net/
Here are five more things:
I hadn’t planned on putting together an update for the Windows version of MarcEdit this week, but I’ve been working with someone putting the Linked Data tools through their paces and came across instances where some of the linked data services were not sending back valid XML data – and I wasn’t validating it. So, I took some time and added some validation. However, because the users are processing over a million items through the linked data tool, I also wanted to provide a more user friendly option that doesn’t require opening the MarcEditor – so I’ve added the linked data tools to the command line version of MarcEdit as well.
Linked Data Command Line Options:
The command line tool is probably one of those under-used and unknown parts of MarcEdit. The tool is a shim over the code libraries – exposing functionality from the command line, and making it easy to integrate with scripts written for automation purposes. The tool has a wide range of options available to it – and for users unfamiliar with the command line tool – they can get information about the functionality offered by querying help. For those using the command line tool – you’ll likely want to create an environmental variable pointing to the MarcEdit application directory so that you can call the program without needing to navigate to the directory. For example, on my computer, I have an environmental variable called: %MARCEDIT_PATH% which points to the MarcEdit app directory. This means that if I wanted to run the help from my command line for the MarcEdit Command Line tool, I’d run the following and get the following results:
C:\Users\reese.2179>%MARCEDIT_PATH%\cmarcedit -help *************************************************************** * MarcEdit 6.1 Console Application * By Terry Reese * email: firstname.lastname@example.org * Modified: 2015/7/29 *************************************************************** Arguments: -s: Path to file to be processed. If calling the join utility, source must be files delimited by the ";" character -d: Path to destination file. If call the split utility, dest should specify a fold r where split files will be saved. If this folder doesn't exist, one will be created. -rules: Rules file for the MARC Validator. -mxslt: Path to the MARCXML XSLT file. -xslt: Path to the XML XSLT file. -batch: Specifies Batch Processing Mode -character: Specifies character conversion mode. -break: Specifies MarcBreaker algorithm -make: Specifies MarcMaker algorithm -marcxml: Specifies MARCXML algorithm -xmlmarc: Specifics the MARCXML to MARC algorithm -marctoxml: Specifies MARC to XML algorithm -xmltomarc: Specifies XML to MARC algorithm -xml: Specifies the XML to XML algorithm -validate: Specifies the MARCValidator algorithm -join: Specifies join MARC File algorithm -split: Specifies split MARC File algorithm -records: Specifies number of records per file [used with split c mmand]. -raw: [Optional] Turns of mnemonic processing (returns raw data) -utf8: [Optional] Turns on UTF-8 processing -marc8: [Optional] Turns on MARC-8 processing -pd: [Optional] When a Malformed record is encountered, it will modi y the process from a stop process to one where an error is simply noted and a s ub note is added to the result file. -buildlinks: Specifies the Semantic Linking algorithm This function needs to be paired with the -options parameter -options Specifies linking options to use: example: lcid,viaf:lc oclcworkid,autodetect lcid: utilizes id.loc.gov to link 1xx/7xx data autodetect: autodetects subjects and links to know values oclcworkid: inserts link to oclc work id if present viaf: linking 1xx/7xx using viaf. Specify index after colon. I no index is provided, lc is assumed. VIAF Index Values: all -- all of viaf nla -- Australia's national index vlacc -- Belgium's Flemish file lac -- Canadian national file bnc -- Catalunya nsk -- Croatia nkc -- Czech. dbc -- Denmark (dbc) egaxa -- Egypt bnf -- France (BNF) sudoc -- France (SUDOC) dnb -- Germany jpg -- Getty (ULAN) bnc+bne -- Hispanica nszl -- Hungary isni -- ISNI ndl -- Japan (NDL) nli -- Israel iccu -- Italy LNB -- Latvia LNL -- Lebannon lc -- LC (NACO) nta -- Netherlands bibsys -- Norway perseus -- Perseus nlp -- Polish National Library nukat -- Poland (Nukat) ptbnp -- Portugal nlb -- Singapore bne -- Spain selibr -- Sweden swnl -- Swiss National Library srp -- Syriac rero -- Swiss RERO rsl -- Russian bav -- Vatican wkp -- Wikipedia -help: Returns usage information
The linked data option uses the following pattern: cmarcedit.exe –s [sourcefile] –d [destfile] –buildlinks –options [linkoptions]
As noted above in the list, –options is a comma delimited list that includes the values that the linking tool should query. A user, for example, looking to generate workids and uris on the 1xx and 7xx fields using id.loc.gov – the command would look like:
<< cmarcedit.exe –s [sourcefile] –d [destfile] –buildlinks –options oclcworkid,lcid
Users interesting in building all available linkages (using viaf, autodetecting subjects, etc. would use:
<< cmarcedit.exe –s [sourcefile] –d [destfile] –buildlinks –options oclcworkid,lcid,autodetect,viaf:lc
Notice the last option – viaf. This tells the tool to utilize viaf as a linking option in the 1xx and the 7xx – the data after the colon identifies the index to utilize when building links. The indexes are found in the help (see above).
The update can be found on the downloads page: http://marcedit.reeset.net/downloads or using the automated update tool within MarcEdit. Direct links:
Mac Port Update:
Part of the reason I hadn’t planned on doing a Windows update of MarcEdit this week is that I’ve been heads down making changes to the Mac Port. I’ve gotten good feedback from folks letting me know that so far, so good. Over the past few weeks, I’ve been integrating missing features from the MarcEditor into the Port, as well as working on the Delimited Text Translation. I’ll now have to go back and make a couple of changes to support some of the update work in the Linked Data tool – but I’m hoping that by Aug. 2nd, I’ll have a new Mac Port Preview that will be pretty close to completing (and expanding) the initial port sprint.
Questions, let me know.
|RDA RELATIONSHIP DESIGNATORS|
Whew – it’s be a wonderfully exhausting past few days here in Columbus, OH as the Libraries played host to Code4LibMW. This has been something that I’ve been looking forward to ever since making the move to The Ohio State University; the C4L community has always been one of my favorites, and while the annual conference continues to be one of the most important meetings on my calendar – it’s within these regional events where I’m always reminded why I enjoy being a part of this community.
I shared a story with the folks in Columbus this week. As one of the folks that attended the original C4L meeting in Corvallis back in 2006 (BTW, there were 3 other original attendees in Columbus this week), there are a lot of things that I remember about that event quite fondly. Pizza at American Dream, my first experience doing a lightening talk, the joy of a conference where people were writing code as they were standing on stage waiting their turn to present, Roy Tennant pulling up the IRC channel while he was on stage, so he could keep an eye on what we were all saying about him. It was just a lot of fun, and part of what made it fun was that everyone got involved. During that first event, there were around 80 attendees, and nearly every person made it onto the stage to talk about something that they were doing, something that they were passionate about, or something that they had been inspired to build during the course of the week. You still get this at times at the annual conference, but with it’s shear size and weight, it’s become much harder to give everyone that opportunity to share the things that interest them, or easily connect with other people that might have those same interests. And I think that’s the purpose that these regional events can serve.
By and large, the C4L regional events feel much more like those early days of the C4L annual conference. They are small, usually free to attend, with a schedule that shifts and changes throughout the day. They are also the place where we come together, meet local colleagues and learn about all the fantastic work that is being done at institutions of all sizes and all types. And that’s what the C4LMW meeting was for me this year. As the host, I wanted to make sure that the event had enough structure to keep things moving, but had a place for everyone to participate. For me – that was going to be the measure of success…did we not just put on a good program – but did this event help to make connections within our local community. And I think that in this, the event was successful. I was doing a little bit of math, and over the course of the two days, I think that we had a participation rate close to 90%, and an opportunity for everyone that wanted to get up and just talk about something that they found interesting. And to be sure – there is a lot of great work being done out here by my Midwest colleagues (yes, even those up in Michigan ).
Over the next few days, I’ll be collecting links and making the slides available via the C4LMW 2015 home page as well as wrapping up a few of the last responsibilities of hosting an event, but I wanted to take a moment and again thank everyone that attended. These types of events have never been driven by the presentations, the hosts, or the presenters – but have always been about the people that attend and the connections that we make with the people in the room. And it was a privilege this year to have the opportunity to host you all here in Columbus.
Here are five more things for you:
- “Workflow. It’s like a buzzword without any buzz.” – from Jim Kidwell’s article Digital Asset Management Workflow: The Unsung Hero of DAM.
- Check out this archive of posters documenting radical history from the University of Michigan Library.
- Another list of the world’s most beautiful libraries and The Seattle Public Library’s Central Library is on there again.