Planet Access

In orbit: Outgoing - Thom Hickey | Quædam cuiusdam - Peter Binkley | Science Library Pad - Richard Akerman | Library Cog - Art Rhyno | Panlibus - Paul Miller | Shushers - U of W | wplinfostuff - Christine | FRBR - William Denton | Inspyration | - Gabriel Farrell | Digital Librarian | Technosophia | eby - Michael and Chris | One Big Library - Dan Chudnov | Solvitur ambulando - Bess | Loomware - Mark Leggott | blogdriverswaltz | The Night Librarian - Stan | Coffee|Code - Dan Scott | Library Web Chic - Karen A. Coombs | Photo Media - Tomasz Neugebauer | Dilettante's Ball - Ross Singer | Zzzoot - Glen Newton

Back-ups will remain for Archival purposes.

I assure you that Access 2006 was longer than one day, I really do. My (!*#@$ laptop had the nerve to go and die on me the second night of the conference, after I’d resolved to take copious notes during each talk, and had mostly kept my promise. I’ll be posting some more notes when I get the machine fixed up, assuming my memory’s still relatively fresh (and my damn hard drive lives through the surgery).

Normally I might just give up and say “why bother?”, since the conference will likely be old news by then. Then again, this is the Access conference we’re talking about; it’s not likely to be a passé topic anytime soon. For a different sort of account of how Access stays with a person, read Dan Chudnov’s post.

In the meantime, there is photographic evidence that I was at the conference (or at least out drinking in Ottawa around the same time).

More to come…

Although Art clearly took a bullet for me (page forward to the "Taking The Bullet" award), Access definitely took its toll. I've not had much stamina - particularly for the combination of travel and public speaking and coding and socializing and PENNANT-WINNING unique to Access conferences - since my illness, which some of you know well. (Ross: point of information, it's "delicate bloom".)

Art's heroic illness-absconding notwithstanding, this has been about the least productive work week I've had all year, and I blame it on Access Exhaustion. I've been down for the count with only feeble progress on anything, paid or no. For those this might be frustrating, I'm sorry, and I expect to be back at full speed next week.

This is a message I just sent to the martini-devel mailing list. I’m posting it here as well because it’s useful to me as a blueprint, and because I often get useful feedback on things I post here.


I’m in the process of re-writing the indexing module for Martini. Before I decide how to do it I would appreciate some feedback from community members about what the most useful thing for you will be.

Currently, you would have your xml files, probably olive xml files, in a directory. You would then take an ant build file, configure it to tell it where the files you want to index are, run the ant build file, and some time later you have an index built. The current ant build file looks like this:

<!– Index newspaper Olive files –>
                <target name=“newspapers” depends=“buildjar”>
                                <arg value=“-indexdir ${indexlocation}” />
                                <arg value=“-datadir /www/data/digitalobjects/newspapers/LSV” />

                        [… snip for brevity …]

                                <arg value=“-indexfield doctype%keyword%literal%newspapers” />
                  <arg value=“-indexfield title%unstored%true%/XMD-entity/Meta/@PUBLICATION”/>
                        <arg value=“-indexfield displayTitle%keyword%true%/XMD-entity/Meta/@PUBLICATION”/>
                                <!– language gets its own special processing instructions, so we can
                                standardize how language fields are being indexed –>

                                <arg value=“-indexfield language%keyword%language%//XMD-entity/@LANGUAGE”/>
                                <arg value=“-indexfield body%unstored%body%/XMD-entity/*” />
                                <!– date gets its own special processing instructions –>
                                <arg value=“-indexfield date%keyword%date%//Meta/@ISSUE_DATE”/>

                        [… snip for brevity …]


So (for those who haven’t used Martini before) the above code creates an index at $indexlocation, and indexes the files in $datadir, creating an index with fields called doctype, title, displayTitle, language, body, and date, each of which have indexing instructions particular to the type of data they are. You run “ant newspapers” on the command line, and it goes away and does its thing.

While this has worked pretty well in the past (and Peter, I know you’ve made changes that aren’t being reflected here. Sorry.) I think it is un-necessarily complicated. If we’re using solr anyway, I would rather have the user configure a solr schema file. Solr schema files are easier to read, more configurable, and there’s a larger group of people to write the documentation for them. I think that decision is a no-brainer. Here’s an example of a solr schema file:

<field name=“doctype” type=“string” indexed=“true” stored=“true” multiValued=“false”/>
      <field name=“title” type=“string” indexed=“true” stored=“true” multiValued=“true”/>
      <field name=“language” type=“string” indexed=“true” stored=“true” multiValued=“false”/>

All the xpath determination of what part of the document belongs with what indexfield happens when the solr document is prepared. Solr won’t run against your XML files natively, you have to interpret them into a form solr can understand. So what does this mean for the indexing workflow? Assuming standard Olive XML files, we can include an XSL file to transform each article into a solr file. Then, we can POST each solr file to the solr servlet to add it to the index.

So here’s what I’m planning:

  1. Have distributed ant tasks. The ant tasks should be able to run on any machine, not just the machine hosting your xml files. (added later: I might also do this with a simple .jar file. I know ant has been a big part of this project, but for ease of use, isn’t it easier to copy a .jar file to multiple servers instead of making people install and configure ant in multiple places?)
  2. each ant task is given a list of urls to index. These should be urls that can fetch each article xml file over the network.
  3. ant is configured with an xsl to turn the article xml into a solr document. It grabs the document via http, transforms it, and writes the output to a tempfile on local disk (or, better yet, each ant task grabs the xsl from the main cocoon instance each time. that way you don’t have to worry about updating all your ant instances every time you make a change to the xsl)
  4. the tempfile is posted to solr
  5. the tempfile is deleted

This gets around one of my big questions, which is “We don’t want to keep all those solr files sitting around on disk, do we?” It seems like they would take up too much space and get stale too quickly. Instead, you could have some report that could generate a list of all the xml files that have changed, and be able to pass that to the indexer so that it would only go through and reindex where needed.

So, that’s what I’m thinking. I hope this hasn’t been too incoherent. Writing it has helped me clarify exactly what I’m trying to do, at least.

Comments, anyone? Is there a reason I’m not seeing why this is a bad idea? How could I improve this process? Peter and Tricia, how might this fit with the pre-processing step you’ve been doing?


David Bigwood was thinking out loud the other day in his Catalogablog posting P2P OPACs

Here's an idea, not even half-baked, how about peer-to-peer (P2P) networks of OPACs? Only available items would display. I'd get to pick the institutions I'd have display and whether to display non-circulating items. Something like Limewire.

Having struggled with the effects of teenage family members installing Limewire and its predecessors on the home PC, and with how we scale the traditional search of a single library's collection up to a reliable performant query of information within overlapping ad hoc groups of library collections, I have also wondered if the P2P (peer-to-peer) technologies underpinning the former could be helpful with the latter.

David's thought, of using P2P and the music sharing application Limewire as an example, when you deconsruct it is attempting to address a few well known problems in the library domain.

  • Identifying and locating Library collections - how the collection is described, physically located, and accessed electronically are all concerns in this area which resource directories, many which have come and gone, have attempted to address. In the music sharing P2P world, the major concern is getting a copy of the file with little concern as to where it comes from.

    There are several current examples of these library directories around, often limited by project, type/size of library, geographic location, commercial constraints, etc. Then there is the Silkworm Directory in the Talis Platform, an open wiki-like in philosophy, directory in which anyone can enter any library collection and then use an open API to query that information

  • The grouping together of an ad hoc set of library collections to search within. - These could be as organized as all the academic libraries within 50 miles of a city, or as random as a student's university library, the local library near her dorm, and the library in her home town - totally logical to the student - random to everyone else

    A little known, as Paul Miller only mentioned it in his Access 2006 presentation(pdf) last week, aspect of the Silkworm Directory is its ability to create ad hoc groups and then query by the members of those groups.

  • The constant searching across many dissimilar collections. - Anyone who has used or tried to pull together a federated search across many library catalogs, traditionally using Z39.50, will always have horror tails of the way locally implemented indexing rules can make a mockery of search an results ranking.

    Now if we could consistently index, search, and rank in a single store all the holdings of the collections we are interested in, as defined in a directory, providing it was scalable and performant this problem would disappear. This is the approach successfully taken by the Googles of the world. It is also how the Bigfoot element of the Talis Platform operates. (see my recent posting for a description of how Bigfoot APIs are driving driving the recently announced Project Cenote interface)

  • Filter the results of a search by the libraries in a group that have holdings. - P2P, in the same way that Z39.50 federated search does, could help in this area by querying directly individual library collections. But I suspect that it would suffer the same problems as current federated search, the fastest response you get is based on the speed of the slowest resource. P2P addresses this with caching and by down loading from several places simultaneously, which are not really applicable where you are trying to get information from a specific collection.

    The Talis Platform's holdings stores address these issues by storing, aggregated across many collections and freely contributed by libraries, holdings statements along side bibliographic stores. This is done in such away as to enable bibliographic results to be augmented with holdings information on the fly as results are returned from an API call.

  • Filter the results of a search by libraries that have in stock items. - This final step is probably the most difficult to solve in a live situation as any store can become out of date at any time that a book is borrowed from a particular collection. P2P may well have valuable application in this area, be it filtering a results set of known holdings, or keeping stores up to date on a minute by minute basis.

It remains to bee seen as to how P2P could be used, but it should not be dismissed as only a technique used for [often illegal] music downloading

David says his thought might be 'half-baked', but there are some useful ingredients in his recipe. How well some of them would scale in the wider library environment I'm not so sure, but a hybrid of P2P with some of the high volume, scalable, performent, open data, open API, aspects of the Talis platform - now that may well have legs.

Technorati Tags: , , , , , , ,

Comments (0)

Comments on this Entry:

Kent Fitch of the National Library of Australia dropped me some e-mail about a very interesting project he’s doing: Searching Bibliographic Records, a test of using Lucene, the free search engine. Some FRBRizing is done, so you’ll want to go have a look.

They say on the home page:

The current Libraries Australia database contains many “duplicates”: records not merged due to subtle differences in metadata which are often inconsequential or errors. Many people also think it would be a good idea to combine various editions of works in the search results interface, although how far this combining should go is debatable. Should it be the equivalent of an FRBR work, or of an FRBR expression? Should it include works across languages and material types?

… What we’re trying to achieve is a set of groupings most likely to be useful to a searcher wanting to find a resource. The searcher probably has very strong preferences for the form and language of the resource they’re seeking, which is why they’re our top two layers/groupings. After that, they may have a preference for a particular edition or, less likely but possibly, even a particular manifestation (publisher, publication year, place of publication).

Of course, they don’t actually care about the bibliographic record; they want to get there hands on the resource, so we have to think about how they can easily tell the system to:

  • Locate any edition I can get today for free
  • Locate any edition published after 1960 I can get today for free
  • Locate either of these two editions I can get cheapest and soonest
  • Locate any French edition available for electronic access…

Whenever I think of Australian literature I think of Sean McMullen, whose great novel Souls in the Great Machine is set a thousand years in the future, in an Australia where electricity cannot be used and librarians settle fights with shotgun duels. That link will take you to the basic display of the book, but notice the “This title can be viewed as part of an experimental FRBR group” link. That takes you here: FRBRized view of Souls in the Great Machine.

A better example is the FRBRized view of Harry Potter and the Prizoner of Azkaban by J.K. Rowling, which has lots of translations and is much juicier FRBRarily but, certainly, less Australian.

Go have a look youself: poke around, try a search, see what results they show. Kent Fitch is interested in hearing your comments.

In my continuing series of publishing my Access 2006 notes, Roy Tennant's keynote on finishing the task of connecting our users to the information they need is something to which every librarian should pay attention.

If you don't understand something I've written, there's always the podcast of Roy's talk. In fact, there are a ton of podcasts of individual Access 2006 talks available from the Access 2006 Speakers, Presentations and Podcasts page. It's the next best thing to actually being there...

As always, any errors in capturing Roy's thoughts are undoubtedly mine.

Continue reading "Getting the Goods: Libraries and the Last Mile"

Ian Strang, a librarian not far from the FRBR Blog central office, has an interesting post on his blog: FRBRising with the Folks. He was thinking about the failures of FRBRizing by algorithms and automated processes:

The problem that I just don’t see them getting around is that often the “work” is simply not represented in the traditional bibliographic record, not even as a combination of elements. If this is the case no amount of processing by computer or librarian will be able to accurately and consistently identify and group “works”. What the FRBRisation process needs is just a little added information about each record. This seems like a perfect task for a social bookmarking application.

I’ve been thinking along the same lines as he was: that Amazon’s Mechanical Turk would be a good way of doing this. Strang found something interesting, though:

Interestingly Amazon developed the Mechanical Turk initially for internal use, to do much the same thing as I’m suggesting. Amazon had a problem with duplicate records. They realized that many products were virtually the same and could be sold/inventoried as a single product but were in their database as two items. It was for to large a problem to give to one person of even a group of people so they created a task marketplace, what would evolve into the Mechanical Turk. A program would identify to similar records and then submit them to the market place as a task. All the Amazon employee had to do to earn a few extra bucks was glance at each record and answer yes or no to the program. If the answer was yes the records were merged, if no the program moved on. All I’m suggesting is that something like to “work set” algorithm replace the Amazon program. Sure it would cost, but looking at how things are priced, not as much as one might think.

My continuing summaries from Access 2006. Thursday, October 12th was the first "normal" day of the conference featuring the following presentations:

Continue reading "Access 2006 notes: October 12"

The more I think about it, the more I think RSS feeds from the OPAC are a waste of time and energy.

What is the appeal of this? I mean, really?

Next Page »