releasing open data for illuminated manuscript collection records and research…
Opportunities and Possibilities
The Knowledge Integration’s Collections Information Integration Middleware implements a couple of very useful ‘data models’. With refinement during Open book we’ve so far not really found anything, providing is ‘object record’ centric, can’t be ‘put into’ one of these models.
This is because there are primarily three type of ‘data connections’ one might want to do in middleware:
Referring to the diagram above there are are two models – Sets and Contexts. In the CIIM both of these are definable – in the diagram this is represented by ‘Definition’. A ‘Definition’ is a class (for the object based modellers amongst you) if you like – at its most basic it defines the field list for this type (say of, Set) and the type of ‘connections’ these (Sets’) records are allowed to make. Clearly you can create as many definitions as you like – for simplicity I am only representing a single definition of each type of model in the diagram.
With a Set you define its field list (schema) and then you can begin creating Set records of that ‘type’. Clearly you may have many Set Records for any given Set defintion. Sets’ principle characteristic is that any one Set record can be allowed to refer to as many object records as it likes. A simple practical example of this might be: the Set definition is for a Collection Level Description (RSLP schema) and then each Collection Level Description (a Set record) can refer to as many object records as it likes.
With a Context you define its field list (schema) and then you begin creating records of that ‘type’. Clearly you may have many Context records for any given Context defintion. Contexts’ principle characteristic is they may only have a one to one relationship with object records. This enforces the augmentation concept. A simple theoretical example is you have a research project generating particular metadata per object – this is clearly an augmentation issue, you have extra structured data per object (which we are assuming you dont want to store in your collections management system). So your research metadata becomes a Context Definition (defining its fields/schema) and each context record will hold the research data for an indiviual object and have a one to one linkage to that objects’ record.
The primary focus of Open Book has been to release open data about the collections in the Fitzwilliam Museum, as well as to explore how we go beyond item level data and link to related resource information and research activity data.
A fundamental building block of the project was to first establish an organisational position on open data – how we define it and what it means to us. We had to ensure that we had an understanding of open data and were prepared to release it. So, the issue of licensing reared its head. But before that we tried to formalise a description of the types of data that, as a museum, we deal with and, in particular, clarify what we meant by metadata. With that distinction in place, internal advocacy for open data was easier. In the end it has led to a layered approach, with a basic set of metadata (aimed specifically at resource discovery and aggregation) being dedicated to the public domain and more detailed data licensed as Creative Commons Attribution-Share Alike. The latter can still be defined as an open licence but recognises the curated nature of this data and the significance (in academic terms) of attribution of source.
The next question was how we were going to release this data and I have to admit we have taken a scattergun approach. Metadata is made available via OAI-PMH for harvesting. Fuller data sets are offered via an API, returning JSON responses, and a SPARQL endpoint using RDF and mapping to the CIDOC-CRM. We have conceived this as a data service that complements our web presence and online catalogue. Achieving this has taken time and technical resources that would have been difficult to access without the Open Book project. The uptake and value of each of these can only be properly assessed over time. If we had to focus on a single aspect of this, perhaps the path with the potential for greatest returns would be the release of open metadata (whether through OAI or some other data feed) to the Culture Grid and subsequently to Europeana. The sector context of such aggregations is likely to offer users the most benefit, in the short term at least.
Development of use case scenarios and understanding of user requirements for the metadata released by Open Book has relied heavily on work carried out with the sister project Contextual Wrappers 2. In the focus group sessions for Contextual Wrappers 2 and Open Book, the concept of open data was well received by data managers and users alike, although it was recognised that there may still be management resistance within some organisations. The general approach, adopted by both Contextual Wrappers 2 and Open Book, of guided discovery (providing contextual information, including collection level descriptions, to supplement item level records) was seen as positive and is already pursued by some museums. The value in particular to users unfamiliar with the subject or collection was highlighted, as well as a role in creating course subject guides. However, the potential was seen also for academic research, creating links across multiple collections and using collection level descriptions as a way to draw together bibliographic information and research material across a group of objects.
In addition to releasing item level metadata, Open Book has explored the modelling of data related to the outputs from ongoing research projects (including pigment analysis of illuminated manuscripts) at the Fitzwilliam Museum. An initial proposition was that the CIIM (collections information integration module), which was being deployed during the project, could be used to capture, store and eventually publish the primary research data outputs. As such, one group of potential users would have been internal, employing the CIIM to manage the research data. However, it became clear at an early stage that this wasn’t going to be possible or desirable. The real strength of the CIIM is data augmentation and publishing; to use it as a database for research outputs and to integrate it within the research workflow (particularly given that the CIIM’s user interface is at an early phase of development) wasn’t a viable option. Although the notion of ‘contexts’ within the CIIM, which augment the item data extracted from the collections database, could be used to store item-specific research data, it was apparent that research projects don’t operate according to this kind of simple one-to-one relationship. Tools and applications with which the researcher is already familiar are a more efficient means of handling the research process and better able to capture the complexities of the data.
Ultimately what we have tried to do is encapsulate the research output in a structured form as metadata about the project. This kind of ‘abstract’ of the research sounds (and is) quite straightforward but it isn’t something that we have done before or have had the capacity to store and publish in a sensible way. We will continue to explore how we might vary the granularity of these abstracts, from global project metadata down to individual strands of a research project, and how they can be linked to specific context information augmenting the basic object data. We want to both publish the fact that research (which may as yet not have any associated academic publication) is ongoing and associate the relevant objects with that research, creating links between item and research metadata in the same way as there are links between items and collections. We have more work to do on assessing the potential use of this metadata but the emphasis is on resource discovery, particularly within a cross-disciplinary environment where an activity such as pigment analysis would be of interest to researchers in disciplines outside the usual museum subject areas. This leads into what are, for us, new areas of aggregation of research activity data. We are familiar, in our object-centric view of the world, with object data aggregation but structuring research related data, and how this is/can be used in a wider academic context, is new territory for us.
In addition to any benefits to external users, there also may be internal gains. Although dealing only with a relatively small number of projects, one by-product of structured research activity data will be an enhanced strategic and administrative capacity to search and produce reports (e.g. research activity over last 10 years funded by x). The role of documenting research in this way was highlighted in the focus group sessions, where it was suggested that it could help raise the profile of research within museums, provide a metric in the Research Excellence Framework and be used for internal advocacy.
lessons learned about modelling metadata
At the risk of following a philosophical path which I’m not qualified to tread, there is a Neo-Platonic tenet that states something along the lines of “The soldier is more real than the army.” By the same token a museum object could be said to be more real than a collection. An object has a physical presence. A collection exists only because someone says it does.
The problem with defining what a collection means, from a museum point of view at least, is that there are many different definitions depending on who is doing the defining and the perceived role and value (to the potential user) of the collection. Beyond collection records that group together items impractical to catalogue individually, they don’t fit well within a collections information management system and are not usually part of the normal curatorial cataloguing workflow.
Museum object records exist because museums have created them as a natural part of collections management. However, there is not the same imperative to create records at a collection level. They are more fluid, artificial constructs and don’t necessarily have a clear role within the organisation. Although published collection catalogues have always been within academic and curatorial scope, short interoperable collection level descriptions, which complement and provide context for object records, have perhaps only begun to find a real role with the advent of digital publication.
The JISC Resource Discovery programme has prompted us (and as a museum we are probably much further behind in this than libraries or archives) to make a greater distinction between publishing resources and publishing information about those resources. We have tended to think of item records in the former category but collection descriptions fall more naturally into the second.
In looking at how we represent the metadata for collections during this project we began also to question whether there was value still in having a distinct entity called a ‘collection level description’. Would a generic resource description, which encompasses collection descriptions (in all their various flavours) as well as online exhibitions, digital resources, and any other published grouping of objects, be a more useful concept? And who should create these resource descriptions – curators, education staff, documentation staff?
The notion of ‘sets’, implemented in the middleware CIIM (collections information integration module) for “Open Book”, has provided a way for us to model these entities. It also addresses the modelling and integration of metadata related to specialist research information generated by the museum (which has been another part of this project). As well as facilitating the publishing of open metadata derived from item records on our collections information management system (Adlib), the CIIM acts as a primary store for metadata describing collections, research outputs, electronic resources or anything else which we wish to publish as an aid to resource discovery. As a concept, the ‘set’ is generic but we are able to assign specific schema to each type of set, according to the metadata content and the purpose of the set. These sets can then be associated also with other sets as well as item level data and with ‘contexts’, additional data beyond that extracted from the collections information database, which are wrapped around the object records. This has provided a very adaptable framework, helping us move beyond publishing only object records. Collections might not exist but we can now at least capture, store, connect and publish the metadata for them.
As I alluded to in my previous post we feel like we have only just scratched the surface with what is possible on the foundations which ‘Open Book’ has laid. This is hardly surprising as this project was pretty ‘rapid fire’, achieving much proof of concept work over a very short project life of 7 months. We believe it’s been a valuable demonstrator of short, sufficiently funded development bringing together a small group of organisations each bringing their specialist experience into the mix.
One thing which is not immediately apparent is that to deploy the type of technical architecture we wished for Open Book (diagram below) you need plenty of flexibility in your infrastructure. The Fitzwilliam Museum’s investment in a virtualised server infrastructure in 2011 has been an important enabler to ‘Open Book’ – deploying the new servers required was trivial for us today, compared to our old infrastructure where server = physical box. The project required a mix of technologies and being able to be flexible enough to just deploy whatever was needed, relativity quickly, mean’t we didnt get bogged down in infrastructure matters.
Another less than obvious thought is that when seeking to create secondary stores and build services over them one needs to make very important decisions about just how structured or unstructured the data needs to be for that service.
Maybe a very small subset of the your data is required which can be mapped into further consolidated fields (e.g. any number of fields from source(s) may end up all being concatenated into a ‘global keyword field’ in the secondary data store) – this can create incredibly small and fast indexes which are quite suitable for the purpose at hand. On the other hand where you want very fine granulation in your output services you will have apply the more traditional highly designed schemas, structures and cross-linkages – this is very resource intensive but it will be reflected in the quality of the service you provide (e.g. a high value triple-store will have had an equally high level of effort expended on structuring & creating it).
Our collaboration during Open Book often came down to ‘modelling‘ – not schemas or field definitions – rather looking at the data we were trying to represent/use and seeing if, rather than solving just a single specific instance problem, whether indeed we could ‘generalise’ the problem up a level (create a model). Knowledge Integration’s CIIM concepts of ‘sets’ and contexts’ are very good examples of this and we enjoyed participating in further refinement of those concepts within the CIIM.
Designing URI’s for open data services is not trivial – but if you hunt about there’s a bit of prior knowledge and activity going on in this area. One very useful resource is the government’s open data URI guidelines and a useful forum for these matters in the Museum context is the Museums and the Machine Processable Web site.
There is bound to be much more that could be said – but instead I’ll offer commenters to ask specific questions if they wish, and we’ll try answer them in the context of our project.
Part 3 – Putting it all together
Let’s start with a picture:
This is a block diagram of what the ‘Open Book’ project has deployed. Some brief explanations:
For those who are interested, below is the same diagram with some details about technicalities added.
So in summary the project is deploying the ‘middleware’ (which was the focus of the problem scenario and solutions/benefits outlined in earlier posts). The middleware’s primary purpose is to provide services.
I went through all the theoretical benefits of this approach in the previous post.
I don’t think its too much of a stretch to say this project (in conjunction with JISC-Contextual Wrappers, JISC-Contextual Wrappers#2 and our own re-development of our online catalogue – ‘Collections Explorer‘) has brought us to a new level of understanding around how to build far more sophisticated, agile and adaptable information management systems.
Sophistication in that ‘best of breed’ software systems are brought to bear on previously intractable problems (when we were limited by existing technologies). Sophistication is also present as an opportunity – we have only just scratched the surface of what this architecture makes possible.
Creating ‘secondary stores’ whose design and purpose is to serve ‘end user’ or ‘end point machine’ services means, at its simplest, more rapid development of those services is possible. A secondary store is simply a re-arranged version of one or more primary data stores – in the diagram these secondary stores exist through the middle layer (CIIM, Collections Explorer and Triplestore).
The CIIM is the odd one out here, as it is able to act also as a ‘primary store’ – it can provide data & functionality which cannot be easily provided in other primary stores – namely the creation of ‘sets’ and ‘contexts’ either in relation to the secondary data it holds or brand new data. It’s a topic on its own which I’ll cover in my next blog. Suffice it to say that these features of the CIIM enable to storage of new data sets (e.g. for the project to implement and store, say, Collection Level Descriptions) and augment existing records (e.g. for the project to relate, say research data, to existing object records).
Agile and adaptable are not buzz words in this project – it is through development experience on the project that we know the new architecture provides these characteristics.
We are now finalising deployment of the services which the project has developed.
In trying to reach a position on the rights and licensing issues related to the range of stuff that a museum deals with, one approach we have taken is to tighten up our definition of metadata. It might seem an obvious thing to do but metadata means different things to different people and organisations. The definition of “Data about data” offers plenty of scope. It is common, for example, to think of a museum object record as metadata. CHIN’s guide to museum standards observes that the most obvious example of metadata in a museum context is “…the museum catalogue record (structured data about an object in the museum’s collection).”
The key distinguishing feature about metadata for us, working within the context of the Resource Discovery strand of this JISC programme, is precisely that it is about resource discovery. It is just a tool, a means to an end, in a way in which an object record, or even a collection level description, is not. It serves a different purpose from a catalogue record. It is not intended for collections management or interpretation but as a signpost which will be of value when aggregated with signposts to other things. As such it could hold information that we wouldn’t put in a collections record. Curators might be reluctant to record the term “Impressionist” in the record of a painting by Renoir, particularly if the painting is on the margin of what may be considered his Impressionist period, but the metadata might usefully contain the term. The museum is released from committing to a definitive interpretation (a rare thing in art history) but at the same time the user is given a pointer to something that could be of relevance to them. Storing these additional data, orientated to resource discovery, within a collections information management system is not ideal – hence the “middleware” approach that we are taking in this project.
By extension, we are looking beyond object records, to collection records (something that we haven’t sought to maintain before – other than the initial foray in Cornucopia and pointers to the other resources that we have begun to create, such as online exhibitions and educational resources (again something that we haven’t maintained since the initial batch of MICHAEL records).
The metadata record could be a generic resource description, typed according to whether it points to object data, collection level data, online exhibitions or other resources. The key factor evidently is how well these things aggregate – how they can be identified so that they can be targeted at different user groups and how the potential links between them are expressed. We don’t need to make object records from different sources/museums interoperable but we do need to give the metadata enough common ground to be effective as a means of resource discovery.
Initial draft model of the different classes of digital “stuff” that we generate and the different levels of licensing that we propose.
Part 2 – What’s “Middleware”
I like the beginning of the Wikpedia article for Middleware (I’m dropping the Middleware quotes now in the name of efficiency, too many keystrokes…). It spends most of the first paragraph saying what Middleware is not. But I’m not here to bore you with definitions, lets keep it simple…
Basically, Middleware does something, to something, to produce something. My other jocular way to describe it is ‘Slurp, Stir and Serve’…
Figure 1 is awfully familiar to anyone with exposure to traditional ‘Computing 101’ – computers take input, process it, and create output. At its simplest level so does Middleware.
But let’s quickly get more concrete. Figure 2 is our ‘problem scenario’ from part 1 – using a Collections Management System (CollMS) and its data to feed directly to your web users.
Next, Figure 3 puts Middleware in the picture. Immediately the opportunity exists to change what you present to your web audience – because you’re ‘driving’ your web presence from a ‘different system and data-set’ – more formally you have de-coupled your web functionality from your Collections Management System (CollMS). Most importantly your Middleware will (be chosen to and) allow you to design for your web users (and services) rather than be constrained by the functionality of your CollMS.
Now I’ve lulled you in a false sense of simplicity, let me expand…
In Figure 4 we see a little more detail of how the Middleware part of this looks.
At a conceptual level it’s worth noting that the Middleware ‘box’ was itself described as an input-process-output ‘box’ at its highest level. And inside the Middleware box, to accomplish that, I’ve shown that we’ll need at least two more input-process-output ‘chains’ to achieve our purpose.
This kind of ‘model’ is valuable in its simplicity – it shows the potential benefit of breaking everything down to a input-process-output ‘tasks’. And, wherever you see a ‘process’ box you can be pretty sure it has its own input-process-output ‘tasks’ inside it. So as the infamous saying goes, in our ‘model’, it really is just ‘turtles all the way down’…
So far, so obvious…
More important is that this approach (Figure 4) has some appealing charateristics from a functional and developmental point of view:
A couple of notes before I finish this weeks instalment.
Presentation. I have completely ignored, for simplicity, that generally there is a whole separate layer, or module, between the Middleware and the web user called the presentation layer. It’s crucial, you should be aware it exists, but it would have messed up my simple pictures..
Integration. Another crucial characteristic of Middleware which I’ve chosen to not detail this week. But it’s not a great stretch to see that if you can ‘slurp’ from one data source, you can do it from many. This is a key in our “Open Book” project – to show how, and document issues, in bringing together (integrating) multiple data sources.
One final point to ponder – when you take the concepts and characteristics outlined here, multiply them many fold, you have a simple conceptual picture of what the web, and in particular the web of data, looks like. A loosely coupled, distributed system, of modules, which are all able to ‘slurp, stir and serve’. You can plug these modules together, integrate different data sets, create a new modules… you can create new modules which just serve a processing purpose and let others use them… and so on and so on…
In part 3 we’ll look what ‘Open Book’ will ‘build’ as we move ahead in constructing The Fitzwilliam Museum’s ‘explore service(s)’…
P.S. When I decided I wanted to illustrate this blog, my heart sank – I have never really ‘got on’ with the likes of Visio and MS Office Draw… anyway I found a freeware drawing package, yEd, that is great. and made ‘knocking out’ the figures for this post pretty straight forward.
Part 1 – The Problem
When developing online resources for an audience using a Museum Collections Management System (CMS) a couple of things become quickly apparent:
It’s obvious isn’t it?
It is hard because neither the data, nor the functions of the application (i.e. search) , were probably ever designed to serve external audiences.
Despite this we have all been doing exactly this – building web services, from OAI-PMH data feeds, to full online public access catalogues (OPACs), directly over collections management systems. Many of these CMS’s have had a ‘web module’ tacked onto them at some stage in their development history and this is what has been used to build these web services.
Experience has shown this has its limitations. We have explored these issues in previous JISC projects as far back 2002 (i.e. issues documents published during the ‘Harvesting the Fitzwilliam’ project). We have been handling the practical problems of this approach ever since. Despite this, we have built OAI-PMH services, have taken our OPAC through two development incarnations, and built varyingly successful ‘dynamic’ web resources based on the ‘underlying OPAC’ (e.g. this is a resource which combines static and OPAC derived data) .
The problem of ‘re-purposing data’ is well rehearsed. The problem of ‘re-purposing’ an application brings yet another set of issues. Briefly looking at the main iterations The Fitzwilliam Museum has gone through with its OPAC is not a bad way to draw some of these issues out.
Phase 1 OPAC (circa 2001/2) was built entirely on the vendors ‘cmsopac’ module and all customisation was carried out in its own proprietary scripting language.
Phase 2 OPAC (circa 2005/6) coincided with an evolution of the vendors ‘cmsopac’ module which now had XML output. This provided the opportunity to ‘wrapper’ the ‘cmsopac’ module with in-house developed functionality (using PHP and XSLT technologies). Now, in-house development is not something one embarks on lightly. In-house development was considered necessary though to be able to provide the web experience we aspired to. Primarily what we achieved was:
Put another way we had built a ‘layer’ which partly de-coupled the web functionality, both on the input and output side, from the underlying ‘cmsopac’ application. This approach served us well for a time. Obviously, however, any limitations the ‘cmsopac’ application has are always present because it is still at the core of the system. In time, the limitations which really began to ‘hurt’ us were: its search functionality, search performance, and our desire to do data integration from multiple sources..
If you have read this far I’m guessing you may be reciting my title by now – “it’s obvious” – and in its simplest version it is – “middleware”. “Middleware” is one solution which could complete the job – meaning it would completely de-couple our CMS’s data and functionality from our OPAC. This is already being done in the museum sector – the most common problem being tackled by the “middleware” approach is integration of the CMS with a Digital Asset Management (DAM) system. An example of this which comes from the same JISC programme as our project is the Bodleian’s iNQUIRE system. Knowledge Integration, a partner in Open Book, has also done work in this direction – bringing CMS and DAM data together in their CIIM system to drive the Imperial War Museum’s Collection Search.
This should come as no surprise really – “middleware”, or “fusion service” as it is called in the JISC Information Environment Architecture, has been conceptually desirable for a very long time. Obviously the Collection Trust’s Culture Grid, as an aggregator, is by definition a sophisticated “middleware” system. Today, in medium to large museums, many components of the JISC IE Architecture are migrating ‘inside’ – a small private version of that architecture inside an organisation if you like.
“Middleware” has probably ‘come of age’ for this smaller, internal, deployment and development for a number of reasons. Firstly the entry barrier has been lowered – sophisticated purpose built open source components requiring less development effort have matured over the past years. This makes it possible for smaller organisations to consider “middleware” solutions from the perspective of the required resources to actually deploy such a system. Previously it was too ‘complicated’. More importantly the need, and aspirations, to provide better user experiences and web services simply require more flexible ‘systems’ to be achievable. “Middleware” becomes an obvious component choice in this new ‘system’.
The specific problems which have brought The Fitzwilliam Museum to the need for “middleware”, and what ‘Open Book’ will begin to address are:
In part 2 we’ll explore how that “middleware” ‘fits into the picture’…