JISC – Open Book Project

releasing open data for illuminated manuscript collection records and research…

Monthly Archives: August 2012

Project final report

Outputs

  • Release of open data, including guidance for the reuse of this metadata, for more than 175,000 item records of the collection of the Fitzwilliam Museum. This is made available via an open data service at: http://data.fitzmuseum.cam.ac.uk/
    This will now form an ongoing service at the Fitzwilliam and will expand with the addition of new object records to the Museum’s collection information database in line with the Museum’s digitisation programme.
  • The data is made available in the following ways:
  1. Open data permanent URI identifiers for objects, e.g. http://data.fitzmuseum.cam.ac.uk/id/object/656
  2. SPARQL endpoint, using RDF and mapped to the CIDOC-CRM
  3. OAI-PMH interface, supplying data in PNDS Dublin Core profile
  4. REST based API, returning JSON responses

Lessons Learned

Opportunities and Possibilities

  • Long-term assessment of uptake and value of the open data service offered by the Fitzwilliam Museum. We still regard the service as being at beta stage. Our next step will be to publicise what is available and invite community feedback to assist in refining and developing the service. As yet there has been relatively little work in the museum and heritage sector on developing RDF triples and mapping to CIDOC-CRM. The ultimate value of this is still to be proven but the Fitzwilliam Museum is now in a good position to participate in this early phase assessment.
  • Further work on how research activity data might be aggregated with data from other Museums, as well as the HE sector.
  • Continue to explore how ‘sets’ and ‘contexts’ can be used in the middleware system as part of a revised approach to the storage and publication of collections information, adding further scope and depth to the open data released.

Modelling: ‘Sets’ and ‘Contexts’

The Knowledge Integration’s Collections Information Integration Middleware implements a couple of very useful ‘data models’.  With refinement during Open book we’ve so far not really found anything, providing is ‘object record’ centric, can’t be ‘put into’ one of these models.

This is because there are primarily three type of ‘data connections’ one might want to do in middleware:

  1. Augment existing object records (with related but different information);
  2. Group existing object records (so that arbitrary groupings of records can be made); and
  3. Create new data sets possibly without any object record linkage.
Sets and Contexts

Models: Sets and Contexts

Referring to the diagram above there are are two models – Sets and Contexts.  In the CIIM both of these are definable – in the diagram this is represented by ‘Definition’. A ‘Definition’ is a class (for the object based modellers amongst you) if you like – at its most basic it defines the field list for this type (say of, Set) and the type of ‘connections’ these (Sets’) records are allowed to make.  Clearly you can create as many definitions as you like – for simplicity I am only representing a single definition of each type of model in the diagram.

With a Set you define its field list (schema) and then you can begin creating Set records of that ‘type’. Clearly you may have many Set Records for any given Set defintion. Sets’ principle characteristic is that any one Set record can be allowed to refer to as many object records as it likes.  A simple practical example of this might be: the Set definition is for a Collection Level Description (RSLP schema) and then each Collection Level Description (a Set record) can refer to as many object records as it likes.

With a Context you define its field list (schema) and then you begin creating records of that ‘type’. Clearly you may have many Context records for any given Context defintion. Contexts’ principle characteristic is they may only have a one to one relationship with object records. This enforces the augmentation concept. A simple theoretical example is you have a research project generating particular metadata per object – this is clearly an augmentation issue, you have extra structured data per object (which we are assuming you dont want to store in your collections management system). So your research metadata becomes a Context Definition (defining its fields/schema) and each context record will hold the research data for an indiviual object and have a one to one linkage to that objects’ record.

Benefits of our approach

The primary focus of Open Book has been to release open data about the collections in the Fitzwilliam Museum, as well as to explore how we go beyond item level data and link to related resource information and research activity data.

A fundamental building block of the project was to first establish an organisational position on open data – how we define it and what it means to us. We had to ensure that we had an understanding of open data and were prepared to release it. So, the issue of licensing reared its head. But before that we tried to formalise a description of the types of data that, as a museum,  we deal with and, in particular, clarify what we meant by metadata. With that distinction in place, internal advocacy for open data was easier. In the end it has led to a layered approach, with a basic set of metadata (aimed specifically at resource discovery and aggregation) being dedicated to the public domain and more detailed data licensed as Creative Commons Attribution-Share Alike. The latter can still be defined as an open licence but recognises the curated nature of this data and the significance (in academic terms) of attribution of source.

The next question was how we were going to release this data and I have to admit we have taken a scattergun approach. Metadata is made available via OAI-PMH for harvesting. Fuller data sets are offered via an API, returning JSON responses, and a SPARQL endpoint using RDF and mapping to the CIDOC-CRM. We have conceived this as a data service that complements our web presence and online catalogue. Achieving this has taken time and technical resources that would have been difficult to access without the Open Book project. The uptake and value of each of these can only be properly assessed over time. If we had to focus on a single aspect of this, perhaps the path with the potential for greatest returns would be the release of open metadata (whether through OAI or some other data feed) to the Culture Grid and subsequently to Europeana. The sector context of such aggregations is likely to offer users the most benefit, in the short term at least.

Users, collection descriptions and research activity data

Development of use case scenarios and understanding of user requirements for the metadata released by Open Book has relied heavily on work carried out with the sister project Contextual Wrappers 2. In the focus group sessions for Contextual Wrappers 2 and Open Book, the concept of open data was well received by data managers and users alike, although it was recognised that there may still be management resistance within some organisations. The general approach, adopted by both Contextual Wrappers 2 and Open Book, of guided discovery (providing contextual information, including collection level descriptions, to supplement item level records) was seen as positive and is already pursued by some museums. The value in particular to users unfamiliar with the subject or collection was highlighted, as well as a role in creating course subject guides. However, the potential was seen also for academic research, creating links across multiple collections and using collection level descriptions as a way to draw together bibliographic information and research material across a group of objects.

In addition to releasing item level metadata, Open Book has explored the modelling of data related to the outputs from ongoing research projects (including pigment analysis of illuminated manuscripts) at the Fitzwilliam Museum. An initial proposition was that the CIIM (collections information integration module), which was being deployed during the project, could be used to capture, store and eventually publish the primary research data outputs. As such, one group of potential users would have been internal, employing the CIIM to manage the research data. However, it became clear at an early stage that this wasn’t going to be possible or desirable. The real strength of the CIIM is data augmentation and publishing; to use it as a database for research outputs and to integrate it within the research workflow (particularly given that the CIIM’s user interface is at an early phase of development) wasn’t a viable option. Although the notion of ‘contexts’ within the CIIM, which augment the item data extracted from the collections database, could be used to store item-specific research data, it was apparent that research projects don’t operate according to this kind of simple one-to-one relationship. Tools and applications with which the researcher is already familiar are a more efficient means of handling the research process and better able to capture the complexities of the data.

Ultimately what we have tried to do is encapsulate the research output in a structured form as metadata about the project. This kind of ‘abstract’ of the research sounds (and is) quite straightforward but it isn’t something that we have done before or have had the capacity to store and publish in a sensible way. We will continue to explore how we might vary the granularity of these abstracts, from global project metadata down to individual strands of a research project, and how they can be linked to specific context information augmenting the basic object data. We want to both publish the fact that research (which may as yet not have any associated academic publication) is ongoing and associate the relevant objects with that research, creating links between item and research metadata in the same way as there are links between items and collections. We have more work to do on assessing the potential use of this metadata but the emphasis is on resource discovery, particularly within a cross-disciplinary environment where an activity such as pigment analysis would be of interest to researchers in disciplines outside the usual museum subject areas. This leads into what are, for us, new areas of aggregation of research activity data. We are familiar, in our object-centric view of the world, with object data aggregation but structuring research related data, and how this is/can be used in a wider academic context, is new territory for us.

In addition to any benefits to external users, there also may be internal gains. Although dealing only with a relatively small number of projects, one by-product of structured research activity data will be an enhanced strategic and administrative capacity to search and produce reports (e.g. research activity over last 10 years funded by x). The role of documenting research in this way was highlighted in the focus group sessions, where it was suggested that it could help raise the profile of research within museums, provide a metric in the Research Excellence Framework and be used for internal advocacy.

Do collections exist?

lessons learned about modelling metadata

At the risk of following a philosophical path which I’m not qualified to tread, there is a Neo-Platonic tenet that states something along the lines of “The soldier is more real than the army.” By the same token a museum object could be said to be more real than a collection. An object has a physical presence. A collection exists only because someone says it does.

The problem with defining what a collection means, from a museum point of view at least, is that there are many different definitions depending on who is doing the defining and the perceived role and value (to the potential user) of the collection. Beyond collection records that group together items impractical to catalogue individually, they don’t fit well within a collections information management system and are not usually part of the normal curatorial cataloguing workflow.

Museum object records exist because museums have created them as a natural part of collections management. However, there is not the same imperative to create records at a collection level. They are more fluid, artificial constructs and don’t necessarily have a clear role within the organisation. Although published collection catalogues have always been within academic and curatorial scope, short interoperable collection level descriptions, which complement and provide context for object records, have perhaps only begun to find a real role with the advent of digital publication.

The JISC Resource Discovery programme has prompted us (and as a museum we are probably much further behind in this than libraries or archives) to make a greater distinction between publishing resources and publishing information about those resources. We have tended to think of item records in the former category but collection descriptions fall more naturally into the second.

In looking at how we represent the metadata for collections during this project we began also to question whether there was value still in having a distinct entity called a ‘collection level description’. Would a generic resource description, which encompasses collection descriptions (in all their various flavours) as well as online exhibitions, digital resources, and any other published grouping of objects, be a more useful concept? And who should create these resource descriptions – curators, education staff, documentation staff?

The notion of ‘sets’, implemented in the middleware CIIM (collections information integration module) for “Open Book”, has provided a way for us to model these entities. It also addresses the modelling and integration of metadata related to specialist research information generated by the museum (which has been another part of this project). As well as facilitating the publishing of open metadata derived from item records on our collections information management system (Adlib), the CIIM acts as a primary store for metadata describing collections, research outputs, electronic resources or anything else which we wish to publish as an aid to resource discovery. As a concept, the ‘set’ is generic but we are able to assign specific schema to each type of set, according to the metadata content and the purpose of the set. These sets can then be associated also with other sets as well as item level data and with ‘contexts’,  additional data beyond that extracted from the collections information database, which are wrapped around the object records. This has provided a very adaptable framework, helping us move beyond publishing only object records. Collections might not exist but we can now at least capture, store, connect and publish the metadata for them.

Learning technical lessons

As I alluded to in my previous post we feel like we have only just scratched the surface with what is possible on the foundations which ‘Open Book’ has laid. This is hardly surprising as this project was pretty ‘rapid fire’, achieving much proof of concept work over a very short project life of 7 months. We believe it’s been a valuable demonstrator of short, sufficiently funded development bringing together a small group of organisations each bringing their specialist experience into the mix.

One thing which is not immediately apparent is that to deploy the type of technical architecture we wished for Open Book (diagram below) you need plenty of flexibility in your infrastructure.  The Fitzwilliam Museum’s investment in a virtualised server infrastructure in 2011 has been an important enabler to ‘Open Book’ – deploying the new servers required was trivial for us today, compared to our old infrastructure where server = physical box. The project required a mix of technologies and being able to be flexible enough to just deploy whatever was needed, relativity quickly, mean’t we didnt get bogged down in infrastructure matters.

JISC Open Book Block Diagram

Another less than obvious thought is that when seeking to create secondary stores and build services over them one needs to make very important decisions about just how structured or unstructured the data needs to be for that service.

Maybe a very small subset of the your data is required which can be mapped into further consolidated fields (e.g. any number of fields from source(s) may end up all being concatenated into a ‘global keyword field’ in the secondary data store) – this can create incredibly small and fast indexes which are quite suitable for the purpose at hand.  On the other hand where you want very fine granulation in your output services you will have apply the more traditional highly designed schemas, structures and cross-linkages – this is very resource intensive but it will be reflected in the quality of the service you provide (e.g. a high value triple-store will have had an equally high level of effort expended on structuring & creating it).

Our collaboration during Open Book often came down to ‘modelling‘ – not schemas or field definitions – rather looking at the data we were trying to represent/use and seeing if, rather than solving just a single specific instance problem, whether indeed we could ‘generalise’ the problem up a level (create a model).  Knowledge Integration’s CIIM concepts of ‘sets’ and contexts’ are very good examples of this and we enjoyed participating in further refinement of those concepts within the CIIM.

Designing URI’s for open data services is not trivial – but if you hunt about there’s a bit of prior knowledge and activity going on in this area.  One very useful resource is the government’s open data URI guidelines and a useful forum for these matters in the Museum context is the Museums and the Machine Processable Web site.

There is bound to be much more that could be said – but instead I’ll offer commenters to ask specific questions if they wish, and we’ll try answer them in the context of our project.

It’s not quite so obvious (#3)

Part 3 – Putting it all together

Let’s start with a picture:

JISC Open Book Block Diagram

This is a block diagram of what the ‘Open Book’ project has deployed. Some brief explanations:

  • blue represents existing infrastructure/services, orange represents new ‘project’ infrastructure/services
  • the bottom line (of boxes) represent internal, primary datastores & applications (more later)
  • the middle line represents ‘middleware’ components which do not hold primary data, the exception to this is the CIIM which can be both (more later)
  • the top line end-points represent services offered to the internet
  • the Digital Asset Management System in grey doesn’t exist – but it is obvious that this is a key primary store/application which is missing… put another way it’s a ‘gaping hole’…
  • the orange dotted arrows are also obvious near future integrations

For those who are interested, below is the same diagram with some details about technicalities added.

JISC Open Book Block Diagram (with tech. labels)

JISC Open Book Block Diagram (with some technical labels)

So in summary the project is deploying the ‘middleware’ (which was the focus of the problem scenario and solutions/benefits outlined in earlier posts).  The middleware’s primary purpose is to provide services.

I went through all the theoretical benefits of this approach in the previous post.

I don’t think its too much of a stretch to say this project (in conjunction with JISC-Contextual Wrappers, JISC-Contextual Wrappers#2 and our own re-development of our online catalogue – ‘Collections Explorer‘) has brought us to a new level of understanding around how to build far more sophisticated, agile and adaptable information management systems.

Sophistication in that ‘best of breed’ software systems are brought to bear on previously intractable problems (when we were limited by existing technologies).  Sophistication is also present as an opportunity – we have only just scratched the surface of what this architecture makes possible.

Creating ‘secondary stores’ whose design and purpose is to serve ‘end user’ or ‘end point machine’ services means, at its simplest, more rapid development of those services is possible.  A secondary store is simply a re-arranged version of one or more primary data stores – in the diagram these secondary stores exist through the middle layer (CIIM, Collections Explorer and Triplestore).

The CIIM is the odd one out here, as it is able to act also as a ‘primary store’ – it can provide data & functionality which cannot be easily provided in other primary stores – namely the creation of ‘sets’ and ‘contexts’ either in relation to the secondary data it holds or brand new data. It’s a topic on its own which I’ll cover in my next blog. Suffice it to say that these features of the CIIM enable to storage of new data sets (e.g. for the project to implement and store, say, Collection Level Descriptions) and augment existing records (e.g. for the project to relate, say research data, to existing object records).

Agile and adaptable are not buzz words in this project – it is through development experience on the project that we know the new architecture provides these characteristics.

We are now finalising deployment of the services which the project has developed.