JISC – Open Book Project

releasing open data for illuminated manuscript collection records and research…

Modelling: ‘Sets’ and ‘Contexts’

The Knowledge Integration’s Collections Information Integration Middleware implements a couple of very useful ‘data models’.  With refinement during Open book we’ve so far not really found anything, providing is ‘object record’ centric, can’t be ‘put into’ one of these models.

This is because there are primarily three type of ‘data connections’ one might want to do in middleware:

  1. Augment existing object records (with related but different information);
  2. Group existing object records (so that arbitrary groupings of records can be made); and
  3. Create new data sets possibly without any object record linkage.
Sets and Contexts

Models: Sets and Contexts

Referring to the diagram above there are are two models – Sets and Contexts.  In the CIIM both of these are definable – in the diagram this is represented by ‘Definition’. A ‘Definition’ is a class (for the object based modellers amongst you) if you like – at its most basic it defines the field list for this type (say of, Set) and the type of ‘connections’ these (Sets’) records are allowed to make.  Clearly you can create as many definitions as you like – for simplicity I am only representing a single definition of each type of model in the diagram.

With a Set you define its field list (schema) and then you can begin creating Set records of that ‘type’. Clearly you may have many Set Records for any given Set defintion. Sets’ principle characteristic is that any one Set record can be allowed to refer to as many object records as it likes.  A simple practical example of this might be: the Set definition is for a Collection Level Description (RSLP schema) and then each Collection Level Description (a Set record) can refer to as many object records as it likes.

With a Context you define its field list (schema) and then you begin creating records of that ‘type’. Clearly you may have many Context records for any given Context defintion. Contexts’ principle characteristic is they may only have a one to one relationship with object records. This enforces the augmentation concept. A simple theoretical example is you have a research project generating particular metadata per object – this is clearly an augmentation issue, you have extra structured data per object (which we are assuming you dont want to store in your collections management system). So your research metadata becomes a Context Definition (defining its fields/schema) and each context record will hold the research data for an indiviual object and have a one to one linkage to that objects’ record.

Advertisements

Learning technical lessons

As I alluded to in my previous post we feel like we have only just scratched the surface with what is possible on the foundations which ‘Open Book’ has laid. This is hardly surprising as this project was pretty ‘rapid fire’, achieving much proof of concept work over a very short project life of 7 months. We believe it’s been a valuable demonstrator of short, sufficiently funded development bringing together a small group of organisations each bringing their specialist experience into the mix.

One thing which is not immediately apparent is that to deploy the type of technical architecture we wished for Open Book (diagram below) you need plenty of flexibility in your infrastructure.  The Fitzwilliam Museum’s investment in a virtualised server infrastructure in 2011 has been an important enabler to ‘Open Book’ – deploying the new servers required was trivial for us today, compared to our old infrastructure where server = physical box. The project required a mix of technologies and being able to be flexible enough to just deploy whatever was needed, relativity quickly, mean’t we didnt get bogged down in infrastructure matters.

JISC Open Book Block Diagram

Another less than obvious thought is that when seeking to create secondary stores and build services over them one needs to make very important decisions about just how structured or unstructured the data needs to be for that service.

Maybe a very small subset of the your data is required which can be mapped into further consolidated fields (e.g. any number of fields from source(s) may end up all being concatenated into a ‘global keyword field’ in the secondary data store) – this can create incredibly small and fast indexes which are quite suitable for the purpose at hand.  On the other hand where you want very fine granulation in your output services you will have apply the more traditional highly designed schemas, structures and cross-linkages – this is very resource intensive but it will be reflected in the quality of the service you provide (e.g. a high value triple-store will have had an equally high level of effort expended on structuring & creating it).

Our collaboration during Open Book often came down to ‘modelling‘ – not schemas or field definitions – rather looking at the data we were trying to represent/use and seeing if, rather than solving just a single specific instance problem, whether indeed we could ‘generalise’ the problem up a level (create a model).  Knowledge Integration’s CIIM concepts of ‘sets’ and contexts’ are very good examples of this and we enjoyed participating in further refinement of those concepts within the CIIM.

Designing URI’s for open data services is not trivial – but if you hunt about there’s a bit of prior knowledge and activity going on in this area.  One very useful resource is the government’s open data URI guidelines and a useful forum for these matters in the Museum context is the Museums and the Machine Processable Web site.

There is bound to be much more that could be said – but instead I’ll offer commenters to ask specific questions if they wish, and we’ll try answer them in the context of our project.

It’s not quite so obvious (#3)

Part 3 – Putting it all together

Let’s start with a picture:

JISC Open Book Block Diagram

This is a block diagram of what the ‘Open Book’ project has deployed. Some brief explanations:

  • blue represents existing infrastructure/services, orange represents new ‘project’ infrastructure/services
  • the bottom line (of boxes) represent internal, primary datastores & applications (more later)
  • the middle line represents ‘middleware’ components which do not hold primary data, the exception to this is the CIIM which can be both (more later)
  • the top line end-points represent services offered to the internet
  • the Digital Asset Management System in grey doesn’t exist – but it is obvious that this is a key primary store/application which is missing… put another way it’s a ‘gaping hole’…
  • the orange dotted arrows are also obvious near future integrations

For those who are interested, below is the same diagram with some details about technicalities added.

JISC Open Book Block Diagram (with tech. labels)

JISC Open Book Block Diagram (with some technical labels)

So in summary the project is deploying the ‘middleware’ (which was the focus of the problem scenario and solutions/benefits outlined in earlier posts).  The middleware’s primary purpose is to provide services.

I went through all the theoretical benefits of this approach in the previous post.

I don’t think its too much of a stretch to say this project (in conjunction with JISC-Contextual Wrappers, JISC-Contextual Wrappers#2 and our own re-development of our online catalogue – ‘Collections Explorer‘) has brought us to a new level of understanding around how to build far more sophisticated, agile and adaptable information management systems.

Sophistication in that ‘best of breed’ software systems are brought to bear on previously intractable problems (when we were limited by existing technologies).  Sophistication is also present as an opportunity – we have only just scratched the surface of what this architecture makes possible.

Creating ‘secondary stores’ whose design and purpose is to serve ‘end user’ or ‘end point machine’ services means, at its simplest, more rapid development of those services is possible.  A secondary store is simply a re-arranged version of one or more primary data stores – in the diagram these secondary stores exist through the middle layer (CIIM, Collections Explorer and Triplestore).

The CIIM is the odd one out here, as it is able to act also as a ‘primary store’ – it can provide data & functionality which cannot be easily provided in other primary stores – namely the creation of ‘sets’ and ‘contexts’ either in relation to the secondary data it holds or brand new data. It’s a topic on its own which I’ll cover in my next blog. Suffice it to say that these features of the CIIM enable to storage of new data sets (e.g. for the project to implement and store, say, Collection Level Descriptions) and augment existing records (e.g. for the project to relate, say research data, to existing object records).

Agile and adaptable are not buzz words in this project – it is through development experience on the project that we know the new architecture provides these characteristics.

We are now finalising deployment of the services which the project has developed.

It’s Obvious #2

Part 2 – What’s “Middleware”

I like the beginning of the Wikpedia article for Middleware (I’m dropping the Middleware quotes now in the name of efficiency, too many keystrokes…). It spends most of the first paragraph saying what Middleware is not. But I’m not here to bore you with definitions, lets keep it simple…

Figure 1

Basically, Middleware does something, to something, to produce something. My other jocular way to describe it is ‘Slurp, Stir and Serve’…
Figure 1 is awfully familiar to anyone with exposure to traditional ‘Computing 101’ – computers take input, process it, and create output. At its simplest level so does Middleware.

But let’s quickly get more concrete. Figure 2 is our ‘problem scenario’ from part 1 – using a Collections Management System (CollMS) and its data to feed directly to your web users.

Figure 2

Next, Figure 3 puts Middleware in the picture. Immediately the opportunity exists to change what you present to your web audience – because you’re ‘driving’ your web presence from a ‘different system and data-set’ – more formally you have de-coupled your web functionality from your Collections Management System (CollMS). Most importantly your Middleware will (be chosen to and) allow you to design for your web users (and services) rather than be constrained by the functionality of your CollMS.

Figure 3

Figure 3

Now I’ve lulled you in a false sense of simplicity,  let me expand…

Figure 4

Figure 4

In Figure 4  we see a little more detail of how the Middleware part of this looks.

At a conceptual level it’s worth noting that the Middleware ‘box’ was itself described as an input-process-output ‘box’ at its highest level. And inside the Middleware box, to accomplish that, I’ve shown that we’ll need at least two more input-process-output ‘chains’ to achieve our purpose.

This kind of  ‘model’ is valuable in its simplicity – it shows the potential benefit of breaking everything down to a input-process-output ‘tasks’. And, wherever you see a ‘process’ box you can be pretty sure it has its own input-process-output ‘tasks’ inside it. So as the infamous saying goes, in our ‘model’, it really is just ‘turtles all the way down’

So far, so obvious…

More important is that this approach (Figure 4) has some appealing charateristics from a functional and developmental point of view:

  • it’s technology agnostic. It is not reliant on the system (in our case, CollMS) it’s ‘slurping from’. It’s possible to use a different piece of technology for each of the components (datastore, each of the processes) – this has the potential for you to choose ‘best of breed’ technology for each component. Or, more pragmatically, to choose technology your organisation has skills in – i.e. for the ‘Microsoft shop’ they can use .NET stuff, for University environments like ours we can leverage our skills in open source/standard technologies like PHP, XSLT, JSON etc.
  • it’s modular. Modular is good – it allows us to break problems into contained packets which humans can deal with.  Through modularisation the development of such a system can be broken into simpler, smaller, problems – or ‘black boxes’. Modular should also aid sustainability – replacing one piece in a modular system won’t break the rest – if designed properly.
  •  it’s loosely coupled. What’s good about loose coupling is exactly that it’s not like its opposite – a monolithic system. Our Figure 1 shows a simplified monolithic system, where if the CollMS ‘goes down’ everything is down. Conversely in the loosely coupled example, Figure 3, our web users won’t even notice when the CollMS is down because they are being ‘served’ by the Middleware. Loose coupling is most important as a requirement for modularisation – together they enable the ‘black box’ development approach.
  • it can be distributed. Modularisation and loose coupling means that different components of the system can ‘live’ on different machines, even in different localities if desired. In the technical world this characteristic provides us an easier path to deal with issues such as continuity, recoverability & scalability.

A couple of notes before I finish this weeks instalment.

Presentation. I have completely ignored, for simplicity, that generally there is a whole separate layer, or module, between the Middleware and the web user called the presentation layer. It’s crucial, you should be aware it exists, but it would have messed up my simple pictures..

Integration. Another crucial characteristic of Middleware which I’ve chosen to not detail this week.  But it’s not a great stretch to see that if you can ‘slurp’ from one data source, you can do it from many. This is a key in our “Open Book” project – to show how, and document issues, in bringing together (integrating) multiple data sources.

One final point to ponder – when you take the concepts and characteristics outlined here, multiply them many fold, you have a simple conceptual picture of what the web, and in particular the web of data, looks like. A loosely coupled, distributed system, of modules, which are all able to ‘slurp, stir and serve’. You can plug these modules together,  integrate different data sets, create a new modules… you can create new modules which just serve a processing purpose and let others use them… and so on and so on…

In part 3 we’ll look what ‘Open Book’ will ‘build’ as we move ahead in constructing The Fitzwilliam Museum’s ‘explore service(s)’…

P.S. When I decided I wanted to illustrate this blog, my heart sank – I have never really ‘got on’ with the likes of Visio and MS Office Draw… anyway I found a freeware drawing package, yEd, that is great. and made ‘knocking out’ the figures for this post pretty straight forward.

It’s Obvious

Part 1 – The Problem

When developing online resources for an audience using a Museum Collections Management System (CMS) a couple of things become quickly apparent:

  • re-purposing data designed for collections care and research is hard, and
  • building an online interface over an ‘internal application’ is hard.

It’s obvious isn’t it?

It is hard because neither the data, nor the functions of the application (i.e. search) , were probably ever designed to serve external audiences.

Despite this we have all been doing exactly this – building web services, from OAI-PMH data feeds, to full online public access catalogues (OPACs), directly over collections management systems. Many of these CMS’s have had a ‘web module’ tacked onto them at some stage in their development history and this is what has been used to build these web services.

Experience has shown this has its limitations. We have explored these issues in previous JISC projects as far back 2002 (i.e. issues documents published during the ‘Harvesting the Fitzwilliam’ project). We have been handling the practical problems of this approach ever since. Despite this, we have built OAI-PMH services, have taken our OPAC through two development incarnations, and built varyingly successful ‘dynamic’ web resources based on the ‘underlying OPAC’ (e.g. this is a resource which combines static and OPAC derived data) .

The problem of ‘re-purposing data’ is well rehearsed.  The problem of ‘re-purposing’ an application brings yet another set of issues. Briefly looking at the main iterations The Fitzwilliam Museum has gone through with its OPAC is not a bad way to draw some of these issues out.

Phase 1 OPAC (circa 2001/2) was built entirely on the vendors ‘cmsopac’ module and all customisation was carried out in its own proprietary scripting language.

Phase 2 OPAC (circa 2005/6) coincided with an evolution of the vendors ‘cmsopac’ module which now had XML output.  This provided the opportunity to ‘wrapper’ the ‘cmsopac’ module with in-house developed functionality (using PHP and XSLT technologies). Now, in-house development is not something one embarks on lightly. In-house development was considered necessary though to be able to provide the web experience we aspired to. Primarily what we achieved was:

  • to provided a search interface to the user which did its best to ‘hide’ the underlying application functionality (and its limitations)
  • to build a completely flexible presentation system (based on XSLT) above the ‘cmsopac’
  • tinkered with the ability (as unsophisticated as it is) to integrate simple related data, not held in the CMS, into OPAC results

Put another way we had built a ‘layer’ which partly de-coupled the web functionality, both on the input and output side, from the underlying ‘cmsopac’ application. This approach served us well for a time. Obviously, however, any limitations the ‘cmsopac’ application has are always present because it is still at the core of the system. In time, the limitations which really began to ‘hurt’ us were: its search functionality, search performance, and our desire to do data integration from multiple sources..

If you have read this far I’m guessing you may be reciting my title by now – “it’s obvious” – and in its simplest version it is – “middleware”. “Middleware” is one solution which could complete the job – meaning it would completely de-couple our CMS’s data and functionality from our OPAC.  This is already being done in the museum sector – the most common problem being tackled by the “middleware” approach is integration of the CMS with a Digital Asset Management (DAM) system. An example of this which comes from the same JISC programme as our project is the Bodleian’s iNQUIRE system.  Knowledge Integration, a partner in Open Book, has also done work in this direction – bringing CMS and DAM data together in their CIIM system to drive the Imperial War Museum’s Collection Search.

This should come as no surprise really – “middleware”, or “fusion service” as it is called in the JISC Information Environment Architecture, has been conceptually desirable for a very long time. Obviously the Collection Trust’s Culture Grid, as an aggregator, is by definition a sophisticated “middleware” system.  Today, in medium to large museums, many components of the JISC IE Architecture are migrating ‘inside’ – a small private version of that architecture inside an organisation if you like.

“Middleware” has probably ‘come of age’ for this smaller, internal, deployment and development for a number of reasons.  Firstly the entry barrier has been lowered – sophisticated purpose built open source components requiring less development effort have matured over the past years. This makes it possible for smaller organisations to consider “middleware” solutions from the perspective of the required resources to actually deploy such a system.   Previously it was too ‘complicated’.  More importantly the need, and aspirations, to provide better user experiences and web services simply require more flexible ‘systems’ to be achievable. “Middleware” becomes an obvious component choice in this new ‘system’.

The specific problems which have brought The Fitzwilliam Museum to the need for “middleware”, and what ‘Open Book’ will begin to address are:

  • an ability to bring together collection, object catalogue and object research data
  • the ability to provide more sophisticated harvesting (OAI-PMH)
  • the ability to provide new services conforming to open linked data best practices

In part 2 we’ll explore how that “middleware” ‘fits into the picture’…

Project Plan

The project team met again on the 22nd of February at the Natural History Museum.

The main task of this meeting was to discuss & clarify an all partners understanding of the deliverables.

Useful discussions also took place regarding data modelling, architecture and understanding any linkages with Contextual Wrappers #2.

The primary result of this meeting was the publishing of the project’s Project Plan.

Project ‘kick-off’ – 23 Jan 2012

A project team meeting was held at the Natural History Museum on the 23rd January 2012.

Open Book – releasing open data for museum collection records and research – is a JISC funded project under Digital Infrastructure Programme – Resource Discovery strand.

This is the project’s website and here you will find published outputs from the project, blog entries etc.

Find out more about this project…