Resource Maps and ODaP

[Unfortunately, most of the links are obsolete, referring to servers that don’t exist anymore.]

One of the deliverables of the ODaP (Open Data and Publications) project is to have 40 Resource Maps with linked datasets and publications.
Our original plan was to use the ORE gateway that we developed a few years ago, see http://ore.place.pukurin.uvt.nl/. This turned out to be too cumbersome. We have now support of OAI ORE Resource Map implemented directly in our search engine, which is based on Meresco of CQ2. I didn’t use the rdf and triple store support of Meresco itself. There was not enough time to get familiar with this and what we want is rather simple. There is even no need to store the triples – just dynamically generating triples from the xml “parts” that are stored for each document by the search engine. Moreover we can make use of the record processing that is already implemented. Perhaps in a later stage we can make use of a rdf library. The sources of our Meresco variant that is called bzv are here: https://svn.non-gnu.uvt.nl/uvt-dev/trunk/sources/meresco/bzv/. For the ORE support we added a new module: see https://svn.non-gnu.uvt.nl/uvt-dev/trunk/sources/meresco/bzv/src/ore.py – ugly but effective.

There are two test servers that have implemented the ore support:

These are test servers. They can be unavailable or broken. For all documents stored by a search engine, Resource Maps can be generated. For the Get It! server (search.uvt.nl) this will mean 6.000.000 Resource Maps (there is a bug for generating Resource Maps for our 700.000 catalog records – this will be fixed).

The following URL templates are supported:

  • http(s):///ore/ – Aggregation
  • http(s):///ore/.rdf – Resource Map
  • http(s):///pub/ – Publication
  • http(s):///mods/.xml – MODS
  • http(s):///ddi/.xml – DDI, version 2
  • http(s):///ddi3/.xml – DDI, version 3

The first four URLs will always work, but the URLs for the DDI work only when there is a DDI document available.
The Aggregation and Publication URLs  are 303 See other redirected to the corresponding Resource Map.
Examples are:
http://evs.uvt.nl/search?displayType=single&query=evs-uvt-nl:oai:evs.uvt.nl:3256420 (Human Start Page)
http://evs.uvt.nl/ore/evs-uvt-nl:oai:evs.uvt.nl:3256420.rdf (Resource Map)
and
https://bzv.place.pukurin.uvt.nl/search?displayType=single&query=ir-uvt-nl:oai:wo.uvt.nl:167602 (Human Start Page)
https://bzv.place.pukurin.uvt.nl/ore/ir-uvt-nl:oai:wo.uvt.nl:167602.rdf (Resource Map)

The SURFshare community has an application profile under development for Resource Maps in RDF XML, see http://wiki.surffoundation.nl/display/vp/Resource+Maps+in+RDF+XML
I will now indicate where I have not followed this document. I refer to version 0.9.

  • No dcterms:issued for Aggregations. Unclear when Aggregations are issued. They are generated according to rules (computer code). “issued” implies an human act, but in our case there is no such an act.
  • Eprint as it is used in the SURFshare document is overloaded. On the one hand it stands for the publication (that is enhanced) and on the other hand it stands for a corresponding object file. It is possible to have a publication without a corresponding object file, but with an enrichment in the form of a dataset or the operationalisations of the concepts used in the study. In the European Values Study, this is often the case. It is also possible to have more object files, that are versions of each other, e.g. files in different formats or to have seperate files for parts of the publication (chapters). The metadata of a publication (work) are not the same as the metadata of a file. In the Aggregation we have one resource that is the publication with the metadata of the publication and we have zero, one or more object file resources. Our arguments to make this distinction between the publication and the object files are the same as in Use of MODS for institutional repositories.There is a describedBy relation from the publication resource to the MODS metadata resource. Both resources are in the Aggregation. The object files are treated as … files. They are also part of the Aggregation.
    The type of a publication resource is one of the publication types from info:eo-repo/semantics/. The type of the object file resources is info:eu-repo/semantics/objectFile.

Still to do and issues:

  • lots of details
  • use file details from the DIDL container in the metadata for the object file resources.
  • add data object resources. The data objects are described in the ddi. This can be compared to the Data and supplementary material tab of https://bzv.uvt.nl/search?displayType=single&query=ir-uvt-nl:oai:wo.uvt.nl:167602 that is also based on information from the ddi document, i.e. from https://bzv.uvt.nl/ddi/ir-uvt-nl:oai:wo.uvt.nl:167602. Note that we distinguish between datasets, files supplementary materials and combination files with datasets and supplementary materials.
  • add operationalisations (European Values Study) to the Aggregation. This  is the kind of information that is displayed in the Operationalisations tab of http://evs.uvt.nl/search?displayType=single&query=evs-uvt-nl:oai:evs.uvt.nl:3256420 and that comes from this ddi: http://evs.uvt.nl/ddi3/evs-uvt-nl:oai:evs.uvt.nl:3256420. The same for the information about the waves and the countries, see the corresponding tab. I am not sure how to represent this in rdf. One idea is to represent the operationalisations used in a study as a separate Aggregation,
  • expand the publication resource description with more metadata (extracted from the MODS). In the end the MODS becomes redundant because all the information in the MODS record is expressed as rdf triples.
  • dcterms:isPartOf is used to relate a publication to what is called in MODS a related item (type=”host” or type=”series”). At the moment the related item is represented by a literal which is not the intention of dcterms:isPartOf. This literal is in the form of the traditional “source” field which we use in the user interface of the search engine. For “human consumption” this is very clear. Should the rdf description of the publication have the same granularity as the MODS? And how to link to the related items when we have no URIs for them? (I packed a lot of issues in one bullet point 😉
  • I couldn’t find an ontology or vocabulary  to express that a MODS resource “describes” a publication.
  • How to express relations between publications and object files?
  • Relate publication resource to DOI and other publisher controlled identifiers. The integration of the ‘place’ locator that knows about DOIs, etc. is at the moment at the level of Javascript (in the browser). That’s a problem.
  • Add extra information from Webwijs to the foaf descriptions of authors.
  • How to publish the Resource Maps? I have a preference for sitemaps. Very easy for me to implement.
  • What is the added value of the InContext visualizer for the Aggregations of the European Values Study or of ODaP?
  • And so on.

How datasets and publications are linked in ODaP

In ODaP the publications in the institutional repository (IR) and the datasets in DVN are linked in such a way that a user searching the IR can follow links to the datasets in DVN and vice versa an user accessing the Tilburg University dataverse in DVN can follow links to the repository records describing the related publications. So the linking in ODaP is symmetrical: if A links to B then B links to A. This is implemented in such a way that only in one system the links are maintained. The system that is the source of the links is regularly consulted for adding the reverse links to the other system.

The source system for the links is DVN. In the description of a dataset the permalink of the related publication is added. A permalink refers to a page of the Tilburg University search system Get It!. Such a permalink page functions as a splash page or a jump-off page of the publication in the repository. In this way studies in DVN link to the Open Access version of the related publications.

DVN uses the DDI standard (version 2) as metadata format for the description of the datasets. The permalinks of the related publications are stored by us in the DDI element /codebook/stdyDscr/othrStdyMat/relPubl (Related Publications). The DDI records of DVN can be harvested by using the OAI-PMH protocol. The ODaP harvester that harvests DVN, sends the DDI records to the Enrichment Server by using the SRU Record Update protocol. The Enrichment Server uses the permalinks stored in the DDI records to determine the records of the Tilburg University search system Get It! that has to be enriched with the DDI. The records in Get It! come from different sources. One of them being the Tilburg University Repository based on the ARNO system. The ARNO system has no end users interface itself. For this iPort and Get It! are used. Getting the DIDL/MODS records supplied by ARNO into Get It! is done by a harvester as depicted in the following diagram.

Note that the harvesting of the DIDL/MODS from the repository is first and the harvesting of the DDI from DVN comes next. In this way the DIDL/MODS as a representation of a publication is enriched with the DDI as a representation of a dataset and not the other way around. The Enrichment Server can also be used to enrich a search engine record with other information that is coming from an external source. This enriched whole can also be represented as an OAI ORE Resource Map.

This way of enriching bibliographical records is also implemented for Economists Online and for the European Values Study portal. ODaP is most similar to Economists Online. Because the ODaP implementation is still experimental, I will give an example of Economists Online.

This permalink http://www.economistsonline.org/publications?id=eprints-lse-ac-uk:oai:eprints.lse.ac.uk:3607 contains a link to a dataset in DVN. The DVN dataset descriptions have a handle as an unique identifier. In this case the handle is hdl:1902.1/12930 that is resolved by http://dvn.iq.harvard.edu/dvn/study?globalId=hdl:1902.1/12930. When we follow the latter link DVN represents a record with in Related Publications the permalink of the publication.

ODaP and Dataverse Network

In ODaP a dataset belongs to a study that is defined by a particular publication. The data that are used in the study that resulted in the publication is what we call a dataset. These data can be part of a larger dataset or database. In many cases the data as used in a study are the result of processing existing data. The dataset as used for a publication are stored and described in the Dataverse Network of the Institute for Quantative Social Science at Harvard University. We use two dataverses. One is the existing dataverse for Economists Online that is set up in the European project NEEO and the other is a new dataverse for Tilburg University. The dataverse for Tilburg University is made up of collections that correspond to the Schools of the university. The collection of the Tilburg School of Economics and Management (TISEM) is a socalled dynamic collection, while the collections of the other Schools are static. A dynamic collection is populated by studies from other (static) collections. The TISEM collection is defined to be the same collection as the static collection of Tilburg University in the dataverse for Economists Online. In this way our economic datasets can live in two dataverses. The management of the studies that live in one or more dataverses is done in the dataverse that houses the static collection. In our cases the economic studies are described in the dataverse for Economists Online and the studies (datasets) of the other Schools are described in the dataverse for our university.
We had several sessions of one hour to make our information specialists Corry, Ingrid and Trijnie familiar with the DVN system. We try to follow the Guidelines as developed for the NEEO project. It turned out that it is better that ODaP has its own Guidelines. These Guidelines are under review. We will make them available when they are finished.

ODaP and the acquisition of datasets

Acquiring datasets is a lot of work. It means a lot of talking and explaining. We had meetings with the management of the Schools. Mails are sent to leaders of research groups and to individual researchers. We have a (growing) list of 62 names that we want to contact. We contact as much as possible researchers with which the library already has collaborated in the past. But old contacts can lead to new ones; expanding our network. We already contacted 2/3 of the names on the list; in most cases we mail first and then make an appointment. The result until now is that we collected 6 datasets, but we are just starting. In many cases researchers are sympathetic to the idea, but first want to organise (and archive) their data better. This means that ideally we collaborate with the researchers in the different phases of their research. If successful one contact can be good for connecting datasets to several publications. We located more than 60 publications with Tilburg authors that are based on one (longitudinal) dataset. Hopefully we can use these 60 publications in our project. Personal meetings are in most cases necessary to explain and convince researchers.

Open Data and Publications

We got funding by SURFfoundation for a 5 month project. In this project we will help researchers to publish datasets that they used for their publications. The project focusses on researchers from the School of Economics and Managment and the School of Social and Behavioral Science. The goal is to connect 40 publications with the underlying datasets.

Technical information
The Dataverse Network system will be used for the description and storage of the datasets. In the description of the dataset there will be a link to the metadata record of the publication in the search system of Tilburg University. The dataset descriptions are according to the DDI version 2 standard. The DDI records are harvested from the DVN system using the OAI-PMH protocol. In the search system, the DDI records are added to the metadata records of the corresponding publications. The metadata records are MPEG21/DIDL records that contain the bibliographical description in MODS and links to the full text in the institutional repository. The combination of the DDI record and the DIDL record represents a so-called enhanced publication that can be represented by an OAI-ORE Resource Map. With the exception of the Resource Maps, the same setup is used in Economists Online. The technical work for this was done in the European project NEEO, Also related is the portal of the European Values Study that is the outcome of the project DatapluS funded by SURFfoundation.

The real challenge
The real challenge is however not technical but organisational and behavioural. How to convince and motivate researchers to make their datasets available for open access (this also involves limited access in the sense that access requires the consent of the researcher or someone acting on his/her behalf)? At the end of the project we want to have in place procedures for the delivery of datasets comparable to and integrated with the procedures for the delivery of open access publications to the institutional repository. In following blogs, I will describe how we handle this challenge.

LAIRD is ODAP in Edinburgh

Rob and I met Robin Rice a year ago at the NEEO conference that was held in the British Library. She attended the workshop that we gave on enhancing economic publications with datasets. Rob was invited to contribute to a session of an Open Access Conference in Glasgow later that year.

Today Robin wrote to congratulate us on our new project Open Data and Publications. “This sounds similar to what we’ve been working on in LAIRD (Linking Articles Into Research Data), maybe we can keep in touch about it.”   Yes we will do that.