BiblioSight News

Integrating the Web of Science web-services API into the Leeds Met Repository

Project meeting – minutes

Posted by Nick on November 18, 2009

Present: Peter Douglas, Wendy Luker, Arthur Sargeant, Mike Taylor, Babita Bhogal, Nick Sheppard

1. Apologies

Sue Rooke

2. Minutes from last meeting and actions

As emphasised at the last meeting, it has not been possible, within our timescale, to engage a suitable academic replacement after Phil Jones left the institution earlier in the project and it is now anticipated that academic staff / researchers will be involved in evaluating the outcomes of the project beyond the formal end of jiscri. WL/NS do now have a meeting scheduled (30th November 2009) with Professor Richard Light, the recently appointed Chair of the Carnegie Research Institute, to discuss Bibliosight and the wider repository infrastructure.

NS/PD have done some work on clarifying use cases – see item 4.

Transformation of XML from WoS to LOM format for ingest into intraLibrary. See – http://bibliosightnews.wordpress.com/2009/11/16/mapping-fields-from-wos-api-lom/ – more work still needs to be done in this area. (Action – NS/MT)

AS has updated the schematic diagram to clarify what will be achieved by the end of November. See – http://bibliosightnews.wordpress.com/2009/11/13/332/

NS to contribute project management post to blog on day to day work – ongoing – NS to action ASAP.

PD has contributed a blog post on technical standards used in Bibliosight – http://bibliosightnews.wordpress.com/2009/11/17/the-role-of-standards-in-bibliosight/

3. Update on development of desk-top application

As emphasised at the last meeting, three discrete functional requirements of the desktop application (from now on referred to as Bib App) have been clearly identified:

• Retrieve records from WoS as XML
• Perform an appropriate XSLT transformation to LOM format suitable for ingest to intraLibrary
• Deposit LOM records into intraLibrary using SWORD

MT has been working primarily on stages 1 and 2 and has adopted a pragmatic approach, treating them as two discrete tasks before attempting to integrate the functionality in a single user interface, he has a desktop client that will take XML and perform an XSLT transformation so, once we have clarified the LOM format we require – see http://bibliosightnews.wordpress.com/2009/11/16/mapping-fields-from-wos-api-lom/ – it should be relatively straightforward to plug into the WoS API to retrieve XML from the Web of Science which can then be transformed into appropriate LOM.

Deposit of the LOM into intraLibrary via SWORD should also be fairly straightforward – see – http://bibliosightnews.wordpress.com/2009/11/17/the-role-of-standards-in-bibliosight/ – however, in order to generate clean, consistent LOM, there are still a number of issues to be resolved.

From a technical perspective, Mike is not a Java programmer* and is working very hard to master the language in order to implement an integrated UI that can unify these three discrete functional areas – the precise functionality of the Bib App will also be informed by developing use cases – see item 4 below.

*The WoS API is Java based which perhaps makes it less accessible than it could be – it may be that JISC wish to make recommendations to Thomson Reuters and others regarding the development of open web services APIs. See – http://blogs.ukoln.ac.uk/good-apis-jisc/

Action: NS/MT to continue to investigate issues around three functional areas

Action: MT to continue developing Bib App – development will necessarily take us beyond the formal end of jiscri projects at the end of November

4. Update on use cases

PD/NS have summarised our three use cases in some detail which need writing up in full ASAP (Nick to action).

Particular issues that were identified include:

• In light of progress through the project, UC narratives need to be updated from the now outdated drafts proposed in the original bid
• UCs need to be fully itemised with an ‘actor’ clearly identified for each success scenario
• More thought needs to be given to extensions to each UC

There was particular discussion around UC_2 which centres on targeted communications to researchers to encourage deposit of an appropriate author produced version of a recently published/cited article. It is clear that such a use case will need to identify individual publisher’s copyright policy around deposit in an IR; if they do permit deposit, what restrictions / conditions to they impose? For example, a very common restriction is in the form of a 12/18 month embargo that would need to be incorporated into the workflow.

Action: NS to explore use cases in more detail and write up in full.

5. JournalTOCsAPI workshop – 20th November 2009 – Nick attending

NS is attending a workshop being run by the JournalTOCsAPI project on Friday 20th November and has been invited to give a 15 minute presentation on Bibliosight.

The workshop has two main objectives:

1. To learn the techniques/methodologies that professionals managing repositories use to identify new content for their repositories and the potential benefits as well as the shortcomings that they have identified in the JournalTOCsAPI

2. To give an opportunity to repository managers and API developers to learn the thoughts of experts in institutional repositories for efficiently integrating and reusing up-to-date journal TOC RSS feeds within repository systems and forward looking research information systems.

Action: NS to attend and participate as required

6. Project management tasks – project evaluation

The project management task to be addressed on the blog will be project evaluation.

Action: NS/WL to liaise and post on project evaluation

7. Formal end of project

The formal end of the project in line with the jiscri programme is the end of Novemeber 2009 by which time we are confident we will have a detailed proof of concept for Bibliosight that is well documented on the blog. However, there is still a considerable amount to be done to implement a fully functional Bib App which is a valuable outcome for the institution and the sector; work will therefore be ongoing beyond the end of the jiscri project, internal resources allowing.

8. A.O.B.

None

Posted in Bibliosight | Tagged: , | Leave a Comment »

Project meeting number 5: Draft agenda

Posted by Nick on November 16, 2009

Date of meeting:  Tuesday 17th November 2009

1. Apologies

2. Minutes from last meeting and actions

3. Update on development of desk-top application

4. Update on use cases

  • Identify new research in WoS on a regular basis (daily/weekly/monthly); retrieve available metadata associated with records – add to intraLibrary
  • Identify new research in WoS on a regular basis (daily/weekly/monthly); check copyright/SHERPA-RoMEO; generate targeted email

5. JournalTOCsAPI workshop – 20th November 2009 – Nick attending

6. Project management tasks – project evaluation

7. Formal end of project

8. A.O.B.

Posted in Agenda | Tagged: , | Leave a Comment »

Quick sketch #2

Posted by Nick on November 13, 2009

The diagram below is Arthur’s update of my earlier quick sketch to illustrate what Bibliosight will aim to achieve by the formal #jiscri deadline.

It is numbered and colour coded – stages 1 – 3 (shades of blue) are within the #jiscri timeframe; stages 2 (green) & 5 (buff) will require ongoing work beyond the deadline.

(N.B.  Click on the image for a full size view in a separate browser window.)

Bibliosight

Posted in Bibliosight | Tagged: , , , , , , , | 1 Comment »

Thinking out loud…

Posted by Nick on November 11, 2009

As the deadline for #jiscri draws close I have just returned to work after a month away from Bibliosight and I’m now desperately trying to catch up with the project and determine exactly what we can aim to achieve by the end of November…The candid truth is that we have only very recently got to the point where Mike can actually do some coding and begin to put together a prototype that fulfills the requirements of our (still formative) use-case[s].

Yesterday morning I had a stab at completing a more detailed template for a primary use-case (this comprises a narrative and the use case itself); then in the afternoon I sat down with Mike to catch up with his progress from a technical perspective and to brain-storm around precisely what functions we require from our prototype and how this may be achieved; there are also some outstanding issues of clarity pertaining to Thomson Reuter’s API documentation, specifically “WoS Search Retrieve Codes and Descriptions” in that we currently have unrestricted access to the API but it is my understanding that the free* service will actually be restricted.  We are not certain:

a)  Precisely which of the fields are associated with the restricted subset that we will be able to query and/or return under the current terrms of our WoS subscription*

b)  What some of the fields actually are as they lack a description in the documentation

*Free to us under existing subscription

Disclaimer:  I’m very much thinking out loud here and attempting to translate what I understand are ongoing conceptual issues for Mike as he works through the documentation.

Note:  I’ve continued to refer to ResearcherID – see http://bibliosightnews.wordpress.com/2009/10/02/visit-from-thomson-reuters/ – though it is not a service we plan on implementing as part of Bibliosight, and not necessarily even in the longer term, I’m pretty sure we are likely to require some sort of unique identifier for authors – a subject that is currently receiving a lot of attention from the repository community.

Anyway…looking back over the blog it seems that:

The requesting system can query the Web of Science using the following fields:

  • Address (including Street, City, Province, Zip Code, or Country)
  • Author
  • Conference (including title, location, data, and sponsor)
  • Group Author
  • Organization or Sub-organization
  • Source Publication (journal, book or conference)
  • Title
  • Topic
  • Year Published

The service will support the AND, OR, NOT, and SAME Boolean operators.

The Web of Science Web Service returns five fields to the requesting system:

  • Article Title
  • Authors — All authors, book authors, and corporate authors
  • Source — Includes the source title, subtitle, book series and subtitle, volume, issue, special issue, pages, article number, supplement number, and publication date
  • Keywords — all author supplied keywords
  • UT — A unique article identified provided by Thomson Reuters

The test queries that Mike has submitted to the API have returned XML that appears to be both more granular than indicated and that includes fields other than those that constitute these five (e.g. abstract) so the first thing to do, perhaps, is to contact Thomson Reuters and see if they can apply the restrictions that we will ultimately need to work with, if only to remove some of the noise and make it easier to see the wood for the trees.

The API documentation actually lists over 100 “fields”; only a handful of these are actually described in the documentation, however, and while many are reasonably transparent, others are a little less so and some look like they may duplicate information – or are they perhaps used as alternatives? (e.g. bib_id = Volume, issue, special, pages and year data / bib_issue = Volume and year data).  There is also some lack of consistency in this bibliographic info on a record by record basis; we need to ensure that we have consistent XML being returned for all records – hopefully we can then develop a template in intraLibrary itself that reflects that consistent XML as closely as possible such that we can devise an XSLT style-sheet to perform the approriate transformation.

Mike already has a desktop client that will take XML and perform an XSLT transformation so, once we have clarified the LOM format we require (an action for me from the last meeting), it *should* be relatively straightforward to plug into the WoS API to retrieve XML from the Web of Science which can then be transformed into appropriate LOM.

Then we need to ingest that LOM into intraLibrary, preferably using SWORD…which I shall think about another time!

Posted in Progress post | Tagged: , , , , | Leave a Comment »

Notes from the October meeting

Posted by wendyluker on November 11, 2009

Minutes of the Bibliosight Meeting

Tuesday 20th October 2009

1.  Apologies

Nick, Sue, Babita

2.  Minutes of the last meeting, and actions

Actions :

WL /NS to pursue academic contacts for a representative – this has been on-going, but at this stage of the project it seemed unlikely that we would now get a representative.  Academic staff / researchers to be involved in evaluating the outcomes of the project.

PD to clarify upload of XML to intraLibrary including LOM extensions – Peter confirmed that this could be done.

NS/BB/SR to meet with another member of the URO to clarify potential use cases: Wendy reported that Nick had met with Sue Rooke and Sam Armitage, and work had been done on use cases.  Nick would be able to clarify this on his return to work.

PD to contribute blog post on technical standards : on-going.
New action: Wendy to send Peter the required tags for the post.

All team members to contribute to on-going discussion on the blog – reiterated!

3. Update on meeting with Thomson Reuters

Mike updated the group on the meeting with Thomson Reuters.  We have access to the unrestricted API, but we are not entitled to use it to a greater extent than would be provided by the Web Services Lite version.  Even though it appeared that the 100 record limit may not be an issue after all, in fact if we download the initial set of records year by year then this should not present an issue.  Wendy and Arthur reported on some testing of the Web of Science search interface that they had been doing to check whether the ‘Leeds Metropolitan Univ’ search would be sufficiently robust, and it appeared to be so.

We will need to display WofS / Thomson Reuters terms and conditions alongside any material retrieved from WofS.  There is a place in LOM for this.

4. Update on Use Cases

The use cases will be a useful output of the project, and need further work at this stage, e.g. we need to ensure we capture the information around the intended alerting service: at what point will individuals be alerted? Where will the alert come from?
More work also needed on cataloguing workflows, and how we will deal with the initial 1485 items that will be downloaded.

5. API – next steps in the development

Mike updated the group on progress with the API.

At this stage we can:

  • Get records out of WofS
  • Transform them into XML
    Action Nick: what is the LOM XML?
  • Load them into intraLibrary

Mike needed several decisions to be made before he could progress further:

Would the process for downloading be manual or automated? MANUAL

Would the client be desktop or web based: DESKTOP

It was also decided that the XSLT should be easily swapped out so that it can be output in different formats, i.e. to other interfaces, whether they be Endnote, for example, or another repository.  This would be of benefit to the rest of the community.

The group discussed the diagram that Nick had put up on the Blog recently, with regard to the intended scope of the current project, and which tasks might be part of further developments.

Action: Arthur to update the diagram to make it clear what would be achieved by the end of November (encompassing the intended outputs of the original project) and what the future developments might be.

6. Project management tasks: technical standards and value add

The next of the project management tasks to be addressed on the blog would be day to day work.

Action: Nick on his return

Peter would supply a blog on technical standards

Action: Wendy to send Peter the appropriate tags.

7. Other business

There was no other business

8. Date and time of next meeting

The next meeting will be held on Tuesday 17th November, starting at 1pm.
Peter will arrive at approx. 11am for a pre-meeting with Nick (and others) about use cases.

Posted in Bibliosight, SCRUM minutes | Tagged: , , | Leave a Comment »

Bibliosight meeting: Tuesday 20th October

Posted by wendyluker on October 19, 2009

The agenda for the meeting tomorrow is as follows:

1. Apologies

2. Minutes of the last meeting, and actions

3. Update on meeting with Thomson Reuters

4. Update on Use Cases

5. API: next steps in the development

6. Project management tasks:  Value Add and Technical Standards

7. Any other business

[Lunch]

Posted in Agenda | Tagged: , , | Leave a Comment »

Visit from Thomson Reuters

Posted by Nick on October 2, 2009

On Wednesday afternoon Mike and I were finally able to sit down with Jon and…Gareth? (sorry, I’m terrible with names) from Thomson Reuters to discuss Bibliosight and the work we are doing with the WoS API, it probably goes without saying just how useful this was, especially so soon after our Tuesday meeting.

As we have come to appreciate, Thomson are still very much in an ongoing process of developing their suite of tools and commercial services around the extraction of data from WoS using their API and, overall, I was given the impression that the company are currently practising something of a balancing act to weigh their commercial interests against providing appropriate value added services to their subscribers under existing licensing agreement – which is, of course, entirely reasonable.  Jon suggested that the Bibliosight project is something of a pioneer in using this technology and a useful case-study for the company, which certainly puts some of our early difficulties into context – though he did indicate that numerous other folk are also actively investigating the API; in particular he mentioned Queens College Belfast, an institution in Birmingham and R4R at Kings College London in collaboration with EPrints’ Les Carr at Soton.  R4R is the only project that I was hitherto aware of and have had any contact with; it would be really useful if we were able to communicate with others also using the API.

Thomson Reuter’s flagship commercial product is called InCites and “supplies all the data and tools you need to easily produce targeted, customized reports… all in one place. You can conduct in-depth analyses of your institution’s role in research, as well as produce focused snapshots that showcase particular aspects of research performance.” We discussed how, though such a service will be invaluable for the research oriented Russell Group institutions, it is likely to be overkill for a million plus institution like Leeds Met; nevertheless we do require a certain level of functionality to help us analyse our research performance which, alongside our traditional strengths in teaching and learning, is increasingly important, especially in view of the REF.  Hopefully this is where the developing ’suite of tools’ comes in and our guests were keen to get a handle on precisely what we are hoping to achieve with Bibliosight (aren’t we all!).  I outlined our preliminary use-cases for them as a foundation for our discussion and was also keen to ask some of the specific questions that had arisen during the previous day’s meeting.  First of all I asked about the wording of the documentation that appears to suggest that it is only possible to return 100 records with a single query using the API – they weren’t aware of such an issue and agreed that the way it was expressed in the documentation was a little ambiguous; Jon will follow this up for us though Mike may also be able to elucidate the situation when he has investigated further.  They were able to say that another user had discovered that the API could be called twice every second, however, so didn’t anticipate any problems with extracting all the data we need.

The major issue that came up at the meeting on Tuesday was how best to return all of the articles for a given institution with the most appropriate field to query apparently being the address field.  It is not clear, however, how consistent the institutional address actually is and Jon confirmed that it is derived from information harvested from individual journals/papers which preliminary manual searching of WoS has already demonstrated to be idiosyncratic  – at least in the case of Leeds Metropolitan University and almost certainly other institutions aswell (leeds metropolitan university; leeds met [uni]; lmu etc).  Jon suggested that the safest and most effective method of returning all records would actually be by using ResearcherID though this would require all institutional authors to be registered and an additional paid subscription to ResearcherID download (as opposed to upload which is free) – in lieu of this, however, he did confirm that the address field was the only way and that it may be necessary to build a catch-all query to ensure that we don’t miss anything – precisely how we achieve this is still a little bit of a moot point, though he did indicate that some work has been done on disambiguating institutional address formats within WoS and will follow up on this for us in due course.

Through our discussion, Article Match Retrieval is finally beginning to make more sense to me now, and Jon confirmed that this is the method that would be used in conjunction with the API to provide numbers of citations to an individual article – AMR can be queried by numerous fields including DOI and UT Identifier (A unique identifier for a journal article assigned by Thomson Reuters.); in terms of the current project, I think it makes sense to focus initially on extracting bibliographic data first before worrying about citation metrics; via the API, we can also extract the UT identifier and then use this to query AMR.

We also touched on Terms & Conditions and Thomson, again reasonably, expect WoS as data source to be clearly acknowledged on each individual record – Mike wasn’t initially certain how this could easily be achieved from a technical perspective, at least in the case of bibliographic citation information (which may have been added manually); we have a few ideas on how this could actually be achieved but is really just something to be aware of at this stage.

All in all I now feel that the overall shape project is beginning to be resolved and, in addition to the technical work required to extract, store, parse, convert (XML) records and then pass them somewhere else (intraLibrary/EndNote), a large part of Bibliosight will necessarily focus on developing use-cases for our institutiona research administration which is likely to continue well beyond the designated 6 month life-cycle of the #jiscri project!

Posted in Progress post, Research Excellence Framework, Thomson Reuters Research Analytics | Tagged: , , , , , , , , | Leave a Comment »

Quick sketch

Posted by Nick on October 1, 2009

I’ve just had a meeting with Sue and Sam (from URO) about use cases.  It was useful, as always, to put our heads together and we came up with a quick sketch of a potential infrastructure for Bibliosight that we would like feedback on:

A quick sketch

A quick sketch

N.B.  This assumes it is possible to programmatically bulk upload XML records to EndNote – I have no idea if this is possible – and to intraLibrary which should be possible based on discussion at the meeting on Tuesday.

The sketch arose from the fact that research administrators currently use EndNote which they wish to continue using as dedicated citation management software for its high level functionality and the need to simplify their workflow rather than expect them to add records to two separate systems.  Such an approach could also inform our developing use cases to a) auto-populate the repository with metadata from WoS and b) alert repository/research administrators when an article is published so we can pursue an appropriate full text version for the repository.

Posted in Use cases | Tagged: , , , , , , | 4 Comments »