Title: Matching bibliographic data from publication lists with large databases using N-Grams
Other Titles: MSI Working Paper
Authors: Abdulhayoglu, Mehmet Ali
Thijs, Bart
Jeuris, Wouter
Issue Date: Jun-2014
Publisher: KU Leuven - Faculty of Economics and Business
Series Title: FEB Research Report MSI_1413 vol:MSI_1413
Abstract: This paper presents a text matching process for identification and correct assignment of scholarly publications, extracted from publication lists provided by authors or research institutes, in large bibliographic databases such as Thomson Reuters’ Web of Science (WoS). An identification method is implemented by means of overlapping common 3-grams and the results are obtained from the match of the two sources according to the highest score of the applied cosine measure. Levenshtein similarities based on N-grams have been used to measure the closeness between the given CV publication and the retrieved best possible WoS match as a complementary and confirmatory measure. It is shown that the suggested method has an important potential on reducing the manual effort to find out whether a desired publication is indexed in WoS or not. The similarity scores derived by Levenshtein measure show consistency with those derived from Salton’s similarity measure. Incorrect matches are examined in depth and possible thresholds are suggested to decrease the effort for manual cleaning.
Publication status: published
KU Leuven publication type: IR
Appears in Collections:Department of Managerial Economics, Strategy and Innovation (MSI), Leuven

Files in This Item:
File Description Status SizeFormat
MSI_1413.pdfMSI_1413 Published 356KbAdobe PDFView/Open


All items in Lirias are protected by copyright, with all rights reserved.