Title: Making a large treebank searchable online. The SoNaR case
Authors: Vandeghinste, Vincent
Augustinus, Liesbeth
Issue Date: 31-May-2014
Publisher: ELRA
Host Document: Proceedings of the LREC2014 2nd workshop on Challenges in the Management of Large Corpora (CMLC-2) pages:15-20
Conference: Workshop on Corpus Management of Large Corpora at International Conference on Language Resources and Evaluation edition:2 location:Reykjavik date:may 2014
Abstract: We describe our efforts to scale up a syntactic search engine from a 1 million word treebank of written Dutch text to a treebank of 500
million words, without increasing the query time by a factor of 500. This is not a trivial task. We have adapted the architecture of the
database in order to allow querying the syntactic annotation layer of the SoNaR corpus in reasonable time. We reduce the search space by
splitting the data in many small databases, which each link similar syntactic patterns with sentence identifiers. By knowing on which
databases we have to apply the XPath query we aim to reduce the query times.
Description: no issn
ISBN: 978-2-9517408-8-4
Publication status: published
KU Leuven publication type: IC
Appears in Collections:Formal and Computational Linguistics (ComForT), Leuven

Files in This Item:
File Description Status SizeFormat
LREC2014-GrETELSoNaR.pdfLREC paper 2014 Published 291KbAdobe PDFView/Open


All items in Lirias are protected by copyright, with all rights reserved.

© Web of science