Proceedings of the LREC2014 2nd workshop on Challenges in the Management of Large Corpora (CMLC-2) pages:15-20
Workshop on Corpus Management of Large Corpora at International Conference on Language Resources and Evaluation edition:2 location:Reykjavik date:may 2014
We describe our efforts to scale up a syntactic search engine from a 1 million word treebank of written Dutch text to a treebank of 500
million words, without increasing the query time by a factor of 500. This is not a trivial task. We have adapted the architecture of the
database in order to allow querying the syntactic annotation layer of the SoNaR corpus in reasonable time. We reduce the search space by
splitting the data in many small databases, which each link similar syntactic patterns with sentence identifiers. By knowing on which
databases we have to apply the XPath query we aim to reduce the query times.