In pursuance of better performance, current speech recognition systems tend to use more and more complicated models for both the acoustic and the language component. Cross-word context dependent (CD) phone models and long-span statistical language models (LMs) are now widely used. In this paper, we present a memory-efficient search topology that enables the use of such detailed acoustic and language models in a one pass time-synchronous recognition system. Characteristic of our approach is (1) the decoupling of the two basic knowledge sources, namely pronunciation information and LM information, and (2) the representation of pronunciation information - the lexicon in terms of CD units - by means of a compact static network. The LM information is incorporated into the search at run-time by means of a slightly modified token-passing algorithm. The decoupling of the LM and lexicon allows great flexibility in the choice of LMs, while the static lexicon representation avoids the cost of dynamic tree expansion and facilitates the integration of additional pronunciation information such as assimilation rules, Moreover, the network representation results in a compact structure when words have various pronunciations, and due to its construction, it offers partial LM forwarding at no extra cost.
Demuynck K., Duchateau J., Van Compernolle D., Wambacq P., "An efficient search space representation for large vocabulary continuous speech recognition", Speech communication, vol. 30, no. 1, pp. 37-53, January 2000.