Best Thesis - 2009

Lexical Syntax for Statistical Machine Translation

-Hany Hassan


Statistical Machine Translation (SMT) is by far the most dominant paradigm of Machine Translation. This can be justified by many reasons, such as accuracy, scalability, compu- tational efficiency and fast adaptation to new languages and domains. However, current approaches of Phrase-based SMT lacks the capabilities of producing more grammatical translations and handling long-range reordering while maintaining the grammatical struc- ture of the translation output. Recently, SMT researchers started to focus on extending Phrase-based SMT systems with syntactic knowledge; however, the previous techniques have limited capabilities due to introducing redundantly ambiguous syntactic structures and using decoders with limited language models, and with a high computational cost.

In this thesis, we extend Phrase-based SMT with lexical syntactic descriptions that localize global syntactic information on the word without introducing syntactic redundant ambiguity. We presente a novel model of Phrase-based SMT which integrates linguistic lexical descriptions —supertags— into the target language model and the target side of the translation model. We conduct extensive experiments in two language pairs, Arabic– English and German–English, which show significant improvements over the state-of- the-art Phrase-based SMT systems.

Moreover, we introduce a novel Incremental Dependency-based Syntactic Language Model (IDLM) based on wide-coverage CCG incremental parsing which we integrate into a direct translation SMT system. Our proposed approach is the first to integrate full dependency parsing in SMT systems with a very attractive computational cost since it deploys the linear decoders widely used in Phrase–based SMT systems. The experimental results show a good improvement over a top-ranked state-of-the-art system.