Machine Translation (MT) and Computer-Aided Translation (CAT)
Prof. Christian Boitet (Universite Joseph Fourier, Grenoble, France)
Learn about essential aspects of automatic and automated translation. The first lecture will introduce several fundamental facts about MT, MT systems, and their variety. In particular, it will show that it is far too simplistic to divide MT systems into "statistical" and "rule-based", and that the popular belief among researchers that all operational MT systems are now statistical is totally erroneous. The next three lectures will present a deeper analysis of the linguistic, computational and operational (engineering) architectures of MT systems. The last lectures will go deeper in two important questions: how to evaluate MT "in operation", and how to model and handle "parallel corpora for modern MT", which (will) have to be multilingual, multi-annotated, and multimedia.
Christian Boitet is full professor of computer science and NLP at Universite Joseph Fourier (Grenoble, France). He learned MT with Pr. Bernard Vauquois, a famous pioneer and inventor in this field. He has been director of GETA (Study Group on MT) since 1985. His research activity concerns all theoretical, methodological and operational aspects of MT (Machine Translation), enlarged to other parts of NLP. He has advised 38 PhD and State Doctoral Thesis students to their defense, and advises or coadvises 11 PhD students at the moment, 2 of them currently at NII, and 1 at Kyodai.Although his main training was in mathematics and computer science, he has a degree in Russian and has studied many languages, in particular German, English, Spanish, Italian, Hungarian, Malay and Japanese, to attain proficiency or simply to understand their specificities and be able to work on them with linguists or translators. He has tried to study all MT systems and approaches from the very beginning of the field, down to technical details of tools, techniques, and resource building. He is one of the authors of Ariane-G5, GETA's generic environment for developing "rule-based" fully automatic MT systems. He participated and led research on "static grammars" (Vauquois & Chappuy 1983, 1985), a formalism for semi-formal specification of string-tree correspondences, that brought lingware engineering closer to software engineering. He has also done or led research on several other MT paradigms, such as interactive MT (DBMT, LIDIA project), translation memories, example-based MT (in particular, S-SSTC model of Tang, USM), analogical MT (with Y. Lepage at ATR), and probabilistic MT (for speech and more recently for small sublanguages such as classified ads sent by SMS).He has presented communications in many national and international conferences and published in various journals and books (about 200). He has also edited a volume presenting Prof. Vauquois' scientific work and the proceedings of DBMT-90, COLING-92, and MIDDIM-96. He has organized workshops on Dialog-Based MT and multimodal interactive disambiguation, and co-organized the COLING-92 international conference. In 1998, he chaired the program committee of COLING-ACL'98, with P. Whitelock as ACL co-chair. He is frequently asked to serve as a reviewer (about 30 conference papers and 2-3 journal papers each year).He has been principal investigator for several industrial research contracts (about 150 contract reports). At the international level, he has also participated in or led GETA's involvement in several cooperative research efforts. He has been invited or visiting researcher in several laboratories, notably TAUM (Montreal, 1 year), MGPIIA (Moscow), SFB-100 (Saarbrucken), UTMK (Penang, 6 months in total), KDD (Tokyo), NII (Tokyo), and ATR (almost 2 years in total), where he learned about speech translation before starting it in France in 1995.His current research interests include multilingual interpersonal communication over networks (UNL project); machine-assisted human translation; speech translation; personal dialog-based MT for monolingual authors; portable and readable encodings for multilingual documents; computational tools for post-editors and occasional translators; contributive multilingual lexical data bases; specialized languages and environments for linguistic engineering and research; contributive construction of NLP resources; participative translation; and task-related MT evaluation. He works now on Ariane-Y, a new "MT factory" able to integrate expert (rule-based) as well as empirical (SMT, EBMT) phases.
National Institute of Informatics, 20F, Lecture Room
This lecture will center on the following points. . The various tasks of MT and their difficulty. . The CxAxQ "MT theorem" (Coverage x Automaticity x Quality << 100% while 2 factors can approach 100%) and the intrinsic difficulty of HQ translation. . The independence of linguistic and computational architectures of MT systems. That claim will be supported by numerous examples of MT systems adopting 1 of the 13 existing linguistic architectures. . Some aspects of the operational architecture of MT systems and their influence on the choice of linguistic and computational architectures. . The current evolution towards "deeper" linguistic architectures and hybrid computational architectures, coupled with user involvement.
This lecture will center on the following points. . Characteristics of various possible intermediate representations. . Monolevel and multilevel structures. . Units of translation: segments, infrasegments, supersegments, whole documents? . Different sorts of "deep pivots" (hybrid, semantico-linguistic, semantico-pragmatic). . Pros and cons of various linguistic architectures w.r.t. "translational situation" (a part of the operational context).
This lecture will center on the following points. . Taxonomy of computational architectures (empirical vs. expert) . Taxonomy of algorithmic methods (how to fight non-determinism, fuzziness and noise of correspondences between successive levels of description) . . Empirical computational architectures (statistical, example-based with and without annotations). . Expert computational architectures ("procedural" methods, "rule-based" methods). . Rules of well-formedness (grammars), transition rules (automata), rewriting rules (on strings or trees). . Examples of SLLPs (Specialized Languages for Linguistic Programming) of 3 types (creation, addition, substitution). . Ensuring and relaxing decidability of SLLPs. Examples: ATNs, Q-systems, ATEF, ROBRA, GRADE.
This lecture will center on the following points. . From homogeneous MT systems to heterogeneous CAT systems. . Convergence of evolutions since 1980. . Environments for Developing Lingware (EDL): EDL specific to 1 MT system (MTS), Meta-EDL for heterogeneous MTS, towards Integrating EDL. . Good practices in implementing specialized languages (SLLPs). . Design of Lexical Data Bases for MT systems. . PIVAX for heterogeneous MTS sharing a common leical pivot.
The topic of MT evaluation began to be studied in the early 50's and has taken new turns since the advent of empirical (most notably probabilistic) methods in MT. Ch. Boitet has published last year an article with Herve Blanchon on MT evaluation in the TAL journal. This lecture will draw from that article and center on the following points: . similarities and differences in text and speech MT evaluation. . arguments and proposals for task-related measures. . integrating evaluation as a "no-cost" measure in actual (needed) translation and post-edition tasks.
The author has published last year in RFLA, Revue Francaise de Linguistique Appliquee (French Journal of Applied Linguistics) a paper on that topic, which is currently studied in depth by C.P. Huynh (predoctoral fellow at NII for 5 months) for his PhD. A main idea is that corpora for MT cannot be reduced to collections of "bi-segments", or source-target translation pairs. This lecture will center on the following points. . The notion of "segment" varies with systems, so that corpora should be "multi-segmented". . Translation units may be segments as well as infra-segments and super-segments. . Segments may recursively contain subdocuments (e.g., text in balloons). . Context is important, and has several aspects (linguistic, situational, dialogic), all crucial to solve some important problems (anaphora, ellipsis, tense agreement, gender of addressee, politeness expression, etc.). . Varied and complex annotations have to be attached to segments and possibly to higher hierarchical levels (paragraphs, sections, etc.). They can concern 1 language at a time (e.g. POS or linguistic trees), or 2 languages (various alignments), or all languages (semantic or pragmatic representations). . In the case of speech translation, sound files and various transcriptions have to be handled. . All that justifies the development of systems, programmable at different specialization levels, to operate on translation corpora. This will be illustrated by the SECTra_w system built by Huynh C. P., and its evolution.