Published on March 30, 2008
R&D of Japanese-Chinese Machine Translation System by NICT--Mechanism, Resources and Collaboration --: R&D of Japanese-Chinese Machine Translation System by NICT --Mechanism, Resources and Collaboration -- Hitoshi Isahara National Institute of Information and Communications Technology Slide2: NICT, what? Overview of NLP research in NICT, Japan. Research on Computational Linguistics Research on NLP (C-J Machine Translation) Resource Compilation Slide3: NICT, what? The sole national institute on information and communications technology in Japan. Slide4: 2001 Communications Research Laboratory 1979 Telecommunications and Broadcasting Satellite Organization Incorporated Administrative Agency National Laboratory Certified Institution 1992 Telecommunications Advancement Organization 1988 Communications Research Lab 1952 Radio Research Lab ▼ ▼ ▼ National Institute of Information and Communications Technology The NICT was born by merging CRL and TAO 2004 Basic Research, Applied research, and Funding for New Business: Basic Research, Applied research, and Funding for New Business Basic Research Applied Research New Business Intramural R&D R&D for New Business Promotion High-risk Long-term Extramural R&D Funding Crossing the Valley of Death Collaboration with Industries and Universities Slide6: CORE (INTRAMURAL) R&D Information and Network Systems Wireless Communications Applied Research and Standards Basic and Advanced Research Basic and Fundamental Research, International Collaboration, and Standardization Slide7: COLLABORATIVE (EXTRAMURAL) R&D Collaborative R&D, Test-bed PROMOTION AND FUMDING Funding New Business Promotion Extramural R&D Slide8: Three Routine Services Type Approval & Calibration of Wireless Equipment Space weather forecast center Determination & Supply of Standard Time and Frequency Regular Observation of Ionosphere & Space Environment Info. Service Ootakadoya mtn. JJY LF station (40 kHz, 250m high antenna) Use of radio controlled clocks and watches (About 10 million sets have been sold) Standard site for antenna calibration Hagane mtn. JJY LF station (60 kHz, 200m high antenna) NICT Overseas Bases : モ ン ゴ ル 日 本 タ イ シ ン ガ ポ ー ル Wireless Communications Laboratory Asia Research Center Thai Computational Linguistics Laboratory NICT Overseas Bases Experimental Facility at University of Alaska Washington DC Office Paris Office NICT Vision: NICT Vision Drive ICT 4-engines to create future society 4-Engines New ICT Infrastructure for ICT Society Challenge Test-bed and Promotion Develop ICT Value Chain President Dr. Nagao Slide11: Overview of NLP research in NICT, Japan. Slide12: NICT has 7 research centers Knowledge Creating Communication Research Center (at Keihanna Science City in Kyoto) Computational Linguistics Group Language Information Project Language Grid Project (Prof. Ishida of Kyoto University) Thai Computational Linguistics Laboratory (Dr. Virach) Kobe University Graduate School of Science and Technology Keihanna Open Laboratory Slide13: Group Leader 1 Senior Researcher 4 Researcher 2 Expert Researcher 11 (full time) 8 (part time) Guest Researcher 5 (full time) 3 (part time) Technical Staff 5 Secretary 2 34 researchers We have trainees from universities TCL: 5 researchers Language Grid: 10 researchers Kobe University: 3 PhD candidates One of the biggest NLP research groups in Japan. One of the leading NLP research groups in Japan. Computational Linguistics Group Slide14: NLP Analysis・Generation IR・IE Summarization・QA Machine Translation CALL Learning-based method Computational Linguistics Lexical Semantics Discourse Emotion (Honorific・Music) Intention Open lab. Joint University （Kobe Univ.） Collaboration Linguistic Resources NICT Corpus Multilingual corpus Learners corpus Corpus of spontaneous Japanese EDR Electronic Dictionary Theoretical background Reseach tools Break through Objective Data Training Data Verification Tools Open resources Thai Computational Linguistic Laboratory Resource-based NLP Slide15: Research on Computational Linguistics Slide16: Hierarchy Word sense Mutual relation Self-Organizing Semantic Map Automatic extraction of lexical knowledge from corpora Slide17: We made a list of collocations from the corpora. KIMOCHI (feeling): ureshii (glad), kanashii (sad), shiawasena (happy) … OMOI (thought): ureshii (glad), tanoshii (pleased), hokorashii (proud)… KANTEN (viewpoint): igakutekina (medical), rekishitekina (historical) ... The number of abstract nouns is 365. The number of different adjectives is 10,525 The total number of adjectives is 35,173. KIMOCHI: １１１０００００１０１１００００１１０００・・・ OMOI: １０００１１００１０００１１００１００００・・・ KANTEN: ０００１１００００１０００００１０１０００・・・ Slide18: emotion viewpoint state/situation characteristics aspect range Semantic Map of Japanese abstract nouns Self-organizing Map using Neural Network Input: Collocation information Slide19: Semantic Map with hierarchical information Using measure of inclusion Complementary Similarity Measure (CSM) as a measure of inclusion relation Slide21: Extraction of Hierarchies Based on Inclusion of Co-occurring Words with Frequency Information: Eiko Yamamoto et al. IJCAI2005 Construction of an Objective Hierarchies of Abstract Concepts via Directional Similarity: Kyoko Kanzaki et al. COLING2004 References Slide22: Chinese Semantic Map of sport domain Application to other languages and concrete words Application to Semantic Analysis Application to domain specific knowledge Medical domain. Based on dependency relations. cypress character boy dream ablity speech possibility Slide23: Knowledge Extraction from Corpora Short term (for emergency): NLP without knowledge Automatic extraction of meaningful expressions (from Web document) Medium term (information gathering): NLP with knowledge Automatic extraction of lexical knowledge from corpora Slide24: パイプライン_における_亀裂_発生_時_の_ガス_減圧_特性 gas decompression characteristics when a crack appears on a pipeline 図書_室_における_情報_サービス_と_業務_電算_化 information service and job computerization at a library Some results by our automatic extraction of meaningful expressions (from Web document) Phrases, longer than words and compound words Useful sets of words for web query Domain knowledge from Medical domain Web. latency period - hepatic cell - erythrocyte Malaria Many pages on hepatic trouble Slide25: Hierarchy Word sense Mutual relation Multi-layered Semantic Frame Analysis (MSFA) Self-Organizing Semantic Map Slide27: Research on NLP (C-J Machine Translation) Slide28: Objective Create multilingual machine translation system Current focus is Japanese-Chinese and Japanese-English MT Method Corpus-based machine translation based on deep-analysis – Parsing of both languages (Japanese/Chinese, Japanese/English) – Semantic analysis using ontology Slide29: Corpus-based machine translation in a nutshell Prepare a parallel corpus (e.g., Japanese-English) Train a corpus-based MT system on the parallel corpus Then, you have a running MT system for the language pair Two problems How to prepare a parallel corpus? Biggest issue for new language pairs How to train a corpus-based MT system? Slide30: How to prepare a parallel corpus? Manual translation - Select texts to be translated (e.g., Japanese) Translate them manually into the target language (e.g., Chinese) We begin to translate Japanese texts (mainly scientific literature) into Chinese to create a large parallel corpus (over 100 million words) in 5 years. Automatic compilation Compiling parallel corpora from non-parallel corpora by using NLP techniques Slide31: Five-year national project of developing machine translation system between Japanese and Chinese Methodology: Syntax-augmented Example-based Machine Translation Source text: Scientific papers in Japanese and Chinese Translation ratio: 80% Parallel corpora: 1 to 10 million pairs Fund: Japanese Ministry of Internal Affair and Communication Japanese Ministry of Education and Science Participants: NICT, JST, Kyoto Univ., Univ. of Tokyo, Shizuoka Univ. C-J MT system: C-J MT system China is achieving striking development in science and technology. To make scientific and technological information distributed in Asian countries easily usable in Japan To promote distribution of literature to other countries about science and technology in which Japan is at the forefront To contribute to scientific and technological development in Asian countries and Japan System Overview: Analytical engine Translation engine Dictionary (terminology and general term bilingual dictionary, case frame, semantic system) Corpus (Treebank, parallel corpus) Linguistic resources Scientific and technological documents Scientific and technological literature Japanese Chinese Modify later according to users’ reference frequency Example-based translation, taking linguistic structures into further consideration Realize a practical machine translation system in a new paradigm Corpus compilation Creation of Japanese-Chinese and Chinese-Japanese dictionaries for translation and information retrieval System Overview Research Status in Japan (1): Research Status in Japan (1) R&D of language processing systems NICT conducts R&D of computational linguistics and NLP Several universities are conducting basic research Enterprises are developing commercial systems. Basic language analysis technology Japanese-related technology at practical level Chinese-related technology needs further accuracy Accuracy of Chinese analyzer for machine translation system is still lower that that of Japanese. R&D of Machine translation There is no project aimed at practical high-performance machine translation covering long sentences. Research Status in Japan (2): Dictionary development NICT is proposing a method to automatically develop an English-mediated Japanese-Chinese dictionary. Method to semi-automatically develop a scientific and technological dictionary with millions of words, including synonyms and different notations, has not been realized. Japan Science and Technology Agency has been developing Japanese-English and English-Japanese machine translation systems since the 1980s. Development of a large-scale Japanese-English dictionary (with synonyms and different notations) for information retrieval started in 2004. Research Status in Japan (2) Situations and Utilization Plans of Existing Resources: Situations and Utilization Plans of Existing Resources Existing resources and their utilization plans Japan Science and Technology Agency (JST) has: A large-scale Japanese-English science and technology dictionary with 14 million words A database with about 4 million Japanese-English parallel corpus of scientific and technological literature NICT has: A Japanese-Chinese-English corpus with 40,000 sentences, with detailed language information annotated A Japanese-English electronic dictionary (EDR) with 400,000 words Now being expanded into Japanese-Chinese-English dictionary NICT plans to utilize these resources effectively in carrying out this project. Situations of other existing resources There is no large-scale Japanese-Chinese bilingual corpus in Japan. There is no available dictionary which is larger than JST’s and NICT’s in scale. Cooperative Framework with China: Cooperative Framework with China Establishment of cooperative structures for development of linguistic resources and language processing technology Development of a bilingual corpus Beijing Foreign Studies University translates Japanese scientific and technological literature into Chinese Development of a dictionary Beijing University Library provided a list of English-Chinese terminology dictionaries as a reference material for developing a Chinese-Japanese terminology dictionary NICT and the Institute of Computing Technology of Chinese Academy of Sciences are jointly promoting multilingualization of the EDR dictionary Analytical technology and evaluation NICT and the Institute of Computing Technology of Chinese Academy of Sciences are jointly promoting development of a morphological analysis system and evaluation of machine translations. Slide38: Cooperative Framework with China Japan-China Natural Language Processing Joint Research Promotion Conference Held annually by NICT since 2001 Researchers from more than 10 major research institutes in China (universities and research institutes) have participated in the conference every year, allowing Japanese research institutes to establish expansive cooperative relationships with Chinese research institutes. Technology Map of Multilingual Translation Technology: Technology Map of Multilingual Translation Technology Application Basic Theory Technology Map of Translation Methods in Multilingual Translation: Technology Map of Translation Methods in Multilingual Translation Planned Scale and Period (1): Planned Scale and Period (1) 5 years Reason and basis of necessity of these scale and period: Enormous cost to develop linguistic resources Our experience: Development of a Japanese-English-Chinese corpus (approximately 40,000 sentences) Period: about 4 years Not only development of a bilingual corpus but also development of a dictionary, translation engine and analysis engine is necessary. Cost reduction Conduct translations in China with the cooperation of Universities in China. Extract parallel sentencess from existing comparable texts semi-automatically Aligning words and phrases semi-automatically Make the best use of existing linguistic resources and language processing technology owned by NICT and Japan Science and Technology Agency Planned Scale and Period (2): Planned Scale and Period (2) Goal in the 3rd year Confirm the ability of Japanese-Chinese machine translation prototype system for specific target domains. Goal in the 5th year Enhance Chinese analysis performance fully, and complete demonstration experiments on a Japanese-Chinese and Chinese-Japanese machine translation prototype system Strategy in Multilingual Development: Strategy in Multilingual Development Application to other Asian languages such as Thai Conducted at NICT’s Thai Computational Linguistics Laboratory (TCL) Technology and systems to be developed in this project are versatile. Application to other languages will be possible, without a substantial change of the system, if their corpora are developed. Findings and technologies obtained in this project can be utilized in development of corpora in Asian languages. Possible production of linguistic resources at low cost Research System Chart: Sub-theme 1: Research and development of an example-based, Japanese-Chinese and Chinese-Japanese translation system (Organizations: NICT, Kyoto University) Responsible organization: NICT Sub-theme 3: Development and demonstration experiment of a prototype system (Organization: NICT) Responsible organization: NICT Sub-theme 2: Research on development of linguistic resources for Japanese-Chinese and Chinese-Japanese translation systems (Organizations: Tokyo University, Shizuoka University, JST) Responsible organization: Tokyo University Provision of linguistic resources including a bilingual corpus Feedback from translation results Control of research institutes, progress management, etc. Linguistic resources (dictionary for translation) Translation engine, etc. Develop a high-quality translation system prototype which realizes over 80% translation ratio of Japanese and Chinese scientific and technological literature Research on language processing of basic terms (verbs, adjectives, etc.) (Both sub-theme groups) Research Steering Committee (Chairman: Hitoshi Isahara (NICT)) Research System Chart Slide45: Sub-theme 1: Research and development of an example-based, Japanese-Chinese and Chinese-Japanese translation system (Organization: NICT, Kyoto University) Responsible organization: NICT Research and development of an analysis system (NICT) Research and development of a translation engine (Kyoto University ) Morphological analysis methods, grammar rules, sentence structures, etc. Feedback from the results of the translation engine Slide46: Sub-theme 2: Research on development of linguistic resources for Japanese-Chinese and Chinese-Japanese translation systems (Organization: Tokyo University, Shizuoka University, JST) Responsible organization: Tokyo University Research and development of automatic development of dictionaries for Chinese-Japanese and Japanese-English machine translations (Tokyo University) Development of an integrated system for development of a large- scale corpus and a dictionary (JST) Provision of semantic relation networks with technology for translating distinctively (Shizuoka University) Provision of linguistic resources including a dictionary and a bilingual corpus, etc. Provision of a semi-automatic development system algorithms, etc. Provision of semantic networks for optimal translation Methods of extracting technical terms and recognizing semantic relationships (Tokyo University) Provision of linguistic resources including a bilingual corpus Provision of semantic relation recognition scheme algorithms Provision of semantic perception framework (obtaining synonyms) algorithms Evaluation of Prototype System: Evaluation of Prototype System Evaluation dataset Apply to scientific and technological literatures in specific fields Use part of a bilingual corpus to be produced in this proposal through translation Translation ratio Ratio of informative translations Informative means: Machine translation can convey most of the meaning of a sentence without modifications Machine translation is good enough to use as a preliminary translation by indicating the basis of translation (translation examples used) when it is modified manually when needed. Publication and Copyright of Results: Publication and Copyright of Results In principle, corpuses, dictionaries and systems whose copyrights are owned by us will be made available for researchers promptly after completion of this task. Reference example: Spoken Japanese corpus No copyright problem In development of a bilingual corpus by translation, negotiations will be made with public institutions, including universities and national research institutes, and academic societies, and only literature of which utilization is permitted will be used. Slide49: Resource Compilation Slide50: Development of Huge Linguistic Resouces NICT’s Linguistic Resources ・NICT Japanese-Chinese-English annotated corpus （40 thousand sentences） ・Japanese-English alignment corpora （200 thousand sentences） ・Corpus of Spontaneous Japanese （7.5 million words） ・ＥＤＲ Lexicon （400 thousand words in Japanese and English） ・NICT JLE (Japanese Learner English) Corpus （1200 participant data） Target Huge parallel corpus: 10 million sentences Annotated corpus （morphology, syntax and semantics） Speech conversation corpus Ontology Huge electronic dictionary （special terms、general terms） Toward world-best center of Linguistic Resouces Slide51: Japanese national project (1999-2003 FY) “Spontaneous Speech: Corpus and Processing Technology” In collaboration with the National Institute for Japanese Language Corpus of Spontaneous Japanese Slide52: A large scale spontaneous speech corpus of common Japanese. Mainly of monologues: About 10~15 minutes a lecture ‘Academic Presentation Speech (APS)’ and ‘Simulated Public Speech (SPS) （personal narratives etc.） ’ A fruitful resource for the research of spontaneous speech. Ex. speech recognition, automatic summarization, linguistic researches etc. Slide53: The Corpus of Spontaneous Japanese, CSJ (CRL’s part) Transcription and POS information (National Institute for Japanese Language, NIJLA) Sentence Boundaries (CORE: 50 hours, 180 lectures) Sentence Selection Syntactic Dependency Discourse Structure ・Spontaneous spoken data into syntactically and semantically useful processing units. ・Basic units for the subsequent annotation. Automatic Summarization and Syntactic and Discourse Parsing of Spontaneous Japanese Rate 10%, 50% extraction → Sentence revision Syntactic dependency between bunsetsus and Repair relation Discourse segmentation, hierarchy and Purposes Communications Research Laboratory, CRL http://www2.kokken.go.jp/~csj/public/index.html Morphological Annotation for all Slide54: Japanese Analyzer using Maximum Entropy Model High quality Easily tuned to new domains and systems Output with reliability score Slide55: POS tagging with post-editing Our POS tagger is fully statistic based and can output the reliability score for each output POS. We start checking from most doubtful POSs. The accuracy after post-editing Percentage of checked parts Slide56: Yomiuri Shimbun (Japanese) 1987-2001 Daily Yomiuri (English) 1989-2001 Both are available for academic and business use Sentence Alignment 180,000 pairs Article Alignment 95,000 pairs Raw text of sentence alignment data, and link information of article alignment data are available for academic use. Newspaper Alignment Data Slide57: Alignment Procedure Basic alignment methods Use each English article as a query, search for a corresponding Japanese article 2. Align the sentences in the corresponding articles through DP matching →noisy article and sentence alignments Filtering (Focus) - reliable measures - sort article and sentence alignments to select appropriate ones. Slide58: Japanese-English Thai-English Retrieval of newspaper articles via Internet Users whose mother tongue are Asian languages can easily access similar articles in English, simply choosing proper equivalents from the lists. Slide59: The project started in 2002. Focus on Asian languages Japanese-Chinese English-Japanese Annotate with detailed information Syntactic structure alignment at word and phrase levels Build the corpora with large size & high quality Multilingual annotated corpus (NICT Multilingual Corpus) Overview of NICT Multilingual Corpora : Overview of NICT Multilingual Corpora Japanese Chinese English Penn Treebank English 19.5K sentences Japanese 18K sentence Chinese 38K sentences Original Data Translation Slide61: 2001 2002 2003 2004 2005 2006 2007 Human Translation Japanese English Human Translation English Japanese Human Translation Japanese Chinese Annotation on Chinese Translations Annotation on English Translations Annotation of Alignment (J-E) Annotation of Alignment (J-C) Progress of Project Chinese Translation: Chinese Translation Translate one sentence to one sentence Aim at natural translation By supplementation, deletion or replacement Adjust word order or insert comma Reflect contextual information In an entire article, maintain the same meaning and information as those of the original sentences Problems in Translation: Problems in Translation Translations of proper nouns search on the web 埼玉県 埼玉县 琦玉县 Translations of special things in Japan add explanation 大相撲(grand sumo tournament) 大相朴 OK 春闘” (spring labor offensive ) “春斗”? add (“春季劳资纠纷” ) Slide64: Achieving High-quality Translation Translation (professional) Refinement (different professional) Revision of fluency (Chinese native) Japanese sentence Chinese Translation Chinese Translation Chinese Translation Translations with problems Revision in annotation Chinese Translation Slide65: 新年伊始,村山富市首相于二十八日在首相官邸会见内阁记者会的记者, 就社会党新民主联合所属议员脱党问题谈到：“此举不会给政权带来影响。 即便有人脱党,我想也只是限定在那个范围之内。”从而表明了不至于有 大量议员脱党的看法。 村山富市首相は年頭にあたり首相官邸で内閣記者会と二十八日会見し、 社会党の新民主連合所属議員の離党問題について「政権に影響を及ぼす ことにはならない。離党者がいても、その範囲にとどまると思う」と 述べ、大量離党には至らないとの見通しを示した。 Example of Chinese translations(1/2) subject（this） object conjunction・adverb Supplement constituents in translations Slide66: また、一九九五年中の衆院解散・総選挙の可能性に否定的な見解を 表明、二十日召集予定の通常国会前の内閣改造を明確に否定した。 另外，村山富市首相还对一九九五年内解散众议院、进行大选的可能性 表明了否定性的见解，并且明确否定了在预定二十日召开的通常国会前 改组内阁。 来年一月一日に首相の座に座っているのはだれか。 また、一九九五年中の衆院解散・総選挙の可能性に否定的な見解を 表明し、二十日召集予定の通常国会前の内閣改造を明確に否定した。 另外，村山富市首相还对一九九五年内解散众议院、进行大选的可能性表明了 否定性的见解，并且明确否定了在预定二十日召开的通常国会前改组内阁。 明年一月一日坐在首相席上的该是谁呢？ Example(2/2) Supplement correlation part Slide67: Tool for manual revision Look up dictionary Retrieve a word in the annotated corpus Slide68: Retrieve a word in a file Sort in attributes of the word, the left word or the right word Tool for manual revision Slide69: Automatically Aligned Result The NICT JLE Corpus: The NICT JLE Corpus What is it? Collection of 1,281 speech samples of Japanese learners of English Based on the oral proficiency interview test “SST” 2 million words (including fillers and repetitions) Divided into 9 proficiency levels Research Background: Research Background New types of language learning environment on computer – CALL, E-learning, M-learning Recent foreign language education – focused on “communicative competence” Tasks for learners on existing computerized self-learning system – very passive Deficiency of the techniques to process “non-standard” language spontaneously produced by learners Corpus-based automatic error detection system “Eden (Error Detection System for English)” (Izumi, Uchimoto, & Isahara 2003) Research Questions: Research Questions “Communicative Approach” in foreign language education “Learners need the opportunity to practice language in the same conditions that apply in real-life situations – in communication, where their primary focus is on message conveyance rather than linguistic accuracy.” (Ellis, 2003) Accuracy and communicability are complementary How can these two be balanced? What kind of factors can change the level of intelligibility? To what extent can learner sentences be labeled with its level of intelligibility automatically? The NICT Japanese Learner English (JLE) Corpus: The NICT Japanese Learner English (JLE) Corpus Error tagging: Error tagging Error-coding for 200 transcripts For morphological, grammatical and lexical errors Based on XML (Extensible Markup Language) syntax Automatic Error Detection(Izumi, Uchimoto and Isahara, 2004): Automatic Error Detection (Izumi, Uchimoto and Isahara, 2004) Eden (Error Detection for English) Automatic error detection based on machine learning Slide76: Word Dictionary Japanese Word Dictionary (260,000) English Word Dictionary (190,000) Bilingual Dictionary Jpn.-Eng. Bilingual Dictionary (240,000) Eng.-Jpn. Bilingual Dictionary (160,000) Concept Dictionary(410,000) Co-occurrence Dictionary Japanese Co-occurrence Dictionary (930,000) EDR Corpus (200,000) English Co-occurrence Dictionary (460,000) EDR Corpus (120,000) Technical Terminology Dictionary (Japanese 110,000, English 70,000) ＥＤＲLexicon Slide77: We try to automatically build electronic bilingual lexicons of Asian language pairs by exploiting existing resources The resources of English-one other language are well developed. Resources between non-English language pairs are less developed. Chinese expansion of EDR Lexicon Slide78: Language Grid Semantic Web Knowledge as open source Machine Translation Intercultural collaboration Slide79: What can be useful tools for collaboration? Personal Acquaintanceship Linguistic Resources Corpora and Lexical Resources Resource Compilation System Developments Fundamental Tools and Technologies on NLP, such as Analyzer and Generator Mutual Benefit Standardization (ISO activities) Slide80: Thank you very much !