Corpus Linguistics

Information about Corpus Linguistics

Published on November 16, 2007

Author: Haylee

Source: authorstream.com

Content

Corpus Linguistics:  Corpus Linguistics Developing a PolyU Language Bank Sherman Lee [email protected] PI: Grahame Bilbow Thanks to: Chris Greaves, Raymond Cheung, Li Lan Outline:  Outline Background Goals of corpus linguistics Types of corpora Applications of corpus analysis As an illustration Exploring units of meaning Case study Developing a PolyU Language Bank Aims and objectives of project Similar existing projects Procedures The PolyU Language Bank Current status Sample corpora Sample search Goals of corpus linguistics:  Goals of corpus linguistics Chomskyan linguistics ‘Langue’ (competence) Ideal speaker/hearer Language = innate mental faculty Intuitive evidence Universals Grammar Corpus linguistics ‘Parole’ (performance) Complexity/variation Language = social phenomenon Empirical evidence Differences Meaning Basic tools:  Basic tools Corpus: a systematic collection of speech or writing that is built according to explicit design criteria for a specific purpose c.f. EAGLES’ broad definition: “A corpus can potentially contain any text type, incl. word lists, dictionaries, etc.” Concordancer: search engine (e.g. WordSmith; SARA) Concordance: occurrences of search item, displayed in list with immediate context shown Types of corpora:  Types of corpora Written vs Spoken General vs Specialised e.g. ESP, Learner corpora Monolingual vs Multilingual e.g. Parallel, Comparable Synchronic vs Diachronic; Monitor Annotated vs Unannotated Written corpora:  Written corpora Specialised corpora:  Specialised corpora Other examples of available corpora:  Other examples of available corpora Some applications of corpus analysis:  Some applications of corpus analysis Language teaching & learning Empirical teaching data – authentic examples of language use Reference source – answering learners’ questions or explaining learner errors: “What’s the difference between ‘at last’ and ‘in the end’?” “How is ‘hardly’ used?” Preparation of teaching materials – e.g. vocabulary lists, CLOZE tests CALL; concordancing and data-driven learning Translation Using parallel texts to find suitable translation equivalents Creation of translation databases or glossaries for domain-specific terminology, e.g. business, law, science Exploring units of meaning in texts Linguistics and language research Lexicography & lexical studies – e.g. relative word frequency Language variation – e.g. linguistic features across registers Grammar – corpora used as data to test hypotheses, syntactic theory Pragmatics & discourse – e.g. CA of discourse features in spoken (conversational) data Exploring meaning, units of meaning:  Exploring meaning, units of meaning Focus on meaning because: People interested in the meanings of texts, in how language is actually used in discourse Meaning is a key problem for translation, language learning, information management… What are basic units of meaning? Language teaching (TEFL): vocabulary often introduced in the form of new single words Words considered to be basic units of meaning Is the word an ideal unit of meaning? “… If you dog a dog during the dog days of summer, you’ll be a dog tired dog catcher…” “… Can I sit down? My dogs are barking…” Most lexical errors made by language learners result from failure to deal with ambiguities of single words ‘Unambiguous Units of Meaning’:  ‘Unambiguous Units of Meaning’ Notion of an ‘Unambiguous Unit of Meaning’ necessary for understanding meaning UUoM = keyword and all words in the context that contribute to making the word unambiguous Compounds, idioms, multi-word units, collocations, set phrases Often determined by a syntactic pattern Adj + N friendly fire, closing remarks V + N invite proposals, draw conclusions Adv + A politically correct, environmentally friendly N + of + N cause of death, proof of identity, code of practice, duty of care Case study:  Case study Search for units of meaning in online dictionaries and corpora friendly fire environmentally friendly Corpora from 1990s British National Corpus (BNC) 100,000,000+ words Written (90%) Extracts from regional/national newspapers, specialist periodicals, academic books, popular fiction, un/published letters, memos, school/university essays Spoken (10%) Informal conversation, formal meetings (business, government), radio shows, phone-ins The Times (1995, Jan – March) 10,220,367 words Written : business, home news, readers’ letters, reviews Corpora from 1960 - 1970s Brown corpus / LOB corpus Each 1 million words Written, balanced corpora of 15 genres of text Search results:  Search results What the results show:  What the results show ‘friendly fire’, ‘environmentally friendly’ Represent fairly new concepts Occur in the newer corpora (1990s) as units of meaning Occur as entries in some of the online dictionaries only (not bilingual dictionaries) New terminology and terms of common usage not always recorded in dictionaries and termbanks One way of using corpora for learning and translation: Use corpus evidence to help students recognise units of meaning; introduce notion of units of meaning into language learning Aims of PULB project:  Aims of PULB project To design and build an archive of language corpora = ‘language bank’ To be used by staff and students in the department For teaching, language learning and research purposes To provide a user-friendly platform A WWW interface via which users can freely access the language bank With browse, search and concordance facilities Ingredients of PULB:  Ingredients of PULB Sources: standard corpora, departmental collections Medium: written texts, transcribed spoken data Language types: native speaker, learner corpora Languages: English, Chinese, Japanese, French, German Genres: business, law, academia, media, social, literature Target Size: 30 million words (European) / characters (Asian) Why a language bank? - “What’s in it for us”:  Why a language bank? - “What’s in it for us” Free and simple shared access to a collection of language corpora That you can utilise for your teaching Authentic examples of language use at your fingertips Empirical teaching data covering different specialisms (ESP, EAP) That you can utilise for your research A ready-made collection of data waiting for you to work on Saving on time and resources Way of incorporating new methods and information technology into the department’s teaching and research activities Increase students’ awareness of this rapidly developing methodology / branch of language studies (corpus linguistics, corpora studies) Way of integrating theory with technology in the classroom Train students to be more computer-literate All of the above can Motivate students to become active learners Help students to more effectively learn the target language (cf goals of DDL) Similar existing projects :  Similar existing projects W3 Corpora Project (Essex) http://clwww.essex.ac.uk/w3c/ Access to corpora (Gutenberg texts, LOB, LOB-tagged) Web interface for performing searches Online tutorial and info on corpus linguistics Web Concordancer (VLC, PolyU) http://vlc.polyu.edu.hk/concordance/ Access to variety of corpora and texts (bilingual/parallel corpora, news, Bible, works of fiction) Web interface for performing searches Directions for PULB:  Directions for PULB Build a language bank with features that parallel those of similar sites ~ VLC Bring together corpora and texts of various types and genres, of different languages ~ Essex Make available different facilities for different categories of users (cf. legal considerations) Provide on-site tutorial, corpora-based info Include extra features Allow searches in multiple texts / corpora simultaneously Some form of parallel concordancing Target composition of PULB:  Target composition of PULB PolyU Language Bank Chinese Japanese English General corpora Learner corpora ICE Business English (PUBC) Legal English Academic English BNC BROWN Spoken Corpora Workplace English HK spoken corpus Conference speeches Academic presentations French German Legal Chinese Business Chinese Business Japanese Japanese Literature Student work Social interactions Teaching reflections Business writing Specialised corpora English Literature Procedures (i):  Procedures (i) Collate, sort, categorise data from various sources Commercially available data Departmental collections, incl. PolyU Business Corpus (Li and Bilbow) Bilingual corpora (Xu) ESP / EAP corpora (Forey) Learner corpora (Sengupta) … Procedures (ii):  Procedures (ii) For the departmental collections: Decide how to present each collection E.g. Sub-categories, macro categories Clean up texts E.g. Duplications of text samples E.g. Structural features (headings, typographic features) E.g. Personal information found in data To protect anonymity or privacy of authors and speakers Annotate texts Provide descriptive information about each corpus Compiler, time of compilation, type of collection… Provide descriptive information about the texts Number, size, genre of subtexts Bibliographic info (written text) Ethnographic info (spoken data) Provide structural information for texts if necessary Mark texts for paragraph boundaries etc… Procedures (iii):  Procedures (iii) Put corpora together on platform; set up search and support facilities: ‘PULB map’ Browse facility Search and concordance facilities Tutorial / general information Transplant PULB onto dept website for use by staff and students Promote PULB among corpora community Data provider to data archives / distribution sites, e.g. OLAC; ICAME The PolyU Language Bank:  The PolyU Language Bank Current status Range of corpora totalling 12M+ words Individual corpus descriptions Index of corpora Simple to use built-in concordancer Available at http://langbank.engl.polyu.edu.hk/ The PolyU Language Bank:  The PolyU Language Bank Some of the currently available corpora PolyU Business Corpus (Eng, Chi, Jap) BNC Sampler Corpus (Spoken, Written) Corpus of Multilingual Texts Corpus of Nursing and Health Science Texts Learner Corpus of Essays and Reports HK Bilingual Corpus of Legal and Documentary Texts ... How you can contribute:  How you can contribute Talk to us about your ideas What would you like to see being incorporated into PULB? In terms of corpora In terms of search facilities and supplementary information Can you think of other ways in which PULB can be organised and structured? How likely are you to make use of PULB in your teaching and research? Do you have any suggestions for corpus studies based on available or potentially available corpora from PULB? Do you know of similar projects being undertaken elsewhere that we can learn from? Talk to us about your collections / corpora Do you have collections of language data from past research projects that are (could be) presented as a corpus (corpora)? Can we help you put your collections to good use? Can we work together to incorporate your collections into PULB? Concluding remarks:  Concluding remarks Corpora represent a valuable but under exploited resource for teaching and research PULB aims to bring together various corpora under a single departmental archive, accessible via WWW You can help us by contributing your ideas and/or your language collections Please visit and test the PULB website at http://langbank.engl.polyu.edu.hk/ and provide us with feedback using the online evaluation form Thank you very much Social grooming:  Social grooming CLOZE:  CLOZE PolyU Business Corpus:  PolyU Business Corpus Compiled in 1999-2000 (Li & Bilbow) Multilingual - comparable corpora: English (c. 1.3 M words) Chinese (c. 1.2 M words) Japanese (c. 1.1 M words) Business texts from: newspapers, government reports, company reports and brochures… Has been used for creating a bilingual English-Chinese business lexicon PolyU Business Lexicon :  PolyU Business Lexicon Duplication:  Duplication

Related presentations


Other presentations created by Haylee

UKansas04 seminar 1
28. 11. 2007
0 views

UKansas04 seminar 1

ransom2004
28. 11. 2007
0 views

ransom2004

Nile Basin
25. 10. 2007
0 views

Nile Basin

Artezio company presentation
26. 10. 2007
0 views

Artezio company presentation

Gujarat
06. 11. 2007
0 views

Gujarat

Bertrand Leneveu
06. 11. 2007
0 views

Bertrand Leneveu

thksgving jeo
06. 11. 2007
0 views

thksgving jeo

W6 wp52plenary
07. 11. 2007
0 views

W6 wp52plenary

alts adds presso
07. 11. 2007
0 views

alts adds presso

FAZNET overview 051109
14. 11. 2007
0 views

FAZNET overview 051109

AIIM pres
16. 11. 2007
0 views

AIIM pres

Dogs of the Dow
19. 11. 2007
0 views

Dogs of the Dow

bailey AHDS2007
23. 11. 2007
0 views

bailey AHDS2007

Life Ch9 Seed Plants
17. 12. 2007
0 views

Life Ch9 Seed Plants

wk5
29. 11. 2007
0 views

wk5

Reback and Shoptaw ADPA Sept07
23. 12. 2007
0 views

Reback and Shoptaw ADPA Sept07

final reg controls 13 01 06
29. 12. 2007
0 views

final reg controls 13 01 06

slides healthy water
01. 01. 2008
0 views

slides healthy water

online tool
02. 01. 2008
0 views

online tool

dgposter
02. 01. 2008
0 views

dgposter

rootsinplanting
03. 01. 2008
0 views

rootsinplanting

family psychoeducation
07. 01. 2008
0 views

family psychoeducation

KRFOR97
07. 01. 2008
0 views

KRFOR97

BL4CH05
11. 12. 2007
0 views

BL4CH05

Seaweeds dr pido
28. 12. 2007
0 views

Seaweeds dr pido

lec01w
28. 12. 2007
0 views

lec01w

Marston
30. 12. 2007
0 views

Marston

SergioNavas ArDM DARK07
20. 02. 2008
0 views

SergioNavas ArDM DARK07

momscova
24. 02. 2008
0 views

momscova

CP3VideoCompression
27. 02. 2008
0 views

CP3VideoCompression

mckenzie
27. 03. 2008
0 views

mckenzie

Ordklasser
27. 11. 2007
0 views

Ordklasser

SingEconomics4 04
31. 10. 2007
0 views

SingEconomics4 04

lida2005 nicholson
03. 12. 2007
0 views

lida2005 nicholson

andrews
19. 11. 2007
0 views

andrews

btp 17
24. 12. 2007
0 views

btp 17

EngVillCompendex
04. 12. 2007
0 views

EngVillCompendex