Published on November 14, 2007
Customizing the IMDI metadata schema for endangered languages: Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES) Introduction: Introduction IMDI: International Standards for Language Engineering Metadata Initiative DOBES: Volkswagen Foundation’s Documentation of Endangered Languages initiative AILLA: the Archive of the Indigenous Languages of Latin America Types of resources: Types of resources Audio and video recordings in various digital formats Annotation text files, e.g. transcriptions and translations Standalone texts, e.g. dictionaries, poetry Wide range of genres: from verbal art to scholarly analyses Bundles of resources: Bundles of resources Session (IMDI, 2001): resources resulting from a linguistic elicitation session - recordings and annotations. Only models one kind of resource production - a recording session. Collections will include a greater variety of resources, in sets of related materials. Types of bundles: Types of bundles Canonical bundle: the original session. A digitized recording, in different formats, and some textual annotation files, also in different formats. Minimal bundle: a single file. Examples: dictionary, poem, recording of uninterpretable chants. Meta-bundle: a bundle containing other bundles. Example: a book about a set of annotated recordings. Bundle elements: Bundle elements Current: Name of bundle Date and place of production Proposed: Resource relations Date archived Last modified Major subschemas: Major subschemas Project Collector Content Participants Resources References The Content Subschema: The Content Subschema Genre is the top-level category: Interaction: conversation, interview … Explanation: description, recipe … Performance: narrative, poem, oratory … Teaching: primer, textbook … Analysis: grammar, dictionary … Other Content categories: Other Content categories Modality: speech, writing, gesture Communication context: Interactivity Planning Involvement Languages Task Description Keys AILLA’s Content Keys: AILLA’s Content Keys Register: a characterization of how the discourse reflects the social context. Example: honorific speech Style: about poetic and stylistic effects. Examples: parallelism, metered verse. The Project subschema: The Project subschema Current elements: Name: a nickname or acronym Title: official title ID: a unique identifier Contact information Proposed element: Funder: name of funding organization The Collector subschema: The Collector subschema AILLA renames this Depositor, since this is the individual we have to keep track of (e.g. for Level 3 access permission). When the Depositor is not also the Collector, Collector can be listed under Participants. The Participants subschema: The Participants subschema Type: functional role, e.g. creator Role: family relationship Name/Full name Language(s) Ethnic group, age, sex: Education Anonymous: True if participant’s Full name is reserved; False otherwise AILLA additions to Participants: AILLA additions to Participants Origin: Place (country, region, etc) of origin of the creator of the primary resource in the bundle (e.g. the speaker whose voice is recorded). Occupation: Can be relevant in assessing accuracy of some kinds of data. The Resources subschema: The Resources subschema Resources contains information about formats and provenance of files in a bundle. Media Files: audio, video, etc. Annotation Files: text files. Proposal: call them all Media Files, to reduce redundancy in the database. (All have URL, size, etc. elements.) Text resources: Text resources Current elements: Type: type of annotation, e.g. phonetic transcription. Content encoding: annotation encoding scheme, e.g. EUROTYP. Character encoding: character set(s) used in a text file. Text resources 2: Text resources 2 Proposed elements: Transcription type Translation (aka Glossing) type Software: used to produce transcriptions, translations, other annotations (e.g. Shoebox) Describe Annotator in Participants (along with Translator, etc.) Proposed subschema: Proposed subschema Place: composed of several elements: Continent Country Region Subregion (address) Repeated at least twice, in Bundle and in Participants (Origin). Might also be useful in the Language subschema. Conclusion: Conclusion IMDI schema is a flexible tool. Customization through Key/Value pairs allows local modifications. Most of the proposed changes are terminological, moving from the DOBES in-house terminology to more general usage.