Anteprima
Vedrai una selezione di 1 pagina su 4
Corpus Linguistics - Corpora Typology Pag. 1
1 su 4
D/illustrazione/soddisfatti o rimborsati
Disdici quando
vuoi
Acquista con carta
o PayPal
Scarica i documenti
tutte le volte che vuoi
Estratto del documento

CORPORA TYPOLOGY

- Hansard reports corpus:

Corpus of parlamentary debate produced in the UK

 Opportunistic corpus: no particular sampling frame or collection of ever-larger body

 of data, but what it was possible to gather – built to exploit the first

machine-readible material available at that time

Not very reliable bcs transcripts are known to make certain changes adding

 

information about the speakers and people referred to; omits situational references

(es. turn taking); it seems that MPs speak one after the other without any apparent

meta-comment on how and when to speak

- Canadian Hansard:

Parallel corpus: ST in one language + translations into other languages

- Helsinki Corpus:

Historical corpus compiled in the 1980s

 th

Temporal span: 850 CE – 18 century (possibly the widest of any available corpora)

 Attempt to cover a variety of types of text

 1.5 million words (the cover of some periods is inevitably scanty)

- ARCHER (A Representative Corpus of Historical English Registers):

Biber, at Northern Arizona Uni

 One of the most important diachronic corpora of recent years

 Represents both a spread of time periods and a spread of genres

 1.7 million words, focused more on the last 350 years

 Analysis of diachronic change with emphasis on genre variation

 Ideal for research on the emergence of grammatical structures that characterize

 present-day English

- Specialized historical corpora: Corpus of Early English Correspondence (solely of letters); Lancaster

Newsbooks Corpus (only very early news publications of the 1650s); Corpus of English Dialogues

1560-1760 (historical speech, no audio recording at that time so collects dialogues bcs are likely to

approximate speech more than other genres of writing transcripts of court trials and scripts of

plays)

- COLT corpus of London teenage speech:

Thoughtful approach to the issue of anonymity in corpus building

 Recordings made in London, recruits are pupils

 Time alignment (as ONZE – origin of New Zealand English - and ICE-GB)

- International Corpus of English (ICE):

Represents one language (Engl) and a number of international varieties of it

 Allows comparison and contrast among varieties

 It is a family of matched corpora

 rd

Uses ICECUP – parser (3 generation, 1990s): brackets the main syntactic

 constituents and creates a clear graphical display of the output (Treebank) – applied

to ICE-GB

Includes spoken data

 Covers areas where English is spoken mainly as foreign language or second

 language (Honk Hong, India)

Lacks the diachronic dimension of the Brown

- British National Corpus (BNC):

Sample corpus – reference corpus

 90% written texts 10% (transcribed) spoken texts

 Time span 1960-1993

 100.000.000 w

 POS tagged with CLAWS

 Interface for queries: SARA

 Model for ANC, Korean National Corpus

- American National Corpus (ANC):

Same sampling fame as BNC (model)

 22 million words at the moment, but it aims at 100 mw

 Aborted sample corpus (skewed atm)

- The Bank of English (BoE):

University of Birmingham

 Best-known example of monitor corpus, open-ended

 Started in 1980s and continually expanded since that time

 1987, it allowed the publication of the first corpus-based dictionary (Collins Cobuild

 English Language Dictionary)

In 1991, more material was added and it became the BoE (COBUILD before)

 450 million words (56 million available)

 General English section + materials of use in language pedagogy (56 mw)

- Corpus of Contemporary American English (COCA):

Monitor corpus but (!) more explicit design than BoE

 Each extra section added complies to the same breakdown of text-varieties

 Halfway between the sample and the monitor corpus it is a monitor corpus that

 

proceeds according to a sampling frame

- CANCODE: 5 million w of spoken English

 Cambridge Uni Press

 To explore the nature of language in speech (Brazil)

- Web as Corpus: Web can be regarded as the largest monitor corpus: massive collection of data that

 is ever-growing

WebCorp interface to explore the web for language studies: 1) uses search engines

 like Google to retrieve information 2) kwic view 3) wildcards 4) retrieves results in

n. of occurrences and not no. of pages 5) filters according to domain and document

type 6) post-processing (possibility to eliminate useless examples)

Problems of Web as Corpus:

 • Mixture of carefully prepared texts and casually prepared material

• Content is not divided by genre material retrieved is undifferentiated and

may need a great deal of processing to sort into meaningful groups of texts

• Contains errors of all sorts (this may prove useful to investigate common

spelling errors though; if this is not the purpose of our research, such errors

are just unwelcome noise)

• Number of examples may be overwhelming and a good deal of data may

have to be discarded + problem of over-representation of some words (ex.

Click here)

• Difficult to replicate a study done on the web bcs it is forever changing;

replicability is essential in experimental procedures

- London-Lund Corpus:

Jan Svartvik (Lund uni)

 Result of 2 phases: 1975-1981 1985-1988

 500.000 w

 Spoken English

 2 sections: monologue and dialogue

 Pragmatically annotated (prosody)

 Was originally a part of The Survey of English Usage (SEU, 1959, University

 College London)

- Diachronic Corpus of Present-Day Spoken English:

Result of re-editing of SEU

 800.000 w

 Diachronic study of Engl through combining and contrasting spoken material from

 SEU and ICE

- Brown Corpus:

Standard Corpus of Present-Day American English

 Brown University

 Sample corpus

 1.000.000 w

 Written American English published in 1961

 Divided into 500 chunks of 2000w each

 It excludes drama (bcs fictive recreation of spoken language) and fiction with more

 than 50% of dialogue

15 categories grouped into 4 broad genres: press, general prose, learned and fiction

 Model of small-scale corpora but (!) problems with categories

- Lancaster-Oslo/Bergen (LOB) corpus:

British English match of/for Brown Corpus collected in the 1970s but sampling

 English from 1961

Annotated with parse trees

 Comparable with Brown Family

- Lancaster Corpus of Mandarin Chinese:

Comparable with Brown Family (groups of corpora designed using the same

 sampling frame)

- CRATER (Lancaster uni):

Multilingual parallel corpus: French, English, Spanish Aligned and annotated,

 

unidirectional

- EMILLE: Parallel corpus, unidirectional

 15 South-Asian languages

- ENPC: English/Norwigian

 Both comparable and parallel

Dettagli
Publisher
A.A. 2013-2014
4 pagine
SSD Scienze antichità, filologico-letterarie e storico-artistiche L-LIN/12 Lingua e traduzione - lingua inglese

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher alessia.lento di informazioni apprese con la frequenza delle lezioni di Corpus Linguistics e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Università degli Studi di Pavia o del prof Freddi Maria.