vuoi
o PayPal
tutte le volte che vuoi
CORPORA TYPOLOGY
- Hansard reports corpus:
Corpus of parlamentary debate produced in the UK
Opportunistic corpus: no particular sampling frame or collection of ever-larger body
of data, but what it was possible to gather – built to exploit the first
machine-readible material available at that time
Not very reliable bcs transcripts are known to make certain changes adding
information about the speakers and people referred to; omits situational references
(es. turn taking); it seems that MPs speak one after the other without any apparent
meta-comment on how and when to speak
- Canadian Hansard:
Parallel corpus: ST in one language + translations into other languages
- Helsinki Corpus:
Historical corpus compiled in the 1980s
th
Temporal span: 850 CE – 18 century (possibly the widest of any available corpora)
Attempt to cover a variety of types of text
1.5 million words (the cover of some periods is inevitably scanty)
- ARCHER (A Representative Corpus of Historical English Registers):
Biber, at Northern Arizona Uni
One of the most important diachronic corpora of recent years
Represents both a spread of time periods and a spread of genres
1.7 million words, focused more on the last 350 years
Analysis of diachronic change with emphasis on genre variation
Ideal for research on the emergence of grammatical structures that characterize
present-day English
- Specialized historical corpora: Corpus of Early English Correspondence (solely of letters); Lancaster
Newsbooks Corpus (only very early news publications of the 1650s); Corpus of English Dialogues
1560-1760 (historical speech, no audio recording at that time so collects dialogues bcs are likely to
approximate speech more than other genres of writing transcripts of court trials and scripts of
plays)
- COLT corpus of London teenage speech:
Thoughtful approach to the issue of anonymity in corpus building
Recordings made in London, recruits are pupils
Time alignment (as ONZE – origin of New Zealand English - and ICE-GB)
- International Corpus of English (ICE):
Represents one language (Engl) and a number of international varieties of it
Allows comparison and contrast among varieties
It is a family of matched corpora
rd
Uses ICECUP – parser (3 generation, 1990s): brackets the main syntactic
constituents and creates a clear graphical display of the output (Treebank) – applied
to ICE-GB
Includes spoken data
Covers areas where English is spoken mainly as foreign language or second
language (Honk Hong, India)
Lacks the diachronic dimension of the Brown
- British National Corpus (BNC):
Sample corpus – reference corpus
90% written texts 10% (transcribed) spoken texts
Time span 1960-1993
100.000.000 w
POS tagged with CLAWS
Interface for queries: SARA
Model for ANC, Korean National Corpus
- American National Corpus (ANC):
Same sampling fame as BNC (model)
22 million words at the moment, but it aims at 100 mw
Aborted sample corpus (skewed atm)
- The Bank of English (BoE):
University of Birmingham
Best-known example of monitor corpus, open-ended
Started in 1980s and continually expanded since that time
1987, it allowed the publication of the first corpus-based dictionary (Collins Cobuild
English Language Dictionary)
In 1991, more material was added and it became the BoE (COBUILD before)
450 million words (56 million available)
General English section + materials of use in language pedagogy (56 mw)
- Corpus of Contemporary American English (COCA):
Monitor corpus but (!) more explicit design than BoE
Each extra section added complies to the same breakdown of text-varieties
Halfway between the sample and the monitor corpus it is a monitor corpus that
proceeds according to a sampling frame
- CANCODE: 5 million w of spoken English
Cambridge Uni Press
To explore the nature of language in speech (Brazil)
- Web as Corpus: Web can be regarded as the largest monitor corpus: massive collection of data that
is ever-growing
WebCorp interface to explore the web for language studies: 1) uses search engines
like Google to retrieve information 2) kwic view 3) wildcards 4) retrieves results in
n. of occurrences and not no. of pages 5) filters according to domain and document
type 6) post-processing (possibility to eliminate useless examples)
Problems of Web as Corpus:
• Mixture of carefully prepared texts and casually prepared material
• Content is not divided by genre material retrieved is undifferentiated and
may need a great deal of processing to sort into meaningful groups of texts
• Contains errors of all sorts (this may prove useful to investigate common
spelling errors though; if this is not the purpose of our research, such errors
are just unwelcome noise)
• Number of examples may be overwhelming and a good deal of data may
have to be discarded + problem of over-representation of some words (ex.
Click here)
• Difficult to replicate a study done on the web bcs it is forever changing;
replicability is essential in experimental procedures
- London-Lund Corpus:
Jan Svartvik (Lund uni)
Result of 2 phases: 1975-1981 1985-1988
500.000 w
Spoken English
2 sections: monologue and dialogue
Pragmatically annotated (prosody)
Was originally a part of The Survey of English Usage (SEU, 1959, University
College London)
- Diachronic Corpus of Present-Day Spoken English:
Result of re-editing of SEU
800.000 w
Diachronic study of Engl through combining and contrasting spoken material from
SEU and ICE
- Brown Corpus:
Standard Corpus of Present-Day American English
Brown University
Sample corpus
1.000.000 w
Written American English published in 1961
Divided into 500 chunks of 2000w each
It excludes drama (bcs fictive recreation of spoken language) and fiction with more
than 50% of dialogue
15 categories grouped into 4 broad genres: press, general prose, learned and fiction
Model of small-scale corpora but (!) problems with categories
- Lancaster-Oslo/Bergen (LOB) corpus:
British English match of/for Brown Corpus collected in the 1970s but sampling
English from 1961
Annotated with parse trees
Comparable with Brown Family
- Lancaster Corpus of Mandarin Chinese:
Comparable with Brown Family (groups of corpora designed using the same
sampling frame)
- CRATER (Lancaster uni):
Multilingual parallel corpus: French, English, Spanish Aligned and annotated,
unidirectional
- EMILLE: Parallel corpus, unidirectional
15 South-Asian languages
- ENPC: English/Norwigian
Both comparable and parallel