Corpora typology
Hansard reports corpus
The Hansard reports corpus is a corpus of parliamentary debate produced in the UK. It is an opportunistic corpus, meaning there is no particular sampling frame or collection of an ever-larger body of data, but rather what it was possible to gather. It was built to exploit the first machine-readable material available at that time. However, it is not very reliable because transcripts are known to make certain changes, adding information about the speakers and people referred to. It omits situational references, such as turn-taking, making it seem as though MPs speak one after the other without any apparent meta-comment on how and when to speak.
Canadian Hansard
This is a parallel corpus: the source text (ST) is in one language with translations into other languages.
Helsinki corpus
The Helsinki corpus is a historical corpus compiled in the 1980s. It spans from 850 CE to the 18th century, possibly the widest of any available corpora. This corpus attempts to cover a variety of types of text and contains 1.5 million words, although the coverage of some periods is inevitably scanty.
ARCHER (A Representative Corpus of Historical English Registers)
Created by Biber at Northern Arizona University, ARCHER is one of the most important diachronic corpora of recent years. It represents both a spread of time periods and a spread of genres, containing 1.7 million words. The focus is more on the last 350 years, and it allows the analysis of diachronic change with an emphasis on genre variation. It is ideal for research on the emergence of grammatical structures that characterize present-day English.
Specialized historical corpora
- Corpus of Early English Correspondence: Solely of letters.
- Lancaster Newsbooks Corpus: Only very early news publications of the 1650s.
- Corpus of English Dialogues 1560-1760: Historical speech, no audio recording at that time, so it collects dialogues because they are likely to approximate speech more than other genres of writing, such as transcripts of court trials and scripts of plays.
COLT corpus of London teenage speech
This corpus takes a thoughtful approach to the issue of anonymity in corpus building. Recordings were made in London, with recruits being pupils. It includes time alignment, similar to ONZE (origin of New Zealand English) and ICE-GB.
International Corpus of English (ICE)
The ICE represents one language (English) and a number of international varieties of it, allowing comparison and contrast among these varieties. It is a family of matched corpora and uses the ICECUP parser (3rd generation, 1990s), which brackets the main syntactic constituents and creates a clear graphical display of the output (Treebank), applied to ICE-GB. This corpus includes spoken data and covers areas where English is spoken mainly as a foreign language or second language, such as Hong Kong and India. However, it lacks the diachronic dimension of the Brown corpus.
-
Discourse Analysis and Corpus Linguistics
-
Corpus di testi in slavo ecclesiastico
-
Diritto canonico - Il Corpus Iuris Canonici
-
Corpora Linguistics