Estratto del documento

Corpora typology

Hansard reports corpus

The Hansard reports corpus is a corpus of parliamentary debate produced in the UK. It is an opportunistic corpus, meaning there is no particular sampling frame or collection of an ever-larger body of data, but rather what it was possible to gather. It was built to exploit the first machine-readable material available at that time. However, it is not very reliable because transcripts are known to make certain changes, adding information about the speakers and people referred to. It omits situational references, such as turn-taking, making it seem as though MPs speak one after the other without any apparent meta-comment on how and when to speak.

Canadian Hansard

This is a parallel corpus: the source text (ST) is in one language with translations into other languages.

Helsinki corpus

The Helsinki corpus is a historical corpus compiled in the 1980s. It spans from 850 CE to the 18th century, possibly the widest of any available corpora. This corpus attempts to cover a variety of types of text and contains 1.5 million words, although the coverage of some periods is inevitably scanty.

ARCHER (A Representative Corpus of Historical English Registers)

Created by Biber at Northern Arizona University, ARCHER is one of the most important diachronic corpora of recent years. It represents both a spread of time periods and a spread of genres, containing 1.7 million words. The focus is more on the last 350 years, and it allows the analysis of diachronic change with an emphasis on genre variation. It is ideal for research on the emergence of grammatical structures that characterize present-day English.

Specialized historical corpora

  • Corpus of Early English Correspondence: Solely of letters.
  • Lancaster Newsbooks Corpus: Only very early news publications of the 1650s.
  • Corpus of English Dialogues 1560-1760: Historical speech, no audio recording at that time, so it collects dialogues because they are likely to approximate speech more than other genres of writing, such as transcripts of court trials and scripts of plays.

COLT corpus of London teenage speech

This corpus takes a thoughtful approach to the issue of anonymity in corpus building. Recordings were made in London, with recruits being pupils. It includes time alignment, similar to ONZE (origin of New Zealand English) and ICE-GB.

International Corpus of English (ICE)

The ICE represents one language (English) and a number of international varieties of it, allowing comparison and contrast among these varieties. It is a family of matched corpora and uses the ICECUP parser (3rd generation, 1990s), which brackets the main syntactic constituents and creates a clear graphical display of the output (Treebank), applied to ICE-GB. This corpus includes spoken data and covers areas where English is spoken mainly as a foreign language or second language, such as Hong Kong and India. However, it lacks the diachronic dimension of the Brown corpus.

Anteprima
Vedrai una selezione di 1 pagina su 4
Corpus Linguistics - Corpora Typology Pag. 1
1 su 4
D/illustrazione/soddisfatti o rimborsati
Acquista con carta o PayPal
Scarica i documenti tutte le volte che vuoi
Dettagli
SSD
Scienze antichità, filologico-letterarie e storico-artistiche L-LIN/12 Lingua e traduzione - lingua inglese

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher alessia.lento di informazioni apprese con la frequenza delle lezioni di Corpus Linguistics e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Università degli Studi di Pavia o del prof Freddi Maria.
Appunti correlati Invia appunti e guadagna

Domande e risposte

Hai bisogno di aiuto?
Chiedi alla community