Corpus Based Lexicography

What is a corpus?

The word ‘corpus’ can refer to any body of texts in digital form. But for lexicographers, grammarians, and anyone else involved in general language description, a corpus means a large, representative sample illustrating the full repertoire of text-types in a language – including novels, journalism, academic writing, and many other varieties of text. Dozens of corpora now exist – or are under development – for languages as diverse as Estonian, Xhosa, Mandarin Chinese, and American English, and numerous dictionary projects worldwide are already benefiting from these resources.

Against this background, there is a clear expectation that any dictionary including English as one of its languages should be thoroughly corpus-based, in the interests of objectivity, completeness, accuracy, and indeed credibility. For this project, we will be providing the Irish language, too, with state-of-the-art linguistic resources that will underpin a new programme of dictionary development for the 21st century.

The Corpora

The New Corpus for Ireland

The new English-Irish dictionary will be based on the biggest and best linguistic resources available. The corpus includes:

  • A new corpus of Irish, containing 30 million words. This corpus is based on the National Corpus of Irish developed by the ITÉ (Linguistics Institute of Ireland) and containing 8.5 million words, together with a further 15.5 million words also collected by the ITÉ. A further 6 million words was added to this database during Phase 1.
  • A new corpus of Irish-English, with 25 million words of text written – in newspapers, novels, and other sources – by authors from the island of Ireland.
  • Additionally, we have drawn on existing resources for British and American English, including the British National Corpus (BNC) and data licensed from the Linguistic Data Consortium (LDC) in Pennsylvania.

Why do we need a corpus of Irish English?

Although the English spoken in Ireland and England is very similar, there are some major differences between the two dialects. From a lexicographical point of view it was essential that English as it is spoken in Ireland would be reflected in the dictionary. To this end an corpus of Irish English (or Hiberno-English) was compiled to ensure that the linguistic patterns and structures found in Ireland could be captured and included.

The Project

THE PROJECT

Find out more about the project background and phases

Sample Material

SAMPLE MATERIAL

English Headwords (Phase 2A)
Translation Phase (Phase 2B)

News

NEWS

All of the updates from the project are in the archives.