Home | Skip Navigation | Accessibility Information | Site Search | Site Map | Contact Us 
Home | Contact us | Site Search | Sitemap | Accessibility 

An Fhoclóireacht Chorpasbhunaithe | Corpus Based Lexicography

Corpus Based Lexicography

One of the most ambitious aspects of the project is the development of the ‘New Corpus for Ireland’, which includes large text collections for both Irish and Irish-English (see The New Corpus for Ireland).

What is a corpus?

The word ‘corpus’ can refer to any body of texts in digital form. But for lexicographers, grammarians, and anyone else involved in general language description, a corpus means a large, representative sample illustrating the full repertoire of text-types in a language – including novels, journalism, academic writing, and many other varieties of text. Dozens of corpora now exist – or are under development – for languages as diverse as Estonian, Xhosa, Mandarin Chinese, and American English, and numerous dictionary projects worldwide are already benefiting from these resources.

The first computer corpora were developed as far back as the early 1960s. But it was the computer boom of the 1980s and 1990s that established the corpus as an indispensable tool for lexicography. As processing power increased and the cost of data-storage fell, it became feasible to build corpora containing tens of millions – even hundreds of millions – of words.

Against this background, there is a clear expectation that any dictionary including English as one of its languages should be thoroughly corpus-based, in the interests of objectivity, completeness, accuracy, and indeed credibility. For this project, we will be providing the Irish language, too, with state-of-the-art linguistic resources that will underpin a new programme of dictionary development for the 21st century.

How can a corpus help us make better dictionaries?

A dictionary is, essentially, a set of generalizations about the way the words in a language behave. But how do we know these generalizations are reliable? Before corpus data became available, lexicographers relied on a combination of citations (short extracts from published works, illustrating a word in context) and their own intuitions as native-speakers. Both these sources of data have their place in dictionary-making, but each has its limitations. Citational evidence – collected by human readers in a somewhat opportunistic way – tends to be sporadic and incomplete, while the intuitions of a single individual are inevitably subjective and partial. A large and diverse corpus, on the other hand, gives us enormous volumes of empirical data, showing how people have used a word or phrase in real communicative situations. And it is this evidence, analyzed using increasingly sophisticated software, that provides a secure basis for the generalizations in our dictionaries, and gives us confidence that our generalizations are reliable.

What does a corpus tell us?

A corpus provides us with the information we need to create an authoritative description of the vocabulary of a language. For any given word, we can discover a range of information, including:

semantic
evidence for the word’s different meanings and nuances
syntactic
how the word combines grammatically with others
contextual
the environments in which the word typically appears, and the patterns or phrases it forms part of
stylistic
the kinds of text where we are most likely to encounter the word – whether these be literary or poetic, journalistic, academic, or email
statistical
evidence for the relative frequency of different words or of the different meanings or grammatical constructions of the same word

How do lexicographers use corpus data?

The main software tool for analyzing a corpus is the concordancer – a program that searches the corpus, finds every instance of a word or phrase, and displays it with its context. Here is part of a concordance for the English noun challenge:


1. ological barrier against mounting a challenge when the party was in office
2. incial circuit presents a hell of a challenge The only advantage London ha
3. s wedding. Now he faces the biggest challenge of his career - the developme
4. face his greatest and most daunting challenge - the ‘fight for the right’
5. bane Broncos, threw down a dramatic challenge to record-breaking Wigan last
6. terms, such as the need for a fresh challenge . Now, it is a players’ manag
7. This presents an immediate challenge . Teachers will need to think
8. n crusade if we can't face the real challenge of getting rid of British new
9. uipment. Mr Parkinson now faces the challenge of selling to the electorate
10. is to adapt and change to meet the challenge of providing effective educat

 

Concordances like this – providing detailed evidence of the way words behave and combine – have helped to transform lexicography in English since the early 1980s. But analyzing data in this form is a labour-intensive task, and in recent years the sheer volume of available data has begun to limit the value of the concordance. If you are presented with several hundred (or even several thousand) ‘concordance lines’ like this, it becomes difficult for the human brain to process the information effectively.

In the last few years, however, this problem has been successfully addressed by a new generation of ‘lexical profiling’ software, developed by Adam Kilgarriff at the University of Brighton. Effectively, Kilgarriff’s software takes the output of a concordancer and adds another level of computer processing. It identifies the important grammatical relationships that a word like challenge appears in, then finds the words that most frequently fill these slots. The lexicographer is then presented with a ‘Word Sketch’ that summarizes the word’s behaviour. The extract below shows three common structures – adjectives that frequently modify challenge, verbs that take challenge as an object, and nouns that follow the expression ‘a challenge to ...’

challenge noun
frequency: 64.4 per million
rank: 1586
Sample Concordances for "challenge"

adjectives object_of_verbs +to (prep-phrase)
greatest face leadership
biggest meet authority
serious pose status quo
formidable withstand order
daunting mount decision
exciting prevent power
direct relish idea
major accept government
new resist dominance
tough enjoy validity


This is only a small extract from a Word Sketch – in the real thing, there is a wealth of frequency data and details about how the word combines, as well as direct links to sentences in the corpus that show each of these combinations in use. But this does at least give some idea of the benefits this software can bring. Not only does it save a lot of time and effort: it gives an accurate (and rapid) overview of the key features of any word, and provides the data we need to compile a more comprehensive and systematic account of word use than has ever been possible before.

For further information on the Corpus compiled during Phase 1, see The New Corpus for Ireland

back to top

Foras na Gaeilge logo