Lemmatisation

The significance of lemmatisation for the users of text databases

Info: Lemmatisation is a linguistic procedure which assigns by means of numerical codes an actual word form appearing in a text to its grammatical base word form.

Disadvantages of non-lemmatised databases

Example: Search inquiry for ‹lex, legis› (= law) in the Augustinian oeuvre in conventional non-lemmatised databases:

Possible strategy 1: Request for lex and for leg*

Problem: You will find, additionally to ‹lex›, any inflected forms, but also any forms of the present stem and the perfect stem active of ‹legere› (= to read). Also derivates of ‹legalis, -e› or ‹legitimus, -a, -um› etc.

Possible strategy 2: Entry of any forms of ‹lex›:

Problem: You will have to search for 8 different forms, whereas you will have to take in consideration the overlapping with 4 forms of the verbum ‹legere›, which must be sorted out from the research result.

lex	leges	lex	leges
legis	legum	legis	legum
legi	legibus	legi	legibus
legem	leges	legem	leges
lege	legibus	lege	legibus

Advantages of lemmatised databases

Example: Search inquiry for ‹lex, legis› (= law) within the Augustinian oeuvre by means of the lemmatised text database of the Corpus Augustinianum Gissense a Cornelio Mayer editum (CAG-online):

Entry of l:lex

Result: Within some seconds you will find any of the 8.000 word forms of ‹lex› exclusively – identical forms of other words are not to be included.