The significance of lemmatisation for the users of text databases
Info: Lemmatisation is a linguistic procedure which assigns by means of numerical codes an actual word form appearing in a text to its grammatical base word form.
Disadvantages of non-lemmatised databases
Example: Search inquiry for ‹lex, legis› (= law) in the Augustinian oeuvre in conventional non-lemmatised databases:
Possible strategy 1: Request for lex and for leg*
Problem: You will find, additionally to ‹lex›, any inflected forms, but also any forms of the present stem and the perfect stem active of ‹legere› (= to read). Also derivates of ‹legalis, -e› or ‹legitimus, -a, -um› etc.
Possible strategy 2: Entry of any forms of ‹lex›:
Problem: You will have to search for 8 different forms, whereas you will have to take in consideration the overlapping with 4 forms of the verbum ‹legere›, which must be sorted out from the research result.
lex | leges | lex | leges |
legis | legum | legis | legum |
legi | legibus | legi | legibus |
legem | leges | legem | leges |
lege | legibus | lege | legibus |
Advantages of lemmatised databases
Example: Search inquiry for ‹lex, legis› (= law) within the Augustinian oeuvre by means of the lemmatised text database of the Corpus Augustinianum Gissense a Cornelio Mayer editum (CAG-online):
Entry of l:lex
Result: Within some seconds you will find any of the 8.000 word forms of ‹lex› exclusively – identical forms of other words are not to be included.