# Setup and Testing

We'll bring plain text files into the notebook for processing. There are 7 text files, written in Latin in the first century BC, comprising Julius Caesar's _Commentaries on the Gallic Wars_.

## Prerequisites

1. Python versions 3.7, 3.8, or 3.9
2. __The Classical Language Toolkit (https://docs.cltk.org/en/latest/index.html)__

## First steps

In [1]:
from cltk import NLP

In [2]:
cltk_nlp = NLP(language="lat")

‎𐤀 CLTK version '1.1.1'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinLexiconProcess`.


In [5]:
# read the first file, Gallic Wars Book 1, which is in the same directory as this notebook

with open("gall1.txt") as fo:
 caesar_book1 = fo.read()

In [6]:
# text snippet

caesar_book1[:200]

'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differun'

In [7]:
# let's get some estimates

print("Character count:", len(caesar_book1))
print("Approximate token count:", len(caesar_book1.split()))

Character count: 57955
Approximate token count: 8173


In [4]:
# removing ``LatinLexiconProcess`` before running cltk_nlp.analyze because it's slow

cltk_nlp.pipeline.processes.pop(-1)
print(cltk_nlp.pipeline.processes)

[, , , ]


In [9]:
# now execute NLP algorithms upon input text
# execution time is ~60 sec on a my Thinkpad T460s

%time cltk_doc = cltk_nlp.analyze(text=caesar_book1)

CPU times: total: 2min 56s
Wall time: 2min 16s


In [41]:
# have a look at the first 10 words

cltk_doc.tokens[:10] # note that punctuation is included here

['Gallia',
 'est',
 'omnis',
 'divisa',
 'in',
 'partes',
 'tres',
 ',',
 'quarum',
 'unam']

In [47]:
# let's remove punctuation

caesar_tokens_no_punct = [token for token in cltk_doc.tokens if token not in ['.', ',', ':', ';']]
caesar_word_tokens_no_punct[:10]

['Gallia',
 'est',
 'omnis',
 'divisa',
 'in',
 'partes',
 'tres',
 'quarum',
 'unam',
 'incolunt']

In [43]:
# instead of tokens(words), let's find the root words, or the "lemmata"

cltk_doc.lemmata[:10]

['Gallia', 'sum', 'omnis', 'divisa', 'in', 'pars', 'tres', ',', 'qui', 'unus']

In [49]:
# let's remove punctuation

caesar_lemmata_no_punct = [token for token in cltk_doc.lemmata if token not in ['.', ',', ':', ';']]
caesar_lemmata_no_punct[:10]

['Gallia',
 'sum',
 'omnis',
 'divisa',
 'in',
 'pars',
 'tres',
 'qui',
 'unus',
 'incaleo']

# Book 1

## A cursory look at Book 1 reveals the German King Ariovistus is the enemy most often mentioned by Caesar. Exactly how many times can we find Ariovistus in Book 1?

In [7]:
from collections import Counter

In [60]:
# here is a dictionary of every word in the Book 1 along with how many times each word appears

caesar_word_counts = Counter(caesar_lemmata_no_punct)
caesar_word_counts

Counter({'Gallia': 15,
 'sum': 223,
 'omnis': 66,
 'divisa': 2,
 'in': 177,
 'pars': 26,
 'tres': 7,
 'qui': 213,
 'unus': 23,
 'incaleo': 4,
 'Belgae': 3,
 'alius': 12,
 'Aquitani': 1,
 'tertius': 10,
 'ipse': 44,
 'lingua': 3,
 'Celtae': 1,
 'noster': 39,
 'Galli': 1,
 'appello': 9,
 'is': 269,
 'instituo': 3,
 'lex': 3,
 'inter': 17,
 'se': 162,
 'differo': 1,
 'Gallos': 4,
 'ab': 102,
 'Aquitanis': 1,
 'Garumna': 3,
 'flumen': 21,
 'Belgis': 1,
 'Matrona': 1,
 'et': 193,
 'Sequana': 1,
 'disco': 5,
 'fortis': 2,
 'propterea': 15,
 'quod': 82,
 'cultus': 2,
 'atque': 75,
 'humanitas': 2,
 'provinciae': 8,
 'longe': 7,
 'absum': 8,
 'minimeque': 1,
 'ad': 107,
 'mercator': 1,
 'saepe': 5,
 'commeo': 1,
 'effemino': 1,
 'animus': 10,
 'pertineo': 6,
 'importo': 1,
 'proximique': 1,
 'Germanis': 5,
 'trans': 7,
 'Rhenum': 15,
 'quicum': 3,
 'continenter': 2,
 'bellum': 29,
 'gero': 8,
 'Qua': 2,
 'de': 36,
 'causa': 25,
 'Helvetii': 23,
 'quoque': 1,
 'reliquus': 17,
 'virtute': 10,
 '

In [93]:
caesar_word_counts['Ariovistus']

20

In [76]:
caesar_word_counts['Ariovistum']

10

## The above two lines show that the lemmatizer does not work for proper names. We'll have to search the text for every grammatical case

In [84]:
# show me how many times Ariovistus is named in this text, for every case of the word "Ariovistus", namely: nominative, vocative, accusative, genative, dative, ablative


nom = caesar_word_counts['Ariovistus']
voc = caesar_word_counts['Arioviste']
acc = caesar_word_counts['Ariovistum']
gen = caesar_word_counts['Ariovisti']
abl = caesar_word_counts['Ariovisto'] # same as dative case

print(nom + acc + voc + gen + abl)

42


# Book 2
## Let's do the same processing on Book 2, simplifying the code as we go. We will choose another target, the Druid Diviciacus

In [100]:
fo.close()
with open("gall2.txt") as fo:
 caesar_book2 = fo.read()

cltk_doc2 = cltk_nlp.analyze(text=caesar_book2)

In [113]:
caesar_word_counts = Counter(cltk_doc2.tokens)
nom = caesar_word_counts['Diviciacus']
voc = caesar_word_counts['Diviciace']
acc = caesar_word_counts['Diviciacum']
gen = caesar_word_counts['Diviciaci']
abl = caesar_word_counts['Diviciaco'] # same as dative case
print(nom + acc + voc + gen + abl)


5


# Book 3
## Viridovix, the Gallic Chieftan

In [18]:
fo.close()
with open("gall3.txt") as fo:
 caesar_book3 = fo.read()

cltk_doc3 = cltk_nlp.analyze(text=caesar_book3)

In [19]:
caesar_word_counts = Counter(cltk_doc3.tokens)

nom = caesar_word_counts['Viridovix']
# voc = caesar_word_counts['Viridovix'] # same as nominative
acc = caesar_word_counts['Viridovigem']
gen = caesar_word_counts['Viridovigis']
dat = caesar_word_counts['Viridovigi']
abl = caesar_word_counts['Viridovige']
print(nom + acc + voc + gen + abl)


5


# Book 4
## Ariovistus mentioned again, but just one time. On to Book 5

# Book 5
## The Belgic King and Chieftan Ambiorix

In [15]:
fo.close()
with open("gall5.txt") as fo:
 caesar_book5 = fo.read()

cltk_doc5 = cltk_nlp.analyze(text=caesar_book5)

In [17]:
caesar_word_counts = Counter(cltk_doc5.tokens)

nom = caesar_word_counts['Ambiorix']
# voc = caesar_word_counts['Ambiorix'] # same as nominative
acc = caesar_word_counts['Ambiorigem']
gen = caesar_word_counts['Ambiorigis']
dat = caesar_word_counts['Ambiorigi']
abl = caesar_word_counts['Ambiorige']
print(nom + acc + voc + gen + abl)


20


# Book 6
## Ambiorix, once more

In [20]:
fo.close()
with open("gall6.txt") as fo:
 caesar_book6 = fo.read()

cltk_doc6 = cltk_nlp.analyze(text=caesar_book6)

In [21]:
caesar_word_counts = Counter(cltk_doc6.tokens)

nom = caesar_word_counts['Ambiorix']
# voc = caesar_word_counts['Ambiorix'] # same as nominative
acc = caesar_word_counts['Ambiorigem']
gen = caesar_word_counts['Ambiorigis']
dat = caesar_word_counts['Ambiorigi']
abl = caesar_word_counts['Ambiorige']
print(nom + acc + voc + gen + abl)


18


# Book 7
## Vercingetorix, King and Chieftan of the Arverni and leader of the unified Gallic revolt against the Romans. Of all the antagonists, he is mentioned most by Caesar.

In [24]:
fo.close()
with open("gall7.txt") as fo:
 caesar_book7 = fo.read()

cltk_doc7 = cltk_nlp.analyze(text=caesar_book7)

In [26]:
caesar_word_counts = Counter(cltk_doc7.tokens)

nom = caesar_word_counts['Vercingetorix']
# voc = caesar_word_counts['Vercingetorix'] # same as nominative
acc = caesar_word_counts['Vercingetorigem']
gen = caesar_word_counts['Vercingetorigis']
dat = caesar_word_counts['Vercingetorigi']
abl = caesar_word_counts['Vercingetorige']
print(nom + acc + voc + gen + abl)


46


# Results

### For these seven books of _The Gallic Wars_ we knew ahead of time who were the main foes Caesar mentions in each book. The task has been to count the number of mentions in each text and to infer their relative importance in the resistance to the Roman campaigns. The quantitative results we arrived at here could also have been found using the search function of a text editor. But the methods provided by the Classical Language Toolkit are appropriate for this text because they take into account the morphology and syntax of the language. Indeed, a long text such as a novel might not easily be handled by a text editor, and a more powerful set of instruments for natural language processing is often required.