Thank You, Fairy, for Bringing My Niece a Family After Years Apart

“Mummy, when will the fairy give me a daddy?” my daughter once asked, her wide eyes brimming with more hope than I could bear. She and I often played make-believe, drawing pictures and spinning tales. That day, she pulled a sketch from her box—a little girl whispering to a tiny winged figure. Then she found another drawing—herself doing morning stretches, laughter captured in pencil strokes.

“I’ll do my exercises like this, then splash my face with water, Mummy!” she chirped before drifting off to sleep, content.

That moment made me realise life’s unpredictability. But let me start properly.

Years ago, I studied at a teaching college with my dearest friend, Emily. We were inseparable—late-night study sessions, shared dreams of the future. After graduation, we both became schoolteachers. Emily had a gift; she illustrated children’s books in her spare time, her imagination boundless. Her talent caught the eye of a London publisher, and soon, she was offered a contract abroad. She left—for three long years. We wrote, called, missed each other terribly.

When Emily returned to Manchester, she wasn’t alone. A little girl clung to her hand—her daughter. She never spoke of the father. Her own parents were gone by then. She raised the child alone, and I did my best to help. Lily was sunshine itself. In quiet moments, Emily sketched—her daughter as a schoolgirl, a teen, a woman. The precision shook me.

“How do you know what she’ll look like?” I’d ask.

“Wait and# February 27th, 2020

## Updates from yesterday
– I decided not to implement the line-length alignment yet, because data processing is more important right now.
– I think I might have to write a custom tokenizer, because the C++ tokenizer in CLTK we currently use seems to be very oversimplistic (see [the tokenizer’s test file](https://github.com/cltk/cltk/blob/master/cltk/tokenize/latin/tests.py) for an example of its outputs). It doesn’t have any handling for things like proper nouns (e.g. *Rōma* being split into *Rō* and *ma*), or abbreviations like *S.P.Q.R.*, or even hyphens like *līberōrum-que*. That’s bad, because it means the NER model will be missing tokens that should be grouped as a single label (e.g. *Rōma → LOC*). Because of this, I might look into implementing my own tokenizer (e.g. porting [this](https://nlp.stanford.edu/software/tokenizer.shtml) to CLTK) prior to training.
– I also have to write some kind of token-splitter/merger, since the Latin treebank is word-level but the Latin NER corpus is character-level (so I need a way to reconstruct words from characters to assign a label per word). That should be doable with some careful string processing.

## Goals for today
### Main goal
– Start experimenting with alignment methods. My initial research has led me to the [SequenceMatcher](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher) class in Python’s `difflib`. Using this, I should be able to build an alignment between two documents (in this case, the treebank and the raw text). Since the Latin treebank is very small (~50k words, compared to ~3M words for the English one), this will have to do for now.

### Secondary goal
– If time permits, try and get the NER tagging done (assuming I can get alignment working today).