The Tone of Two Letter Words

In my spare time, when I do get that, I’ve been fiddling around with Twitter again. I’m working on generating poems again, and that work comes with a lot of text manipulation.

For the most part, there’s a lot of searching and discarding (I try not to modify too much). An example would be that if I find a link in the Twitter post, I’ll discard the whole thing and not use it. If I can detect there’s more than a handful of capitalized letters next to one another, I’ll also discard the whole thing and not use it.

As a sort of refinement, one step I’m using is to detect whether there are any two letter words. The trick then becomes: which two letter words are acceptable, and which are not? Some two letter words I want to allow (on, it, or, be) while others tend more towards slang (dm, ur, ho).

More than any other step so far, this decision seems to really alter and shape the tone of the Twitter posts I allow into my program. Rather than simply allow all legal two-letter words according to Scrabble, I’m finding by excluding some words… it results in more readable, self-contained sentences. They’re less informal.

But is this a good thing? I want to say yes, but it also feels a lot like censorship. I realize I’m curating, but this decision feels markdely different for me than anything else I’ve done. Everything else feels like code: bits of logic, string manipulation, regex patterns.

Even something as simple as: should I allow the word “ok” to be used? Should that word exist in a poem that’s generated? Is that too restrictive? Or does allowing the word “ok” result in too informal a tone?

It’s been a pretty fascinating exercise so far. I’ve gone pretty restrictive in my first pass, but have taken the approach of commenting out the words I’m dis-allowing. My concern is that once I omit a given word entirely, I would have a hard time knowing to add it back.

This is such a small part of the logic I have going on, but it’s been the most thought-provoking step so far.

