Blog

haiku-generator

What is a haiku?

I'm sure you have heard the term haiku at least once in your life. Haiku is a type of short form poetry that originated in Japan. These pieces of poetry consist of three phrases, where the first and third line have 5 syllables each and the second one has 7 syllables. This makes a total of 17 syllables per poem which follow a 5-7-5 pattern.

Haikus have become popular in other cultures too. For example, one can find haikus written in English and many other languages, although these do not always deal with the same themes as traditional Japanese haikus. Haikus are now part of pop culture and one can easily find thousands of them with a simple Google search. So, since it seems everyone can now write a haiku and call themselves a poet, for this project I decided to create a haiku generator using Python.

How did you do this?

Briefly summarized, I compiled a bunch of texts and wrote a Python script that would output a random haiku using words from those texts, always following the 5-7-5 rule. Want to know how I did it? Keep on reading or check out the code here!

The data

To be honest, any long text would have worked for this project, as the script would simply extract words (or groups of words) from the text to write the haiku.

However, to keep it more "literary" I decided to use a variety of excerpts from famous books available at Project Gutenberg. More specifically, I used excerpts from Metamorphosis (Franz Kafka), The Great Gatsby (Scott Fitzgerald), Anthem (Ayn Rand), and Moby Dick (Herman Melville).

The final dataset can be found here.

Installing libraries

The first thing I did was to import and install some helpful Python libraries. The most important ones for our purposes are spaCy and SyllaPy.

SpaCy is a natural language processing library that allows you to do plenty of things with language data, from tokenization, to POS-tagging, to even named entity recognition. I used it to identify patterns in the text data and build the haikus.

# Loading the English language model
import spacy.cli
spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")

# Matcher will help identify patterns in the text
from spacy.matcher import Matcher

SyllaPy, on the other hand, allows you to calculate syllable counts for English words. This is important because I had to follow the 5-7-5 rule when building the haikus, so I needed to be able to count the number of syllables per word somehow.

!pip install syllapy
import syllapy

 

Creating the script

The script has two main parts: a section where I extracted random groups of words from the text and a section where I filtered them out to keep only those that had either 5 or 7 syllables.

Extracting groups of words

Here, I used Matcher to identify random groups of 2, 3, and 4 words in the dataset. Matcher works by specifying a linguistic pattern and applying it to your dataset. It will then identify and extract all words (or groups of words) that follow that pattern. Below, you can see an example:

matcher = Matcher(nlp.vocab)
pattern = [{"POS": "NOUN"}, {"POS": "VERB"}]
matcher.add("SamplePattern", [pattern])

The key part here is the second line, where you specify the pattern you are interested in. The pattern always needs to appear within a list (hence the []). Also, for each 'object' you want to identify in the text, you need to create a dictionary, where the keys describe the type of token and the values describe its attributes. For example, the sequence {"POS": "NOUN"} means that the token we are looking for is a 'part of speech' (POS) whose attribute is a 'noun'. Thus, all the second line does is say that Matcher will look for a noun that is followed by a verb. That's it!

One thing to keep in mind is that attributes are specified using the Universal Dependencies framework, so if you are looking for an adjective, you can't write {"POS": "ADJECTIVE"}, but rather {"POS": "ADJ"}. You can find all the Universal POS tags here.

Once you establish a pattern, you need to identify it in the text. For that, you can use a code snippet like the one below (I added two extra print statements so you can see how it works).

doc = nlp(texts)
matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text) # Print match info
    print(span.text) # Print match only

Now that you know how Matcher works, I will explain what patterns I tried to identify in the text to build my haiku. I decided to go for groups of words which were either 2, 3, or 4 words long. Furthermore, for each group of words, I specified what parts of speech I wanted the first and last word in the group to be. This is the code I used:

# Groups of 2 words

matcher2w = Matcher(nlp.vocab)

pattern = [{"POS":{"IN": ["NOUN", "ADV", "ADJ", "ADP"]}},
           {"POS":{"IN": ["VERB", "NOUN"]}}]
matcher2w.add("2words", [pattern])

# Groups of 3 words

matcher3w = Matcher(nlp.vocab)

pattern = [{"POS":{"IN": ["NOUN", "ADV", "ADJ", "VERB", "ADP"]}},
           {"IS_ASCII": True, "IS_PUNCT": False},
           {"POS":{"IN": ["VERB", "NOUN", "ADV", "ADJ"]}}]
matcher3w.add("3words", [pattern])

# Groups of 4 words

matcher4w = Matcher(nlp.vocab)

pattern = [{"POS":{"IN": ["NOUN", "ADV", "ADJ", "VERB", "ADP"]}},
           {"IS_ASCII": True, "IS_PUNCT": False},
           {"IS_ASCII": True, "IS_PUNCT": False},
           {"POS":{"IN": ["VERB", "NOUN", "ADV", "ADJ"]}}]
matcher4w.add("4words", [pattern])

# Identify patterns in the text

doc = nlp(texts)
matches2w = matcher2w(doc)
matches3w = matcher3w(doc)
matches4w = matcher4w(doc)

As you can see from the {"IN": [ ... ]} sequences, I allowed for plenty of flexibility. That is, different POS could appear at the beginning of the pattern as well as at the end of the pattern. You can explore more patterns here.

Filtering out groups of words

After identifying groups of words, I had to leave out those which did not have 5 nor 7 syllables, since those are not useful to build the haiku. To do that, I used SyllaPy to count the syllables of each group of words and leave out those groups of words that I would not be able to use.

This is the code I used:

lines_5_syll = []
lines_7_syll = []

for match_id, start, end in matches2w + matches3w + matches4w:
   string_id = nlp.vocab.strings[match_id]
   span = doc[start:end]

   syllable_count = 0

   for token in span:
     syllable_count += syllapy.count(token.text)
   if syllable_count == 5:
     if span.text not in lines_5_syll:
       lines_5_syll.append(span.text)
   if syllable_count == 7:
     if span.text not in lines_7_syll:
       lines_7_syll.append(span.text)

 

Generating a haiku

Lastly, I wrote a simple line of code to print a haiku by putting together two lines of 5 syllables each and one line of 7 syllables following a 5-7-5 rule.

This is the code I used for that (I had to import the random library first):

print("{0}\n{1}\n{2}".format(random.choice(lines_5_syll), random.choice(lines_7_syll), random.choice(lines_5_syll)))

 

Sample Haiku

This is a haiku I was able to generate with this script!

closed in the evening

sense of the fundamental

now at any price

It probably doesn't make much sense, but does it have to?

 

 

If you found this post interesting, you can check out my other blog posts here!