Natural language processing

Haiku generator

A haiku is a type of short form poetry that originated in Japan and that consists of three phrases, where the first and third line have 5 syllables each and the second one has 7 syllables. This makes a total of 17 syllables per poem which follow a 5-7-5 pattern.

Using the natural language processing library spaCy and a dataset of random texts taken from Project Gutenberg, I developed a haiku generator in Python. The code extracts groups of words from the dataset and puts them together following the pattern for a haiku (5-7-5 syllables per line).

Check code on Github →

Read blog post →

Automatic dialect classifier

Language detection algorithms help identify the language of a given text. For example, this type of algorithms could determine that a sentence like "Mary bakes a cake" is written in English, whereas "María cocina un pastel" is written in Spanish. This project goes a step further and tries to identify the particular dialect of a text. Specifically, by using machine learning, the algorithm can detect whether a Spanish sentence belongs to the Mexican or to the Peninsular variety of the language.

To train the model, text data from movie subtitles was used. In this case, all movies had been produced either in Mexico or in Spain. The machine learning classification algorithm used was a Naive Bayes classifier. All code was written in Python. You can also learn more about this project by reading a blog post I wrote about it.