Automatic Dialect Classifier

Language detection algorithms help identify the language of a given text. For example, this type of algorithms could determine that a sentence like "Mary bakes a cake" is written in English, whereas "María cocina un pastel" is written in Spanish. This project goes a step further and tries to identify the particular dialect of a text. Specifically, by using machine learning, the algorithm can detect whether a Spanish sentence belongs to the Mexican or to the Peninsular variety of the language.

To train the model, text data from movie subtitles was used. In this case, all movies had been produced either in Mexico or in Spain. The machine learning classification algorithm used was a Naive Bayes classifier. All code was written in Python and can be found here. You can also learn more about this project by reading this blog post.