Linguist | Language Data Analyst | Conversation Designer | Digital Communications Specialist
I’m a linguist with ample experience working with language data (from transcription and annotation to corpora creation) and conversational AI (from training NLU engines to designing and testing conversational flows).
I'm currently involved in projects aiming to make language technologies more inclusive and linguistically diverse while pursuing a PhD in Applied Linguistics at the same time.
I'm a very creative person and I also have experience working in digital communications and content creation.
In my free time, you can find me reading non-fiction books, cooking and trying out new recipes, or traveling around the world in search of new adventures.
Language corpora
The Bilinguals in the Midwest (BILinMID) Corpus documents the Spanish and English spoken in the Midwest of the United States by different types of bilinguals. The corpus comprises written transcriptions of The Little Red Riding Hood fairy tale, which participants narrated orally. This is the first corpus to document the Spanish and English spoken in this region of the United States.
The bilingual speakers who participated in the study were either born in the United States or in a Spanish-speaking country, but at the time they were recorded, they had all lived in the United States for a long period of time. Many of the bilingual speakers recorded are heritage speakers, making this one of the few heritage language corpora out there. Information about each participant's background was also collected and it is part of the corpus' metadata.
The BILinMID Corpus is open-access (licensed by a Creative Commons CC BY License). It has a user interface developed with R Shiny that allows users to perform a variety of queries. The process of corpus creation is described in detail in a paper published in the ACL Anthology.
The Heritage Language Writing project sought to examine the development of Spanish heritage language learners' writing abilities over the course of a semester. For this project, a corpus was created using written data from a total of 24 Spanish heritage language learners enrolled in a writing course at the college level. The data consisted of essays and other writing assignments that learners completed throughout the semester.
After the corpus was created, data was imported into R and analyzed using natural language processing tools and techniques in order to explore different areas: lexical density, lexical diversity, syntactic complexity, lexical sophistication.
Statistical analyses were conducted to examine how these writing areas developed throughout the course of the semester and various data visualizations were created. A research paper was published in the journal Languages discussing the details of the corpus and the analyses.
Conversation design
Open D Express is a web-based chatbot that I developed for OpenDialog. Open D Express offers delivery services (e.g., scheduling the delivery of an online order). The chatbot was developed as a demo to showcase OpenDialog's software capabilities, and it is now part of OpenDialog Academy and OpenDialog's documentation.
This is a project I developed for a company that offers vehicle replacement services for insurers, brokers, and fleet providers. I used ChatGPT to create a testing dataset containing sample user utterances to evaluate the efficacy of an ML model that would classify users' intents into different categories.
Asa is a voice assistant that helps patients schedule medical appointments via phone. This is a project I developed after taking the courses Conversation Design and Designing Conversations with Voiceflow and CDI, both by the Conversation Design Institute.
Experimental research
Structural priming is a psycholinguistic phenomenon observed in language production and language comprehension that consists of a tendency to repeat a linguistic structure one has previously been exposed to. Structural priming has been attested in human-human interaction, but also in human-computer interaction. There is currently a huge scientific debate regarding what structural priming means or why it occurs in the first place.
The Structural Priming project is actually a set of smaller projects that seek to shed light on the phenomenon of structural priming. Some questions being examined in these projects concern the nature of structural priming in various understudied populations of speakers, the influence of language (i.e., why some linguistic structures are more "primeable" than others), as well as the effect of the interaction (human-human interaction vs. human-computer interaction).
Altogether, these projects contribute to our understanding of the psycholinguistic mechanisms which are at play in human communication. This is specially important nowadays when there is a big push towards the adoption of conversational technologies.
Below, you can find a list of published research papers reporting some of the findings of the Structural Priming project:
Hurtado, I., & Montrul, S. (2021). Priming dative clitics in spoken Spanish as a second and heritage language. Studies in Second Language Acquisition, 43(4), 729-752. Read paper →
Hurtado, I. (2021). Syntactic priming may not lead to language change. Proceedings of the 12th International Conference of Experimental Linguistics, 121-124. Read paper →
Hurtado, I. (2021). How do construction frequency effects modulate L2 priming? Proceedings of the 45th Annual Boston University Conference on Language Development, 346-359. Read paper →
Hurtado, I., & Montrul, S. (2020). Examining the effects of structural priming on three different populations: Spanish native speakers, Spanish L2 learners, and Spanish heritage speakers. Proceedings of the 44th Annual Boston University Conference on Language Development, 196-209. Read paper →
Natural language processing
A haiku is a type of short form poetry that originated in Japan and that consists of three phrases, where the first and third line have 5 syllables each and the second one has 7 syllables. This makes a total of 17 syllables per poem which follow a 5-7-5 pattern.
Using the natural language processing library spaCy and a dataset of random texts taken from Project Gutenberg, I developed a haiku generator in Python. The code extracts groups of words from the dataset and puts them together following the pattern for a haiku (5-7-5 syllables per line).
Language detection algorithms help identify the language of a given text. For example, this type of algorithms could determine that a sentence like "Mary bakes a cake" is written in English, whereas "María cocina un pastel" is written in Spanish. This project goes a step further and tries to identify the particular dialect of a text. Specifically, by using machine learning, the algorithm can detect whether a Spanish sentence belongs to the Mexican or to the Peninsular variety of the language.
To train the model, text data from movie subtitles was used. In this case, all movies had been produced either in Mexico or in Spain. The machine learning classification algorithm used was a Naive Bayes classifier. All code was written in Python. You can also learn more about this project by reading a blog post I wrote about it.