Language corpora

language-corpora.jpg

The BILinMID Corpus

The Bilinguals in the Midwest (BILinMID) Corpus documents the Spanish and English spoken in the Midwest of the United States by different types of bilinguals. The corpus comprises written transcriptions of The Little Red Riding Hood fairy tale, which participants narrated orally. This is the first corpus to document the Spanish and English spoken in this region of the United States.

The bilingual speakers who participated in the study were either born in the United States or in a Spanish-speaking country, but at the time they were recorded, they had all lived in the United States for a long period of time. Many of the bilingual speakers recorded are heritage speakers, making this one of the few heritage language corpora out there. Information about each participant's background was also collected and it is part of the corpus' metadata.

The BILinMID Corpus is open-access (licensed by a Creative Commons CC BY License). It has a user interface developed with R Shiny that allows users to perform a variety of queries. The process of corpus creation is described in detail in a paper published in the ACL Anthology.

Explore corpus →
Check code on Github →
Read paper →

 

The Heritage Language Writing Corpus

The Heritage Language Writing project sought to examine the development of Spanish heritage language learners' writing abilities over the course of a semester. For this project, a corpus was created using written data from a total of 24 Spanish heritage language learners enrolled in a writing course at the college level. The data consisted of essays and other writing assignments that learners completed throughout the semester.

After the corpus was created, data was imported into R and analyzed using natural language processing tools and techniques in order to explore different areas: lexical density, lexical diversity, syntactic complexity, lexical sophistication.

Statistical analyses were conducted to examine how these writing areas developed throughout the course of the semester and various data visualizations were created. A research paper was published in the journal Languages discussing the details of the corpus and the analyses.

Check code on Github →
Read paper →