Though dialectal language is increasingly abundant on social media, few resources exist for developing NLP tools to handle such language. We conduct a case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter. We propose a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and we verify that this language follows well-known AAE linguistic phenomena. In addition, we analyze the quality of existing language identification and dependency parsing tools on AAE-like text, demonstrating that they perform poorly on such text compared to text associated with white speakers.
Further work on language identification is on our TwitterLangID page.
We also develop new Universal Dependency annotations and parsing models for AAE on Twitter (below).
This dataset is made available for research purposes only. If you use it in research, please cite Blodgett et al., EMNLP 2016 (above).
Our implementation of our mixed-membership demographic language model, including parameters learned from Census+Twitter as described in the EMNLP 2016 paper and used in its experiments, calculates demographic dialect proportions for a text (including the AAE proportion). It is available at: github.com/slanglab/twitteraae.
This dataset and model is made available for research purposes only. If you use them in research, please cite Blodgett et al., ACL 2018 (above).