TwitterAAE: Research on African-American English on Twitter


Papers

Demographic Dialectal Variation in Social Media: A Case Study of African-American English. Su Lin Blodgett, Lisa Green, and Brendan O'Connor. Proceedings of EMNLP 2016. [pdf]

Twitter Universal Dependency Parsing for African-American and Mainstream American English. Su Lin Blodgett, Johnny Tian-Zheng Wei, and Brendan O'Connor. Forthcoming, Proceedings of ACL 2018.

Overview

Though dialectal language is increasingly abundant on social media, few resources exist for developing NLP tools to handle such language. We conduct a case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter. We propose a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and we verify that this language follows well-known AAE linguistic phenomena. In addition, we analyze the quality of existing language identification and dependency parsing tools on AAE-like text, demonstrating that they perform poorly on such text compared to text associated with white speakers.

Further work on language identification is on our TwitterLangID page.

We also develop new Univerisal Dependency annotations and parsing models for AAE on Twitter (below).

Universal Dependencies annotated tweets

This dataset and model is made available for research purposes only. If you use them in research, please cite Blodgett et al., ACL 2018 (above).

  • UD-TwitterAAE: Our Universal Dependencies annotations for 250 African-American English and 250 Mainstream American English tweets. Forthcoming.
  • We hope to also release a UD parsing model; our research details experiments with UDPipe and a deep biaffine model.

TwitterAAE corpus

This dataset is made available for research purposes only. If you use it in research, please cite Bldogett et al., EMNLP 2016 (above).

  • TwitterAAE-full-v1.zip (5.5 GB) of messages and our model's inferences.
  • Also: TwitterAAE-deps-v1.zip, an older annotated dataset containing partial Stanford Dependencies-style syntactic annotations for 200 messages. This data is superceded by the UD dataset.