SLANG Lab at UMass

The UMass Global English on Twitter Dataset

This dataset is made available for research purposes only. If you use it in research, please cite Blodgett et al., EMNLP 2016 (above).

umass_global_english_tweets-v1.zip: This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages, annotated for being in English, non-English, or having code switching, language ambiguity, or having been automatically generated. It includes messages sent from 130 different countries. (README) Our data annotations are licensed under the Creative Commons Attribution 4.0 International License.

Software: Language Identifier

This dataset is made available for research purposes only. If you use it in research, please cite Blodgett et al., EMNLP 2016 (below).

umass_demoens_langid-v1.zip: Our demographic ensemble language identifier. It provides the demographic topic proportions for messages and extends the open-source langid.py to improve recall for English messages.

Data: AAE and demographics-tagged English on Twitter

A subset of our TwitterAAE corpus used for further investigation into race/dialect disparity in language identification. (To be posted - check back for updates)

Papers

On the UMass Global English on Twitter Dataset:

A Dataset and Classifier for Recognizing Social Media English.
Su Lin Blodgett, Johnny Tian-Zheng Wei, and Brendan O'Connor.
3rd Workshop on Noisy User-generated Text (WNUT) at EMNLP 2017.

On the demographic ensemble English language identifier and racial disparity of language identification:

Demographic Dialectal Variation in Social Media: A Case Study of African-American English.

Su Lin Blodgett, Lisa Green, and Brendan O'Connor.
Proceedings of EMNLP 2016.

Further experiments in racial disparity and language identification:

Racial Disparity in Natural Language Processing: A Case Study of Social Media African-American English.

Su Lin Blodgett and Brendan O'Connor.
Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) workshop at KDD 2017.

Twitter Language Identification: Data and Software

The UMass Global English on Twitter Dataset

Software: Language Identifier

Data: AAE and demographics-tagged English on Twitter

Papers