Twitter Language Identification: Data and Software


The UMass Global English on Twitter Dataset

This dataset is made available for research purposes only. If you use it in research, please cite Blodgett et al., EMNLP 2016 (above).

umass_global_english_tweets-v1.zip: This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages, annotated for being in English, non-English, or having code switching, language ambiguity, or having been automatically generated. It includes messages sent from 130 different countries. (README) Our data annotations are licensed under the Creative Commons Attribution 4.0 International License.

Software: Language Identifier

This dataset is made available for research purposes only. If you use it in research, please cite Blodgett et al., EMNLP 2016 (below).

umass_demoens_langid-v1.zip: Our demographic ensemble language identifier. It provides the demographic topic proportions for messages and extends the open-source langid.py to improve recall for English messages.

Data: AAE and demographics-tagged English on Twitter

A subset of our TwitterAAE corpus used for further investigation into race/dialect disparity in language identification. (To be posted - check back for updates)

Papers

On the UMass Global English on Twitter Dataset:

On the demographic ensemble English language identifier and racial disparity of language identification:

Further experiments in racial disparity and language identification: