Twitter Language Identification: Data and Software


The UMass Global English on Twitter Dataset

umass_global_english_tweets-v1.zip: This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages, annotated for being in English, non-English, or having code switching, language ambiguity, or having been automatically generated. It includes messages sent from 130 different countries. (README) Our data annotations are licensed under the Creative Commons Attribution 4.0 International License.

Software: Language Identifier

Our demographic ensemble language identifier. It extends the open-source langid.py to improve recall for English messages. (To be posted - check back for updates)

Data: AAE and demographics-tagged English on Twitter

A subset of our TwitterAAE corpus used for further investigation into race/dialect disparity in language identification. (To be posted - check back for updates)

Papers

On the UMass Global English on Twitter Dataset:

On the demographic ensemble English language identifier and racial disparity of language identification:

Further experiments in racial disparity and language identification: