Twitter Language Identification: Data and Software

The UMass Global English on Twitter Dataset This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages, annotated for being in English, non-English, or having code switching, language ambiguity, or having been automatically generated. It includes messages sent from 130 different countries. (README) Our data annotations are licensed under the Creative Commons Attribution 4.0 International License.

Software: Language Identifier

Our demographic ensemble language identifier. It extends the open-source to improve recall for English messages. (To be posted - check back for updates)

Data: AAE and demographics-tagged English on Twitter

A subset of our TwitterAAE corpus used for further investigation into race/dialect disparity in language identification. (To be posted - check back for updates)


On the UMass Global English on Twitter Dataset:

On the demographic ensemble English language identifier and racial disparity of language identification:

Further experiments in racial disparity and language identification: