umass_global_english_tweets-v1.zip: This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages, annotated for being in English, non-English, or having code switching, language ambiguity, or having been automatically generated. It includes messages sent from 130 different countries. (README) Our data annotations are licensed under the Creative Commons Attribution 4.0 International License.
Our demographic ensemble language identifier. It extends the open-source langid.py to improve recall for English messages. (To be posted - check back for updates)
A subset of our TwitterAAE corpus used for further investigation into race/dialect disparity in language identification. (To be posted - check back for updates)
On the UMass Global English on Twitter Dataset:
On the demographic ensemble English language identifier and racial disparity of language identification:
Further experiments in racial disparity and language identification: