The UMass Global English on Twitter Dataset
Version 1, released September 8, 2017
Su Lin Blodgett, Johnny Tian-Zheng Wei, and Brendan O'Connor
http://slanglab.cs.umass.edu/TwitterLangID/

This dataset contains 10,502 tweets, randomly sampled from all publicly
available geotagged Twitter messages, annotated for being in English,
non-English, or having code switching, language ambiguity, or having been
automatically generated.  It includes messages sent from 130 different
countries.

The dataset is described in the paper:

@inproceedings{blodgett:2017wnut,
	Author = {Su Lin Blodgett and Johnny Tian-Zheng Wei and Brendan O'Connor},
	Booktitle = {Proceedings of the Workshop on Noisy User-Generated Text},
	Publisher = {Association for Computational Linguistics},
	Title = {Recognizing Global Social Media English with U.S. Demographic Modeling},
	Year = {2017},
}

If you make use of this dataset in research, please consider citing our paper.
Thanks!

The file all_annotated.tsv contains the dataset of 10,502 tweets used in the
paper.  Text is encoded as UTF-8.

The columnn headings (also given in the .tsv file) are: tweet ID, ISO country
code, tweet date, tweet text, definitely English, ambiguous, definitely not
English, code-switched, ambiguous due to named entities, and automatically
generated tweets. 

All annotations are binary; the definitely English, ambiguous, and definitely
not English annotations are mutually exclusive. Experiments reported in our
paper were run by excluding tweets labeled as ambiguous or automatically
generated.  

Our paper also describes a language classifier, which is available on our
website: http://slanglab.cs.umass.edu/TwitterLangID/

Our data annotations are licensed under the Creative Commons Attribution 4.0
International License: https://creativecommons.org/licenses/by/4.0/