Public Datasets that might be of interest for projects (in
addition to the links already on the syllabus)
NOTE THAT THIS LIST IS A RANDOM SAMPLE OF THINGS I HAPPEN TO KNOW
ABOUT; THERE ARE PROBABLY OTHERS!
Graded Essays
ASAP: Automated
Student Assessment Prize data
Native Language Identification Shared Task:
Each essay in the TOEFL11 is labeled with an English language
proficiency level (high, medium, or low)
International
Corpus of Learning English (several different types of grades)
CityU
corpus of essay drafts of English learners: Sentences in these drafts are annotated with comments and error codes
by language tutors, and are aligned to sentences in subsequent drafts;
final grade available
Argument Mining
Essays,
User-Generated Web Discourse, Scientific Articles, News Articles
Internet argument corpus
Wikipedia
Discourse
as Dialog Acts
Text Segmentation
Reviews
Sentiment
Corpus
Email and Blogs
Topic
Segmentation and Labeling
My Data
You can also take a look at my papers and if there is
data you are interested in, I might be able to give it to you in some
cases. Particularly relevant might be papers by Zhang, Rahimi, and Nguyen.
Updates since original post: Your data? NB annotated papers
from the course?