Public Datasets that might be of interest for projects (in addition to the links already on the syllabus)

NOTE THAT THIS LIST IS A RANDOM SAMPLE OF THINGS I HAPPEN TO KNOW ABOUT; THERE ARE PROBABLY OTHERS!

Graded Essays

  • ASAP: Automated Student Assessment Prize data
  • Native Language Identification Shared Task: Each essay in the TOEFL11 is labeled with an English language proficiency level (high, medium, or low)
  • International Corpus of Learning English (several different types of grades)
  • CityU corpus of essay drafts of English learners: Sentences in these drafts are annotated with comments and error codes by language tutors, and are aligned to sentences in subsequent drafts; final grade available

    Argument Mining

  • Essays, User-Generated Web Discourse, Scientific Articles, News Articles
  • Internet argument corpus

    Wikipedia

  • Discourse as Dialog Acts
  • Text Segmentation

    Reviews

  • Sentiment Corpus

    Email and Blogs

  • Topic Segmentation and Labeling

    My Data

  • You can also take a look at my papers and if there is data you are interested in, I might be able to give it to you in some cases. Particularly relevant might be papers by Zhang, Rahimi, and Nguyen.

    Updates since original post: Your data? NB annotated papers from the course?