Home » Text Mining, Spring 2018

Text Mining, Spring 2018

Elena Filatova, efilatova@citytech.cuny.edu

Office: GC 4410

Class is held in room: GC 6494

Tools and Books

Grading Policy


Azure Notebooks

Programming Assignments

Date Topic Reading Assignment
Week 1
Jan. 29
Introduction, word statistics, text similarity measures

lecture notes


Before the Internet, Librarians Would Answer Everything and Still do

Alibaba’s and Microsoft’s results for the Stanford Reading Comprehension Test:

Week 2
Feb. 5
Basic text processing concepts: Sentence Splitting, Word Tokenization, Types, Tokens,
Vector space model: binary representation, classic IR system, Vector Space modellecture notes
Turney and Pantel, 2010. From Frequency to Meaning: Vector Space Models of Semantics, Journal of AI Research 27 (141-188)

Chung and Pennebaker, 2007.  The Psychological Functions of Function Words, Social Communication (343-359)

Danescu-Niculescu-Mizil et al. 2012.  Echoes of Power: Language Effects and Power Differences in Social Interaction

Niederhoffer and Pennebaker, 2002. Linguistic Style Matching in Social Interaction. Journal of Language and Social Psychology 2002 21: 337.

Edit distance

Feb. 12
Feb. 19:
No class: The Graduate Center is closed
Week 3
Feb. 20
Document Classification, feature representation, cosine similarity, inverse
document frequency, tf*idf weighting, term-document weightinglecture notes
 The text classification problem Naive Bayes text classification 

Vector space classification

D. Blei and J. Lafferty, 2009. Topic Models

Check topic modeling visualization in the corresponding Wikipedia page

Mosteller and Wallace, 1964. Deciding Authorship

More references on Authorship and Style

Week 4
Classification Evaluation, Neural Nets

lecture notes

R. Collobert, et al. 2011. Natural Language Processing (Almost) from Scratch

Learning the Meaning Behind Words. Google Research Blog.

Mikolov et al, 2013 Efficient Estimation of Word Representations in
Vector Space

M. Gales, 2016. Deep Learning notes

Week 5
Mar. 5
Unsupervised learning on textual data, Invited talk


lecture notes

 Zhang and LeCunn, 2015. Text Understanding from scratch.
Week 6
Mar. 12
Word Embeddings for Information Extraction and  Question Answering,



lecture notes

 E. Agichtein, L. Gravano. 2000. Snowball: Extracting Relations from Large Plain-Text Collections.

M. Mintz, et al. 2009. Distant supervision for relation extraction without labeled data.

C. Sutton, A. McCallum, 2006. Tutorial: An Introduction to Conditional Random Fields for Relational Learning.

NLTK book, CH. 7: Extracting Information From Text

Week 7
Mar. 19
Proposal Presentation, Language modeling, topic modeling, LSI
Week 8
Mar. 26
Analyzing the Meaning of Sentences

Jurafky: QA

Jurafky: semantics

Palmer: PropBank

Navigli. 2009. Word Sense Disambiguation: a Survey.

Kingsbury, Palmer, 2002. From TreeBank to PropBank.

Reddy et al. 2014. Large-Scale Parsing without Question-Answer Pairs.

NLTK book, Chapter 10


Apr. 2
No class: Spring break
Week 9
Apr. 9
Sentiment Analysis, Figurative language


 Socher et al. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

Ghosh et al. 2017. The Role of Conversation Context for Sarcasm Detection in Online Interactions

Thomas et al. 2006. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts

Laver et al. 2003. Extracting Policy Positions from Political Texts Using Words as Data


Apr. 16
Opinions and Trust: Using social information for sentiment analysis, Helpfulness

LDA topic modeling

lecture notes

Mohammad et al. 2016. Stance and Sentiment in Tweets

Misra and Walker, 2017. Topic Independent Identification of Agreement and Disagreement in Social Media Dialogue

Broby and Elhadad. 2010. An Unsupervised Aspect-Sentiment Model for Online Reviews

Yang et al. 2007. Semantic Analysis and Helpfulness Prediction of Text for Online Product Reviews

Hall et al. 2008. Studying the History of Ideas Using Topic Models

Kuang et al. 2017. An LDA Topic Model and Social Network Analysis of a School Blogging Platform

Apr. 23
Text mining and crowdsourcing



Callison-Burch and Dredze. 2010. Creating Speech and Language Data With Amazon’s Mechanical Turk

Sheng et al. 2012. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. (optional)

Pavlick et al. 2014. The Language Demographics of Amazon Mechanical Turk

Chilton et al. 2016. HumorTools: A Microtask Workflow for Writing News Satire.

MTurk Tutorial

Defalla et al. 2018. Demographics and Dynamics of Mechanical Turk Workers. (optional)


Apr. 30
Text mining and crowdsourcing

(Opinion, Trust, Helpfulness)


Regression notes

Linear Regression Assumptions

C. Danescu-Niculescu-Mizil et al. 2011. How Opinions are Received by Online Communities: A Case Study on Amazon.com Helpfulness Votes

NYT article on fake reviews

M. Ott et al. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination

V. Niculae et al. 2015. Linguistic Harbingers of Betrayal (data, summary)

L. Fu et al. 2017. When Confidence and Competence Collide: Effects on Online Decision-Making Discussions


May 7
Building Grammars? Managing linguistic data? Authorship detection? Neural Nets?

(Clustering, dimentionality reduction / feature selection)


COLING 2012 tutorial on dimentionality reduction

CMU recitation

 D. Lin, 1998. “Automatic retrieval and clustering of similar words.”

T. Liu et al. 2003. “An evaluation on feature selection for text clustering.” 

Y. Zao et al. 2002. “Evaluation of hierarchical clustering algorithms for document datasets.” 

May 14
Student presentations
May 21
Need help with the Commons? Visit our
help page
Send us a message