Lecture 005

Lexical Density: all adjectives, adverbs, nouns and verbs divided by total number of words * 100

Good Software: https://www.laurenceanthony.net/software.html

Text Analyzer:

https://seoscout.com/tools/keyword-analyzer
https://www.analyzemywriting.com/
https://voyant-tools.org/

Features

The word-count graph for each word separated by word type Persentage of confidence words word tagging (with ) sentiment analysis

preprocessing

remove stop word (nltk.stem.WordNetLemmatizer)
SnowballStemmer (standarlize affixes/suffixes) (NLTK's WordNetLemmatizer s/es)

difficulty

sentence length
adv, adj

Which tend to have good essay writing

focus on topic
information density % of stop word (gensim.stopword) gensim.models TF-IDF

Which (geology/education) tend to have most information density?

% of stop word
TF-IDF score using model
number of in-text citation using regexp

TF-IDF

https://zhuanlan.zhihu.com/p/67883024
https://radimrehurek.com/gensim/models/tfidfmodel.html
https://www.kaggle.com/pavelvpster/google-q-a-labeling-tf-idf-pytorch
https://zh.wikipedia.org/wiki/Tf-idf

https://www.kaggle.com/bryanlincoln/twitter-support-topic-modeling External Data: https://www.kaggle.com/joshkyh/glove-twitter, https://www.kaggle.com/facebook/fasttext-english-word-vectors-including-subwords

For my corpus project, I want to compare "education research articles" and "geology research articles" to answer my question: paper produced in which field tend to have greater information density. For this theme, I choose the following variables to study: "percentage of stop words", "average TF-IDF score (trained on other corpora)", and "number of in-text citations per word count". Firstly, because I assume stop words tend to have low information density, the "percentage of stop words" is a good indicator for the overall information density of a corpus. Secondly, because the TF-IDF (term frequency * inverse document frequency) score (with DF calculated with external dataset) indicates the relative importance of a word in a document, a high TF-IDF score averaged by distinct words can be a good indication of the information density of a document. Thirdly, the number of citations normalized by word count indicates how much the author takes advantage of other research, which associates with the information density of a paper. In terms of data-processing for the raw .txt files, I will use python on Jupyter Notebook with libraries such as "gensim" and "nltk" for stopword-removal and TF-IDF score calculation. I will use regular expressions to search for the occurrence of in-text citations. Can you give me some feedback on my choices of the theme as well as the variables? Can you also suggest some external datasets I can use for calculating DF?

Table of Content