Lexical Density: all adjectives, adverbs, nouns and verbs divided by total number of words * 100
Good Software: https://www.laurenceanthony.net/software.html
The word-count graph for each word separated by word type Persentage of confidence words word tagging (with ) sentiment analysis
remove stop word (nltk.stem.WordNetLemmatizer)
SnowballStemmer (standarlize affixes/suffixes) (NLTK's WordNetLemmatizer s/es)
Which tend to have good essay writing
focus on topic
information density % of stop word (gensim.stopword) gensim.models TF-IDF
Which (geology/education) tend to have most information density?
% of stop word
TF-IDF score using model
number of in-text citation using regexp
https://www.kaggle.com/bryanlincoln/twitter-support-topic-modeling External Data: https://www.kaggle.com/joshkyh/glove-twitter, https://www.kaggle.com/facebook/fasttext-english-word-vectors-including-subwords
For my corpus project, I want to compare "education research articles" and "geology research articles" to answer my question: paper produced in which field tend to have greater information density. For this theme, I choose the following variables to study: "percentage of stop words", "average TF-IDF score (trained on other corpora)", and "number of in-text citations per word count". Firstly, because I assume stop words tend to have low information density, the "percentage of stop words" is a good indicator for the overall information density of a corpus. Secondly, because the TF-IDF (term frequency * inverse document frequency) score (with DF calculated with external dataset) indicates the relative importance of a word in a document, a high TF-IDF score averaged by distinct words can be a good indication of the information density of a document. Thirdly, the number of citations normalized by word count indicates how much the author takes advantage of other research, which associates with the information density of a paper. In terms of data-processing for the raw .txt files, I will use python on Jupyter Notebook with libraries such as "gensim" and "nltk" for stopword-removal and TF-IDF score calculation. I will use regular expressions to search for the occurrence of in-text citations. Can you give me some feedback on my choices of the theme as well as the variables? Can you also suggest some external datasets I can use for calculating DF?
Table of Content