Lecture 005

Lexical Density: all adjectives, adverbs, nouns and verbs divided by total number of words * 100

Good Software: https://www.laurenceanthony.net/software.html

Text Analyzer:

Features

The word-count graph for each word separated by word type Persentage of confidence words word tagging (with ) sentiment analysis

preprocessing

difficulty

Which tend to have good essay writing

Which (geology/education) tend to have most information density?

TF-IDF

https://www.kaggle.com/bryanlincoln/twitter-support-topic-modeling External Data: https://www.kaggle.com/joshkyh/glove-twitter, https://www.kaggle.com/facebook/fasttext-english-word-vectors-including-subwords

For my corpus project, I want to compare "education research articles" and "geology research articles" to answer my question: paper produced in which field tend to have greater information density. For this theme, I choose the following variables to study: "percentage of stop words", "average TF-IDF score (trained on other corpora)", and "number of in-text citations per word count". Firstly, because I assume stop words tend to have low information density, the "percentage of stop words" is a good indicator for the overall information density of a corpus. Secondly, because the TF-IDF (term frequency * inverse document frequency) score (with DF calculated with external dataset) indicates the relative importance of a word in a document, a high TF-IDF score averaged by distinct words can be a good indication of the information density of a document. Thirdly, the number of citations normalized by word count indicates how much the author takes advantage of other research, which associates with the information density of a paper. In terms of data-processing for the raw .txt files, I will use python on Jupyter Notebook with libraries such as "gensim" and "nltk" for stopword-removal and TF-IDF score calculation. I will use regular expressions to search for the occurrence of in-text citations. Can you give me some feedback on my choices of the theme as well as the variables? Can you also suggest some external datasets I can use for calculating DF?

Table of Content