Lexical Density: all adjectives, adverbs, nouns and verbs divided by total number of words * 100

For my corpus project, I want to compare "education research articles" and "geology research articles" to answer my question: paper produced in which field tend to have greater information density. For this theme, I choose the following variables to study: "percentage of stop words", "average TF-IDF score (trained on other corpora)", and "number of in-text citations per word count". Firstly, because I assume stop words tend to have low information density, the "percentage of stop words" is a good indicator for the overall information density of a corpus. Secondly, because the TF-IDF (term frequency * inverse document frequency) score (with DF calculated with external dataset) indicates the relative importance of a word in a document, a high TF-IDF score averaged by distinct words can be a good indication of the information density of a document. Thirdly, the number of citations normalized by word count indicates how much the author takes advantage of other research, which associates with the information density of a paper. In terms of data-processing for the raw .txt files, I will use python on Jupyter Notebook with libraries such as "gensim" and "nltk" for stopword-removal and TF-IDF score calculation. I will use regular expressions to search for the occurrence of in-text citations. Can you give me some feedback on my choices of the theme as well as the variables? Can you also suggest some external datasets I can use for calculating DF?

