Lecture 008

Introduction: (present tense)


Abstract: 100~200

While text-level information density can be a good indicator of successful writing strategy, there exists limited study about this metrics in the context of research article. This article exams the difference in text-level information density between articles in natural science and in social science by comparing research articles from a geology corpus and an education corpus. We analyzed lexical density, sentence-wise TF-IDF, and the number of citations per 1000 words for each corpus. Our result showed a negligible difference in lexical density and sentence-wise TF-IDF but a noteworthy difference in the number of citations per 1000 words. Future study might include a larger dataset to increase the validity of the research.

This study was aimed to compare the information density between geology and education corpora. Our result indicates that although there existed minimal difference between the two corpora in terms of lexical density and sentence-wise TF-IDF, the number of citations per 1000 words of geology corpus is twice as much as that of education corpus.

The similarity in terms of lexical density and sentence-wise TF-IDF could possibly be explained by their similar academic settings. Because Geology and American Educational Research Journal are both credible academic journal, the authors in both fields might have applied similar strategy to condense their articles while preserving their article's readability. The difference in the number of citation per 1000 words between two corpora could have caused by the difference in the nature of two fields. Geology, as a subset of natural science, might require more build-up knowledge and jargon to understand certain phenomena. In contrast, education studies could be more independent of each other.

There exists limitations in the study. Wikipedia Sentences, the corpus chosen as the general corpus when calculating TF-IDF score, might not be general enough to include all words in English language. Therefore, some words were excluded in the study, making the score less accurate. A future study is recommended to address the sample size of the general corpus as well as geology and education corpora.

intro method abstract

Table of Content