Production and Consumption in Knowledge Market: Solving the Old Puzzles with New Techniques
The first chapter investigates the drivers of citation counts of academic papers. I match yearly citation data, full texts, and yearly author data of 4,482 papers in the top 5 economics journals, and use textual analysis to construct high dimensional vectors of features of papers and authors. The 10-year citation distribution is highly right-skewed, and the upper tail of the distribution is well approximated by a power law. In addition, higher 10-year citation counts are associated with higher popular topic coverage, numbers of authors, and total citations of authors' co-authors, while associated with lower "Micro" intensity, paper complexity, and numbers of authors' top field publications. I use several state-of-the-art machine learning methods and develop a hybrid method that combines variable construction of dictionary-based textual analysis, variable selection of regression shrinkage, and model fitting of Gradient Boosted Trees to predict papers' 10-year citations with the information available as of the year of publication. My proposed hybrid method gives the smallest Mean Squared Error for 10-year citation out-of-sample prediction test while using a relatively small number of variables compared to other machine learning methods. It correctly predicts 72.7% of the papers that are in the upper half of the citation distribution and correctly predicts 76.7% of the papers that are in the lower half of the citation distribution.The second chapter analyzes editorial decision making in the academic publishing process. I analyze data on keywords, abstract, referee recommendations, historical records of authors, and records of editorial decision making of 13,517 manuscripts submitted to four academic journals, linked with data on paper citation counts. I use textual analysis to analyze keywords and abstracts of each paper to construct high dimensional measures of research topics and fields. Then, I estimate the effects of features of papers, authors, and referee recommendations on editorial decision making, duration from submission to decision, and paper citations. Empirical results suggest that papers with higher referee recommendation scores, higher scientific contribution scores, lower standard deviation of referee recommendation scores, higher share of positive referee recommendations, higher coverage of popular research topics, and written by authors with longer and more solid submission history (higher number of submissions and lower rejection rate) are more likely to be published. Papers with lower coverage of popular research topics and written by authors with shorter and weaker submission history are more likely to be desk rejected. For non-desk-rejected papers, the ones with higher referee recommendation scores and lower standard deviation of the scores have shorter durations of the first round of review. The results for paper citations suggest that accepted papers on average get higher citations than rejected ones, and higher paper citation counts are associated with higher coverage of popular research topics, referee recommendation scores, and scientific contribution scores. In the prediction part, I use machine learning methods (regression shrinkage methods, Random Forest, and Gradient Boosted Trees) to predict paper citations with the information available at the time of submission. The model that uses Random Forest method, measures of publication information, measures of research fields and topics, and high dimensional measures of the appearance of popular topic words gives the best out-of-sample prediction performance. Using the preferred prediction model, I test the possibility of combining artificial intelligence (AI) and human experts in the academic publishing process. The experiment shows that the average number of cumulative citations of the published papers is more than 24% higher than all submissions. This result suggests that papers published by the human intelligence based academic publishing process turn to have higher average citations than rejected ones, even though editors may not use paper's expected citations as one of the criteria when they decide which paper to publish. As an exercise, I use the citation prediction model to decide which papers to publish based on maximizing citations. For a comparable acceptance rate as the human-based editorial process, the papers published by the algorithm have 2% higher citation counts. In addition, the average number of cumulative citations of the papers selected by the artificial intelligence from the publishable paper is 22% higher than all publishable papers. Admittedly, there are other factors that affect editors' decision on which paper to publish. However, the artificial intelligence based prediction model may help editor to identify the papers that are more likely to be highly cited from publishable papers.
Showing items related by title, author, creator and subject.