Findings Seminal Papers Using Data Mining Techniques


The aim of this contribution is to show the detection of seminal papers using data mining techniques. To achieve the objective of this research, Rapidminer Studio software and its data mining tools are used, based on data created with information extracted from Google Scholar and Scopus, in three different areas of knowledge. In this process, other softwares such as Microsoft Excel and Publish or Perish are used. Comparing the results obtained for the searches in Knowledge Management, Entrepreneurship and Marketing, it was obtained that there is no marked similarity between the sets of articles that were obtained in Google Scholar and Scopus. The values for the Similarity Index remained below 0.52%, similar between Knowledge Management and Entrepreneurship but decreasing for Marketing. The detection of outliers using Data Mining techniques and in particular using Rapidminer, allowed to determine the seminals papers for the three search terms analyzed and allowed to characterize these in the space, in Google Scholar and Scopus. It was shown that the seminal articles can be different if Google Scholar or Scopus is used. The results suggest determining for other search terms whether the trend found is maintained or not.

Share and Cite:

Hernández, A. and Hidalgo, D. (2020) Findings Seminal Papers Using Data Mining Techniques. Open Journal of Social Sciences, 8, 293-305. doi: 10.4236/jss.2020.89023.

1. Introduction

Knowing the articles that laid the foundations of a specialty or a specific topic of research has been defined, for many years, as one of the essential objectives of a literature review (Hart, 1998). The literature review, necessary in any investigation, has been defined with fairness by (Webster & Watson, 2002) as the analysis of the past as an essential preparation of the vision of the future that every good scientific article should contain. The realization of the so-called “state of the art” has become an essential step in the realization of research: Although it seems amazing the realization of these “states of art” was recognized almost from the very appearance of scientific journals, already many centuries (Sciences, 1823).

One of the main objectives in the realization of a state of the art is to identify those articles that have seated so much, the possible conceptual bases, as methodological of discipline, that is to say, those contributions that in fact “do not age” (Singer, 2009). It is usual, therefore, in the specialized literature, to find both: to determine those seminal articles published in a given journal (Parkinson et al., 2013), the role of one of these contributions in a particular discipline (Dolman, Miralles, & de Jeu, 2014), in a specific technique (Nash, Walker, Gidwani, & Ajuied, 2015; Nash, Walker, Lucas, & Ajuied, 2016) or the most important in a given branch of science (Riordon, Zubritsky, & Newman, 2000).

The importance of identifying the so-called seminal articles has been recognized as a de facto standard in the realization of a state of the art in the most dissimilar disciplines. To identify these articles of unquestionable significance in an investigation (Berkani, Hanifi, & Dahmani, 2020; Silva, Villa, & Cabrera, 2020), different alternatives have been proposed such as the use of collaborative models (Wang & Blei, 2011) and the use of personalized systems for the recommendation of the most relevant articles (Pera & Ng, 2011). Less studied has been the fact of how to identify these and their possible genealogy (Bae, Hwang, Kim, & Faloutsos, 2011, 2014). The fact is that the current researcher is faced with a quantity of information that does not do anything simply to find the most relevant jobs and this requires considerable time and effort (Alonso, Perez, & Hidalgo, 2016; Bravo Hidalgo & León González, 2018).

Within this problematic this contribution started from the investigative idea that the seminal articles are recognized as such, do not age, it is for two reasons:

1) They have been cited in a significant way, that is, they are recognized by the scientific community.

2) They remain valid for several years.

These two simple reasons should lead them to stand out as outliers in space:

VY = f(C)

where VY is the Validity in Years of a given article, that is, the time elapsed from the publication of the article until the current date:

C is the number of appointments received during that period for the article in question.

Data mining offers different possibilities for data analysis (Berkhin, 2006) including different techniques (Bakar, Mohemad, Ahmad, & Deris, 2006; Buthong, Luangsodsai, & Sinapiromsaran, 2013) and algorithms for the detection of values atypical (Ramaswamy, Rastogi, & Shim, 2000). At the same time, different applications have been developed (Rangra & Bansal, 2014) that facilitate the use of data mining. Among these, the Rapidminer offers a whole set of possibilities for the analysis of data (Amer & Goldstein, 2012; Jungermann, 2009) and in particular for the detection of outliers (Buthong et al., 2013). The outlier has long been defined (Barnett & Lewis, 1974) as an observation, or set of observations, that seems to be inconsistent with the data set under analysis.

This contribution was proposed from these considerations to determine if in the space VY = f(C) could be distinguished the seminal articles as outliers using the possibility offered by the Rapidminer ( to classify them in said space. Another aspect that cannot be ignored is how the articles are determined and the number of citations received by each one. For this purpose, it was also proposed to explore in this research which was the coincidence in relation to the articles considered as seminal when using the Google Scholar (Martin-Martin, Orduna-Malea, Harzing, & Delgado López-Cózar, 2017) compared to another Database of wide recognition by the scientific world, such as Scopus (Burnham, 2006).

2. Material and Methods

To form the space VY = f(C), we proceeded to search both Scopus and Google Scholar for the following terms in English, in the Title of the articles and for the period 1960-2019:

1) Knowledge management

2) Marketing

3) Entrepreneurship

The Publish or Perish (POP) tool (Harzing, 2007), which has been applied in different bibliometric studies (Harzing & Alakangas, 2016; Jacsó, 2009), was used to search Google Scholar.

For each of the search terms, the 990 most-cited articles were selected. These were exported to Excel according to the possibilities offered by both Scopus and POP. The Database is thus formed by the fields: Cites, Authors, Title, Year, and Validity that is calculated by subtracting the year of publication of the last year of rank for the search (2019).

In order to compare the similarity between the two sets of articles determined for each term, a Similarity Index (SI) was calculated from:

SI = 2C/AGoogle Scholar + BScopus


SI is Similarity Index.

This SI reproduces the original idea of Sorensen, formulated many years ago to establish the similarity of groups of equal amplitude.

AGoogle Scholar and BScopus are the number of Articles in each of the sets considered (990 for each).

C is the number of shared items of both sets. This number is easily calculated in Excel, if a formula is programmed that compares the coincidences for the two-column matrix (TitleGoogle Schlar, TitleScopus).

The detection of outliers was done using Rapidminer and the process scheme that can be configured in this is shown in Figure 1.

The first Operator reads the file in Excel and processes the Cites and Validity fields, this was done for each search term and for each of the Bases used (Google Scholar and Scopus). The second identifies the Outliers in the data set. This allows you to specify both the number of neighbors (k), and the number of Outliers (n). To be able to compare the different search terms, these parameters were adjusted, after some preliminary tests, to the values:

n = 10

k = 10

The calculation of the distances between the values of k was made using the Euclidean distances between these values. In practical terms, an attempt was made to answer the question: How to determine the 10 articles that can be considered seminal for each of the search terms analyzed?

3. Results

Table 1 below summarizes the Number of Cites, Years, Cites/Year and Cites/Paper as well as the Similarity Index for the three search terms considered. This information is of great importance for the purpose of characterizing the papers detected in the different databases used.

The Similarity Index remains similar between Knowledge Management and Entrepreneurship but decreases for Marketing.

Figure 1. Process outlier detection in Rapidminer. The data used are those detected under the search criteria previously defined.

Table 1. Summary data. Google scholar matches with Scopus. Similarity index.

Analysis of Seed Articles

Define abbreviations and acronyms the first time they are used in the text, even after they have been defined in the abstract. Abbreviations such as IEEE, SI, MKS, CGS, sc, dc, and rms do not have to be defined. Do not use abbreviations in the title or heads unless they are unavoidable (Figure 2).

Table 2 presents the results for the SI for the case of articles determined as

Figure 2. Outliers in the space VYScopus = f(CScopus); knowledge management case.

Table 2. Seminals papers found: knowledge management, entrepreneurship and marketing.

Outliers and that can be categorized as seminal using Google Scholar and Scopus and for the three search criteria used. The results obtained for the three search terms used are shown below in Table 2. In other words, this table identifies each of the detected documents as Outliers.

4. Conclusion

When comparing the results obtained for searches in Scopus and Googles Scholar for Knowledge Management, Entrepreneurship and Marketing, it was obtained that there is no marked similarity between the sets of articles that were obtained in both cases. The values for the Similarity Index remained below 0.52%, similar between Knowledge Management and Entrepreneurship but decreasing for Marketing.

The detection of outliers using Data Mining techniques and in particular using Rapidminer, allowed to determine the seminals papers for the three search terms analyzed and allowed to characterize these in the space VA = f(C) in Google Scholar and Scopus. It was shown that the seminal articles can be different if Google Scholar or Scopus is used. The results suggest determining for other search terms whether the trend found is maintained or not.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.


[1] Acs, Z. J., Braunerhjelm, P., Audretsch, D. B., & Carlsson, B. (2009). The Knowledge Spillover Theory of Entrepreneurship. Small Business Economics, 32, 15-30.
[2] Alavi, M., & Leidner, D. E. (2001). Knowledge Management and Knowledge Management Systems: Conceptual Foundations and Research Issues. MIS Quarterly, 25, 107-136.
[3] Alonso, J. A. G., Perez, Y. M., & Hidalgo, D. B. (2016). Empleo de indicadores bibliométricos para la realización de un estado del arte. Un enfoque práctico. Revista Publicando, 3, 81-97.
[4] Amer, M., & Goldstein, M. (2012). Nearest-Neighbor and Clustering Based Anomaly Detection Algorithms for RapidMiner. In the 3rd RapidMiner Community Meeting and Conference.
[5] Bae, D. H., Hwang, S. M., Kim, S. W., & Faloutsos, C. (2011). Constructing Seminal Paper Genealogy. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (pp. 2101-2104).
[6] Bae, D. H., Hwang, S. M., Kim, S. W., & Faloutsos, C. (2014). On Constructing Seminal Paper Genealogy. IEEE Transactions on Cybernetics, 44, 54-65.
[7] Bakar, Z. A., Mohemad, R., Ahmad, A., & Deris, M. M. (2006). A Comparative Study for Outlier Detection Techniques in Data Mining. In IEEE Conference on Cybernetics and Intelligent Systems.
[8] Barnett, V., & Lewis, T. (1974). Outliers in Statistical Data. Hoboken, NJ: Wiley.
[9] Baron, R. A. (2003). Human Resource Management and Entrepreneurship: Some Reciprocal Benefits of Closer Links. Human Resource Management Review, 13, 253-256.
[10] Baumol, W. J. (1996). Entrepreneurship: Productive, Unproductive, and Destructive. Journal of Business Venturing, 11, 3-22.
[11] Berkani, L., Hanifi, R., & Dahmani, H. (2020) Hybrid Recommendation of Articles in Scientific Social Networks Using Optimization and Multiview Clustering. In 3rd International Conference on Smart Applications and Data Analysis for Smart Cyber-Physical Systems (pp. 117-132, Vol. 1207). Berlin: Springer.
[12] Berkes, F., Colding, J., & Folke, C. (2000). Rediscovery of Traditional Ecological Knowledge as Adaptive Management. Ecological Applications, 10, 1251-1262.[1251:ROTEKA]2.0.CO;2
[13] Berkhin, P. (2006). A Survey of Clustering Data Mining Techniques Grouping Multidimensional Data (pp. 25-71). Berlin: Springer.
[14] Berry, L. L. (1995). Relationship Marketing of Services—Growing Interest, Emerging Perspectives. Journal of the Academy of Marketing Science: Official Publication of the Academy of Marketing Science, 23, 236-245.
[15] Bravo Hidalgo, D., & León González, J. L. (2018). Divulgación de la investigación científica en el Siglo XXI. Revista Universidad y Sociedad, 10, 88-97.
[16] Burnham, J. F. (2006). Scopus Database: A Review. Biomedical Digital Libraries, 3, 1.
[17] Buthong, N., Luangsodsai, A., & Sinapiromsaran, K. (2013). Outlier Detection Score Based on Ordered Distance Difference. In Computer Science and Engineering Conference.
[18] Churchill Jr., G. A. (1979). A Paradigm for Developing Better Measures of Marketing Constructs. Journal of Marketing Research, 16, 64-73.
[19] Dalkir, K. (2013). Knowledge Management in Theory and Practice. Abingdon-on-Thames: Taylor and Francis.
[20] Davenport, T. H., David, W., & Beers, M. C. (1998). Successful Knowledge Management Projects. Sloan Management Review, 39, 43-58.
[21] De Long, D. W., & Fahey, L. (2000). Diagnosing Cultural Barriers to Knowledge Management. Academy of Management Executive, 14, 113-127.
[22] Dolman, A. J., Miralles, D. G., & de Jeu, R. A. M. (2014). Fifty Years since Monteith’s 1965 Seminal Paper: The Emergence of Global Ecohydrology. Ecohydrology, 7, 897-902.
[23] Drucker, P. (2014). Innovation and Entrepreneurship. Abingdon-on-Thame: Routledge.
[24] Fayolle, A. (2007). Entrepreneurship and New Value Creation: The Dynamic of the Entrepreneurial Process. Cambridge: Cambridge University Press.
[25] Gold, A. H., Malhotra, A., & Segars, A. H. (2001). Knowledge Management: An Organizational Capabilities Perspective. Journal of Management Information Systems, 18, 185-214.
[26] Gomez-Perez, A., Fernández-López, M., & Corcho, O. (2006). Ontological Engineering: with Examples from the Areas of Knowledge Management, e-Commerce and the Semantic Web. Berlin: Springer Science & Business Media.
[27] Gronroos, C. (1984). A Service Quality Model and its Marketing Implications. European Journal of Marketing, 18, 36-44.
[28] Hart, C. (1998). Doing a Literature Review: Releasing the Social Science Research Imagination. Thousand Oaks, CA: Sage Publications.
[29] Harzing, A.-W. (2007). Publish or Perish.
[30] Harzing, A.-W., & Alakangas, S. (2016). Google Scholar, Scopus and the Web of Science: A Longitudinal and Cross-Disciplinary Comparison. Scientometrics, 106, 787-804.
[31] Hedlund, G. (1994). A Model of Knowledge Management and the N-Form Corporation. Strategic Management Journal, 15, 73-90.
[32] Henseler, J., Ringle, C. M., & Sinkovics, R. R. (2009). The Use of Partial Least Squares Path Modeling in International Marketing. In Advances in International Marketing (pp. 277-319, Vol. 20). Bingley: Emerald Group Publishing Ltd.
[33] Hoffman, D. L., & Novak, T. P. (1996). Marketing in Hypermedia Computer-Mediated Environments: Conceptual Foundations. Journal of Marketing, 60, 50-68.
[34] Hunt, S. D., & Vitell, S. (1986). A General Theory of Marketing Ethics. Journal of Macromarketing, 6, 5-16.
[35] Jacsó, P. (2009). Calculating the h-Index and Other Bibliometric and Scientometric Indicators from Google Scholar with the Publish or Perish Software. Online Information Review, 33, 1189-1200.
[36] Jarvis, C. B., Mackenzie, S. B., Podsakoff, P. M., Giliatt, N., & Mee, J. F. (2003). A Critical Review of Construct Indicators and Measurement Model Misspecification in Marketing and Consumer Research. Journal of Consumer Research, 30, 199-218.
[37] Jungermann, F. (2009). Information Extraction with Rapidminer. In Proceedings of the GSCL Symposium’ Sprachtechnologie und eHumanities.
[38] King, R. G., & Levine, R. (1993). Finance, Entrepreneurship and Growth. Journal of Monetary Economics, 32, 513-542.
[39] Kirzner, I. M. (2015). Competition and Entrepreneurship. Chicago, IL: University of Chicago Press.
[40] Kotler, P., & Armstrong, G. (2015). Principles of Marketing-Global Edition. London: Pearson.
[41] Kotler, P., & Gertner, D. (2002). Country as Brand, Product, and Beyond: A Place Marketing and Brand Management Perspective. Journal of Brand Management, 9, 249-261.
[42] Kotler, P., & Keller, K. (2000). Marketing Administration. Sao Paulo: Prentice Hall Publishers.
[43] Kozinets, R. V. (2002). The Field behind the Screen: Using Netnography for Marketing Research in Online Communities. Journal of Marketing Research, 39, 61-72.
[44] Lee, H., & Choi, B. (2003). Knowledge Management Enablers, Processes, and Organizational Performance: An Integrative View and Empirical Examination. Journal of Management Information Systems, 20, 179-228.
[45] Malhotra, N. K. (2012). Pesquisa de marketing: Uma orientação aplicada. Porto Alegre: Bookman Editora.
[46] Malhotra, N., Hall, J., Shaw, M., & Oppenheim, P. (2006). Marketing Research: An Applied Orientation. Melbourne: Pearson Education Australia.
[47] Martin-Martin, A., Orduna-Malea, E., Harzing, A.-W., & Delgado López-Cózar, E. (2017). Can We Use Google Scholar to Identify Highly-Cited Documents? Journal of Informetrics, 11, 152-163.
[48] Miller, D. (1983). The Correlates of Entrepreneurship in Three Types of Firms. Management Science, 29, 770-791.
[49] Morgan, R. M., & Hunt, S. D. (1994). The Commitment-Trust Theory of Relationship Marketing. Journal of Marketing, 58, 20-38.
[50] Nash, W., Walker, R., Gidwani, S., & Ajuied, A. (2015). Seminal Papers in Hand and Wrist Surgery. Orthopaedics and Trauma, 29, 408-418.
[51] Nash, W., Walker, R., Lucas, J., & Ajuied, A. (2016). Seminal Papers in Spinal Surgery. Orthopaedics and Trauma, 32, 263-274.
[52] Palmatier, R. W., Dant, R. P., Grewal, D., & Evans, K. R. (2006). Factors Influencing the Effectiveness of Relationship Marketing: A Meta-Analysis. Journal of Marketing, 70, 136-153.
[53] Parkinson, L., Richardson, K., Sims, J., Wells, Y., Naganathan, V., Brooke, E., & Lindley, R. (2013). Identifying Seminal Papers in the Australasian Journal on Ageing 1982-2011: A Delphi Consensus Approach. Australasian Journal on Ageing, 32, 6-11.
[54] Pera, M. S., & Ng, Y.-K. (2011). A Personalized Recommendation System on Scholarly Publications. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management.
[55] Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD Record (ACM Special Interest Group on Management of Data), 29, 427-438.
[56] Rangra, K., & Bansal, K. (2014). Comparative Study of Data Mining Tools. International Journal of Advanced Research in Computer Science and Software Engineering, 4, 2277.
[57] Riordon, J., Zubritsky, E., & Newman, A. (2000). Top 10 Articles. Analytical Chemistry Looks at 10 Seminal Papers. Analytical Chemistry, 72, 324A-329A.
[58] Sanchez, R., & Mahoney, J. T. (1996). Modularity, Flexibility, and Knowledge Management in Product and Organization Design. Strategic Management Journal, 17, 63-76.
[59] Sciences, T. P. (1823). ART. I. A Comparative View of the State of Medical Science among the Ancients and Moderns, Its Revolutions in Different Periods of the World, and an Enumeration of Some of the Errors Which Check Its Progress. The Philadelphia Journal of the Medical and Physical Sciences, 7, 211-226.
[60] Shane, S. (2003). A General Theory of Entrepreneurship: The Individual-Opportunity Nexus. Cheltenham: Edward Elgar Publishing.
[61] Shane, S., & Venkataraman, S. (2000). The Promise of Entrepreneurship as a Field of Research. Academy of Management Review, 25, 217-226.
[62] Silva, J., Villa, J. V., & Cabrera, D. (2020). An Intelligent Approach to Design and Development of Personalized Meta Search: Recommendation of Scientific Articles. In 16th International Conference on Distributed Computing and Artificial Intelligence, DCAI 2019 (pp. 99-106, Vol. 1003). Berlin: Springer Verlag.
[63] Singer, W. (2009). Seminal Papers Don’t Age. Perception, 38, 799-802.
[64] Smallbone, D., & Welter, F. (2012). Entrepreneurship and Institutional Change in Transition Economies: The Commonwealth of Independent States, Central and Eastern Europe and China Compared. Entrepreneurship and Regional Development, 24, 215-233.
[65] Stevenson, H. H., & Jarillo, J. C. (2007). A Paradigm of Entrepreneurship: Entrepreneurial Management. In Entrepreneurship: Concepts, Theory and Perspective (pp. 155-170). Berlin: Springer.
[66] Tax, S. S., Brown, S. W., & Chandrashekaran, M. (1998). Customer Evaluations of Service Complaint Experiences: Implications for Relationship Marketing. Journal of Marketing, 62, 60-76.
[67] Timmons, J. A., & Spinelli, S. (2004). New Venture Creation: Entrepreneurship for the 21st Century (Vol. 6). New York: McGraw-Hill.
[68] Tranfield, D., Denyer, D., & Smart, P. (2003). Towards a Methodology for Developing Evidence-Informed Management Knowledge by Means of Systematic Review. British Journal of Management, 14, 207-222.
[69] Vargo, S. L., & Lusch, R. F. (2004). Evolving to a New Dominant Logic for Marketing. Journal of Marketing, 68, 1-17.
[70] Wang, C., & Blei, D. M. (2011). Collaborative Topic Modeling for Recommending Scientific Articles. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[71] Webster, J., & Watson, R. T. (2002). Analyzing the Past to Prepare for the Future: Writing a Literature Review. MIS Quarterly, 26, 13-23.
[72] Wilson, A., Zeithaml, V. A., Bitner, M. J., & Gremler, D. D. (2012). Services Marketing: Integrating Customer Focus across the Firm. New York: McGraw Hill.
[73] Zahra, S. A. (1993). A Conceptual Model of Entrepreneurship as Firm Behavior: A Critique and Extension. Entrepreneurship Theory and Practice, 17, 5-21.
[74] Zahra, S. A. (2012). Organizational Learning and Entrepreneurship in Family Firms: Exploring the Moderating Effect of Ownership and Cohesion. Small Business Economics, 38, 51-65.

Copyright © 2023 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.