Intelligences Journal revue en intelligence économique
[Version française]

Evaluation of the potential applicability of a research project: towards a methodology based on content analysis

Ivana Roche, Nathalie Vedovotto, Claire François, Dominique Besagni, Pascal Cuxac, Marianne Hörlesberger, Dirk Holste et Edgar Schiebel

Abstract

When the scientific output becomes too complex and sophisticated to be assessed by a single reviewer, informetric methods can prove valuable. These may be used either to support the decision-making process or to evaluate the decision. Informetric evaluation can fulfil an increasing need in the follow-up of scientific results. It is in this context that the current study was undertaken, proposing an evaluation methodology of the applicability potential of a research project submitted to a funding agency. The approach employed depends on content analysis carried out with the help of natural language processing tools and automatic classification. The aim was to facilitate the expertise step, which nevertheless remains unavoidable. Our work is illustrated by a genuine case study with results extracted from a prestigious European funding agency which relies on a selection process based on scientific excellence of the exploratory research as the only decision standard.

Keywords

proceso de selección, evaluación proyecto, trabajo experto, análisis de contenido, aplicabilidad, indicador, processus de sélection, évaluation de projet, expertise, selection process, analyse de contenu, project evaluation, applicabilité, expertise, indicateur, content analysis, applicability, indicator

Full text

Introduction

The question addressed in this study is the evaluation of the potential applicability of a scientist’s research presented during conferences and/or published in the scientific and technical literature. We were confronted with this problem in the context of a European project [1] where we developed an analysis methodology based on the modelling of criteria defined by the European Research Council (ERC) and employed by its scientific experts during the selection of research projects for financing.

We do not expect that a numerical model can substitute either for human expertise or for recognition by the scientific community, both of which are difficult to quantify. However, representations of this type can serve to verify decisions, provide complementary information or highlight bias in the selection procedure [2]. The nature and goals of the criteria used in the selection of research projects are highly diversified : identity of the principal investigator, risk associated with the project, novelty of the subject etc. This is particularly obvious in the choice of indicators defined to model the strategy, missions and scientific policy that funding agencies employ to establish useful and interpretable causal relationships. Divergences can be noted between the choice of experts and the propositions that arise from informetric indicators. Several reports concern in-depth adjacent reasons [2-4].

Amongst the ERC criteria, applicability is found and is characterised as follows : “… may well be concerned with both new knowledge about the world and with generating potentially useful knowledge at the same time. Therefore, there is a much closer and more intimate connection between the resulting science and technology, with few of the barriers that arise when basic research and applied research are carried out separately.” [5].

Donald Stokes [6] introduced an approach that allows fundamental research to be distinguished from applied research, defining a two-dimensional chart known as Pasteur’s Quadrant. This is a characterisation of research that allows existing relationships between fundamental sciences and technological innovation to be classified The work of Louis Pasteur is considered to be a perfect example of scientifically important research that is susceptible to give rise to practical applications. The result defines three types of research :

  • pure fundamental research, as illustrated by the work of Niels Bohr, Danish physician from the beginning of the 20th century;

  • purely applied research, as illustrated by the work of Thomas Edison, North American inventor and industrialist ;

  • research based on theory, but with an applied aim, described as Pasteur’s Quadrant, after the French scientist, originally a chemist and physician, pioneer of Microbiology.

In this article, we first define the approach used to evaluate the applied orientation of a project. We then describe the data processing. To conclude, we present and discuss the obtained results.

Context

A usual way to determine the application degree of a scientist’s research is to study the possible patents with which he is associated [e.g. 7-9]. A patent application is often the outcome of proceedings towards technological transfer. Thus, it is considered as the practical result of research endeavours and presents strictly applied characteristics. Another possibility is to directly examine the research published by the scientist and define it as applied or fundamental. In a previous study, we modelled these two criteria with the help of indicators calculated from data provided by the researcher at the time of project submission, namely the number of patents to which the scientist had contributed and the journal titles where his research was published, according to his Curriculum Vitae.

The first indicator is based on a simple numbering system that gives integer values of [0, [. However, the number of patents cited in a Curriculum Vitae is often low and this produces a negative impact on the accuracy of the indicator in question. The second indicator is obtained by calculating the part of the publications that appears in journals whose content is categorised as applied. This indicator gives real values that vary from 0 to 1. For both indicators we presume that the higher the value the more applicable it avers that the research is. Although this approach may seem pragmatic, it has several weaknesses due notably to the step of journal categorisation. Effectively, by applying a binary characterisation of "applied or fundamental" to the journals it seems a priori effortless to automatically transpose the journal category to all articles that appear therein. The first difficulty is the determination of criteria that allow the journal type to be decided. A great deal of effort has gone into the characterisation of journals according to scientific domains, but the problem has not been addressed in terms of whether a journal is firmly applied or resolutely fundamental.

Furthermore, it may be that a journal does not have a unique category, but that its type depends on the scientific domain of each publishing researcher. For example, consider Biology as basically a fundamental research area. All journals classified in this domain, and all articles published on the subject, would risk being called "fundamental" whereas this is not always true. Indeed, a publication written by a computer specialist that brings the development of an applied application to serve the Biologist should be tagged as "applied".

In the current study, we have overcome this categorisation step by proposing a more advanced indicator, based on content analysis and operated by a scientific expert. This relies firstly on all publications that cite at least one of the researcher’s articles, and secondly on the publications that cite at least one reference in common with the bibliography of the researcher’s project, from which he is supposed to find, in part, his reasoning.

Concerning the first batch of publications, our hypothesis was that, via citation, articles express the exploitation of the researcher’s publications. In consequence, these articles constitute a genuine and pragmatic source of information relative to the use of the researcher’s former results in the context of new works. Content analysis applied to this this group of articles provides the means to assess the use of the researcher’s work until the time of project submission. For the second set, we hypothesize that the publications having one or more citations in common with a project may represent the scientific context which the latter could be part of. The content analysis operated on this corpus allows qualification of the applicateness degree of research that, along with the project, is based on a single batch of knowledge. Thus, by analogy the same applicateness degree is attributed to a project. Finally, the comparison of results from the two analyses yields an answer to the question that concerns the evolving aspect of a scientist’s research applicateness degree and leads to infer its potential applicability.

Moreover, this approach allows the consideration of two levels of applicateness : the first corresponds to the simple use of published results in new work that refers to them, while the second brings in, via content analysis, the characterisation of the latter element by the measurement of its own standard that only comports two possible values : applied or fundamental.

Agrandir pasteureng

Figure 1. Methodological schema of an evaluation process of a research project’s applicability potential

The content analysis results in a cartographic representation for each of the two corpora considered, in the form of a thematic map that organizes their content into clusters of similar publications. The examination of these two maps allows scientific experts to detect the presence of themes with applied characteristics and to evaluate their relative importance. Furthermore, a comparison of the results obtained for each of the two thematic maps made a characterization of the evolution of the applicateness degree of the researcher’s work possible. Four types of scientific "history" can be observed. Two of these show a true stability : past and present research are and remain of either an applied or a fundamental aim. The two other trajectories are more interesting, since they show a true transition :

  • either towards the applied: the researcher changes from fundamental to applied;

  • or towards the fundamental : the researcher, beginning from applied changes to fundamental.

We have represented these different elements with the help of Pasteur’s Quadrant (Figure 1).

Although a history towards applied research is easily interpreted, that showing a transition towards fundamental raises questions. Could it not be that this transition is simply a necessary passage, required to gain fundamental theories that the scientist can then use in the context of new future applications ?

Methods

Primary data was directly extracted from the documents deposited at the moment of project submission, namely the researcher’s name and the references cited in the project bibliography. From the name of the researcher, a list of his works published in the scientific and technical literature was determined and the corresponding bibliographic records are extracted from a database that furthermore allows access to the cited references. Once this list is obtained, it is easy to emphasize the publications that cite at least one of the researcher’s articles. This first corpus is considered to be the representation of the scientific context relative to referring articles that use, on a smaller or larger scale, the knowledge covered by the previous research of the scientist.

Moreover, with the exception of self-citations, all the references cited by the researcher in his project proposal serve for an extraction of the other publications in which these same references are also cited. This group of publications build the second corpus, which is taken into account as a representation of the knowledge that they share with the researcher’s project. A content analysis was carried out on each of these two corpora. To carry out this analysis, bibliographic records were re-formatted and integrated into the information analysis platform Stanalyst. A data mining step, which applied techniques based on NLP (natural language processing), operated an assisted indexing of the records, associating keywords with them. A clustering approach was then applied to this enriched corpus of bibliographic records. The computing tool used, implemented in Stanalyst, employed an unsupervised and non-hierarchical algorithm of automatic classification, the K-means axial method. This was inspired by the formalism of Kohonen self-organizing maps. This method employs keywords as indicators of the content of bibliographic records, which in turn are considered as indicators of research themes. This step was followed by a principal component analysis that allows clusters to be positioned on a 2D map. The relationships between the clusters were then used to construct thematic networks that constitute, according to a geographical metaphor, a map of the research domain map represented by the corpus.

The expert then analysed the results obtained for each of the two corpora, namely the clusters content as well as their position and their relationships within the cluster map. During this analysis, the expert must apply a specific cipher-grid to evaluate if the content of each corpus is mainly applied or fundamental. To achieve this, the expert needed to examine the content of each cluster of the two maps, by studying the title and keywords associated with each record, to get an evaluation of the applicateness score of these clusters. For each cluster, the expert :

  • determined the proportion of fundamental and applied subjects, Pf and Pa respectively (of which the sum equals 1)

  • determined the applicateness score of each cluster, equal to (Pa – Pf) and encompassed in the interval of [-1, 1]. A value of -1 corresponds to a totally fundamental cluster, while a value of 1 is a totally applied one, and intermediate values indicate clusters that present the two characteristics.

The sum of the applicateness score attributed to each cluster of a map supplies the applicateness degree of the corpus. If this value is negative, the corpus concerns mainly fundamental research, and in contrast, if this value is positive, the corpus concerns above all applied research.

This expertise task carried out on the two corpora allowed them to be compared in order to determine whether the scientific history of the researcher is stably localised within the boundaries of either applied or fundamental research, or whether this exhibits a transition between the two.

Results

The method presented in this report was applied to a genuine case study that arose from the 2009 ERC call for projects. Among the 25 major domains covered by the ERC we chose Communication and Systems Engineering. In this domain 31 projects were submitted and we chose one of the 4 that were selected by the ERC panel of experts in charge of this scientific domain. For reasons of confidentiality we cannot publish nominative data or information that would allow these to be traced by cross-checking. Thus, the researcher responsible for the project was known as CHE. An investigation on the WoS (Web of Science) of the name CHE gave 24 publications from 2000 to 2009 after verification of a lack of homonyms or artefacts. According to WoS these publications were cited in 663 articles, which we have collected and that build the first corpus (called FC). Furthermore, in his project CHE presented a bibliography of 45 references. Five of these were eliminated as they were self-citations and 25 were not indexed in the WoS. The 15 remaining references published between 2000 and 2007 were cited in 4612 articles, which we have extracted and that form the second corpus, known henceforth as SC.

The FC and SC records were re-formatted to allow them to be processed by the Stanalyst analysis platform and were enriched by an indexing with keywords obtained thanks to the application of text mining tools based on NLP techniques. Moreover, this assisted indexing step benefited from a scientific expertise at two levels : during the constitution of the vocabulary that serves as the terminological reference and during the validation of the final result with the elimination of terms judged too generic. Table 1 summarizes the principal characteristics of the two indexed corpora.

Corpus

Number of records in the corpus

Number of indexation keywords

Number of clusters

Number of keywords in the clusters

Records in the clusters

number

 % of initial corpus

FC

663

2210

20

1000

662

99,85 %

SC

4612

4915

20

3010

4608

99,91 %

Table 1. Characteristics of the two indexed corpora

The maps obtained for FC and SC along with the results of the expertise in terms of the applicateness score of each cluster are presented in figures 2 and 3. For example, the clusters “synchronization”, “energy consumption”, “interference”, “channel estimation” and “monitoring” in figure 2 gave respectively values of -1 ; -0,34 ; 0 ; 0,5 et 1.

The applicateness degree was calculated for each corpus, and gave : 4.17 for FC and 8.15 for SC. These values show that both corpora are applied, but a net evolution is observed between FC and SC. Effectively, when the two maps are compared, it can be seen that FC shows 5 totally fundamental clusters, whereas SC has none. Furthermore, some clusters that are present on both maps, for example “energy consumption”, show a notable increase in their applicateness score in SC.

Agrandir Image1

Figure 2. Mapping of the first corpus (FC) ; showing the applied and fundamental proportion of each cluster, as textured rectangles (chequered background = applied, hatched = fundamental)

Agrandir Image2

Figure 3. Mapping of the second corpus (SC), showing the applied and fundamental proportion of each cluster as textured rectangles(chequered background = applied, hatched = fundamental)

This increase in applicateness degree between FC and SC allowed us to hypothesize that, provided the scientific history of a researcher remains localised in an applied perimeter, the potential applicability of his project is high. We concluded that this project responds favourably to the standard of applicability defined by the funding body.

Conclusions and perspectives

The developed approach aims to provide a method to estimate the potential applicability of a research project. It should be noted that this study relies on a group of representative data, which is provided by the funding agency itself. Then, a source of bibliographic data is used to enrich this data. On the one hand publications that cite those of the researcher are used, and on the other hand articles that share at least one of the references cited in the project. Content analysis is applied to these two groups of data to provide help to the work of scientific expertise. In total, this methodology could bring an a priori bibliometric evaluation of the potential of applicability of submitted projects, helping the funding agency in the selection process.

In the future, our work could be axed towards an improvement of this method, by

- weighting the "applicateness score" obtained for each cluster, using a parameter that issues from the results of clustering, to put the cluster into context with respect to its environment, i .e ., the cluster map

- introducing a tool to help the expert with the decision, by proposing a value of the applicateness score of each cluster calculated by mean of an existing semantic categorization of keywords.

However, the final decision will remain the prerogative of the funding agency and will not rely on a single indicator.

Bibliography

[1] Holste D., Roche I., Hörlesberger M., Besagni D., Scherngell T., Francois C., Cuxac P., Schiebel E., submitted to Scientometrics, 2012

[2] Juznic P., Peclin S., Zaucer M., Mandelj T., Pusnik M, Demsar F., Scientometric indicators : peer review, bibliometric methods and conflict of interest. Scientometrics, 2010, 85, p. 429-441

[3] Besselaar, P.v.d. & Leydesdorff, L., Past performance, peer review and project selection : a case study in the social and behavioral sciences. Research Evaluation, 2009, 18, p. 273-288

[4] Bormann L., Leydesdorff L., Besselaar P.v.d., A Meta-evaluation of Scientific Research Proposals : Different Ways of Comparing Rejected to Awarded Applications. Journal of Informetrics, 2009, 4, p. 211-220

[5] EC – EUROPEAN COMMISSION, Frontier research : The European Challenge. High Level Expert Group Report, EUR 21619, 2005

[6] Stokes D., Pasteur's Quadrant - Basic Science and Technological Innovation, Brookings Institution Press, 1997

[7] Glänzel W., Meyer M., Patents cited in the scientific literature : An exploratory study of “reverse” citation relations. Scientometrics, 2003, 58, p. 415-428

[8] Moed H.F., Glänzel W., Schmoch U., Handbook of quantitative science and technology research : The use of publication and patent statistics in studies of S&T systems. Kluwer Academic Publishers, 2004

[9] Glänzel W., Zhou P., Publication activity, citation impact and bi-directional links between publications and patents in biotechnology. Scientometrics, 2011, 86, p. 505-525

To cite this document :

Ivana Roche, Nathalie Vedovotto, Claire François, Dominique Besagni, Pascal Cuxac, Marianne Hörlesberger, Dirk Holste et Edgar Schiebel, «Evaluation of the potential applicability of a research project: towards a methodology based on content analysis», Intelligences Journal [En ligne], Number 3 , Full text issues , URL : http://lodel.irevues.inist.fr/isj/index.php?id=336

Authors

Ivana Roche
INIST-CNRS, 2 allée du Parc de Brabois, 54519 Vandoeuvre-lès-Nancy Cedex, France
Nathalie Vedovotto
INIST-CNRS, 2 allée du Parc de Brabois, 54519 Vandoeuvre-lès-Nancy Cedex, France
Claire François
INIST-CNRS, 2 allée du Parc de Brabois, 54519 Vandoeuvre-lès-Nancy Cedex, France
Dominique Besagni
INIST-CNRS, 2 allée du Parc de Brabois, 54519 Vandoeuvre-lès-Nancy Cedex, France
Pascal Cuxac
INIST-CNRS, 2 allée du Parc de Brabois, 54519 Vandoeuvre-lès-Nancy Cedex, France
Marianne Hörlesberger
AIT, Austrian Institute of Technology GmbH, Donau-City-Strasse 1, 1220 Vienna, Austria
Dirk Holste
AIT, Austrian Institute of Technology GmbH, Donau-City-Strasse 1, 1220 Vienna, Austria
Edgar Schiebel
AIT, Austrian Institute of Technology GmbH, Donau-City-Strasse 1, 1220 Vienna, Austria