 Home >
 Full text issues >
 Number 3 >
[Version française] 
A new approach for automatizing diachronic analysis of research domain
JeanCharles LamirelAbstract
The acquisition of new scientific knowledge and the evolution of the needs of the society regularly call into question the orientations of research. Means to recall and visualize these evolutions are thus necessary. The existing tools for research survey give only one fixed vision of the research activity, which does not allow performing tasks of dynamic topic mining. The objective of this paper is thus to propose a new incremental approach in order to follow the evolution of research themes and research groups for a scientific discipline given in terms of emergence or decline.
Keywords
neural networks, time varying data, data analysis, multiview model, bayesian reasoningTable of contents
Full text
Introduction
The literature taking into account the chronological aspect in information flows is focused on "DataStream" whose main idea is the "on the fly" management of incoming (i.e. not stored) data [1] [41]. But the algorithms resulting from this work are intended to treat very large volumes of data (i.e. DataStream) and are thus not optimum for detecting emergent topics and for precisely followingup the evolution of' a research field.
Some of our recent works [25] highlighted that most of the clustering algorithms show high performance in the usual context of the analysis of homogeneous textual datasets. However, this work also clearly highlighted the drastic decrease of performance of these algorithms when a heterogeneous or polythematic textual dataset, which can be considered a static simulation of a timeevolving dataset, is used as an input. To cope with the current defects of existing incremental clustering methods, an alternative approach for analyzing information evolving over time consists in performing diachronic analysis. This type of analysis relies on the application of a clustering method on data associated with two, or more, successive periods of time, and on the study of the evolution of the clusters contents and on the one of their mappings between the different periods. For analyzing the evolution of the vocabulary describing the clusters of different periods, Shiebel and al. [37] propose to construct a matrix of keywords comparison which is based on the percentage of keywords of one period which preexist in the clusters of another period. Thanks to this matrix, it is then possible for an expert of the domain to highlight different cluster behaviours: stability, but also merging or splitting. Even if it avoids exploiting the clustering methods in their critical area, an important limitation of this approach is that the process of comparison between clustering models must be achieved in a supervised way.
In [14], Lamirel and al. firstly introduced the dynamic cooperation between clustering models in the context of information retrieval.
The principle of the MVDA paradigm is thus to be constituted by several clustering models that have been generated from the same data or even from data that share the same overall description space. Each model is issued from a specific viewpoint and can be generated by any clustering method. The relation between the models is established through the use of an intermodels communication mechanism relying itself on unsupervised Bayesian reasoning.
The intermodels communication mechanism enables to highlight semantic relationships between different topics (i.e. clusters) belonging to different viewpoints related to the same data or even to the same data description space (see Fig. 1).
Figure 1  The MVDA intermodels communication principle.
The MVDA paradigm represents a challenging paradigm in the context of the analysis of time varying information. In section 2, we highlight how to exploit the principles of this paradigm in two different ways to automatically perform such analyses. Section 3 describes our first experiment and its results. Section 4 draws our conclusion and perspectives.
A new approach for analyzing timevarying information
Analyzing the difference between time periods concerns different kinds of topics changes or similarities that could occur between the periods (appearing topics, disappearing topics, splitting topics, merging topics, stable topics). Our approach is a data properties oriented approach. In this approach a second step of cluster labelling is achieved after the construction of the clustering model for each time period. The goal of the labelling step is to figure out which peculiar properties or labels can be associated to each cluster of a given time period. The identification of the topics relationships between two time periods is then achieved through the use of Bayesian reasoning relying on the extracted labels that are shared by the compared periods (see Fig. 2). The main advantages of this approach are both its intrinsic precision due to the direct model comparison process and its preservation of the independence between compared models. Nevertheless, this precision can only be obtained under the constraint of the use of a very efficient cluster labelling technique.
Figure 2  The proposed approach
When anyone aims at comparing clustering methods, or even evaluating clustering results, he will be faced with the problem of choice of reliable clustering quality indexes. The classical evaluation indexes for the clustering quality are based on the intra‑cluster inertia and the intercluster inertia [18] [21] [27]. Thanks to these two indexes, a clustering result is considered as good if it possesses low intracluster inertia as compared to its intercluster inertia. However, as it has been shown in [18], the distance based indexes are often strongly biased1 and highly dependent on the clustering method. They cannot thus be easily used for comparing different methods, or even different clustering results issued from data whose description spaces have different sizes. Moreover, as it has been also shown in [9], they are often properly unable to identify an optimal clustering model whenever the dataset is constituted by complex data that must be represented in a both highly multidimensional and sparse description space, as it is often the case with textual data. To cope with such problems, our own approach takes its inspiration both from the behavior of symbolic classifiers and from the evaluation principles used in Information Retrieval. Our Recall/Precision and Fmeasures indexes exploit the properties of the data associated to each cluster after the clustering process without prior consideration of clusters profiles [18]. Their main advantage is thus to be independent of the clustering methods and of their operating mode.
In IR, the Recall R represents the ratio between the number of relevant documents which have been returned by an IR system for a given query and the total number of relevant documents which should have been found in the documentary database [29]. The Precision P represents the ratio between the number of relevant documents which have been returned by an IR system for a given query and the total number of documents returned for the said query. Recall and Precision generally behave in an antagonist way: as Recall increases, Precision decreases, and conversely. The Ffunction has thus been proposed in order to highlight the best compromise between these two values [36].
It is given by:
Based on the same principles, the Recall and Precision quality indexes which we introduce hereafter evaluate the quality of a clustering method in an unsupervised way2 by measuring the relevance of the clusters content in terms of shared properties. In our further descriptions, a cluster’s content is supposed to be represented by the data associated with this latter after the clustering process and the descriptors (i.e. the properties) of the data are supposed to be weighted by values within the range.
Let us consider a set of clusters C resulting from a clustering method applied on a set of data D, the local Recall (Rec)and Precision (Prec) indexes for a given property p of the cluster c can be expressed as:
where the notation represents the restriction of the set X to the set members having the property p.
Then, for estimating the overall clustering quality, the averaged Recall (R) and Precision (P) indexes can be expressed as:
where S_{c} is the set of properties which are peculiar to the cluster c that is described as:
where represents the peculiar set of clusters extracted from the clusters of C,which verifies:
where represents the weight of the property p for element x.
Similarly to IR, the Fmeasure (described by Eq. 3) could be used to combine averaged Recall and Precision results. Moreover, we have demonstrated in [14] that if both values of averaged Recall and Precision reach the unity value, the peculiar set of clusters represents a Galois lattice. Therefore, the combination of this two measures enables to evaluate to what extent a numerical clustering model can be assimilated to a Galois lattice natural classifier. The stability of our Quality criteria has also been demonstrated in [5].
MacroRecall and MacroPrecision indexes defined by the (Eq. 2) can be considered as cluster‑oriented measures because they provide average values of Recall and Precision for each cluster. They have opposite behaviors according to the number of clusters. Thus, these indexes permit to estimate in a global way an optimal number of clusters for a given method and a given dataset. The best data partition, or clustering result, is in this case the one which minimizes the difference between their values. However, similarly to the classical distance based indexes, their main defect is that they are not enough sensitive to the presence of small number of heterogeneous clusters of large size, especially in the case of the joint existence of a big number of clusters of small size [16].
To correct that, we propose to construct complementary propertyoriented indexes (i.e. micro‑measures) of MicroRecall and MicroPrecision by averaging the Recall/Precision values of the peculiar properties independently of the structure of the clusters:
where L represents the size of the data description space.
In a complementary way, the role of clusters labeling is to highlight the peculiar characteristics or properties of the clusters associated to a cluster model at a given time. Labeling can be thus used both for visualizing or synthesizing clustering results [24] and for validating or optimizing learning of a clustering method [5]. Labeling can both rely on endogenous data properties or on exogenous ones. Endogenous data properties represent the ones being used during the clustering process. Exogenous data properties represent either complementary properties or specific validation properties. Some label relevance indexes can be derivated from our former quality indexes using a probabilistic approach.
The Label RecallLR derives directly from the Eq. 4. It is expressed as:
The Label PrecisionPR can be expressed as:
Consequently, the set of labels L_{c} that can be attributed to a cluster c can be expressed as the set of endogenous or exogenous cluster data properties which verifies:
where the Label Fmeasurecan be defined as :
As soon as Label Recall is equivalent to the conditional probability P(cp) and Label Precision is equivalent to the conditional probability P(pc), this former labeling strategy can be classified as an expectation maximization approach with respect to the original definition given by Dempster and al. [11].
Experimentation and results
Data preprocessing
For our project, we chose to work starting from the INIST PASCAL database and to rely on its classification plan to analyze the dynamics of the various identified topics. We firstly employed a simple search strategy, consisting in the selection of the bibliographic records having at the same time a code in Physics, and a code corresponding to a technological scope of application. The two selected applicative fields are the Engineering and the Life sciences (Biological Sciences and Medicine). By successive selections, combining statistical techniques and expert approaches, we finally released the 10 promising sets of themes [35] [39].
We finally chose to use the set of themes of the optoelectronic devices here because this field is one of most promising of last decade [37]. 3890 records related to these topics were thus selected in the PASCAL database.
Our approach consisted then in cutting out our corpus in two periods that are (19961999: period 1) and (2000‑2003: period 2), to carry out for each one an automatic classification by using the content present in the bibliographic records. The structure of the records makes it possible to distinguish the titles, the summaries, the indexing keywords and the authors as representatives of the contents of the information published in the corresponding article. In our experiment, the research topics associated with the indexing keywords field are solely considered. For each year period, a specific dataset is generated. For that purpose, a set of preprocessing steps is applied to the keywords field of the corresponding records in order to obtain a weighted vector representation of the information it contains. Keywords of overall frequency less than 3 are firstly removed from the record descriptions. 1797 records indexed by 1256 keywords are consequently kept in period 1, and 2074 records indexed by 1352 keywords in period 2. In a further step, the resulting vectors associated to each record are weighted using IDFweighting scheme [33] in both periods in order to decrease the effect of more frequent indexes.
Selection of the best clustering method
A first step of your experiment consists in characterizing the best clustering method for integrating it in a further step of time varying analysis in the MDVA context. Even in a static context, based on a single time period, the planed clustering task represents a challenging goal for a clustering method. Hence, the chosen method should be able to produce homogeneous clustering results highlighting at least the different research sub‑domains covered by the period. Moreover, for further efficient comparison between periods, the obtained clusters should even be sufficiently precise to discriminate between subtopics belonging to the same sub‑domain.
For our experiment we select two reference methods in the category of non neural methods, that are the K‑means method [28] and the Walktrap method [32], as well as in the one of the neural methods, ranging form static ones, like SOM, NG to incremental ones, like GNG, IGNG or I^{2}GNG. For each method, we do many different experiments letting varying the number of clusters in the case of static methods and the neighbourhood parameters in the case the incremental ones. We finally keep the best clustering results for each method regarding to the value of RecallPrecision F‑measure defined at the section 2. As regards to these quality indexes, the best results are provided by the GNG method.
Time varying analysis
The time varying analysis consists in integrating the clustering method selected at the first step of your experiment, which is the SOM method, in the context of the MVDA paradigm, in order to perform Bayesian reasoning between the time periods. The two different approaches exploiting Bayesian reasoning we have described in the section 2, that are the merged model approach and the label based approach, are tested during this step.
We firstly make use of our quality measures to generate two different optimal SOM clustering models build up from the datasets associated to the two different time periods. Furthermore, a global optimal model build up from the fusion of the datasets of the two periods is generated in order to be exploited in the context of the merged models approach.
To enhance the quality of the comparison, we have applied two different kinds of postprocessing steps on our resulting optimal clustering models. We have firstly chosen to apply a threshold of 3 regarding to the clusters size to discard non significant clusters (chunk clusters). The optimal results of GNG method for the two tiome period are presented in table 2.
To compute the probability of matching between clusters belonging to two time periods, we slightly modify the standard computation of Bayesian communication of the MVDA model given by Eq. 1.
Let P(ts) be defined as the probability of activity of a target period cluster t knowing the activity of a source period cluster s. It can be expressed as:
where L_{x} represent the set of labels associated to the cluster x,using the cluster label maximization approach defined by the Eq. 12, and L_{x}⋂ L_{y} represent the common labels, which can be called the label matching kernel, between the cluster x and the cluster y.
The average matching probability P_{A}(S)of a source period cluster can be defined as the average probability of activity generated on all the clusters of the target period clusters by its associated labels:
where Env(s) represent the set of target period clusters activated by the labels of the source period cluster s.
The global average activity A_{s}generated by a source period model S on a target period model T can be defined as:
Its standard deviation can be defined as б_{s}.
The similarity between a cluster s of the source period and a cluster t of the target period is established if the 2 following similarity rules are verified:
1) P(ts) > P_{A}(s) et P(ts) > A_{s}+ б_{s}.
2) P(st) > P_{A}(t) et P(st) > A_{t} + б_{t}.
Cluster splitting is verified if there is more than one cluster of the target period which verifies the similarity rules with a cluster of the source period. Conversely, cluster merging is verified if there is more than one cluster of the source period which verifies the similarity rule with a cluster of the target period.
Cluster of the source period that do not have similar cluster on the target period are considered as vanishing clusters. Conversely, clusters of the target period that do not have similar cluster on the source period are considered as appearing clusters.
Table 3 summarizes the results of our time periods comparison experiment. As it is shown in Fig. 3, similarities between clusters are identified through label matching kernels. The matching kernels could be considered as significant groups of labels that are shared between clusters belonging to different time periods and that represent themselves global topic temporal matches. On the one hand, small temporal changes can be identified in the context of these global topic temporal matches, and on the other hand, as it is shown at Fig. 4, big temporal changes can be associated to isolated clusters that do not participate to any matching kernels (i.e. appearing or vanishing clusters).
The results of the two periods produced by our automatized time period comparison approach have been finally compared by with the analysis of the results produced on separate time periods performed by a domain expert in the former experiment of Shiebel and al. [37]. This analysis highlighted that the general set of themes of this corpus corresponds to the optoelectronic devices containing mineral or organic semiconductors; that the applications of optoelectronics evolve/move in the following way:

In the first period, the applications of the type “photodetector” (imagery, sensors, measuring instruments, …) have been judged dominating.

In the second period the electroluminescent diodes have been judged to become a central area of research and application.
Similar results have been found with our automatized method. Furthermore, this latter has been able to more precisely highlights in which subarea of research and application the changes have precisely occurred.
Conclusion
We show in this paper the feasibility of an incremental approach based on a timestep analysis of bibliographical data. This analysis has been carried out thanks to the exploitation of a specific model of data analysis managing multiple views on the data, namely the MVDA model. It was also based on the exploitation of original and stable measures for evaluating the quality and the coherence of the clustering results, and even for precisely synthesizing clusters content. To our knowledge, our approach represents the first approach that has being proposed for automatizing the process of analysis of time evolving textual information. Our experimentation proved that this approach is reliable and that it can produce precise and significant results on a complex dataset constituted of bibliographic records related to the research domain of opto‑electronic devices.
In a further step, we plan to make use of the MVDA generalization mechanism [3] in order to enhance the quality of the results by allowing tuning the level of generality of the clusters that are used in the comparison between periods.
We also plan to apply this approach in the challenging domain of the analysis of timeevolving genomics data and to evaluate its application within the French INIST institute for the tasks of scientific and technological survey based on larger available scientific databases. Within this framework, the automated detection of the evolutions, emergences and reorganizations of research themes and groups is essential because it gives to the information analysts the possibility of carrying out exploratory studies at a large scale.
Bibliography
[1] Allan J., Carbonell J., Doddington, G.,. Yamron J., Yang Y. (1998). Topic detection and tracking pilot study, final report. Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia.
[2] AlShehabi S., Lamirel J.C. (2004). Inference Bayesian Network for Multitopographic neural network communication: a case study in documentary data. Proceedings of ICTTA, Damas, Syria, April 2004.
[3] AlShehabi S., Lamirel J.C. (2005). MultiTopographic Neural Network Communication and Generalization for MultiViewpoint Analysis. International Joint Conference on Neural Networks – IJCNN'05, Montreal, Canada, August 2005.
[4] Al Shehabi S., Lamirel J.C. (2006). Evaluation of collaboration between European universities using dynamic interaction between multiple sources.Journal of Information Management and Scientometrics (JIMS) 1(3) 2006.
[5] Attik M., Lamirel J.C., Al Shehabi S. (2006).Clustering Analysis for Data with Multiple Labels. Proceedings of the The IASTED International Conference on Databases and Applications (DBA), Innsbruck, Austria, February 2006.
[6] Batagelj V., Zaversnik M. (2002). An o(m) algorithm for cores decomposition of networks. University of Ljubljana, preprint series Vol. 40, 799.
[7] Besagni D., François C., Polanco X., Roche I., Stanalyst^{®} : Une station pour l’analyse de l’information. Proceedings of VSST2004, Toulouse, pp 319320, 2529, October 2004.
[8] Calinski, T. and Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3 (1974),1–27.
[9] Cuxac P., Lelu A., Cadot M. (2009). Suivi incrémental des évolutions dans une base d'information indexée : une boucle évaluation / correction pour le choix des algorithmes et des paramètres. Systèmes d’Information et Intelligence Economique (SIIE’09), 1214 Février 2009, Hammamet, Tunisie.
[10] Davies, D.and Bouldin, W. (1979). A cluster separation measure. IEEE Trans. Pattern Anal. Machine Intell. 1 (1979) 224–227.
[11] Dempster A.P., Laird N.M. and Rubin D.B. (1977). Maximum likelihood for incomplete data via the em algorithm. Journal of the Royal Statistical Society, vol. B39: 138.
[12] Ester M., Kriegel H.P., Sander J., and Xu X. (1996). A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD'96 ). AAAI Press, Menlo Park, CA, pp. 226231.
[13] François C. , Hoffmann M., Lamirel J.C., Polanco X. (2003). Artificial Neural Network mapping experiments. EICSTES (IST199920350) Final Report (WP 9.4), 86 p., September 2003.
[14] Frizke B. (1995). A growing neural gas network lerans topologies. Tesauro G., Touretzky D. S., leen T. K., Eds., Advances in neural Information processing Systems 7, pp 625632, MIT Press, Cambridge MA.
[15] Gaber M., Zaslavsky A. and Krishnaswamy S. (2005). Mining Data Streams: A Review. SIGMOD Record, 34(2).
[16] Ghribi M., Cuxac P., Lamirel J.C., Lelu A. (2010). Mesures de qualité de clustering de documents : Prise en compte de la distribution des motsclés. Atelier EvalECD’2010 Workshop, Hamamet, Tunisia.
[17] Hamza H., Belaïd Y., Belaîd. A, Chaudhuri B. B. (2008). Incremental classification of invoice documents. 19th International Conference on Pattern Recognition  ICPR 2008.
[18] Kassab R., Lamirel J.C., (2008). Feature Based Cluster Validation for High Dimensional Data. IASTED International Conference on Artificial Intelligence and Applications(AIA), Innsbruck, Austria, February 2008.
[19] Kohonen T. (1982). Selforganized formation of topologically correct feature maps. Biological Cybernetics, vol. 43, pp 5659.
[20] Lamirel J.C., Créhange M. (1994). Application of a symbolicoconnectionist approach for the design of a highly interactive documentary database interrogation system with online learning capabilities. Proceedings ACMCIKM 94, Gaitherburg, Maryland, USA, November 94.
[21] Lamirel J.C., AlShehabi S., François C., Hoffmann M. (2004). New classification quality estimators for analysis of documentary information: application to patent analysis and web mapping. Scientometrics, 60(3).
[22] Lamirel J.C. and Al Shehabi S. (2004). Comparison of unsupervised neural clustering methods for mining Web and textual data, SCI 2004, Orlando, FL, USA, July 2004.
[23] Lamirel J.C., Al Shehabi S. (2006). MultiSOM: a multiview neural model for accurately analyzing and mining complex data.Proceedings of the 4th International Conference on Coordinated & Multiple Views in Exploratory Visualization (CMV), London, UK, July 2006.
[24] Lamirel J.C., Ta A.P. and Attik M. (2008). Novel Labeling Strategies for Hierarchical Representation of Multidimensional Data Analysis Results.IASTED International Conference on Artificial Intelligence and Applications(AIA), Innsbruck, Austria, February 2008.
[25] Lamirel J.C., Boulila Z., Ghribi M., Cuxac P. (2010). A new incremental growing neural gas algorithm based on clusters labeling maximization: application to clustering of heterogeneous textual data. 23rd International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEAAIE 2010), Cordoba, Spain, June 2010.
[26] Lelu A. (1993). Modèles neuronaux pour l'analyse de données documentaires et textuelles. PhD Thesis. University of Paris 6.
[27] Martinetz T. et Schulten K. (1991). A "neural gas" network learns topologies. In Kohonen, T., Makisara K., Simula O., and Kangas J., editors, Articial Neural Networks, pp 397402. Elsevier Amsterdam.
[28] MacQueen J.B. (1967). Some methods of classification and analysis of multivariate observations. L. Le Cam and J. Neyman (Eds.), Proc. 5th Berkeley Symposium in Mathematics, Statistics and Probability, vol 1., pp. 281297, Univ. of California, Berkeley, USA, 1967.
[29] Merkl D., Shao Hui He, Dittenbach M., and Rauber A. (2003). Adaptive hierarchical incremental grid growing: an architecture for highdimensional data visualization. In Proceedings of the 4th Workshop on SelfOrganizing Maps, Advances in SelfOrganizing Maps, pp 293298, Kitakyushu, Japan, September 1114 2003.
[30] Mitton N., Busson A., Fleury E. (2004). Selforganization in large scale ad hoc networks. The Third Annual Mediterranean Ad Hoc Networking Workshop, (MEDHOCNET 04). Bodrum, Turkey.
[31] Polanco X., François C. (2000). Data Classification and Cluster Mapping or Visualization in Text Processing and Mining. Proceedings of the Sixth international ISKO conference, Toronto, Canada. Edited by Clare Beghtol, Lynne C. Howarth, Nancy J. Williamson: Ergon Verlag, pp 359 – 365, 1013 July 2000.
[32] P.Pons P., Latapy M. (2006). Computing communities in large networks using random walks. Journal of Graph Algorithms and Application, 2006.
[33] Prudent Y., Ennaji, A. (2005). An Incremental Growing Neural Gas learns Topology. ESANN2005, 13^{th} European Symposium on Artificial Neural Networks, Bruges, Belgium, 2729April 205, published in Neural Networks, 2005. IJCNN apos;05. Proceedings. 2005 IEEE International Joint Conference , vol. 2, no. 31 pp 1211  1216, July4 Aug. 2005.
[34] Robertson, S. E., & Sparck Jones, K. (1976). Relevance Weighting of Search Terms. Journal of the American Society for Information Science, 27:129–146.
[35] Roche I., François C., Besagni D. (2007). Les méthodes bibliométriques en soutien d’une approche expert dans la détection de technologies prometteuses. Proceedings of VSST2007, Marrakech, 2125 octobre 2007.
[36] Salton, G. (1971). The SMART Retrieval System: Experiments in Automatic Document Processing.Prentice Hall Inc., Englewood Cliffs, New Jersey.
[37] Schiebel E., Hörlesberger., Roche I., François C., Besagni D. (2010). An advanced diffusion model to identify emergent research issues: the case of optoelectronic devices. Scientometrics. Vol. 83, N° 3, pp 765781, 2010.
[38] Trémolières R.C. (1979). Thepercolation method for an efficient grouping of data. Pattern Recognition, 11(4).
[39] Von Oertzen J. (2007). Results of evaluation and screening of 40 technologies. Deliverable 04 for Project PROMTECH (contract N° 15615), 32 pages + appendix.
[40] Voorhees E.M. (1986). Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing and Management, vol. 22, pp 465476.
[41] Wayne C.L. (1998). Topic detection & tracking (TDT): Overview & perspective. Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia.
Footnotes
1 A bias can occur when the intrinsic dimensions of the obtained clusters (number of nonzero components in the reference vectors describing the clusters) are not of the same order of magnitude than the intrinsic dimensions of the data profiles (see [18] for more details).
2 Conversely to classical Recall and Precision indexes that are supervised.