Data Analysis and Data Mining: Current Issues in Biomedical Informatics

Authors contacts: Riccardo Bellazzi; Telephone: +39-0382-985720; Fax: +39-0382-985373; ti.vpinu@izzalleb.odraccir; Address: Dipartimento di Informatica e Sistemistica, Via Ferrata 1, 27100 Pavia (PV), Italy. Marianna Diomidous; Telephone: ; Fax: ; rg.aou.srun@idimoidm; Address: University of Athens, Department of Nursing, Athens, Greece. Indra Neil Sarkar; Telephone: +1-802-656-8283 ; Fax: +1-802-656-4589 ; ude.mvu@rakraS.lieN; Address: University of Vermont, Center for Clinical and Translational Science, 89 Beaumont Avenue, Given Courtyard N309, Burlington, VT 05405 USA. Katsuhiko Takabayashi; Telephone: ; Fax ; pj.ca.u-abihc.oh@abakat; Address: Chiba University Hospital, Japan. Andreas Ziegler; Telephone: +49 451 500 2780; Fax: +49 451 500 2999; ed.kcebeul-inu.sbmi@relgeiz; Address: Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Maria-Goeppert-Str. 1, 23562 Lübeck, Germany Alexa McCray; Telephone: 1 617 432-2144; Fax: 1 617 ; ude.dravrah.smh@yarccm_axela; Address: Harvard Medical School, 10 Shattuck Street, Boston, MA 02115, USA

The publisher's final edited version of this article is available at Methods Inf Med

Summary

Background

Medicine and biomedical sciences have become data-intensive fields, which, at the same time, enable the application of data-driven approaches and require sophisticated data analysis and data mining methods. Biomedical informatics provides a proper interdisciplinary context to integrate data and knowledge when processing available information, with the aim of giving effective decision-making support in clinics and translational research.

Objectives

To reflect on different perspectives related to the role of data analysis and data mining in biomedical informatics.

Methods

On the occasion of the 50th year of Methods of Information in Medicine a symposium was organized, that reflected on opportunities, challenges and priorities of organizing, representing and analysing data, information and knowledge in biomedicine and health care. The contributions of experts with a variety of backgrounds in the area of biomedical data analysis have been collected as one outcome of this symposium, in order to provide a broad, though coherent, overview of some of the most interesting aspects of the field.

Results

The paper presents sections on data accumulation and data-driven approaches in medical informatics, data and knowledge integration, statistical issues for the evaluation of data mining models, translational bioinformatics and bioinformatics aspects of genetic epidemiology.

Conclusions

Biomedical informatics represents a natural framework to properly and effectively apply data analysis and data mining methods in a decision-making context. In the future, it will be necessary to preserve the inclusive nature of the field and to foster an increasing sharing of data and methods between researchers.

Keywords: Biomedical informatics, data mining, data analysis, data-driven methods, translational bioinformatics

Introduction

The current era can be considered as a golden age for Biomedical Informatics (BI) [1,2,3]. After the early days of enthusiasm followed by a period of disillusion, [4,5], BI has now matured to the level of being an essential component of health care activities and biomedical research [6,7]. On the one hand, health care institutions are leveraging hospital information systems, which are a crucial asset of these complex organizations, and increasing investments of the public and private sectors show that medical and health informatics have become a solid field [7]. On the other hand, the ‘–omics’ data explosion and the need for translating research results in clinical practice have boosted the activities of BI, which is now an irreplaceable component of molecular medicine [8].

On the occasion of the 50th year of Methods of Information in Medicine a scientific symposium was organized, which took place in Heidelberg, Germany from June 9 to 11, 2011. A select number of distinguished colleagues from around the world gathered in Heidelberg to participate in the symposium which had as its theme: ‘Biomedical Informatics: Confluence of Multiple Disciplines’, reflecting on opportunities, challenges and priorities of organizing, representing and analysing data, information and knowledge in biomedicine and health care.

As one outcome of this symposium, the contributions of experts with different backgrounds in the area of biomedical data analysis have been collected in this paper, in order to provide a broad, though coherent, overview of some of the most interesting aspects of the field.

Looking at the current scenario of biomedical sciences, it is easy to notice that, now more than ever, data analysis and information processing have become basic and crucial components of the day-to-day activities of researchers, scientists, clinicians, nurses and decision-makers. It is not surprising, therefore, that on the occasion of its 50 th anniversary, Methods of Information in Medicine proposes a careful reflection on the current perspectives of the role of data analysis and data mining in BI.

Rather interestingly, since its beginnings, data-driven approaches and data mining methods have been sources of controversies. First of all, the transformation of biology and medicine into a “data-intensive” field has provided validity to experiments, in which the goal was to gather data to generate new unbiased knowledge [9]. The risk of false discoveries, however, has raised skepticism and several studies have partially lower the initial enthusiasms [10]. Second, there still exists an unresolved tension between data miners, who agnostically use methods from computer science, signal processing, optimization and statistics, and data analysts, who mainly ground their approaches in well-established statistical theory and tools.

Being at the intersection of many disciplines, BI represents the natural space for reconciling in a coherent scenario different paradigms and perspectives for the benefit of scientific progress (See Figure 1 ). The basic dilemma between empiricism and rationalism, underlying many of the above mentioned disputes, is resolved in BI following a pragmatic strategy, aimed at solving problems in the best possible way, given the current status of knowledge and taking into account technological constraints and limitations [11,12]. Moreover, the availability of knowledge repositories in electronic format so strongly empowers biomedical research that data analysis and knowledge generation steps are now part of a unique, continuous cycle [13].

An external file that holds a picture, illustration, etc. Object name is nihms-329568-f0001.jpg

Data-driven and knowledge-driven approaches will co-operate and make rapid progress in biomedical science [8].

The first two sections of the paper deepen the insight on this crucial theme. The role of data-driven approaches and the integration of data and knowledge in BI are analyzed and future challenges outlined.

As science is progressing, data mining and statistical approaches are no longer seen as alternative ways of dealing with data analysis problems. On the contrary, they are beginning to be seen as fully complementary. One of the aspects of such relationships is the ability to evaluate predictive models, such as classification or regression, on the basis of sound strategies. The third section of the paper describes a number of suitable methods to be applied for assessing the performance of learning “machines” grounded in confidence intervals theory.

Certainly, one of the main engines of the data-driven revolution in biomedicine has been high-throughput –omics biotechnology and the related bioinformatics needs. The natural conjunction of genomic medicine, bioinformatics and medical informatics is represented by translational bioinformatics, which bridges the different fields into a unique, purposive, discipline aimed at exploiting research results in clinical practice. The fourth and fifth sections of the paper describe the data analysis aspects of this field, with a particular focus on information integration and genetic epidemiology.

The paper ends stressing two main issues: i) the potential enabling role that BI may have to provide open-access information to clinical data; ii) the need for keeping the BI field open to diverse methodological contributions that will strengthen its innovation capabilities.

Data Accumulation and Data-Driven Approaches across Biomedical Informatics

One of the most outstanding features of all electronic information, especially in biomedicine, is its integrity and expandability. In the last two decades, data and knowledge have been rapidly accumulated in each subdiscipline of biomedical informatics. This wealth of information is now about to change the circumstances and methods of investigation in every field. In clinical medicine, not only the rapid progress of the technology of modern medicine, but also the progress of computational science has contributed to dramatic medical advances. For example, electronic medical literature can be easily searched from PubMed on the Internet. Thus PubMed enables scientists and medical doctors to obtain the most up-to-date knowledge relevant to their work quickly and easily, thereby expediting the progress of medicine. Also, electronic documentation systems have made it possible to collect a very high volume of active patient data relevant to what was impossible in the paper-based era.

Once these data have been collected, we can create a data warehouse and retrieve special case data or apply data mining to find hidden facts and rules [14] Electronic discharge summaries are now being collected and can be reused to retrieve similar cases by using text mining [15]. Laboratory data have been recorded for as long as one’s entire life and can be integrated from several facilities and made ready to be analyzed for disease management. In addition, as the volume of data increases beyond medical facilities, it becomes more important for databases to be re-utilized. Regional healthcare information systems can provide more data than one medical facility and electronic health records (EHRs) can further be expanded to a national or global scale [16]. For example, all billing data for every month, which includes main disease names, types and times of laboratory tests, and names and doses of drugs administrated or injected in hospitals, can be electronically collected from all medical facilities in Korea and Japan. Even only having this information, we can analyze the nationwide tendency of clinical practices for a disease [17]. Now, because EHRs have evolved, all the events affecting a person’s health can be collected electronically throughout one’s lifetime, which can be considered as a personal health record (PHR), or a personal life record (PLR). In a PHR, not only medical data but also health data are included, such as blood pressure and body weight measurements taken at a health club, or the list of lunch items consumed in a company cafeteria. Thus, a PHR can include a complete personal history about health. The same or even a more extreme phenomenon of huge data accumulation is occurring in genetic research. This research includes genome and sequence analysis, microarray data or genetic expression data analysis, high-throughput genome mapping, gene regulation, protein structure prediction and classification, and disease classification, for which informatics plays a very important role [18]. In this rapidly emerging field, an extensive amount of knowledge has also been accumulated in many genomic databases for reuse by many researchers seeking to discover new relations by using the techniques of informatics.

Going forward, active research will increase between the individual disciplines to pursue the final goal of biomedical informatics; in other words, the confluence of disciplines will lead to the discovery of new relations beyond the limits of individual disciplines. Connecting one’s PHR, which records extrinsic and environmental factors affecting a person’s health, and outcome, with one’s complete DNA sequences that identify intrinsic factors is the final destination. To accomplish this, many steps and phases must be carried out, because we cannot connect genomic and clinical information so easily. We must develop specific tools to complete the steps and phases one by one. Translational research is a field with a goal to integrate biology and clinical medicine in order to bridge the gap between basic medical research and clinical care [8]. Consequently, translational research provides a vast and challenging field for biomedical informatics researchers [19].

The traditional approach in biomedical science has been knowledge-driven and aimed at generating hypotheses from domain knowledge in a top-down fashion. Instead, we are about to enter the data-intensive science era in which hypotheses are generated automatically among the enormous amount of data available by using computational science with inductive reasoning [20]. These two approaches are not conflicting, but they can be combined or integrated to discover new knowledge [21]. Thus biomedical informaticians will play a significant role in developing new methods in the field of data mining and machine learning that will then be available to domain experts.

Another role of biomedical informaticians will be as supervisors and administrators of biomedical data management. In this broad map covering the entire biomedical field, a new discipline is needed to comprehensively oversee all steps of biomedical informatics, from the micro- to the macro-level of information, and to identify which parts are unknown, which limiting factors remain to be solved, and which areas need to be linked to other areas. These administrators are different from domain experts and must fulfill their tasks to accelerate the progress of all biomedical science. As the current disciplines of biomedical informatics interconnect, new roles for biomedical and computer scientists will emerge.

Knowledge and Data Integration

As mentioned in the previous section, an increasing flood of data in electronic format nowadays characterizes a variety of human activities, including health care and biomedical research. For this reason, fifteen years after their rise, the fields of data mining and knowledge discovery in databases are still topical and represent a crucial sector of biomedical informatics [22,23]. Certainly, over the last few years, the very nature of the collected data has changed as have data mining methods and tools. Data are available in a variety of formats, including not only numeric or codified values, signals and images, but also textual reports and summaries, multivariate time series and data streams, event logs, mobility information, social networks and interaction databases [24,25]. As a consequence, a noteworthy effort has been devoted to designing and applying a number of recent technologies, such as text mining [26,27], temporal data mining [28], workflow mining [29], and networks analysis [30]. Within biomedical data mining, one of the most interesting aspects is the exploitation of domain knowledge and the integration of different data sources in the data analysis process. As a matter of fact, data analysis is strongly empowered by the knowledge available in electronic format, which can either be already formalized, say through ontology and annotation repositories, or still informal but novel, as, for example, reported in Pubmed abstracts and papers [31].

The integration of data and knowledge is being crucially stimulated by bioinformatics applications, where the joint availability of publicly available databases, annotation systems and biomedical ontologies, has given rise to the field called “integrative bioinformatics” [32].

Rather interestingly, also in medical informatics and computer science, attention has been devoted to this problem since the late nineties. Intelligent data analysis (IDA) is a research field that refers to all methods devoted to (automatically) transform data into information by exploiting the available domain knowledge. IDA and data mining have been the focus of one of the working groups of the International Medical Informatics Association since 2000 [33]. The IDA and Data Mining IMIA working group have resulted in a variety of interesting results, papers and research projects (31, 34-36).

The “natural” step forward is to build on the results obtained so far to define new methodologies able to merge data exploration, visual analytics and data mining with inductive reasoning, as also underlined in the previous section. Efforts towards the combination of reasoning approaches with data analysis have been recently published [37, 38], as have very interesting software products, including open source frameworks [39]. IDA and reasoning require different disciplines to converge. Knowledge representation, automated reasoning, statistical and mathematical methods, new algorithms, efficient and modern IT technologies, advanced interfaces based on cognitive science need to be properly integrated in this context [38]. Novel IT systems empowered by IDA tools hold the promise of leveraging biomedical research and clinical decision making.

While some of the IDA and data mining methods are now ripe and ready to be used in clinical practice, such as for example classification, regression and clustering models, other instruments still need to be more widely studied and applied [35]. Temporal reasoning and data mining represent one such interesting area, which deserves an increasing level of attention and further research. Dealing with time is a crucial and challenging problem that is widespread in biomedical applications [40,41]. Even if a variety of methods is available to deal with biomedical signals and time-varying data, none of these tools is able alone to cope with the inherent complexity of temporal information and temporal reasoning. For example, data are very often irregularly collected due to an uneven schedule of measurements and visits, which may be dependent on the organizational settings or the severity of the disease. Moreover, the interpretation of temporal data is highly context-sensitive, so that the same pattern of the same variable may assume a different meaning in different clinical problems, say in the ICU or during home monitoring. Temporal reasoning and data mining are attempting to work together to solve such a difficult task through the so-called Temporal Data Mining (TDM) [42-44] field. The main goal of TDM is to extract relevant patterns from data: a temporal pattern is thus a sequence of events that is (clinically) important in a particular problem.

Rather interestingly, TDM methods have been designed to deal with different temporal data types, including time series of physiological variables, such as arterial pressure or blood glucose levels, and sequences of clinical events, such as hospital admissions and discharges, or drug prescriptions. These methods are therefore well suited to integrate information from a variety of data sets, including clinical records, monitoring devices, and large warehouses of administrative records. The IDA IMIA working group has worked extensively in this context, proposing methods able to deal with the extraction of temporal patterns from time series data and to synthesize temporal information into temporal features [45]. Such methods strongly depend on the knowledge available about the domain, and therefore, their application requires the integration of signal processing, algorithm design, knowledge representation and formalization.

The broad coverage of biomedical informatics, which spans from the molecular to the population level, is a tremendous enabler for the cross-fertilization of its different converging disciplines. Looking again at the temporal processes domain, we can easily note that different problems can be studied with similar approaches. For example, the so-called “workflow” modeling approach can be used to model care-flow processes [46], but it can be conveniently applied also to describe and analyze the complex intertwined processes underlying molecular studies [47]. For this reason, the algorithms able to automatically analyze process data (event logs) seem to have a wide potential application in all areas covered by Biomedical Informatics [48].

Together with advances in data mining algorithms, over the last few years there has been a great growth in the number and sophistication of data warehouses and integrated data repositories. Looking at recent developments, one of the most exciting advances is represented by the implementation of complex IT infrastructures designed to support clinical and biological research. As a matter of fact, the NIH-funded i2b2 research center [49], as well as the EHR4CR project [50,51], funded by the EU-IMI initiative, have shown that it is now possible to profoundly innovate biomedical research relying on newly designed IT systems. In a nutshell, the challenge is to create IT infrastructures able to support research by providing access to data collected in a data warehouse, which is populated from different data sources, including hospital and laboratory information systems, biobanks and the variety of small databases collected for single research studies. If properly implemented, such types of infrastructure can be a great accelerator for the entire research process [52]. However, this integration poses several challenges, in terms of data and knowledge representation, standard interfaces between software systems, data access, security and privacy policies, user interfaces and data querying functionality [8]. Moreover, in order to make such systems really effective in day-to-day activities, it is necessary to implement data analysis methods to help researchers in scientific discovery and health care providers in clinical decision-making [37]. Finally, natural language processing tools should be included in order to improve data gathering from textual documents and to summarize the knowledge available in the bibliome [27]. Such an ambitious project needs the confluence of many disciplines underlying biomedical informatics, ranging from IT systems design, data base management, and software interoperability, to ontology and terminology handling, data and text mining, human-computer interfaces and finally, experts in research and clinical processes [23]. Once available, these infrastructures will clearly show that biomedical informatics may be the ultimate enabler for the applications of bioinformatics methods and algorithms within a clinical context [52, 53].

Evaluating the Performance of Biomedical Data Mining Algorithms with Statistical Tools

Novel regression and classification methods are developed in various areas of research, such as medical informatics, bioinformatics, data mining or biostatistics. The performance of several competing approaches is usually evaluated in benchmark experiments [54]. The most important question to be answered in such a classification experiment is whether two automated learning “machines” differ in a relevant magnitude and/or significantly from one another. Here, it is important to note that simple algorithms are often quite good, and they may be even superior to complicated machines [55]. And, generally, one cannot necessarily expect a pronounced superiority of a highly sophisticated approach [56].

One aspect in comparing learning machines with each other deserves specific attention. If various machines are trained on training data, their performance can only be compared in a fair manner by applying all machines to the same test data. In this case, the above described procedures lead to valid estimates.

If only a training data set is available so that machines are both trained and compared on the same data, prediction accuracy varies systematically depending on the way machines are trained [55]. For example, logistic regression utilizes all available data in the model-building step, and it is more prone to overfitting compared with ensemble methods that use only a portion of the available data and rarely overfit. Therefore, error fractions and performance estimates are more reliable for ensemble methods but may also be substantially higher. As an alternative, 5-fold or 10-fold cross validation may be used for internally validating the models that has been shown to yield satisfactory results. Here, it is important to note that all steps of model building, model-dependent data transformations, and variable selection need to be repeated for each loop of the cross validation. Overfitting could otherwise result [57]. However, even with 10-fold cross validation not the same amount of data, i.e., the same number of degrees of freedom is used as with bootstrapping. Specifically, bootstrapping–which is often used in ensemble methods–tends to use approximately 2/3 of the data set. In contrast, with 5-fold or 10 -fold cross validation, 90% or 80% of the training data is used to train the machine.

If the same set of training data is generated in the first step and if all of these data are used to train the machines, this source of bias in comparing prediction accuracies can be overcome. Specifically, all machines can be tested on the same out of bag samples, i.e., the samples not drawn in the bootstrap, and these give paired results for the machines. These can then be compared by appropriate averaging across all bootstrap samples [55], and standard statistics for comparing machines can easily be calculated.

For classification methods these are the Brier score [58,59], sensitivity, specificity, or the error fraction [55]. More specifically, the predictions from two machines for the same patient are expected to be correlated: the presence of such correlation can be formally tested with McNemar’s test. Corresponding confidence intervals for the differences in error fractions, sensitivity, or specificity can easily be calculated [60-62].

To give an example, using the notation of Table 1 , Wilson’s score method – method [63] in the review of Newcombe [61]– specifically yields the following confidence interval at level 1–α for the difference of the two proportions θ = (π1 + π2) − (π1 + π3) = π2 − π3. The interval is [ θ ^ − δ ; θ ^ + ε ] , where δ and ε are positive values δ = d l 2 2 − 2 φ ^ d l 2 d u 3 + d u 3 2 and ε = d u 2 2 − 2 φ ^ d u 2 d l 3 + d l 3 2 with dl2 = (a + b)/nl2, du2 = u2 − (a + b)/n. Here, l2 and u2 are the roots of ∣ ξ − a + b n ∣ = z 1 − α ∕ 2 ξ ( 1 − ξ ) n . Similarly, dl3 = (a + c)/nl3, du3 = u3 − (a + c)/n, where l3 and u3 are the roots of ∣ ξ − a + c n ∣ = z 1 − α ∕ 2 ξ ( 1 − ξ ) n . Finally,

φ ^ = < max ( a d − b c − n ∕ 2 , 0 ) ( a + b ) ( c + d ) ( a + c ) ( b + d ) if a d − b c >0 , 0 if a d − b c = 0 , a d − b c ( a + b ) ( c + d ) ( a + c ) ( b + d ) if a d − b c < 0 . >

For a comparison of error fractions between different independent data sets, for example, to compare differences in the performance of temporal and external validation [55], appropriate tests and confidence intervals have also been developed [63].