Data Mining Techniques in Grid Computing Environments (eBook)

Werner Dubitzky (Herausgeber)

eBook Download: PDF
2008 | 1. Auflage
288 Seiten
Wiley (Verlag)
978-0-470-69989-8 (ISBN)

Lese- und Medienproben

Data Mining Techniques in Grid Computing Environments -
Systemvoraussetzungen
110,99 inkl. MwSt
  • Download sofort lieferbar
  • Zahlungsarten anzeigen
Based around eleven international real life case studies and including contributions from leading experts in the field this groundbreaking book explores the need for the grid-enabling of data mining applications and provides a comprehensive study of the technology, techniques and management skills necessary to create them. This book provides a simultaneous design blueprint, user guide, and research agenda for current and future developments and will appeal to a broad audience; from developers and users of data mining and grid technology, to advanced undergraduate and postgraduate students interested in this field.

Werner Dubitzky, PhD, is Chair of Bioinformatics at the Biomedical Sciences Research Institute in the Faculty of Life and Health Sciences at the University of Ulster. His research investigates systems biology, knowledge management in biology, grid computing, and data mining.

Krzysztof Kurowski, PhD, leads the Applications Department at Poznan Supercomputing and Networking Center in Poland. His research is focused on the modeling of advanced applications, scheduling, and resource management in networked environments.


Based around eleven international real life case studies and including contributions from leading experts in the field this groundbreaking book explores the need for the grid-enabling of data mining applications and provides a comprehensive study of the technology, techniques and management skills necessary to create them. This book provides a simultaneous design blueprint, user guide, and research agenda for current and future developments and will appeal to a broad audience; from developers and users of data mining and grid technology, to advanced undergraduate and postgraduate students interested in this field.

Werner Dubitzky, PhD, is Chair of Bioinformatics at the Biomedical Sciences Research Institute in the Faculty of Life and Health Sciences at the University of Ulster. His research investigates systems biology, knowledge management in biology, grid computing, and data mining. Krzysztof Kurowski, PhD, leads the Applications Department at Poznan Supercomputing and Networking Center in Poland. His research is focused on the modeling of advanced applications, scheduling, and resource management in networked environments.

Data Mining Techniques in Grid Computing Environments 1
Contents 7
Preface 15
List of Contributors 19
1 Data mining meets grid computing: Time to dance? 25
1.1 Introduction 26
1.2 Data mining 27
1.2.1 Complex data mining problems 27
1.2.2 Data mining challenges 28
1.3 Grid computing 30
1.3.1 Grid computing challenges 33
1.4 Data mining grid – mining grid data 33
1.4.1 Data mining grid: a grid facilitating large-scale data mining 33
1.4.2 Mining grid data: analyzing grid systems with data mining techniques 35
1.5 Conclusions 36
1.6 Summary of Chapters in this Volume 37
2 Data analysis services in the knowledge grid 41
2.1 Introduction 41
2.2 Approach 42
2.3 Knowledge Grid services 44
2.3.1 The Knowledge Grid architecture 45
2.3.2 Implementation 48
2.4 Data analysis services 53
2.5 Design of Knowledge Grid applications 55
2.5.1 The VEGA visual language 55
2.5.2 UML application modelling 56
2.5.3 Applications and experiments 57
2.6 Conclusions 58
3 GridMiner: An advanced support for e-science analytics 61
3.1 Introduction 61
3.2 Rationale behind the design and development of GridMiner 63
3.3 Use Case 64
3.4 Knowledge discovery process and its support by the GridMiner 65
3.4.1 Phases of knowledge discovery 66
3.4.2 Workflow management 69
3.4.3 Data management 70
3.4.4 Data mining services and OLAP 71
3.4.5 Security 73
3.5 Graphical user interface 74
3.6 Future developments 76
3.6.1 High-level data mining model 76
3.6.2 Data mining query language 76
3.6.3 Distributed mining of data streams 76
3.7 Conclusions 77
4 ADaM services: Scientific data mining in the service-oriented architecture paradigm 81
4.1 Introduction 82
4.2 ADaM system overview 82
4.3 ADaM toolkit overview 84
4.4 Mining in a service-oriented architecture 85
4.5 Mining web services 86
4.5.1 Implementation architecture 87
4.5.2 Workflow example 88
4.5.3 Implementation issues 88
4.6 Mining grid services 90
4.6.1 Architecture components 91
4.6.2 Workflow example 92
4.7 Summary 93
5 Mining for misconfigured machines in grid systems 95
5.1 Introduction 95
5.2 Preliminaries and related work 95
5.2.1 System misconfiguration detection 97
5.2.2 Outlier detection 98
5.3 Acquiring, pre-processing and storing data 99
5.3.1 Data sources and acquisition 99
5.3.2 Pre-processing 99
5.3.3 Data organization 100
5.4 Data analysis 101
5.4.1 General approach 101
5.4.2 Notation 102
5.4.3 Algorithm 102
5.4.4 Correctness and termination 104
5.5 The GMS 104
5.6 Evaluation 106
5.6.1 Qualitative results 106
5.6.2 Quantitative results 107
5.6.3 Interoperability 109
5.7 Conclusions and future work 112
6 FAEHIM: Federated Analysis Environment for Heterogeneous Intelligent Mining 115
6.1 Introduction 115
6.2 Requirements of a distributed knowledge discovery framework 117
6.2.1 Category 1: knowledge discovery specific requirements 117
6.2.2 Category 2: distributed framework specific requirements 118
6.3 Workflow-based knowledge discovery 118
6.4 Data mining toolkit 119
6.5 Data mining service framework 120
6.6 Distributed data mining services 123
6.7 Data manipulation tools 124
6.8 Availability 125
6.9 Empirical experiments 125
6.9.1 Evaluating the framework accuracy 126
6.9.2 Evaluating the running time of the framework 127
6.10 Conclusions 128
7 Scalable and privacy preserving distributed data analys is overa service-oriented platform 129
7.1 Introduction 129
7.2 A service-oriented solution 130
7.3 Background 131
7.3.1 Types of distributed data analysis 131
7.3.2 A brief review of distributed data analysis 132
7.3.3 Data mining services and data analysis management systems 132
7.4 Model-based scalable, privacy preserving, distributed data analysis 133
7.4.1 Hierarchical local data abstractions 133
7.4.2 Learning global models from local abstractions 134
7.5 Modelling distributed data mining and workflow processes 135
7.5.1 DDM processes in BPEL4WS 135
7.5.2 Implementation details 136
7.6 Lessons learned 136
7.6.1 Performance of running distributed data analysis on BPEL 136
7.6.2 Issues specific to service-oriented distributed data analysis 137
7.6.3 Compatibility of Web services development tools 138
7.7 Further research directions 138
7.7.1 Optimizing BPEL4WS process execution 138
7.7.2 Improved support of data analysis process management 139
7.7.3 Improved support of data privacy preservation 139
7.8 Conclusions 140
8 Building and using analytical workflows in Discovery Net 143
8.1 Introduction 143
8.1.1 Workflows on the grid 144
8.2 Discovery Net system 145
8.2.1 System overview 145
8.2.2 Workflow representation in DPML 146
8.2.3 Multiple data models 147
8.2.4 Workflow-based services 147
8.2.5 Multiple execution models 147
8.2.6 Data flow pull model 148
8.2.7 Streaming and batch transfer of data elements 148
8.2.8 Control flow push model 149
8.2.9 Embedding 149
8.3 Architecture for Discovery Net 150
8.3.1 Motivation for a new server architecture 150
8.3.2 Management of hosting environments 151
8.3.3 Activity management 151
8.3.4 Collaborative workflow platform 151
8.3.5 Architecture overview 151
8.3.6 Activity service definition layer 153
8.3.7 Activity services bus 154
8.3.8 Collaboration and execution services 154
8.3.9 Workflow Services Bus 154
8.3.10 Prototyping and production clients 154
8.4 Data management 155
8.5 Example of a workflow study 157
8.5.1 ADR studies 157
8.5.2 Analysis overview 157
8.5.3 Service for transforming event data into patient annotations 158
8.5.4 Service for defining exclusions 158
8.5.5 Service for defining exposures 159
8.5.6 Service for building the classification model 159
8.5.7 Validation service 159
8.5.8 Summary 160
8.6 Future directions 160
9 Building workflows that traverse the bioinformatics data landscape 165
9.1 Introduction 165
9.2 The bioinformatics data landscape 167
9.3 The bioinformatics experiment landscape 167
9.4 Taverna for bioinformatics experiments 169
9.4.1 Three-tiered enactment in Taverna 170
9.4.2 The open-typing data models 171
9.5 Building workflows in Taverna 172
9.5.1 Designing a SCUFL workflow 173
9.6 Workflow case study 174
9.6.1 The bioinformatics task 176
9.6.2 Current approaches and issues 177
9.6.3 Constructing workflows 178
9.6.4 Candidate genes involved in trypanosomiasis resistance 180
9.6.5 Workflows and the systematic approach 181
9.7 Discussion 183
10 Specification of distributed data mining workflows with DataMiningGrid 189
10.1 Introduction 189
10.2 DataMiningGrid environment 191
10.2.1 General architecture 191
10.2.2 Grid environment 191
10.2.3 Scalability 191
10.2.4 Workflow environment 192
10.3 Operations for workflow construction 193
10.3.1 Chaining 193
10.3.2 Looping 193
10.3.3 Branching 194
10.3.4 Shipping algorithms 194
10.3.5 Shipping data 194
10.3.6 Parameter variation 195
10.3.7 Parallelization 195
10.4 Extensibility 195
10.5 Case studies 197
10.5.1 Evaluation criteria and experimental methodology 197
10.5.2 Partitioning data 197
10.5.3 Classifier comparison scenario 199
10.5.4 Parameter optimization 199
10.6 Discussion and related work 199
10.7 Open issues 200
10.8 Conclusions 200
11 Anteater: Service-oriented data mining 203
11.1 Introduction 203
11.2 The architecture 205
11.3 Runtime framework 207
11.3.1 Labelled stream 209
11.3.2 Global persistent storage 209
11.3.3 Termination detection 210
11.3.4 Application of the model 211
11.4 Parallel algorithms for data mining 213
11.4.1 Decision trees 213
11.4.2 Clustering 217
11.5 Visual metaphors 219
11.6 Case studies 220
11.7 Future developments 221
11.8 Conclusions and future work 222
12 DMGA: A generic brokering-based Data Mining Grid Architecture 225
12.1 Introduction 225
12.2 DMGA overview 226
12.3 Horizontal composition 228
12.4 Vertical composition 230
12.5 The need for brokering 232
12.6 Brokering-based data mining grid architecture 233
12.7 Use cases: Apriori, ID3 and J4.8 algorithms 234
12.7.1 Horizontal composition use case: Apriori 234
12.7.2 Vertical composition use cases: ID3 and J4.8 237
12.8 Related work 240
12.9 Conclusions 241
13 Grid-based data mining with the Environmental Scenario Search Engine (ESSE) 245
13.1 Environmental data source: NCEP/NCAR reanalysis data set 246
13.2 Fuzzy search engine 247
13.2.1 Operators of fuzzy logic 248
13.2.2 Fuzzy logic predicates 250
13.2.3 Fuzzy states in time 251
13.2.4 Relative importance of parameters 253
13.2.5 Fuzzy search optimization 253
13.3 Software architecture 255
13.3.1 Database schema optimization 255
13.3.2 Data grid layer 257
13.3.3 ESSE data resource 259
13.3.4 ESSE data processor 259
13.4 Applications 261
13.4.1 Global air temperature trends 262
13.4.2 Statistics of extreme weather events 263
13.4.3 Atmospheric fronts 263
13.5 Conclusions 267
14 Data pre-processing using OGSA-DAI 271
14.1 Introduction 271
14.2 Data pre-processing for grid-enabled data mining 272
14.3 Using OGSA-DAI to support data mining applications 272
14.3.1 OGSA-DAI’s activity framework 273
14.3.2 OGSA-DAI workflows for data management and pre-processing 277
14.4 Data pre-processing scenarios in data mining applications 279
14.4.1 Calculating a data summary 279
14.4.2 Discovering association rules in protein unfolding simulations 280
14.4.3 Mining distributed medical databases 281
14.5 State-of-the-art solutions for grid data management 282
14.6 Discussion 283
14.7 Open Issues 283
14.8 Conclusions 284
Index 287

Erscheint lt. Verlag 13.10.2008
Sprache englisch
Themenwelt Informatik Datenbanken Data Warehouse / Data Mining
Mathematik / Informatik Informatik Netzwerke
Studium Querschnittsbereiche Infektiologie / Immunologie
Naturwissenschaften Biologie
Technik Elektrotechnik / Energietechnik
Schlagworte Bioinformatik • Computer Science • Data Mining • Data Mining & Knowledge Discovery • Data Mining u. Knowledge Discovery • Grid & Cloud Computing • grid computing • Grid- u. Cloud-Computing • Informatik
ISBN-10 0-470-69989-2 / 0470699892
ISBN-13 978-0-470-69989-8 / 9780470699898
Haben Sie eine Frage zum Produkt?
PDFPDF (Adobe DRM)
Größe: 4,2 MB

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seiten­layout eignet sich die PDF besonders für Fach­bücher mit Spalten, Tabellen und Abbild­ungen. Eine PDF kann auf fast allen Geräten ange­zeigt werden, ist aber für kleine Displays (Smart­phone, eReader) nur einge­schränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Mehr entdecken
aus dem Bereich
Achieve data excellence by unlocking the full potential of MongoDB

von Marko Aleksendric; Arek Borucki; Leandro Domingues …

eBook Download (2024)
Packt Publishing (Verlag)
53,99