Entity Information Life Cycle for Big Data - John R. Talburt, Yinle Zhou

Entity Information Life Cycle for Big Data (eBook)

Master Data Management and Information Integration

John R. Talburt, Yinle Zhou (Autoren)

eBook Download: PDF | EPUB

2015 | 1. Auflage
254 Seiten
Elsevier Science (Verlag)
978-0-12-800665-8 (ISBN)

Entity Information Life Cycle for Big Data walks you through the ins and outs of managing entity information so you can successfully achieve master data management (MDM) in the era of big data. This book explains big data's impact on MDM and the critical role of entity information management system (EIMS) in successful MDM. Expert authors Dr. John R. Talburt and Dr. Yinle Zhou provide a thorough background in the principles of managing the entity information life cycle and provide practical tips and techniques for implementing an EIMS, strategies for exploiting distributed processing to handle big data for EIMS, and examples from real applications. Additional material on the theory of EIIM and methods for assessing and evaluating EIMS performance also make this book appropriate for use as a textbook in courses on entity and identity management, data management, customer relationship management (CRM), and related topics. - Explains the business value and impact of entity information management system (EIMS) and directly addresses the problem of EIMS design and operation, a critical issue organizations face when implementing MDM systems - Offers practical guidance to help you design and build an EIM system that will successfully handle big data - Details how to measure and evaluate entity integrity in MDM systems and explains the principles and processes that comprise EIM - Provides an understanding of features and functions an EIM system should have that will assist in evaluating commercial EIM systems - Includes chapter review questions, exercises, tips, and free downloads of demonstrations that use the OYSTER open source EIM system - Executable code (Java .jar files), control scripts, and synthetic input data illustrate various aspects of CSRUD life cycle such as identity capture, identity update, and assertions

Dr. John R. Talburt is Professor of Information Science at the University of Arkansas at Little Rock (UALR) where he is the Coordinator for the Information Quality Graduate Program and the Executive Director of the UALR Center for Advanced Research in Entity Resolution and Information Quality (ERIQ). He is also the Chief Scientist for Black Oak Partners, LLC, an information quality solutions company. Prior to his appointment at UALR he was the leader for research and development and product innovation at Acxiom Corporation, a global leader in information management and customer data integration. Professor Talburt holds several patents related to customer data integration and the author of numerous articles on information quality and entity resolution, and is the author of Entity Resolution and Information Quality (Morgan Kaufmann, 2011). He also holds the IAIDQ Information Quality Certified Professional (IQCP) credential.

Preface

The Changing Landscape of Information Quality

Since the publication of Entity Resolution and Information Quality (Morgan Kaufmann, 2011), a lot has been happening in the field of information and data quality. One of the most important developments is how organizations are beginning to understand that the data they hold are among their most important assets and should be managed accordingly. As many of us know, this is by no means a new message, only that it is just now being heeded. Leading experts in information and data quality such as Rich Wang, Yang Lee, Tom Redman, Larry English, Danette McGilvray, David Loshin, Laura Sebastian-Coleman, Rajesh Jugulum, Sunil Soares, Arkady Maydanchik, and many others have been advocating this principle for many years.

Evidence of this new understanding can be found in the dramatic surge of the adoption of data governance (DG) programs by organizations of all types and sizes. Conferences, workshops, and webinars on this topic are overflowing with attendees. The primary reason is that DG provides organizations with an answer to the question, “If information is really an important organizational asset, then how can it be managed at the enterprise level?” One of the primary benefits of a DG program is that it provides a framework for implementing a central point of communication and control over all of an organization’s data and information.

As DG has grown and matured, its essential components become more clearly defined. These components generally include central repositories for data definitions, business rules, metadata, data-related issue tracking, regulations and compliance, and data quality rules. Two other key components of DG are master data management (MDM) and reference data management (RDM). Consequently, the increasing adoption of DG programs has brought a commensurate increase in focus on the importance of MDM.

Certainly this is not the first book on MDM. Several excellent books include Master Data Management and Data Governance by Alex Berson and Larry Dubov (2011), Master Data Management in Practice by Dalton Cervo and Mark Allen (2011), Master Data Management by David Loshin (2009), Enterprise Master Data Management by Allen Dreibelbis, Eberhard Hechler, Ivan Milman, Martin Oberhofer, Paul van Run, and Dan Wolfson (2008), and Customer Data Integration by Jill Dyché and Evan Levy (2006). However, MDM is an extensive and evolving topic. No single book can explore every aspect of MDM at every level.

Motivation for This Book

Numerous things have motivated us to contribute yet another book. However, the primary reason is this. Based on our experience in both academia and industry, we believe that many of the problems that organizations experience with MDM implementation and operation are rooted in the failure to understand and address certain critical aspects of entity identity information management (EIIM). EIIM is an extension of entity resolution (ER) with the goal of achieving and maintaining the highest level of accuracy in the MDM system. Two key terms are “achieving” and “maintaining.”

Having a goal and defined requirements is the starting point for every information and data quality methodology from the MIT TDQM (Total Data Quality Management) to the Six-Sigma DMAIC (Define, Measure, Analyze, Improve, and Control). Unfortunately, when it comes to MDM, many organizations have not defined any goals. Consequently these organizations don’t have a way to know if they have achieved their goal. They leave many questions unanswered. What is our accuracy? Now that a proposed programming or procedure has been implemented, is the system performing better or worse than before? Few MDM administrators can provide accurate estimates of even the most basic metrics such as false positive and false negative rates or the overall accuracy of their system. In this book we have emphasized the importance of objective and systematic measurement and provided practical guidance on how these measurements can be made.

To help organizations better address the maintaining of high levels of accuracy through EIIM, the majority of the material in the book is devoted to explaining the CSRUD five-phase entity information life cycle model. CSRUD is an acronym for capture, store and share, resolve and retrieve, update, and dispose. We believe that following this model can help any organization improve MDM accuracy and performance.

Finally, no modern day IT book can be complete without talking about Big Data. Seemingly rising up overnight, Big Data has captured everyone’s attention, not just in IT, but even the man on the street. Just as DG seems to be getting up a good head of steam, it now has to deal with the Big Data phenomenon. The immediate question is whether Big Data simply fits right into the current DG model, or whether the DG model needs to be revised to account for Big Data.

Regardless of one’s opinion on this topic, one thing is clear – Big Data is bad news for MDM. The reason is a simple mathematical fact: MDM relies on entity resolution, and entity resolution primarily relies on pair-wise record matching, and the number of pairs of records to match increases as the square of the number of records. For this reason, ordinary data (millions of records) is already a challenge for MDM, so Big Data (billions of records) seems almost insurmountable. Fortunately, Big Data is not just matter of more data; it is also ushering in a new paradigm for managing and processing large amounts of data. Big Data is bringing with it new tools and techniques. Perhaps the most important technique is how to exploit distributed processing. However, it is easier to talk about Big Data than to do something about it. We wanted to avoid that and include in our book some practical strategies and designs for using distributed processing to solve some of these problems.

Audience

It is our hope that both IT professionals and business professionals interested in MDM and Big Data issues will find this book helpful. Most of the material focuses on issues of design and architecture, making it a resource for anyone evaluating an installed system, comparing proposed third-party systems, or for an organization contemplating building its own system. We also believe that it is written at a level appropriate for a university textbook.

Organization of the Material

Chapters 1 and 2 provide the background and context of the book. Chapter 1 provides a definition and overview of MDM. It includes the business case, dimensions, and challenges facing MDM and also starts the discussion of Big Data and its impact on MDM. Chapter 2 defines and explains the two primary technologies that support MDM – ER and EIIM. In addition, Chapter 2 introduces the CSRUD Life Cycle for entity identity information. This sets the stage for the next four chapters.

Chapters 3, 4, 5, and 6 are devoted to an in-depth discussion of the CSRUD life cycle model. Chapter 3 is an in-depth look at the Capture Phase of CSRUD. As part of the discussion, it also covers the techniques of truth set building, benchmarking, and problem sets as tools for assessing entity resolution and MDM outcomes. In addition, it discusses some of the pros and cons of the two most commonly used data matching techniques – deterministic matching and probabilistic matching.

Chapter 4 explains the Store and Share Phase of CSRUD. This chapter introduces the concept of an entity identity structure (EIS) that forms the building blocks of the identity knowledge base (IKB). In addition to discussing different styles of EIS designs, it also includes a discussion of the different types of MDM architectures.

Chapter 5 covers two closely related CSRUD phases, the Update Phase and the Dispose Phase. The Update Phase discussion covers both automated and manual update processes and the critical roles played by clerical review indicators, correction assertions, and confirmation assertions. Chapter 5 also presents an example of an identity visualization system that assists MDM data stewards with the review and assertion process.

Chapter 6 covers the Resolve and Retrieve Phase of CSRUD. It also discusses some design considerations for accessing identity information, and a simple model for a retrieved identifier confidence score.

Chapter 7 introduces two of the most important theoretical models for ER, the Fellegi-Sunter Theory of Record Linkage and the Stanford Entity Resolution Framework or SERF Model. Chapter 7 is inserted here because some of the concepts introduced in the SERF Model are used in Chapter 8, “The Nuts and Bolts of ER.” The chapter concludes with a discussion of how EIIM relates to each of these models.

Chapter 8 describes a deeper level of design considerations for ER and EIIM systems. It discusses in detail the three levels of matching in an EIIM system: attribute-level, reference-level, and cluster-level matching.

Chapter 9 covers the technique of blocking as a way to increase the performance of ER and MDM systems. It focuses on match key blocking, the definition of match-key-to-match-rule alignment, and the precision and recall of match keys. Preresolution blocking and transitive closure of match keys are discussed as a prelude to Chapter 10.

Chapter 10 discusses the problems in implementing the CSRUD Life Cycle for Big Data. It gives examples of how the Hadoop Map/Reduce framework can be used to...

Erscheint lt. Verlag	20.4.2015
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
	Informatik ► Office Programme ► Outlook
	Mathematik / Informatik ► Informatik ► Software Entwicklung
	Wirtschaft ► Betriebswirtschaft / Management ► Wirtschaftsinformatik
ISBN-10	0-12-800665-X / 012800665X
ISBN-13	978-0-12-800665-8 / 9780128006658

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

PDF (Adobe DRM)
Größe: 6,9 MB

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

EPUB (Adobe DRM)
Größe: 7,7 MB

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.