Data Mining

Read Data Mining Online

Authors: Mehmed Kantardzic

BOOK: Data Mining
5.19Mb size Format: txt, pdf, ePub

Table of Contents

Cover

Series page

Title page

Copyright page

DEDICATION

PREFACE TO THE SECOND EDITION

PREFACE TO THE FIRST EDITION

1 DATA-MINING CONCEPTS

1.1 INTRODUCTION

1.2 DATA-MINING ROOTS

1.3 DATA-MINING PROCESS

1.4 LARGE DATA SETS

1.5 DATA WAREHOUSES FOR DATA MINING

1.6 BUSINESS ASPECTS OF DATA MINING: WHY A DATA-MINING PROJECT FAILS

1.7 ORGANIZATION OF THIS BOOK

2 PREPARING THE DATA

2.1 REPRESENTATION OF RAW DATA

2.2 CHARACTERISTICS OF RAW DATA

2.3 TRANSFORMATION OF RAW DATA

2.4 MISSING DATA

2.5 TIME-DEPENDENT DATA

2.6 OUTLIER ANALYSIS

3 DATA REDUCTION

3.1 DIMENSIONS OF LARGE DATA SETS

3.2 FEATURE REDUCTION

3.3 RELIEF ALGORITHM

3.4 ENTROPY MEASURE FOR RANKING FEATURES

3.5 PCA

3.6 VALUE REDUCTION

3.7 FEATURE DISCRETIZATION: CHIMERGE TECHNIQUE

3.8 CASE REDUCTION

4 LEARNING FROM DATA

4.1 LEARNING MACHINE

4.2 SLT

4.3 TYPES OF LEARNING METHODS

4.4 COMMON LEARNING TASKS

4.5 SVMs

4.6 KNN: NEAREST NEIGHBOR CLASSIFIER

4.7 MODEL SELECTION VERSUS GENERALIZATION

4.8 MODEL ESTIMATION

4.9 90% ACCURACY: NOW WHAT?

5 STATISTICAL METHODS

5.1 STATISTICAL INFERENCE

5.2 ASSESSING DIFFERENCES IN DATA SETS

5.3 BAYESIAN INFERENCE

5.4 PREDICTIVE REGRESSION

5.5 ANOVA

5.6 LOGISTIC REGRESSION

5.7 LOG-LINEAR MODELS

5.8 LDA

6 DECISION TREES AND DECISION RULES

6.1 DECISION TREES

6.2 C4.5 ALGORITHM: GENERATING A DECISION TREE

6.3 UNKNOWN ATTRIBUTE VALUES

6.4 PRUNING DECISION TREES

6.5 C4.5 ALGORITHM: GENERATING DECISION RULES

6.6 CART ALGORITHM & GINI INDEX

6.7 LIMITATIONS OF DECISION TREES AND DECISION RULES

7 ARTIFICIAL NEURAL NETWORKS

7.1 MODEL OF AN ARTIFICIAL NEURON

7.2 ARCHITECTURES OF ANNS

7.3 LEARNING PROCESS

7.4 LEARNING TASKS USING ANNS

7.5 MULTILAYER PERCEPTRONS (MLPs)

7.6 COMPETITIVE NETWORKS AND COMPETITIVE LEARNING

7.7 SOMs

8 ENSEMBLE LEARNING

8.1 ENSEMBLE-LEARNING METHODOLOGIES

8.2 COMBINATION SCHEMES FOR MULTIPLE LEARNERS

8.3 BAGGING AND BOOSTING

8.4 ADABOOST

9 CLUSTER ANALYSIS

9.1 CLUSTERING CONCEPTS

9.2 SIMILARITY MEASURES

9.3 AGGLOMERATIVE HIERARCHICAL CLUSTERING

9.4 PARTITIONAL CLUSTERING

9.5 INCREMENTAL CLUSTERING

9.6 DBSCAN ALGORITHM

9.7 BIRCH ALGORITHM

9.8 CLUSTERING VALIDATION

10 ASSOCIATION RULES

10.1 MARKET-BASKET ANALYSIS

10.2 ALGORITHM
APRIORI

10.3 FROM FREQUENT ITEMSETS TO ASSOCIATION RULES

10.4 IMPROVING THE EFFICIENCY OF THE
APRIORI
ALGORITHM

10.5 FP GROWTH METHOD

10.6 ASSOCIATIVE-CLASSIFICATION METHOD

10.7 MULTIDIMENSIONAL ASSOCIATION–RULES MINING

11 WEB MINING AND TEXT MINING

11.1 WEB MINING

11.2 WEB CONTENT, STRUCTURE, AND USAGE MINING

11.3 HITS AND LOGSOM ALGORITHMS

11.4 MINING PATH–TRAVERSAL PATTERNS

11.5 PAGERANK ALGORITHM

11.6 TEXT MINING

11.7 LATENT SEMANTIC ANALYSIS (LSA)

12 ADVANCES IN DATA MINING

12.1 GRAPH MINING

12.2 TEMPORAL DATA MINING

12.3 SPATIAL DATA MINING (SDM)

12.4 DISTRIBUTED DATA MINING (DDM)

12.5 CORRELATION DOES NOT IMPLY CAUSALITY

12.6 PRIVACY, SECURITY, AND LEGAL ASPECTS OF DATA MINING

13 GENETIC ALGORITHMS

13.1 FUNDAMENTALS OF GAs

13.2 OPTIMIZATION USING GAs

13.3 A SIMPLE ILLUSTRATION OF A GA

13.4 SCHEMATA

13.5 TSP

13.6 MACHINE LEARNING USING GAs

13.7 GAS FOR CLUSTERING

14 FUZZY SETS AND FUZZY LOGIC

14.1 FUZZY SETS

14.2 FUZZY-SET OPERATIONS

14.3 EXTENSION PRINCIPLE AND FUZZY RELATIONS

14.4 FUZZY LOGIC AND FUZZY INFERENCE SYSTEMS

14.5 MULTIFACTORIAL EVALUATION

14.6 EXTRACTING FUZZY MODELS FROM DATA

14.7 DATA MINING AND FUZZY SETS

15 VISUALIZATION METHODS

15.1 PERCEPTION AND VISUALIZATION

15.2 SCIENTIFIC VISUALIZATION AND INFORMATION VISUALIZATION

15.3 PARALLEL COORDINATES

15.4 RADIAL VISUALIZATION

15.5 VISUALIZATION USING SELF-ORGANIZING MAPS (SOMs)

15.6 VISUALIZATION SYSTEMS FOR DATA MINING

APPENDIX A

A.1 DATA-MINING JOURNALS

A.2 DATA-MINING CONFERENCES

A.3 DATA-MINING FORUMS/BLOGS

A.4 DATA SETS

A.5 COMERCIALLY AND PUBLICLY AVAILABLE TOOLS

A.6 WEB SITE LINKS

APPENDIX B: DATA-MINING APPLICATIONS

B.1 DATA MINING FOR FINANCIAL DATA ANALYSIS

B.2 DATA MINING FOR THE TELECOMUNICATIONS INDUSTRY

B.3 DATA MINING FOR THE RETAIL INDUSTRY

B.4 DATA MINING IN HEALTH CARE AND BIOMEDICAL RESEARCH

B.5 DATA MINING IN SCIENCE AND ENGINEERING

B.6 PITFALLS OF DATA MINING

BIBLIOGRAPHY

Index

IEEE Press

445 Hoes Lane

Piscataway, NJ 08854

IEEE Press Editorial Board

Lajos Hanzo,
Editor in Chief

R. Abhari
M. El-Hawary
O. P. Malik
J. Anderson
B-M. Haemmerli
S. Nahavandi
G. W. Arnold
M. Lanzerotti
T. Samad
F. Canavero
D. Jacobson
G. Zobrist

Kenneth Moore,
Director of IEEE Book and Information Services (BIS)

Technical Reviewers

Mariofanna Milanova, Professor

Computer Science Department

University of Arkansas at Little Rock

Little Rock, Arkansas, USA

Jozef Zurada, Ph.D.

Professor of Computer Information Systems

College of Business

University of Louisville

Louisville, Kentucky, USA

Witold Pedrycz

Department of ECE

University of Alberta

Edmonton, Alberta, Canada

Copyright © 2011 by Institute of Electrical and Electronics Engineers. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at
www.copyright.com
. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
http://www.wiley.com/go/permissions
.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at
www.wiley.com
.

Library of Congress Cataloging-in-Publication Data:

Kantardzic, Mehmed.

 Data mining : concepts, models, methods, and algorithms / Mehmed Kantardzic. – 2nd ed.

p. cm.

 ISBN 978-0-470-89045-5 (cloth)

 1. Data mining. I. Title.

 QA76.9.D343K36 2011

 006.3'12–dc22

2011002190

oBook ISBN: 978-1-118-02914-5

ePDF ISBN: 978-1-118-02912-1

ePub ISBN: 978-1-118-02913-8

To Belma and Nermin

PREFACE TO THE SECOND EDITION

In the seven years that have passed since the publication of the first edition of this book, the field of data mining has made a good progress both in developing new methodologies and in extending the spectrum of new applications. These changes in data mining motivated me to update my data-mining book with a second edition. Although the core of material in this edition remains the same, the new version of the book attempts to summarize recent developments in our fast-changing field, presenting the state-of-the-art in data mining, both in academic research and in deployment in commercial applications. The most notable changes from the first edition are the addition of

  • new topics such as ensemble learning, graph mining, temporal, spatial, distributed, and privacy preserving data mining;
  • new algorithms such as Classification and Regression Trees (CART), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Balanced and Iterative Reducing and Clustering Using Hierarchies (BIRCH), PageRank, AdaBoost, support vector machines (SVM), Kohonen self-organizing maps (SOM), and latent semantic indexing (LSI);
  • more details on practical aspects and business understanding of a data-mining process, discussing important problems of validation, deployment, data understanding, causality, security, and privacy; and
  • some quantitative measures and methods for comparison of data-mining models such as ROC curve, lift chart, ROI chart, McNemar’s test, and K-fold cross validation paired t-test.

Keeping in mind the educational aspect of the book, many new exercises have been added. The bibliography and appendices have been updated to include work that has appeared in the last few years, as well as to reflect the change in emphasis when a new topic gained importance.

I would like to thank all my colleagues all over the world who used the first edition of the book for their classes and who sent me support, encouragement, and suggestions to put together this revised version. My sincere thanks are due to all my colleagues and students in the Data Mining Lab and Computer Science Department for their reviews of this edition, and numerous helpful suggestions. Special thanks go to graduate students Brent Wenerstrom, Chamila Walgampaya, and Wael Emara for patience in proofreading this new edition and for useful discussions about the content of new chapters, numerous corrections, and additions. To Dr. Joung Woo Ryu, who helped me enormously in the preparation of the final version of the text and all additional figures and tables, I would like to express my deepest gratitude.

I believe this book can serve as a valuable guide to the field for undergraduate, graduate students, researchers, and practitioners. I hope that the wide range of topics covered will allow readers to appreciate the extent of the impact of data mining on modern business, science, even the entire society.

MEHMED KANTARDZIC

Louisville

July 2011

Other books

Mob Rules by Cameron Haley
Convicted: A Mafia Romance by Macguire, Jacee
Dead and Forsaken by West, J.D.
Independence by John Ferling
Whose Business Is to Die by Adrian Goldsworthy
Chasing Butterflies by Beckie Stevenson
Limit of Exploitation by Rod Bowden