DELVE - Data Exploration, Learning, and Visualization Environment

A project by the Datamining Group at Hofstra University


Link to Sun AEG proposal

Intro to Datamining

Datamining is the extraction of interesting information from large repositories of data.  Such data can be generated from various sources such as retail sales transactions in supermarkets, service center call logs, web access records, weather measurements, etc.  With the creation of new tools that draw upon resources from machine learning (artificial intelligence), statistics and databases, administrators can now discover deep rules that underlie the observational records of their business enterprise.

Irrespective of the context of the application, decision makers can use datamining tools to make at least three different (but related) kinds of inferences:

  1. Associations:  These are rules that group objects together based on some relation between them. Consider the following scenario: A grocery store manager, whose database records individual purchase transactions may be interested in a rule like the following: people who buy diapers tend to also buy beer.  Such rules can help with product placement, advertising, pricing, and so on.

  3. Predictions: Prediction (or classification) rules are used to categorize data into disjoint groups.  Consider the situation at a bank that makes loans. Based on the performance of loans made in the past, it may create profiles for safe and for risky loan seekers.  When a new customer applies for a loan, the bank has to make a decision regarding the credit worthiness of the applicant based on information made available to them.  By matching the profile of this customer against prior customer profiles, the bank might classify the loan application for approval, or for rejection.

  5. Sequences: Association rules and prediction rules usually involve intra-record data.  There are, however, situations where the occurrence of a prior event or transaction influences the occurrence of subsequent transactions.  Sequence problems are those in which such temporal connections need to be uncovered.  Such situations frequently occur in the stock market, or domains where seasonal variance occurs.
Datamining draws upon techniques initially developed in the machine learning and statistics communities.  These methods, however precise they may claim to be, suffer from one serious flaw: they cannot handle large quantities of data.  Datamining alleviates this performance issue by adding in the database perspective, where the performance issue has been addressed for a long time.