Position Abstract for UW/Microsoft Summer Research Institute on Data Mining

Rajeev Motwani

1. What do you think data mining is about?

To me, data mining is the search for patterns and structure in large data sets and the discovery of information not explicitly present in the data. In a sense, the term ``mining'' is a misnomer since in data mining there is no one particular material being mined, it is more like data ``fishing.'' In fact, one of the difficult problems in data mining is to concretely define the classes of patterns that may be of interest.

2. What do you think the "real" problems are?

The first major problem is to define the classes of patterns that may be of interest. A major issue is whether such patterns can be defined in a domain-independent fashion. A second major problem is to distinguish real patterns from spurious ones, given that we would expect Ramsey-theoretic patterns in any large enough body of data. Statistical techniques are geared more towards testing the validity of a single pattern than towards finding all interesting patterns and distinguishing these from by-chance patterns that one would expect in any large enough body of data; certainly, the techniques which would apply are not sufficiently efficient for the data sizes under consideration. This leads to the third major problem, which is performance and efficiency. Finally, the fourth major problem is that of extracting value from the discovered patterns. My fear is that there may not be any significant value in most cases.

3. What are the research challenges in data mining?

See the response to Question 2.

4. What is your "wish list" or "set of demands" for a data mining system?

Its far too early in the game for such a list. We can, of course, make generic demands that one makes of any system -- usability, scalability, efficiency,....

5. What do you think will be the factors that may bring about the failure of data mining efforts?

AI-like hype without any substantive success stories is perhaps the greatest danger. Assuming that we manage to solve the first three problems listed in my response to Question 2, my fear is that even if data mining does produce the required results, these may not have sufficient value to the user to justify the investment.

6. Do you think people know what they want from their large databases? Is exploratory data analysis useful at all?

At this point in time, I suspect that the answer is ``no.'' But hopefully that will change as the field evolves.

7. What do you think are the "systems" issues in data mining?

See response to Question 4.

8. What fields do you think are relevant to data mining?

Algorithms, databases, statistics.

9. Will data mining be a new area of research? Will a new science and methodology for dealing with large databases emerge or is it all straightforward extensions of existing techniques?

It is hard to imagine a new science emerging here, but I doubt if the extension of known techniques will be straightforward either. I believe it will involve exciting inter-disciplinary research which will require creative use of known techniques and extremely non-trivial extensions of the related fields.

10. What do I think I will get out of this meeting in Seattle?

Two things: a chance to understand the viewpoint of people from very different fields, and ideas for future research.