Position Abstract for UW/Microsoft Summer Research Institute on Data Mining
Rajeev Motwani
1. What do you think data mining is about?
To me, data mining is the search for patterns and structure in large
data sets and the discovery of information not explicitly present in
the data. In a sense, the term ``mining'' is a misnomer since in data
mining there is no one particular material being mined, it is more like
data ``fishing.'' In fact, one of the difficult problems in data mining
is to concretely define the classes of patterns that may be of interest.
2. What do you think the "real" problems are?
The first major problem is to define the classes of patterns that may be
of interest. A major issue is whether such patterns can be defined in a
domain-independent fashion. A second major problem is to distinguish real
patterns from spurious ones, given that we would expect Ramsey-theoretic
patterns in any large enough body of data. Statistical techniques are
geared more towards testing the validity of a single pattern than towards
finding all interesting patterns and distinguishing these from by-chance
patterns that one would expect in any large enough body of data; certainly,
the techniques which would apply are not sufficiently efficient for the
data sizes under consideration. This leads to the third major problem,
which is performance and efficiency. Finally, the fourth major problem is
that of extracting value from the discovered patterns. My fear is that
there may not be any significant value in most cases.
3. What are the research challenges in data mining?
See the response to Question 2.
4. What is your "wish list" or "set of demands" for a data mining system?
Its far too early in the game for such a list. We can, of course, make
generic demands that one makes of any system -- usability, scalability,
efficiency,....
5. What do you think will be the factors that may bring about the
failure of data mining efforts?
AI-like hype without any substantive success stories is perhaps the greatest
danger. Assuming that we manage to solve the first three problems listed in my
response to Question 2, my fear is that even if data mining does produce the
required results, these may not have sufficient value to the user to justify
the investment.
6. Do you think people know what they want from their large databases?
Is exploratory data analysis useful at all?
At this point in time, I suspect that the answer is ``no.'' But hopefully
that will change as the field evolves.
7. What do you think are the "systems" issues in data mining?
See response to Question 4.
8. What fields do you think are relevant to data mining?
Algorithms, databases, statistics.
9. Will data mining be a new area of research? Will a new science
and methodology for dealing with large databases emerge or is
it all straightforward extensions of existing techniques?
It is hard to imagine a new science emerging here, but I doubt if
the extension of known techniques will be straightforward either.
I believe it will involve exciting inter-disciplinary research
which will require creative use of known techniques and extremely
non-trivial extensions of the related fields.
10. What do I think I will get out of this meeting in Seattle?
Two things: a chance to understand the viewpoint of people from
very different fields, and ideas for future research.