data mining projects

Data cleaning
Data from real-world sources are often erroneous, incomplete, and inconsistent, which can result from operator error, system implementation flaws, etc. Such low quality data is not suitable for effective data mining. Data cleaning has been identified as an important problem. However, little progress has been made thus far. In this project, we will study the issues related to data cleaning with the aim of developing an engineering approach that can be useful to the user. The project consists of three phases: (1) identify and categorize the possible errors in data from multiple sources; (2) survey the available and potentially usable techniques to address the problem; and (3) develop a system that can identify and resolve some of the errors.
Data mining in multiple databases
It is common that many databases are kept in an organization. They are collected to serve different purposes. Data mining in individual databases has attracted a lot of attention. Some encouraging results have been achieved. It is time now to consider how we can make use of all the databases in an organization for data mining. Many issues remain unresolved. Significant ones are: (1) will one database help in the data mining of another database? (2) how can we consider multiple databases simultaneously for data mining? (3) can current data mining techniques help in this new situation or should we develop new techniques? (4) what is so special about mining of multiple databases?
Intelligent WEB document management using data mining techniques
With the development of Internet/Web technology, the volume of web documents increases dramatically. Effective management of the documents is becoming an important issue. The objective of this project is to build an intelligent system that can help Web Masters manage Web documents so that they can serve the users better. We will first survey the available and potentially useful techniques for discovering access patterns of Web documents stored in an information provider’s web server. The major issues include establishing measurements and heuristics on user access patterns and developing techniques to discover and maintain such discovered patterns. The results are then expanded in the direction of using the discovered user access patterns to manage Web documents so that information subscribers can access information of interest more efficiently. Techniques to be investigated include clustering of web documents, pre-fetching and caching, and customized linkage of Web documents.
Data mining with neural networks
While the use of neural networks for pattern classification has been common in practice, neural networks have not been widely applied in data mining applications. The reason is that neural network decision process is not easily explainable in terms of rules that human experts can verify. In the past two years, we have investigated the problem of extracting rules from trained neural networks. Our results have been encouraging. The rules extracted by our algorithms are not only more concise than those generated from decision trees, but are, in general, more accurate. Our algorithms can extract rules of the following forms:
Symbolic rules, e.g. if (married = yes) and (sex = male), then …..
MofN rules, e.g. if 2 of the 3 conditions {number of children is not more than 3, married more than 10 years, owns private property} are satisfied, then ….
Oblique rules. e.g. if (monthly salary – 1.5*monthly mortgage), then …..
In this project, we plan to implement these algorithms on a 32 processor Fujitsu AP3000 parallel computer to speed up the training process.