代写数据挖掘作业-CISC7107代写-Data Mining代写
代写数据挖掘作业

代写数据挖掘作业-CISC7107代写-Data Mining代写

CISC7107 Data Mining and Decision Support Systems

Assignment 2.0

Exploiting the Powers of Clustering

代写数据挖掘作业 It is fine that you use high dimensional dataset, and you can choose any two attributes for display for 2D clustering evaluation.

Objectives:

Students are to gain experience in using WEKA and other data mining software, the clustering, association and different uses of clustering in combinations. Learn how to analyze real-life datasets using these techniques, and to interpret the meanings of the results.

Tasks to do:

Clustering – How to evaluate Clusters and Clustering algorithms? (Warm-up exercises, NO NEED to put into assignment report)

Firstly, make yourself understood and familiarized with the experiment on “mouse.arff”and “Iris.arff” by using K-means, DBSAN, EM and SOM etc. in Weka.

In a 2D dataset, such as “mouse.arff”, only use evaluation indicators such as StDev-x and StDev-y which are default performance outputs by the clustering algorithms.

Now find a dataset of your own choice from the given online links. You may reuse the datasets which you have used in Assignment 1 or try some other datasets. It is fine that you use high dimensional dataset, and you can choose any two attributes for display for 2D clustering evaluation. Optionally, instead of randomly picking any two or three attributes for visualization, you may want to use Select Attributes → InfoGainEval →Ranker to find the top two or three attributes for visualization.

In a slightly higher dimension dataset, such as “Iris.arff” (that has 4 dimensions), usually this is for a classification problem, you will find a class column in the dataset. The purpose of doing clustering is to group similar data according to the predicted class, for data segmentation. It has a technical term called “Classes-to-clusters evaluation”    代写数据挖掘作业

In this case, you will evaluate the clustering performance:

  • Produce a clustering performance comparison in (simple 2D) using your chosen dataset, in a way that is similar to the sample file “mouse_visualization.xls”.
  • Use DBSAN to find the outliers from either a simple 2D (any pair of 2 attributes of your chosen dataset), or from all the attributes, produce the result in your Excel file.
  • Produce a clustering performance comparison for “Classes-to-clusters evaluation”,using all the attributes of your chosen dataset. In this case, you may want to

compare four or more clustering algorithms, without visualizing the results. You only need to record the performance results and tabulate them in your report.You may need to use the following Weka function from meta-classifiers called “ClassificationViaClustering

– It works by creating a classifier, ignoring classes, then it clusters,assigns to each cluster its most frequent class

– Obviously not competitive in terms of classification accuracy with other classification techniques, but it is a good way of comparing clusterers

– In Weka, select Classify → ClassificationViaClustering  代写数据挖掘作业

  • Again, use DBSAN to find the outliers from all the attributes of your chosen dataset, produce the result in your Excel file.
  • Briefly summarize your findings and discuss your observations

1.Clustering – How to find outliers using clustering algorithm?

In this case, you will show that you can find outliers, visualize them 3D, from your chosen dataset, using the following two functions in Weka :

  • Use the Weka filter called Interquartile Range, to find outliers and extreme values. Show that you can trim off the outliers and extreme values. Visualize the BEFORE and AFTER outliers and extreme values removals using Visualize 3D function in Weka. You can either select any three attributes or using feature selection to select the top most important attributes for visualization.
  • Use DBScan algorithm in Weka (or any other similar clustering algorithm) which produces the noises as outliers. Extract the data and the outliers into a data file, visualize them on Visualize 3D with any three attributes you chose. For example,for the emotion-train.arff dataset*, using DBScan with eps=1.7, minpt=8 gives exactly 16 outliers which accounts for 4% of the whole dataset. Repeat the same task on your chosen dataset. Alternatively, you can do this task of finding outliers using RapidMiner or other data mining software.

Source:

https://github.com/rivolli/mlmlbr/blob/master/dataset/emotions/emotions-train.arff

代写数据挖掘作业
代写数据挖掘作业

2.Clustering – How to interpret the results from clustering algorithm?  

In this case, you will show that you can visualize clusters, and be able to interpret them,similar to the two examples given: Bank dataset (“bank-data-final.arff”) and Human Activity Recognition dataset. Use your chosen dataset.• Apply the Weka K-means or any other clustering algorithm you prefer from any DM software. Create appropriate number of clusters over your chosen dataset(s).

  • Visualize your clusters and interpret the results in your own words. Write a short paragraph commenting about what you observe from the clustered data.

3.Clustering – How to combine clustering and association rule mining?  代写数据挖掘作业

It is well-known that in business especially marketing industry, companies want to do“targeted marketing” or “segmented marketing” instead of mass marketing. Try to find a suitable dataset of your own, from some online archive for association analysis.

  • Firstly, find the most appropriate clustering method for dividing up your chosen dataset into x number of clusters (where x ≥ 2)
  • Copy and Paste the interesting association rules (Apriori) results from Weka to your Excel file, pertaining to what you want to discover from the association rules.
  • Compare the qualities of the rules you produced from the original (whole) dataset and from the clustered datasets.
  • Briefly summarize your findings and discuss your observations.
  • It is important to show that you are able to give meaningful interpretation.

Please refer to these papers for details:

Visual clustering-based apriori ARM methodology for obtaining quality association rules.

(https://dl.acm.org/citation.cfm?id=3108450)

Fast Cluster-learning with Prior Probability from Big Dataset.

(https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8703219)

4.Clustering – How to combine clustering and classification?   代写数据挖掘作业

Clustering can be used as a pre-processing. In this case, clustering is used to add extra information as a new feature to be included in the training dataset. The membership information included as a new feature is supposed to improve the classification model training. Please get yourself familiar with the demo experiment using “emotion-train.arff” and “emotion-test.arff” datasets. Choose your own dataset for trying out this clustering pre-processing for enhancing the classification training performance.

  • Select several clustering algorithms, apply them over your chosen dataset.
  • The membership information generated from each clustering algorithm is added as a new feature to the training dataset.
  • Select no more than 3 classification algorithms of your favourite choices.• Try testing the trained classification models, with all the options of full-training,10-fold CV, 66% split and Testing dataset, record the performance.
  • Discuss about your results in your report.

Submission:  代写数据挖掘作业

Submit your experiment report with write-up and screencaps of the process and results,in MS Excel, including all the materials (both training and testing datasets in ARFF or CSV format before and after your preprocessing) as a single RAR compressed file to UMMOODLE by the due date.

Additional Options:

The tasks listed above are for the fundamental requirements for passing this assignment with good marks. If you will want to score an extremely high mark, consider doing the following tasks which are more challenging.

Challenge 1: Try use Weka Knowledge Flow or RapidMiner files to better organize the 4 tasks.

Challenge 2: In Task 4, what other extra features than cluster membership can be added into classification model training for better machine learning?

Challenge 3: In Task 4, adding extra feature from clustering, does it work also well for data stream mining in MOA?  代写数据挖掘作业

Consultation & Assistance:

Whenever you encounter any problem in this assignment process, please feel free to contact me for help. I would be very happy to assist you in solving problems relating to your assignments.

Reference:

UCI Dataset:

http://archive.ics.uci.edu/ml/datasets.html

KEEL imbalanced data site:

http://sci2s.ugr.es/keel/imbalanced.php

Kaggle Data Mining Competition Website

https://www.kaggle.com/

数据堂_大数据交易平台

http://www.datatang.com/

 

更多代写:澳大利亚assignment代做  托福在家考作弊  Stata程序代写  reflective essay例子  论文abstract怎么写  管理会计essay代写

合作平台:essay代写 论文代写 写手招聘 英国留学生代写

代写数据挖掘作业
代写数据挖掘作业

发表回复