Data Modelling and Analysis
数据建模代写 Instructions 1Data Set 1Software 1Data Report Deliverable 2Code Deliverable 2Marks 2Assessment Criteria 2Plagiarism and ···
This coursework is organized into three parts, each one focusing on a different and important aspect of Data Analysis and Pre-processing, Data Mining, and Data Classification. All parts involve the use of the same data.
The first part focuses on describing and visualizing the data and preparing the data for subsequent treatment. The second part focuses on studying the effects that different methods or different datasets have on clustering methods.Finally, the third part focuses on classification and prediction.
The main goal is to give you first-hand experience on working with a relatively large and real data set, from the earliest states of data description to the later stages of knowledge extraction and prediction.
The data set is a slightly modified version of a real-world plant data set. The data concerns the classification of plant species from different measurements taken from photographs of these plants. Each record consists of several attribute columns (input), and one class column (output) corresponding to the information about the type of plant.
The attributes are markers that have been determined by assessment of different features of each plant, and the class variable is a provisional labelling of the type of plant. The entire data set consists of over 700 instances (plants studied). Some of the variables contain missing values, which are indicated by empty entries.
You are required to only use R and Weka, as indicated in the details below.
You will need to submit a written report describing all the analysis conducted. The length for the report should be between 2000 and 3000 words and twenty sides of A4, excluding the cover page, but including all tables and figures.
Number all of your pages and make sure to include your name and student ID on the front page.
The minimum font size allowed is 11pt (a full page of text in a similar style to this document would contain about 500 words, so the majority of the 20 sides will be tables and figures). The report should clearly explain what you did with the data, how you did it and why you did it, and it should be well structured and illustrated.
Your report should contain three sections in total as described below. You cannot include any code, or raw output (e.g. the output of R commands) in the main body of your report. Note that appendices will not contribute to the word count and are not explicitly marked: they are for reference only.
Code Deliverable 数据建模代写
You will also have to submit a copy of your code. Your code should be organised in such a way that I can run it and replicate your results. A separate link on Moodle will be available for you to submit your code.
You should only make use of the libraries that we have seen in class.
The simplest way of organising your code is to create two folders, one for each part. Then have a main.r in each, and include the code for each exercise either as a) external functions (in separate R files) that you can source in you main, or as b) auxiliary functions inside your main code.
Write your name and student ID at the beginning of each and every script you submit.
Part 1 carries 35 marks, while Part 2 and Part 3 carry 20 marks each. In total, this report aggregates to 75 marks. Marks will only be awarded for the first twenty pages of the main body of your report.
The remaining 25 marks will be awarded to the quality of the code for Part 1 and Part 2.
Assessment Criteria 数据建模代写
The main assessment criteria for the report are:
- Correctness:that is, do you apply techniques correctly; do you make correct assumptions; do you interpret the results in an appropriate manner; etc.?
- Completeness:that is, do you apply a technique only to small subsets of the data; do you apply only one technique, when there are multiple alternatives; do you consider all options; etc.?
- Originality:that is, do you combine techniques in new and interesting ways; do you make any new and/or interesting findings with the data?
- Argumentation:that is, do you explain and justify all of your choices? The main assessment criteria for the code are:
- Correctness:is the code working as it is supposed to? does it solve the questions in the coursework? do you use the correct functions?
- Completeness:is your code doing everything it is supposed to? are you applying it to the correct datasets?
- Organisation:is your code well organised? is it easy to follow? is it consistent (i.e. consistent names for variables, functions, etc.)?
- Style: whatis the quality of your code? are you using informative names for variables and functions? are you taking advange of R’s functionality (i.g. using apply() or aggregate() instead of nested loops, ).
Plagiarism and Collusion vs. Group Discussions 数据建模代写
As you should know, plagiarism and collusion are completely unacceptable and will be dealt with according to the University’s standard policies. Having said this, we do encourage students to have general discussions regarding the coursework with each other in order to promote the generation of new ideas and to enhance the learning experience. Please be very careful not to cross the boundary into plagiarism. The important part is that when you sit down to actually do the data analysis/mining and write about it, you do it individually. If you do this, and you truly understand what you have written, you will not be guilty of plagiarism. Do NOT, under any circumstances, share code or share figures, graphs or charts, etc. As examples, saying to someone, “I used a Pivot Table in Excel to do the cross tabulations” is completely fine; whereas Copying & Pasting the actual Pivot Table itself would be plagiarism.
- Thesubmission deadline is on the 11th of May (Monday) at 15:00.
- Nameyour report DMA-Cwk-XXX.pdf, where XXX should be replaced with your student ID number (e.g. DMA-Cwk-4078181), and submit the single PDF document via Moodle (see website for details).
- Saveall of your code in a ZIP Name it DMA-Code-XXX.zip, where XXX should be replaced with your student ID number and submit a single file via Moodle (see website for details).
- Makesure your full name and student ID are shown in the first page of your
1 ANALYSIS AND PRE-PROCESSING 数据建模代写
This part of the coursework carries 35 marks for the report and 10 marks for the code quality.
- Explorethe data 
- Provide a table for all the attributes of the dataset including measures of centrality, dispersion, and how many missing values each attribute has.
- Produce histograms for each attribute and characterise all the distributions. Provide details on how you created the histograms and comment on the distribution of data.You may also use descriptive statistics to help you characterise the shape of the distribution.
- Explorethe relationships between the attributes, and between the class and the attributes 
- Calculatethe correlations and produce scatterplots for the variables: orientation 4 and orientation
- Whatdoes this correlation tell you about the relationships of these variables?
- Producescatterplots between the class variable and orientation 4, orientation 6 and area What do these tell you about the relationships between these three variables and the class?
- Produce boxplots for all of the appropriate attributes in the dataset.Group each variable according to the class attribute.
- GeneralConclusions 
Take into considerations all the descriptive statistics, the visualisations, the correlations you produced together with the missing values and comment on the importance of the attributes. Which of the attributes seem to hold significant information and which you can regard as insignificant? Provide an explanation for your choice.
Dealingwith missing values in R  数据建模代写
- Replacemissing values in the dataset using three strategies: replacement with 0, mean and
- Define,compare and contrast these approaches and its effects on the
- Attributetransformation 
Using the three datasets generated in 1.4, explore the use of three transformation techniques (mean centering, normalisation and standardisation) to scale the attributes. Define, compare and contrast these approaches and its effects on the data.
- Attribute/ instance selection 
- Starting again from the raw data, consider attribute and instance deletion strategies to deal withmissing Choose a number of missing values per instance or per attribute and delete instances or attributes accordingly. Explain your choices and its effects on the dataset.
- Start from the raw data, use correlations between attributes to reduce the number of attributes. Tryto reduce the dataset to contain only uncorrelated attributes and no missing Explain your choices and its effects on the dataset.
- Startingfrom an appropriate version of the dataset, use Principal Component Analysis to create a data set with eight attributes. Explain the process and the result obtained.
As a result, you will end up with several different sets of data to be used in Sections 2 & 3. Give each set of data a clear and distinct name, so that you can easily refer to again in the later stages.
2 CLUSTERING 数据建模代写
This part of the coursework carries 20 marks for the report and 15 marks for the code quality.
Using only R, explore the use of clustering techniques to find natural groupings in the data, without using the class variable – i.e. use only the numeric (input) attributes to perform the clustering. Once the data is clustered, you may use the class variable to evaluate or interpret the results (how do the new clusters compare to the original classes?).
- Choose an appropriate dataset and use hierarchical, k-means, and PAM as clustering algorithms to create classifications of five clusters and write the results.Which dataset have you used? Use a combination of internal and external metrics to evaluate which algorithm produces better results when compared to the class attribute 
- Choose an appropriate dataset.Optimise each clustering method according to two parameters or more. Which parameters produce the best results for each clustering algorithm? Provide the reasoning of the techniques you used to find the optimal parameters 
- Chooseone clustering algorithm of the above and perform this clustering on these alternative datasets that you have produced as a result of Part 1: 
- Thereduced data set featuring 10 Principal
- Thedataset after deletion of instances and
- Thethree datasets after you replaced missing values with the three
- Whichof these datasets had a positive impact on the quality of the clustering? Provide explanations using the results for each clustering of the alternative data set.
3 CLASSIFICATION 数据建模代写
This part of the coursework has 20 marks. You must use Weka to perform the classification, but you may use R to present results. Using Weka classification techniques to create models that predict the given class from the input attributes.
- Choose an appropriate dataset to obtain predictions using the following classifiers:ZeroR, OneR, NaïveBayes, IBk (k-NN) and J48 (C4.5). Which evaluation protocol did you use? Which dataset have you used? Which algorithm produces the best results? Use a combination of metrics to justify your reasoning 
- Choose one classification algorithm of the above and 5-fold cross-validation.Optimise the classifier of your choose with at least two parameters. Describe each parameter and show the results of your experimentation 
- Use J48 and the datasets below.Provide explanations on the performance of the datasets using a combination of metrics.
- Areduced data set using 10 Principal
- Thedataset after deletion of instances and
- Thethree datasets after you replaced missing values with the three
- Which of the datasets had a good impact on the predictive ability of the algorithm?Provide explanations using the results for each clustering of the alternative data set.