data-analysis代写 – R语言代写 –  Linear Model代写 – STA302H1S
data-analysis代写

data-analysis代写 – R语言代写 – Linear Model代写 – STA302H1S

STA302H1S / 1001 HS Autumn 2019 Assignment # 3

A Multiple Linear Model for Property Sale Price

 

 

Posted: 7am, Thursday, November 21, 2019

Due: In Crowdmark by 10pm on Wednesday, December 4, 2019.

Late assignments will be subjected to a penalty of 20% per day late. Submissions will not

be accepted beyond 48 hours of the due date. Email submissions are not allowed.

 

Instructions: data-analysis代写

  • Use R (or R Studio) to do the data analysis.
  • Use a benchmark signifificance level of 5%. Where appropriate, use 4 decimal places in your numerical answers.
  • Compile your solution as a PDF document (Word, LATEXor Rmarkdown can be your base).
  • Presentation of solutions is very important! Your assignment should have two main sections- Solutions and Appendix. Include relevant plots. And quote relevant numbers from your R output for your Solutions section. Then, in the Appendix, include a legible copy of your R code and other extraneous output.
  • If you work with other students on this assignment then indicate the names of the students on your cover page or on the top of your assignment.
  • Write and submit your own work; your solutions should not be the same as another student’s solutions. For instance, personalized your code as much as possible, using your fifirst name. All plots produced must be given a title with the last 4 digits of your student number.
  • Where appropriate, write your interpretations using plain English

Grading Scheme: data-analysis代写

This assignment has 6 parts (or 7 parts for Graduate Students); each part is worth 3 marks. Additionally, a maximum of 3 marks will be awarded for excellent presentation and your appendix. A general marking scheme for each part is given below:

The Data  data-analysis代写

Over the past 10 years, house price in the Greater Toronto Area (GTA) has been a major issue. For this assignment we extend our work in Assignment 2 to fifind a more complex linear model. Which home buyers can use to predict the sale price of single-family. Detached homes across two GTA neighbourhoods. The data for this assignment is available in the fifile “reale a3data.csv” on the assignment 3 page. The property-based variables in the dataset are:Case ID: property identifification

  • sale: the actual sale price of the property
  • list: the last list price of the property
  • bedrooms: the total number of bedrooms
  • bathrooms: the number of bathrooms
  • lotwidth: the frontage in feet
  • lotlength: the length in feet of one side of the property
  • maxsqfoot: the maximum possible square feet of the property
  • taxes: previous year’s property tax  data-analysis代写

  • location: X– Neighbourhood X, O-Neighbourhood O
  1. First clean the data by removing the three cases with missing values. Secondly, create a variable with the name ‘lotsize’ by multiplying lotwidth by lotlength. Then use this updated data for this part and the successive parts of this assignment.

Produce the pairwise correlations and scatterplot matrix for all pairs of quantitative variables in the data. The quantitative variables include sale price. As the response variable and 8 other predictors. Describe the rank of the quantitative predictors for sale price in terms their correlation coeffiffifficients, from highest to lowest.

data-analysis代写
data-analysis代写
  1. data-analysis代写

(i) Based on the scatterplot matrix in part 1. For which single predictor of sale price would the assumption of constant variance be strongly violated? (ii) Confifirm your answer by showing an appropriate plot of the (standardized) residuals. (iii) Suggest the next step to take to overcome this violation if we are to build a model for sale price with the predictor. (i) Fit a multiple linear regression model with all 9 explanatory variables for sale price. (ii) List the estimated regression coeffiffifficients and the p-values for the corresponding t-tests for these coeffiffifficients.(iii) Interpret the estimated model coeffiffifficient if the t-test result was signifificant.

(i) Using a 2-by-2 layout, show the 4 diagnostic plots that are obtained in R by plotting the model in part 3 above. (ii) List the Case ID’s for the points that may be considered inflfluential. (iii) Specify the threshold (or rule) use to identify inflfluential points. One commonly-used method to fifind a parsimonious model is stepwise regression with AIC. In back ward elimination, it starts with all the potential predictors in the model, then removes the predictor with the largest p-value each time to give a smaller AIC.

The forward selection method is the reverse of the backward method. data-analysis代写

It starts with no explanatory variable in the model, then adds one predictor at a time (with the smallest p-value) until no further variables can be added to produce a smaller AIC value. Stepwise regression alternates forward steps with backward steps. The idea is to end up with a model where no variables are redundant given the other variables in the model. Often, in practice, backward elimination and forward selection will produce the same ‘fifinal’ model.

Start with the full model (‘fullmodel’) fifitted in part 3 and use backward elimination with AIC. What is the fifinal model (write the fifitted model)? Are the results consistent with those in part 3?

  1.   data-analysis代写

Use BIC instead of AIC and repeat part 5. What is the fifinal model? Are the results consistent with what you saw in parts 3 and 5? Explain. Here are some R codes you may use for parts 5 & 6:

(For credit for Graduate students; optional (no credit) for Undergraduate students ) k-fold Cross validation is a standard approach to assess the predictive ability of models by evaluating their per formance on a new data set. The data set used to establish the model is called the training data set. And the data set used to evaluate the model is called the test data. Carry out 2-fold cross-validation with stepwise regression using BIC by the following steps:

a.Set the seed of your randomization to be the last four digits of your student number. Randomly divide the data into two subsets called ‘onehalf’ and ‘nexthalf’. Here are some R codes to do so: set.seed(1234) # note: replace 1234 with the last 4 digits of your student ID id2=sample(1:n, n/2)onehalf=a3data[id2,] nexthalf=a3data[-id2,]

b.data-analysis代写

Using the onehalf dataset, fifind the model chosen by stepwise regression. Then fifit a model with these chosen variables to the data in the nexthalf dataset. Compare the estimated regression coeffiffifficients and p-values for the test that these coeffiffifficients are zero from the model fifit to theonehalf dataset to the model fifit to the nexthalf dataset.

c.Repeat part (b) reversing the roles of the two data sets. That is, use nexthalf as the training data and onehalf as the test data.

 

更多其他:代写作业 数学代写 物理代写 生物学代写 程序编程代写 留学生论文代写

合作平台:天才代写 幽灵代  写手招聘  paper代写

发表回复