STA302H1S / 1001 HS Autumn 2019 Assignment # 3
A Multiple Linear Model for Property Sale Price
Posted: 7am, Thursday, November 21, 2019
Due: In Crowdmark by 10pm on Wednesday, December 4, 2019.
Late assignments will be subjected to a penalty of 20% per day late. Submissions will not
be accepted beyond 48 hours of the due date. Email submissions are not allowed.
Instructions: data-analysis代写
- Use R (or R Studio) to do the data analysis.
- Use a benchmark signifificance level of 5%. Where appropriate, use 4 decimal places in your numerical answers.
- Compile your solution as a PDF document (Word, LATEXor Rmarkdown can be your base).
- Presentation of solutions is very important! Your assignment should have two main sections- Solutions and Appendix. Include relevant plots. And quote relevant numbers from your R output for your Solutions section. Then, in the Appendix, include a legible copy of your R code and other extraneous output.
- If you work with other students on this assignment then indicate the names of the students on your cover page or on the top of your assignment.
- Write and submit your own work; your solutions should not be the same as another student’s solutions. For instance, personalized your code as much as possible, using your fifirst name. All plots produced must be given a title with the last 4 digits of your student number.
- Where appropriate, write your interpretations using plain English
Grading Scheme: data-analysis代写
This assignment has 6 parts (or 7 parts for Graduate Students); each part is worth 3 marks. Additionally, a maximum of 3 marks will be awarded for excellent presentation and your appendix. A general marking scheme for each part is given below:
The Data data-analysis代写
Over the past 10 years, house price in the Greater Toronto Area (GTA) has been a major issue. For this assignment we extend our work in Assignment 2 to fifind a more complex linear model. Which home buyers can use to predict the sale price of single-family. Detached homes across two GTA neighbourhoods. The data for this assignment is available in the fifile “reale a3data.csv” on the assignment 3 page. The property-based variables in the dataset are:Case ID: property identifification
- sale: the actual sale price of the property
- list: the last list price of the property
- bedrooms: the total number of bedrooms
- bathrooms: the number of bathrooms
- lotwidth: the frontage in feet
- lotlength: the length in feet of one side of the property
- maxsqfoot: the maximum possible square feet of the property
-
taxes: previous year’s property tax data-analysis代写
- location: X– Neighbourhood X, O-Neighbourhood O
- First clean the data by removing the three cases with missing values. Secondly, create a variable with the name ‘lotsize’ by multiplying lotwidth by lotlength. Then use this updated data for this part and the successive parts of this assignment.
Produce the pairwise correlations and scatterplot matrix for all pairs of quantitative variables in the data. The quantitative variables include sale price. As the response variable and 8 other predictors. Describe the rank of the quantitative predictors for sale price in terms their correlation coeffiffifficients, from highest to lowest.
-
data-analysis代写
(i) Based on the scatterplot matrix in part 1. For which single predictor of sale price would the assumption of constant variance be strongly violated? (ii) Confifirm your answer by showing an appropriate plot of the (standardized) residuals. (iii) Suggest the next step to take to overcome this violation if we are to build a model for sale price with the predictor. (i) Fit a multiple linear regression model with all 9 explanatory variables for sale price. (ii) List the estimated regression coeffiffifficients and the p-values for the corresponding t-tests for these coeffiffifficients.(iii) Interpret the estimated model coeffiffifficient if the t-test result was signifificant.
(i) Using a 2-by-2 layout, show the 4 diagnostic plots that are obtained in R by plotting the model in part 3 above. (ii) List the Case ID’s for the points that may be considered inflfluential. (iii) Specify the threshold (or rule) use to identify inflfluential points. One commonly-used method to fifind a parsimonious model is stepwise regression with AIC. In back ward elimination, it starts with all the potential predictors in the model, then removes the predictor with the largest p-value each time to give a smaller AIC.
The forward selection method is the reverse of the backward method. data-analysis代写
It starts with no explanatory variable in the model, then adds one predictor at a time (with the smallest p-value) until no further variables can be added to produce a smaller AIC value. Stepwise regression alternates forward steps with backward steps. The idea is to end up with a model where no variables are redundant given the other variables in the model. Often, in practice, backward elimination and forward selection will produce the same ‘fifinal’ model.
Start with the full model (‘fullmodel’) fifitted in part 3 and use backward elimination with AIC. What is the fifinal model (write the fifitted model)? Are the results consistent with those in part 3?
-
data-analysis代写
Use BIC instead of AIC and repeat part 5. What is the fifinal model? Are the results consistent with what you saw in parts 3 and 5? Explain. Here are some R codes you may use for parts 5 & 6:
(For credit for Graduate students; optional (no credit) for Undergraduate students ) k-fold Cross validation is a standard approach to assess the predictive ability of models by evaluating their per formance on a new data set. The data set used to establish the model is called the training data set. And the data set used to evaluate the model is called the test data. Carry out 2-fold cross-validation with stepwise regression using BIC by the following steps:
a.Set the seed of your randomization to be the last four digits of your student number. Randomly divide the data into two subsets called ‘onehalf’ and ‘nexthalf’. Here are some R codes to do so: set.seed(1234) # note: replace 1234 with the last 4 digits of your student ID id2=sample(1:n, n/2)onehalf=a3data[id2,] nexthalf=a3data[-id2,]
b.data-analysis代写
Using the onehalf dataset, fifind the model chosen by stepwise regression. Then fifit a model with these chosen variables to the data in the nexthalf dataset. Compare the estimated regression coeffiffifficients and p-values for the test that these coeffiffifficients are zero from the model fifit to theonehalf dataset to the model fifit to the nexthalf dataset.
c.Repeat part (b) reversing the roles of the two data sets. That is, use nexthalf as the training data and onehalf as the test data.