 Machine Learning代考

## COMS 4771002 (Fall 2017) Exam 1 October 19, 2017

There are total thirteen pages including two blank pages at the end for scratch work. You have 1 hour and 15 minutes to complete this exam.

You may use any result from the lectures or the homeworks without proof.

Name & UNI:

### 1.(16 points) For each statement below state either True or Justify your answer by providing a short explanation/proof sketch (for true statements) or a counter example (for falsestatements).  Machine Learning代考

(a)                                   Bayes classifier (i.e. the optimal classifier) always yields zero classification error.

(b)                                  Kernel regression with a box kernel typically yields a discontinuous predictor.

(c)                                 If we train a Na¨ıve Bayes classifier using infinite training data that satisfies all of its modeling assumptions (e.g., features in the training data are independent given the class labels), then it will achieve zero training error over these training examples.  Machine Learning代考

(d)                               Suppose you are given a dataset of cellular images from patients with and without cancer. If you are required to train a classifier that predicts the probability that the patient has cancer, you would prefer to use Decision trees over Logistic regression.

(e)                              Given a d × d symmetric positive semidefinite matrix A. The transformation ø : x ›→ xTAx is a linear transformation.

(f)                              Ordinary Least Squares (OLS) regression is consistent. That is, as the number of samples approaches infinity, OLS error approaches the error of the optimal regressor.

(g)                              Consider the constraint optimization problem: This optimization problem is equivalent to the following unconstraint optimization problem: That is, the optimal solution to the given constraint optimization problem is equal to the optimal solution for the given unconstraint optimization problem.

(h)                              For any d 1, the unit L0-ball in Rd, that is, the set {x Rd : ||x||0 1}, is not a convex set.

### 2.[Maximum Likelihood Estimation and beyond (13 points)]

Your friend gives you a coin with bias b (that is, tossing the coin turns ‘1’ with probability b, and turns ‘0’ with probability 1 b). You make n independent tosses and get the observation sequence x1, . . . , xn ∈ {0, 1}.

(a)Youwant to estimate the coin’s   What is the Maximum Likelihood Estimate (MLE) ˆb given the observations x1, . . . , xn?

(c)Derivea simple expression for the variance of this coin?

(d)Whatis the MLE for the coin’s variance?

(e)Your friend reveals to you that the coin was minted from a faulty press that biased it towards 1. Supposethe model for the faulty bias is given by the following distribution:

Having this extra knowledge, what is the best estimate for the coin’s bias b given the observation sequence? That is, compute: arg maxb P [b | x1, . . . , xn].

### 3.[AnalyzingClassifterPerformance(6points)]Consider the plot below showing training and test set accuracy curves for decision trees of different sizes, using the same set of training data to train each tree. (a)What is the most likely correspondence between the training and test curves, and the curves marked‘A’ and ‘B’ in the figure?  Machine Learning代考

(b)Describein one sentence how the training data curve will change if the number of training examples approaches infifinity.

(c)How will the test data curve change under the same condition as in part ii (ie, for the case when number of training examples approachesinfinity)?

### 4. [LinearClassiftcation]

(a)(3 points) The figure below depicts the decision boundaries of three linear classifiers (labeled A, B, and C) and the locations of labeled training data of n points (positive points are circles, negative points aretriangles).

Consider the following algorithms for learning linear classifiers which could have produced the depicted classifiers given the depicted training data:

i.ThePerceptron (making one pass over the training data, e., T n).

ii.An arbitrary but fixed algorithm that minimizes training error for homogeneous linearclas- sifiers.  Machine Learning代考

iii.An algorithm that exactly solves the non-separable SVM optimization criterion. That is, SVMwith slack variables optimization criterion.

What is the most likely correspondence between these algorithms and the depicted linear classi- fiers? Explain your matching to receive credit.

#### (b)(12points)

You  are  given  a  binary  classification  problem  in  R2  using  features  f1  and  f2.   For each of the following data preprocessing techniques, state and justify what happens to the training accuracy of any training error minimizing algorithm for homogeneous linear classifification.

You should choose between the following options:

A.Thepreprocessing can only improve the performance.

B.Thepreprocessing absolutely does not change performance.  Machine Learning代考

C.Thepreprocessing can only worsen the performance.

##### D.Thepreprocessing can either improve or worsen theperformance.

i.Scalingthe feature space:  that is, apply the map (f1, f2) (cf1, cf2) (for some fixed constant c > 0 that is selected independently, without considering the training data)

ii.Translating the  feature  space:  that  is,  apply  the  map  (f1, f2)  ›→ (f1 + c, f2 + c)  (for  some fixed constant c ƒ= 0 that is selected independently, without considering the training data)

iii.Augmenting the  feature  space:  that  is,  apply  the  map  (f1, f2)  ›→ (f1, f2, c)  (for  some  fixed constant c ƒ= 0 that is selected independently, without considering the training data)

iv.z-scoringthe feature space:  that is, apply the map , where µi and σi are computed as the training data mean and standard deviation for the feature fi respectively.

v.Combiningthe feature space:  by applying the map (f1, f2›→ (f 2, f1f2, f 2)

vi.Combiningthe features:  by applying the map (f1, f2›→ (f1 + f2)

### 5. [Nearest NeighborClassiftcation]  Machine Learning代考

(a)(5 points) Two classifiers are said to be equal if on every test example they output the same prediction.Alice, Bob and Carol each construct a 3-NN classifier using the same training dataset S. They used the following distance metrics:

dalice(x, xj) := ||x xj||

dbob(x, xj) := 2e||xxt ||+1

dcarol(x, xj) := ||x xj||1

Whose classifiers would be equal for all non-empty training sets S? Justify your answer.

#### (b)(5 points)

Define hk(x; S) as the k-nearest neighbor classifier that takes in a training dataset S and  a  test  example  x  and  outputs  a  prediction  yˆ   {0, 1}.   (so,  h5(x, Si)  denotes  the  5-NN classifier that is trained on Si and predicts on x). Let S1 and S2 be two disjoint training datasets from the same underlying distribution, and consider the following statements:

A.If h1(x, S1) = 1  and  h1(x, S2) = 1,  then  h1(x, S1S2) = 1.

B. If h1(x, S1) = 1  and  h1(x, S2) = 1,  then  h3(x, S1S2) = 1.  Machine Learning代考

C. If h3(x, S1) = 1 and h3(x, S2) = 1, then h3(x, S1 S2) = 1.

i.Only A. iscorrect

ii.Only B. iscorrect

iii.Only C. iscorrect

iv.Only and B. are correct

v.Only and C. are correct

vi.Only and C. are correct

vii.All B. and C. are correct

viii.Noneof the  B. and C. are correct

(c)Just like the Perceptron or the SVM algorithm, you want to design a ‘Kernelized’ version of k-NearestNeighbor  Since distance computations are at the core of any NN algorithm, you realize that in order to develop a successful ‘kernel’-NN algorithm, you need a way to efficiently compute distances in a (possibly infinite dimensional) Kernel space. Let Φ : Rd RD denote an explicit mapping into a kernel space (where D can possibly be infinite).  Machine Learning代考

i.(5 points) Show that Euclidean distances between data points in a kernel space can be writtendown as dot products, and thus can be computed efficiently for certain kernel trans- formations. That is, show that one can write:

||Φ(x) Φ(xj)||

just in terms of the dot products .Φ(x) · Φ(xj)Σ.

ii.(5 points) Write down the pseudo-code for Kernelized 1-Nearest Neighbor. Assume that you have access to the kernel function Kφ: (x, xjφ(x)  φ(xj)  ,  that gives you the dot  product in the feature space φ.

#### Kernel 1-Nearest Neighbor Algorithm

Inputs:

Trainingdata: (x1, y1), (x2, y2), . . . , (xn, yn)

Kernel function: Kφ(,·)

Test point:xt

Output:

Predictionon the test point:  yˆt

Pseudo-code:

### 6.[Regression (10 points)]  Machine Learning代考

You want to design a regressor with good prediction accuracy. Knowing thatvarious kinds of regularizations help, you decide to incorporate both the lasso and ridge penalties in the design of your optimization  Given training data (x1, y1), . . . , (xn, yn) (where each xi Rd and each yi R) you come up with the following optimization where:  (i) X  is a d × n data matrix where the ith  column corresponds to training data xi, (ii) y  is a    1 × n output vector where the ith  component corresponds to the training output yi, and (iii) w  is a     1 × d parameter weight vector. λ 0 and α [0, 1] are known trade-off parameters.

Show that this optimization can be turned into a lasso regression by appropriately augmenting the training data.