$30
CSE4334/5334 Data Mining Assignment 1
You have two late days — use it at as you
wish. Once you run out of this quota, the penalty for late submission will be applied. You can either use
your late days quota (or let the penalty be applied). Clearly indicate in your submission if you seek to use
the quota.
What to turn in:
1. Your submission should include your complete code base in an archive file (zip, tar.gz) and q1/,
q2/, and so on), and a very very clear README describing how to run it.
2. A brief report (typed up, submit as a PDF file, NO handwritten scanned copies) describing what you solved, implemented and known failure cases.
3. Submit your entire code and report to Blackboard.
Notes from instructor:
• Start early!
• You may ask the TA or instructor for suggestions, and discuss the problem with others (minimally).
But all parts of the submitted code must be your own.
• Use Matlab or Python for your implementation.
• Make sure that the TA can easily run the code by plugging in our test data.
Problem 1
(k-means, 40pts) Generate 2 sets of 2-D Gaussian random data, each set containing 500 samples using
parameters below.
µ1 = [1, 0], µ2 = [0, 1.5], Σ1 =
0.9 0.4
0.4 0.9
, Σ2 =
0.9 0.4
0.4 0.9
(1)
1. (20pts) Write a function cluster = mykmeans(X, k, c) that clusters data X ∈ R
n×p
(n number of
objects and p number of attributes) into k clusters. The c here is the initial centers, although this is
usually not necessary, we will need it to test your function. Terminate the iteration when the `2-norm
between a previous center and an updated center is ≤ 0.001 or the number of iteration reaches 10000.
2. (10pts) Apply your code to the data generated above with k = 2 and initial centers c1 = (10, 10) and
c2 = (−10, −10). In your report, report the centers found for each cluster. How many iterations did it
take? Show a scatter plot of the data and the centers of clusters found.
3. (10pts) Apply your code to the data generated above with k = 4 and initial centers c1 = (10, 10) and
c2 = (−10, −10), c3 = (10, −10) and c4 = (−10, 10). In your report, report the centers found for each
cluster. How many iterations did it take? Show a scatter plot of the data and the centers of clusters
found.
Problem 2
(Non-parameteric density estimation 60pts)
1. (30pts) Write a function [p, x] = mykde(X,h) that performs kernel density estimation on X with
bandwidth h. It should return the estimated density p(x) and its domain x where you estimated the
p(x) for X in 1-D and 2-D.
Instructor: W. H. Kim (won.kim@uta.edu), TA: Xin Ma (xin.ma@mavs.uta.edu) Page 1 of 2
CSE4334/5334 Data Mining Assignment 1
2. (10pts) Generate N = 1000 Gaussian random data with µ1 = 5 and σ1 = 1. Test your function mykde
on this data with h = {.1, 1, 5, 10}. In your report, report the histogram of X along with the figures of
estimated densities.
3. (10pts) Generate N = 1000 Gaussian random data with µ1 = 5 and σ1 = 1 and another Gaussian
random data with µ2 = 0 and σ2 = 0.2. Test your function mykde on this data with h = {.1, 1, 5, 10}.
In your report, report the histogram of X along with the figures of estimated densities.
4. (10pts) Generate 2 sets of 2-D Gaussian random data with N1 = 500 and N2 = 500 using the following
parameters:
µ1 = [1, 0], µ2 = [0, 1.5], Σ1 =
0.9 0.4
0.4 0.9
, Σ2 =
0.9 0.4
0.4 0.9
. (2)
Test your function mykde on this data with h = {.1, 1, 5, 10}. In your report, report figures of estimated
densities.
Instructor: W. H. Kim (won.kim@uta.edu), TA: Xin Ma (xin.ma@mavs.uta.edu) Page 2 of 2