What is a matriz o norm vector?


Usually, to get the Euclidian distance, is used the numpy.linalg.norm to get the distance between some data points and clusters centroids.

Precisely, in this context analyzing the KMeans algorithm implementation here presented, we have the following:

# Importing the dataset
data = pd.read_csv('xclara.csv')

(3000, 2)

Get the V1 and V2 columns on f1 and f2 variables

# Getting the values and plotting it
f1 = data['V1'].values
f2 = data['V2'].values

# We associate every  i value of the column f1 with f2 and we put them as elements of a list
X = np.array(list(zip(f1, f2)))

#  array X 

[  2.072345  -3.241693]
[ 17.93671   15.78481 ]
[  1.083576   7.319176]
[ 64.46532  -10.50136 ]
[ 90.72282  -12.25584 ]
[ 64.87976  -24.87731 ]]

# And we put the data on a scatter diagram
plt.scatter(f1, f2, c='black', s=7)

Euclidean distance calculator

# Euclidean Distance Caculator
# https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.norm.html
# Geting the Euclidean distance between data points
def dist(a, b, ax=1):
    return np.linalg.norm(a - b, axis=ax)

How to understand the norm vector inside Euclidian distance context?

# Number of clusters
k = 3

# We generate a random data between 0 and the maximum value -20 of the X array that has the data of the
# columns f1 and f2, with these inputs we generate the values of X and Y coordinates on which
# will position the centroids

# X coordinates of random centroids
C_x = np.random.randint(0, np.max(X)-20, size=k)
# Y coordinates of random centroids
C_y = np.random.randint(0, np.max(X)-20, size=k)

print(" x coordinates" +'\n', C_x)
print(" y coordinates" +'\n', C_y) 

x coordinates
[51 41 25]
y coordinates
[18 76 53]

We associate these lists C_x and C_y with zip so that in a single list, have the location values for each centroid in the dispersion graph

C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
print("Coordinates pair x,y associated inside list to" +'\n', "INITIALIZE RANDOM CENTROIDS" +'\n', C)

Coordinates pair x,y associated inside list to
[[51. 18.]
[41. 76.]
[25. 53.]]

# Plotting along with the Centroids
plt.scatter(f1, f2, c='#050505', s=7)
plt.scatter(C_x, C_y, marker='*', s=200, c='g')

# To store the value of centroids when it updates
C_old = np.zeros(C.shape)

Why is necessary store the old coordinates of centroids?

# Cluster Lables(0, 1, 2)
clusters = np.zeros(len(X))

How to works the grab this distance? In the sense of norm/matriz vector …

# Error func. - Distance between new centroids and old centroids
error = dist(C, C_old, None)

I have been understanding the implementation that the author creates, but this topic of norm vector is something new for me.