what is the format of sklearn cluster labels?

what is the format of sklearn cluster labels? - scikit-learn

I'm using sklearn to cluster some lines of text, but trying to understand the format of the returned cluster labels. It looks like this:
km_model.labels_
array([ 5, 35, 1, 29, 49, 2, 6, 28, 5, 4, 4, 19, 40, 52, 6, 20, 4,\n 40, 40, 7, 10, 13, 14, 4, 10, 29, 14, 22, 24, 13, 24, 5, 4, 21,\n ...
So it's kind of like an array but there are elements of \n to separate clusters?
Is that really the format?
Is this some type of shortcut method for packing matrices in SKLearn? Why don't they return a 2D array of labels, eg one list of labels per cluster?
After that what is the best way to iterate through this type of data and group the labels per cluster?

Your clusters are the number values, the index of each label corresponds to the index of the samples you passed into your model. I suspect the \n is resulting from whatever IDE you're using read this output.

Related

efficient way to operate on the ndarray

There exist an numpy ndarry A of shape [100,50, 5], and I want to expand A as follows. A will be appended with an one-dimensional array of shape (50, ). The resulting A will have shape [100,50,6].
The element of this one-dimensional array is based on the array in the original ndarray, i.e., A[:,:,4] in terms of a given formula, i.e., A[:,i,5]=A[:,i,4]*B[i]+5 for i = 0:49 Here A[:,:,5] corresponds to the added one-dimensional array. B is another array working as weight.
Besides using a for loop to write this function, how to fullfill this task in a vectorized/efficient way leveraging numpy operation

Make 2 arrays - with sizes that we can look at:
In [371]: A = np.arange(24).reshape(2,3,4); B = np.array([10,20,30])
Due to broadcasting we can add a (3,) array to (2,3) array
In [372]: A[:,:,-1]+B
Out[372]:
array([[13, 27, 41],
[25, 39, 53]])
we can then convert that to (2,3,1) array:
In [373]: (A[:,:,-1]+B)[:,:,None]
Out[373]:
array([[[13],
[27],
[41]],
[[25],
[39],
[53]]])
In [374]: A
Out[374]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
and join them on the last axis:
In [375]: np.concatenate((A, Out[373]), axis=-1)
Out[375]:
array([[[ 0, 1, 2, 3, 13],
[ 4, 5, 6, 7, 27],
[ 8, 9, 10, 11, 41]],
[[12, 13, 14, 15, 25],
[16, 17, 18, 19, 39],
[20, 21, 22, 23, 53]]])
Or we can make a target array of the right size, and copy values to it:
In [376]: A1 = np.zeros((2,3,5),int)
In [377]: A1[:,:,:-1]=A
In [379]: A1[:,:,-1]=Out[372]

A numeric version of a sckit-learn tf-idf word-frequency array?

I'm trying to learn more about unsupervised learning in Python. When I was doing a couple of courses on DataCamp, I noticed that scikit-learn had feature extraction but it was only for text. However, what if the list I have is all numbers. Is there one for numeric lists?
Here's the example:
[12, 21, 3, 33, 42, 16, 10, 46, 32, 1, 17, 50]
[3, 18, 19, 21, 11, 44, 36, 17, 2, 42, 6, 11]
... etc ...

SymPy Permutation groups parity not working as expected

I've implemented a Rubik's cube using permutations of Tuples. The cube with no changes is represented as (0, 1, 2, ... , 45, 46, 47).
To apply a 'turn' to the cube the numbers are shuffled around. I've pretty fully tested all of my turns to the point that I'm fairly sure that there is no typos.
I've been trying to implement a method that checks whether a cube is valid or not because only 1 in 12 random permutation of (1, 2, ... 47, 48) is a valid cube. For a permutation to be a valid Rubik's cube it must meet 3 requirements. This was well documented in this SO thread: https://math.stackexchange.com/questions/127577/how-to-tell-if-a-rubiks-cube-is-solvable
The 3 steps are:
Edge orientation: Number of edges flips has to be even.
Corner orientation: Number of corner twists has to be divisible by 3.
Permutation parity: This is where I'm having troubles. The permutation parity must be even, meaning that the corner parity must match the edge parity.
The SymPy library provides a great way for me to work with a number of permutation group properties so I included it in my attempt at computing permutation parity.
The simplest test input that it fails on when it should succeed is back turn of the cube, represented as B.
Here's the code:
def check_permutation_parity(cube):
corners = cube[:24]
edges = cube[24:]
edges = [e - 24 for e in edges]
corner_perms = corner_perms_only(corners)
edge_perms = edge_perms_only(edges)
normalized_corners = [int(c/3) for c in corner_perms]
normalized_edges = [int(e/2) for e in edge_perms]
sympy_corners = Permutation(list(normalized_corners))
sympy_edges = Permutation(list(normalized_edges))
corners_perm_parity = Permutation(list(normalized_corners)).parity()
edges_perm_parity = Permutation(list(normalized_edges)).parity()
if corners_perm_parity != edges_perm_parity:
return False
return True
Using a bunch of print statements I've outlined what happens throughout the code:
This is the initial state. It's the B permutation of the cube and looks as expected.
cube:
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 18, 19, 20, 12, 13, 14, 21, 22, 23, 15, 16, 17, 24, 25, 26, 27, 30, 31, 28, 29, 32, 33, 36, 37, 34, 35, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47)
Next we look at the corners and edges. Remember that the edge has 24 subtracted from every one. This is necessary for eventual conversion to a SymPy permutation.
corners, edges
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 18, 19, 20, 12, 13, 14, 21, 22, 23, 15, 16, 17)
[0, 1, 2, 3, 6, 7, 4, 5, 8, 9, 12, 13, 10, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
Then we extract just every 3 corner and every 2 edge. This lets us look at just the permutation of each piece because we don't care about orientation.
corner_perms_only, edges_perms_only
(0, 3, 6, 9, 18, 12, 21, 15)
(0, 2, 6, 4, 8, 12, 10, 14, 16, 18, 20, 22)
Then we divine by 2 or 3 to convert to SymPy
normalized_corners, edges
[0, 1, 2, 3, 6, 4, 7, 5]
[0, 1, 3, 2, 4, 6, 5, 7, 8, 9, 10, 11]
After converting to SymPy the corners look as such:
sympy corners
(4 6 7 5)
[(4, 5), (4, 7), (4, 6)]
[[0], [1], [2], [3], [4, 6, 7, 5]]
And the edges look as such:
sympy edges
(11)(2 3)(5 6)
[(2, 3), (5, 6)]
[[0], [1], [2, 3], [4], [5, 6], [7], [8], [9], [10], [11]]
Giving us this parity because the corners consists of a 3 cycle and the edges consist of a 2 cycle:
corners, edges perm parity
1
0
Because the parities differ the function returns false.
B: False
We know that the parities should match, but I can't get that result to happen and I'm kind of lost in where to go for further debugging. All of the code can be found on my GitHub here: https://github.com/elliotmartin/RubikLite/blob/master/Rubik.py

My issue had nothing to do with SymPy and the permutation parities. To check this I implemented my own algorithm for cyclic decomposition and then checked the parities. In the end the issue had to do with how I set up the permutations for each move.
I guess I've learned a lot about testing - if your tests don't test for the correct thing then they're not that useful.

RDD: Preserve total order when repartitioning

It seems one of my assumptions were incorrect regarding order in RDDs (related).
Suppose I wish to repartition a RDD after having sorted it.
import random
l = list(range(20))
random.shuffle(l)
spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x)\
.repartition(3)\
.collect()
Which yields:
[16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
As we can see, the order is preserved within a partition but the total order is not preserved over all partitions.
I would like to preserve total order of the RDD, like so:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
I am having difficulty finding anything online which could be of assistance. Help would be appreciated.

It appears that we can provide the argument numPartitions=partitions to the sortBy function to partition the RDD and preserve total order:
import random
l = list(range(20))
random.shuffle(l)
partitions = 3
spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x ,numPartitions=partitions)\
.collect()

Python for every sequence of random sample generated also include an individual ID

I am trying to program a Lotto simulator, where the code generates 6 random unique numbers out of 45 for about 1000 players where each player has a unique ID. I want to place it into an array that looks like this:
lotto[0...n-1][0...5]
Where [0...n-1] contains the players ID, and [0...5] their unique 6 game numbers.
So it should look something like this when printed
lotto[1][32, 34, 24, 13, 20, 8]
lotto[2][1, 27, 4, 41, 33, 17]
...
lotto[1000][6, 12, 39, 16, 45, 3]
What is the best way of doing something like this without actually merging the two arrays together?
As later on I want to use a merge-sort algorithm to then numerically order the game numbers for each player so it would look something like this without the players ID interfering with the game numbers.
lotto[1][8, 13, 20, 24, 32, 34]
lotto[2][1, 4, 17, 27, 33, 41]
So far I've got:
playerID = list(range(1, 1001))
playerNum = random.sample(range(1, 45), 6)
print(playerID + playerNum)
But that just prints and joins:
[1, 2, 3, ..., 1000, 32, 5, 19, 27, 6, 22]
Thanks for the help.

import random
n_players = 1000
lotto = [random.sample(range(1, 45), 6) for _ in range(n_players)]
OR
import random
n_players = 1000
tup = tuple(range(1, 45))
lotto = []
for _ in range(n_players):
lotto.append(random.sample(tup, 6))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

what is the format of sklearn cluster labels? - scikit-learn

Your clusters are the number values, the index of each label corresponds to the index of the samples you passed into your model. I suspect the \n is resulting from whatever IDE you're using read this output.

Related

efficient way to operate on the ndarray

A numeric version of a sckit-learn tf-idf word-frequency array?

SymPy Permutation groups parity not working as expected

RDD: Preserve total order when repartitioning

Python for every sequence of random sample generated also include an individual ID

Categories

Resources