RDD: Preserve total order when repartitioning - apache-spark

It seems one of my assumptions were incorrect regarding order in RDDs (related).
Suppose I wish to repartition a RDD after having sorted it.
import random
l = list(range(20))
random.shuffle(l)
spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x)\
.repartition(3)\
.collect()
Which yields:
[16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
As we can see, the order is preserved within a partition but the total order is not preserved over all partitions.
I would like to preserve total order of the RDD, like so:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
I am having difficulty finding anything online which could be of assistance. Help would be appreciated.

It appears that we can provide the argument numPartitions=partitions to the sortBy function to partition the RDD and preserve total order:
import random
l = list(range(20))
random.shuffle(l)
partitions = 3
spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x ,numPartitions=partitions)\
.collect()

Related

Python for loop one-line difference from regullar

I just came up with a difference between one line and regular for loop.
As an example;
obs = [6, 12, 8, 10, 20 16]
freq = [5, 4, 3, 2, 1, 5]
data = []
data.extend(obs[i:i+1] * freq[i] for i in range(len(obs)))
outputs
[[6, 6, 6, 6, 6], [12, 12, 12, 12], [8, 8, 8], [10, 10], [20], [16, 16, 16, 16, 16]]
However,
for i in range(len(obs)):
data.extend(obs[i:i+1] * freq[i])
outputs
[6, 6, 6, 6, 6, 12, 12, 12, 12, 8, 8, 8, 10, 10, 20, 16, 16, 16, 16, 16]
Can someone kindly explain what causes this?
extending x by y means appending each entry of y to x.
Since the entries of obs[i:i+1] * freq[i] for i in range(len(obs)) are lists of integers, data.extend(obs[i:i+1] * freq[i] for i in range(len(obs))) will append of lists of integers to data, not integers.
On the other hand, the elements of obs[i:i+1] * freq[i] are integers, and therefore data.extend(obs[i:i+1] * freq[i]) will append integers to data.
This would generate the same output as the one-liner:
obs = [6, 12, 8, 10, 20, 16]
freq = [5, 4, 3, 2, 1, 5]
data = []
for i in range(len(obs)):
data.extend([obs[i:i+1] * freq[i]])

Can anyone explain why I can't concatenate these two matrices?

Here is my matrices and codeline:
d = np.array([[1,2,3],[6,7,8],[11,12,13],
[16,17,18]])
e = np.array([[ 4, 5],[ 9, 10],[14, 15],[19, 20]])
np.concatenate(d,e)
and this is the error that I get:
TypeError: only integer scalar arrays can be converted to a scalar index
You have a syntax mistake in np.concatenate(d,e), the syntax requires d and e to be in a tuple, like: np.concatenate((d,e)). I tested it, and axis=1 is also required for it to work.
np.concatenate((d, e), axis=1)
is the solution
Since those arrays have different dimensions you should specify the axis concatenate you what like the follow:
1) np.concatenate((d,e), axis=1)
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20]])
or
2)np.concatenate((d,e), axis=None)
array([ 1, 2, 3, 6, 7, 8, 11, 12, 13, 16, 17, 18, 4, 5, 9, 10, 14,
15, 19, 20])

(pandas) access all rows except for the first 3 for the specific columns at index

I want to access all rows except for the first 3 for the specific columns at index 1, 2, 4, 5, 7, 8, 10, 11, 13, 14 of a csv file. How can I do this? All examples I have found show how to slice (for example 1:14 but I do not want all columns in between but specific ones.
When I try:
p = df[3:, [1, 2, 4, 5, 7, 8, 10, 11, 13, 14]]
I get an error:
p = df[3:, [1, 2, 4, 5, 7, 8, 10, 11, 13, 14]]
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2139, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2146, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 1840, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type: 'slice'
and it does not work with the notation p = df[[3:], [1, 2, 4, 5, 7, 8, 10, 11, 13, 14]]
IIUC you need DataFrame.iloc for filter by positions here all rows without first 3 and columns names by positions:
df.iloc[3:, [1, 2, 4, 5, 7, 8, 10, 11, 13, 14]]
p = df.iloc[[3:],[1, 2, 4, 5, 7, 8, 10, 11, 13, 14]]
you were mostly right except you just had to close the square brackets after "3:".
and using loc, iloc is recommended for indexing.

SymPy Permutation groups parity not working as expected

I've implemented a Rubik's cube using permutations of Tuples. The cube with no changes is represented as (0, 1, 2, ... , 45, 46, 47).
To apply a 'turn' to the cube the numbers are shuffled around. I've pretty fully tested all of my turns to the point that I'm fairly sure that there is no typos.
I've been trying to implement a method that checks whether a cube is valid or not because only 1 in 12 random permutation of (1, 2, ... 47, 48) is a valid cube. For a permutation to be a valid Rubik's cube it must meet 3 requirements. This was well documented in this SO thread: https://math.stackexchange.com/questions/127577/how-to-tell-if-a-rubiks-cube-is-solvable
The 3 steps are:
Edge orientation: Number of edges flips has to be even.
Corner orientation: Number of corner twists has to be divisible by 3.
Permutation parity: This is where I'm having troubles. The permutation parity must be even, meaning that the corner parity must match the edge parity.
The SymPy library provides a great way for me to work with a number of permutation group properties so I included it in my attempt at computing permutation parity.
The simplest test input that it fails on when it should succeed is back turn of the cube, represented as B.
Here's the code:
def check_permutation_parity(cube):
corners = cube[:24]
edges = cube[24:]
edges = [e - 24 for e in edges]
corner_perms = corner_perms_only(corners)
edge_perms = edge_perms_only(edges)
normalized_corners = [int(c/3) for c in corner_perms]
normalized_edges = [int(e/2) for e in edge_perms]
sympy_corners = Permutation(list(normalized_corners))
sympy_edges = Permutation(list(normalized_edges))
corners_perm_parity = Permutation(list(normalized_corners)).parity()
edges_perm_parity = Permutation(list(normalized_edges)).parity()
if corners_perm_parity != edges_perm_parity:
return False
return True
Using a bunch of print statements I've outlined what happens throughout the code:
This is the initial state. It's the B permutation of the cube and looks as expected.
cube:
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 18, 19, 20, 12, 13, 14, 21, 22, 23, 15, 16, 17, 24, 25, 26, 27, 30, 31, 28, 29, 32, 33, 36, 37, 34, 35, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47)
Next we look at the corners and edges. Remember that the edge has 24 subtracted from every one. This is necessary for eventual conversion to a SymPy permutation.
corners, edges
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 18, 19, 20, 12, 13, 14, 21, 22, 23, 15, 16, 17)
[0, 1, 2, 3, 6, 7, 4, 5, 8, 9, 12, 13, 10, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
Then we extract just every 3 corner and every 2 edge. This lets us look at just the permutation of each piece because we don't care about orientation.
corner_perms_only, edges_perms_only
(0, 3, 6, 9, 18, 12, 21, 15)
(0, 2, 6, 4, 8, 12, 10, 14, 16, 18, 20, 22)
Then we divine by 2 or 3 to convert to SymPy
normalized_corners, edges
[0, 1, 2, 3, 6, 4, 7, 5]
[0, 1, 3, 2, 4, 6, 5, 7, 8, 9, 10, 11]
After converting to SymPy the corners look as such:
sympy corners
(4 6 7 5)
[(4, 5), (4, 7), (4, 6)]
[[0], [1], [2], [3], [4, 6, 7, 5]]
And the edges look as such:
sympy edges
(11)(2 3)(5 6)
[(2, 3), (5, 6)]
[[0], [1], [2, 3], [4], [5, 6], [7], [8], [9], [10], [11]]
Giving us this parity because the corners consists of a 3 cycle and the edges consist of a 2 cycle:
corners, edges perm parity
1
0
Because the parities differ the function returns false.
B: False
We know that the parities should match, but I can't get that result to happen and I'm kind of lost in where to go for further debugging. All of the code can be found on my GitHub here: https://github.com/elliotmartin/RubikLite/blob/master/Rubik.py
My issue had nothing to do with SymPy and the permutation parities. To check this I implemented my own algorithm for cyclic decomposition and then checked the parities. In the end the issue had to do with how I set up the permutations for each move.
I guess I've learned a lot about testing - if your tests don't test for the correct thing then they're not that useful.

Change one-dimensional tuple using slicing / object reassignment

I understand that tuples are immutable objects, however, I know tuples support indexing and slicing. Thus, if I have a tuple assigned to a variable, I can reassign the variable to a new tuple object and change the value at the desired index position.
When I attempt to do this using an index slice, I am getting returned a tuple containing multiple tuples. I understand why this is happening, because I am passing comma separated slices of the original tuple, but I can't figure out how (if possible) I can return a one-dimensional tuple with a single element changed when working with larger sets of data.
Example:
someNumbers = tuple(i for i in range(0, 20))
print(someNumbers)
someNumbers = someNumbers[:10], 2000, someNumbers[11:]
print(someNumbers)
Outputs the following:
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
((0, 1, 2, 3, 4, 5, 6, 7, 8, 9), 2000, (11, 12, 13, 14, 15, 16, 17, 18, 19))
Can I return a one-dimensional tuple and change only the desired index value?
Use concatenation:
someNumbers = someNumbers[:10] + (2000,) + someNumbers[11:]
You can use tuple concatenation:
someNumbers = tuple(i for i in range(0, 20))
print(someNumbers)
# (2000, ) to differentiate it from (2000) which is a number
someNumbers = someNumbers[:10]+ (2000,) + someNumbers[11:]
print(someNumbers)
Outputs:
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 2000, 11, 12, 13, 14, 15, 16, 17, 18, 19)

Resources