Pivot and transpose dataset using PySpark - apache-spark

I have around ~30M of records, containing sales data, looking like this:
item
type
days_diff
placed_orders
cancelled_orders
console
ps5
-10
8
1
console
xbox
-8
6
0
console
ps5
-5
4
4
console
xbox
-1
10
7
console
xbox
0
2
3
games
ps5
-11
48
9
games
ps5
-3
2
4
games
xbox
5
10
2
I would like to decrease the number of rows, by creating list of lists corresponding to particular item, like this:
item
types
days_diff
placed_orders
cancelled_orders
console
['ps5', 'xbox']
[[-10, -5],[-8, -1, 0]]
[[8, 4],[6, 10, 2]]
[[1, 4],[0, 7, 3]]
games
['ps5' ,'xbox']
[[-11, -3],[5]]
[[48, 2],[10]]
[[9, 4],[2]]
How can achieve it using PySpark?

You can achieve this by performing 2 groupBy the first on the couple ("item", "type") and then on the column ("item"):
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame(
data=[["console", "ps5", -10, 8, 1], ["console", "xbox", -8, 6, 0],
["console", "ps5", -5, 4, 4], ["console", "xbox", -1, 10, 7], ["console", "xbox", 0, 2, 3],
["games", "ps5", -11, 48, 9], ["games", "ps5", -3, 2, 4], ["games", "xbox", 5, 10, 2]
], schema=["item", "type", "days_diff", "placed_orders", "cancelled_orders"])
df = df.groupBy("item", "type").agg(
collect_list("days_diff").alias("days_diff"),
collect_list("placed_orders").alias("placed_orders"),
collect_list("cancelled_orders").alias("cancelled_orders")
)
df = df.groupBy("item").agg(
collect_list("type").alias("types"),
collect_list("days_diff").alias("days_diff"),
collect_list("placed_orders").alias("placed_orders"),
collect_list("cancelled_orders").alias("cancelled_orders")
)
df.show(10, False)
+-------+-----------+------------------------+--------------------+-------------------+
|item |types |days_diff |placed_orders |cancelled_orders |
+-------+-----------+------------------------+--------------------+-------------------+
|console|[ps5, xbox]|[[-10, -5], [-8, -1, 0]]|[[8, 4], [6, 10, 2]]|[[1, 4], [0, 7, 3]]|
|games |[ps5, xbox]|[[-11, -3], [5]] |[[48, 2], [10]] |[[9, 4], [2]] |
+-------+-----------+------------------------+--------------------+-------------------+

Related

Simplify numpy expression [duplicate]

This question already has answers here:
Access n-th dimension in python [duplicate]
(5 answers)
Closed 2 years ago.
How can I simplify this:
import numpy as np
ex = np.arange(27).reshape(3, 3, 3)
def get_plane(axe, index):
return ex.swapaxes(axe, 0)[index] # is there a better way ?
I cannot find a numpy function to get a plane in a higher dimensional array, is there one?
EDIT
The ex.take(index, axis=axe) method is great, but it copies the array instead of giving a view, what I originally wanted.
So what is the shortest way to index (without copying) a n-th dimensional array to get a 2d slice of it, with index and axis?
Inspired by this answer, you can do something like this:
def get_plane(axe, index):
slices = [slice(None)]*len(ex.shape)
slices[axe]=index
return ex[tuple(slices)]
get_plane(1,1)
output:
array([[ 3, 4, 5],
[12, 13, 14],
[21, 22, 23]])
What do you mean by a 'plane'?
In [16]: ex = np.arange(27).reshape(3, 3, 3)
Names like plane, row, and column, are arbitrary conventions, not formally defined in numpy. The default display of this array looks like 3 'planes' or 'blocks', each with rows and columns:
In [17]: ex
Out[17]:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])
Standard indexing lets us view any 2d block, in any dimension:
In [18]: ex[0]
Out[18]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [19]: ex[0,:,:]
Out[19]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [20]: ex[:,0,:]
Out[20]:
array([[ 0, 1, 2],
[ 9, 10, 11],
[18, 19, 20]])
In [21]: ex[:,:,0]
Out[21]:
array([[ 0, 3, 6],
[ 9, 12, 15],
[18, 21, 24]])
There are ways of saying I want block 0 in dimension 1 etc, but first make sure you understand this indexing. This is the core numpy functionality.
In [23]: np.take(ex, 0, 1)
Out[23]:
array([[ 0, 1, 2],
[ 9, 10, 11],
[18, 19, 20]])
In [24]: idx = (slice(None), 0, slice(None)) # also np.s_[:,0,:]
In [25]: ex[idx]
Out[25]:
array([[ 0, 1, 2],
[ 9, 10, 11],
[18, 19, 20]])
And yes you can swap axes (or transpose), it that suits your needs.

How to get the number of rows if you have both 1D and 2D arrays?

I have two arrays as follows and would like to get the number of rows by function .shape.
X = np.array([0, 4, 3, 5, 1, 2])
Y = np.array([[-1, 0, 4, 4],
[ 1, 0, 5, 0],
[ 2, 7, 4, 0],
[ 3, 0, 4, 9],
[ 4, 6, 4, 0]])
X.shape[0]
Y.shape[0]
The result is
6
5
Because X is a matrix with 1 row, I expect X.shape[0] returns 1. However, it returns 6 which is the number of columns. Could you please suggest a function to achieve my goal?
From #Divakar's comment, the command to achieve this goal is np.atleast_2d.

How to use Numpy .tobytes() to serialize objects

How do you serialized/deserialize a numpy array?
A = np.random.randint(0, 10, 40).reshape(8, 5)
print(A)
print (A.dtype)
snapshot = A
serialized = snapshot.tobytes()
[[9 5 5 7 4]
[3 8 8 1 0]
[5 7 1 0 2]
[2 2 7 1 2]
[2 6 3 5 4]
[7 5 4 8 3]
[2 4 2 4 7]
[3 4 2 6 2]]
int64
Returns
deserialized = np.frombuffer(serialized).astype(np.int64)
print (deserialized)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0]
There is a mismatch between the default dtype used to generate A and in np.frombuffer. Works as expected when using the correct dtype (may depend on the machine / Python / numpy version):
# Python 3.6 64-bits with numpy 1.12.1 64-bits
A = np.random.randint(0, 10, 40).reshape(8, 5)
print(A)
>>> array([[3, 3, 5, 3, 9],
[1, 4, 7, 1, 8],
[1, 7, 4, 3, 0],
[9, 2, 9, 1, 2],
[2, 8, 9, 1, 1],
[3, 3, 5, 2, 6],
[5, 0, 2, 7, 6],
[2, 8, 8, 0, 7]])
A.dtype
>>> dtype('int32')
deserialized = np.frombuffer(A.tobytes(), dtype=np.int32).reshape(A.shape)
print(deserialized)
>>> array([[3, 3, 5, 3, 9],
[1, 4, 7, 1, 8],
[1, 7, 4, 3, 0],
[9, 2, 9, 1, 2],
[2, 8, 9, 1, 1],
[3, 3, 5, 2, 6],
[5, 0, 2, 7, 6],
[2, 8, 8, 0, 7]])

Compare lists of column rows and using filters on them in pandas

sales = [(3588, [1,2,3,4,5,6], [1,38,9,2,18,5]),
(3588, [2,5,7], [1,2,4,8,14]),
(3588, [3,10,13], [1,3,4,6,12]),
(3588, [4,5,61], [1,2,3,4,11,5]),
(3590, [3,5,6,1,21], [3,10,13]),
(3590, [8,1,2,4,6,9], [2,5,7]),
(3591, [1,2,4,5,13], [1,2,3,4,5,6])
]
labels = ['goods_id', 'properties_id_x', 'properties_id_y']
df = pd.DataFrame.from_records(sales, columns=labels)
df
Out[4]:
goods_id properties_id_x properties_id_y
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
1 3588 [2, 5, 7] [1, 2, 4, 8, 14]
2 3588 [3, 10, 13] [1, 3, 4, 6, 12]
3 3588 [4, 5, 61] [1, 2, 3, 4, 11, 5]
4 3590 [3, 5, 6, 1, 21] [3, 10, 13]
5 3590 [8, 1, 2, 4, 6, 9] [2, 5, 7]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]
Having df of goods and their properties. Need to compare goods properties_id_x with properties_id_y row by row and return only those rows whose lists have both "1" and "5" in them. Cannot figure out how to do it.
Desired output:
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]
Option 1:
In [176]: mask = df.apply(lambda r: {1,5} <= (set(r['properties_id_x']) & set(r['properties_id_y'])), axis=1)
In [177]: mask
Out[177]:
0 True
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
In [178]: df[mask]
Out[178]:
goods_id properties_id_x properties_id_y
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]
Option 2:
In [183]: mask = df.properties_id_x.map(lambda x: {1,5} <= set(x)) & df.properties_id_y.map(lambda x: {1,5} <= set(x))
In [184]: df[mask]
Out[184]:
goods_id properties_id_x properties_id_y
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]
You could also use a dict intersection
df["intersect"] = df.apply(lambda x: set(x["properties_id_x"]).intersection(x["properties_id_y"]), axis=1)
df[df["intersect"].map(lambda x: (1 in x) and (5 in x))]
>> 0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
>> 6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]

Nested List using indexing and slicing

How do I slice or index this list in order to get the answer below? I've tried doing multiple methods of slicing and nothing has worked for me.
L = [0, [], [1,2,3,4], [[5],[6,7]], [8,9,10]]
newL = [L[0],L[2][1],L[2][2],L[3][0]]
Answer: [0, 2, 3, [5 ,6], 8, 10]
newL is what I have so far, but I can't seem to get the [6,7] split in the nested list.
We start with:
L = [0, [], [1, 2, 3, 4], [[5], [6, 7]], [8, 9, 10]]
We want:
[0, 2, 3, [5, 6], 8, 10]
Let's start from the farthest inside. We need [5, 6]. These are both buried:
>>> L[3][0][0]
5
>>> L[3][1][0]
6
>>> [L[3][0][0], L[3][1][0]]
[5, 6]
One layer farther out we need 2, 3:
>>> L[2][1]
2
>>> L[2][2]
3
Now put it together:
>>> [L[0], L[2][1], L[2][2], [L[3][0][0], L[3][1][0]], L[4][0], L[4][2]]
[0, 2, 3, [5, 6], 8, 10]

Resources