Pivot and transpose dataset using PySpark

Pivot and transpose dataset using PySpark - apache-spark

I have around ~30M of records, containing sales data, looking like this:
item
type
days_diff
placed_orders
cancelled_orders
console
ps5
-10
8
1
console
xbox
-8
6
0
console
ps5
-5
4
4
console
xbox
-1
10
7
console
xbox
0
2
3
games
ps5
-11
48
9
games
ps5
-3
2
4
games
xbox
5
10
2
I would like to decrease the number of rows, by creating list of lists corresponding to particular item, like this:
item
types
days_diff
placed_orders
cancelled_orders
console
['ps5', 'xbox']
[[-10, -5],[-8, -1, 0]]
[[8, 4],[6, 10, 2]]
[[1, 4],[0, 7, 3]]
games
['ps5' ,'xbox']
[[-11, -3],[5]]
[[48, 2],[10]]
[[9, 4],[2]]
How can achieve it using PySpark?

You can achieve this by performing 2 groupBy the first on the couple ("item", "type") and then on the column ("item"):
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame(
data=[["console", "ps5", -10, 8, 1], ["console", "xbox", -8, 6, 0],
["console", "ps5", -5, 4, 4], ["console", "xbox", -1, 10, 7], ["console", "xbox", 0, 2, 3],
["games", "ps5", -11, 48, 9], ["games", "ps5", -3, 2, 4], ["games", "xbox", 5, 10, 2]
], schema=["item", "type", "days_diff", "placed_orders", "cancelled_orders"])
df = df.groupBy("item", "type").agg(
collect_list("days_diff").alias("days_diff"),
collect_list("placed_orders").alias("placed_orders"),
collect_list("cancelled_orders").alias("cancelled_orders")
)
df = df.groupBy("item").agg(
collect_list("type").alias("types"),
collect_list("days_diff").alias("days_diff"),
collect_list("placed_orders").alias("placed_orders"),
collect_list("cancelled_orders").alias("cancelled_orders")
)
df.show(10, False)
+-------+-----------+------------------------+--------------------+-------------------+
|item |types |days_diff |placed_orders |cancelled_orders |
+-------+-----------+------------------------+--------------------+-------------------+
|console|[ps5, xbox]|[[-10, -5], [-8, -1, 0]]|[[8, 4], [6, 10, 2]]|[[1, 4], [0, 7, 3]]|
|games |[ps5, xbox]|[[-11, -3], [5]] |[[48, 2], [10]] |[[9, 4], [2]] |
+-------+-----------+------------------------+--------------------+-------------------+

Related

Simplify numpy expression [duplicate]

This question already has answers here:
Access n-th dimension in python [duplicate]
(5 answers)
Closed 2 years ago.
How can I simplify this:
import numpy as np
ex = np.arange(27).reshape(3, 3, 3)
def get_plane(axe, index):
return ex.swapaxes(axe, 0)[index] # is there a better way ?
I cannot find a numpy function to get a plane in a higher dimensional array, is there one?
EDIT
The ex.take(index, axis=axe) method is great, but it copies the array instead of giving a view, what I originally wanted.
So what is the shortest way to index (without copying) a n-th dimensional array to get a 2d slice of it, with index and axis?

Inspired by this answer, you can do something like this:
def get_plane(axe, index):
slices = [slice(None)]*len(ex.shape)
slices[axe]=index
return ex[tuple(slices)]
get_plane(1,1)
output:
array([[ 3, 4, 5],
[12, 13, 14],
[21, 22, 23]])

What do you mean by a 'plane'?
In [16]: ex = np.arange(27).reshape(3, 3, 3)
Names like plane, row, and column, are arbitrary conventions, not formally defined in numpy. The default display of this array looks like 3 'planes' or 'blocks', each with rows and columns:
In [17]: ex
Out[17]:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])
Standard indexing lets us view any 2d block, in any dimension:
In [18]: ex[0]
Out[18]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [19]: ex[0,:,:]
Out[19]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [20]: ex[:,0,:]
Out[20]:
array([[ 0, 1, 2],
[ 9, 10, 11],
[18, 19, 20]])
In [21]: ex[:,:,0]
Out[21]:
array([[ 0, 3, 6],
[ 9, 12, 15],
[18, 21, 24]])
There are ways of saying I want block 0 in dimension 1 etc, but first make sure you understand this indexing. This is the core numpy functionality.
In [23]: np.take(ex, 0, 1)
Out[23]:
array([[ 0, 1, 2],
[ 9, 10, 11],
[18, 19, 20]])
In [24]: idx = (slice(None), 0, slice(None)) # also np.s_[:,0,:]
In [25]: ex[idx]
Out[25]:
array([[ 0, 1, 2],
[ 9, 10, 11],
[18, 19, 20]])
And yes you can swap axes (or transpose), it that suits your needs.

How to get the number of rows if you have both 1D and 2D arrays?

I have two arrays as follows and would like to get the number of rows by function .shape.
X = np.array([0, 4, 3, 5, 1, 2])
Y = np.array([[-1, 0, 4, 4],
[ 1, 0, 5, 0],
[ 2, 7, 4, 0],
[ 3, 0, 4, 9],
[ 4, 6, 4, 0]])
X.shape[0]
Y.shape[0]
The result is
6
5
Because X is a matrix with 1 row, I expect X.shape[0] returns 1. However, it returns 6 which is the number of columns. Could you please suggest a function to achieve my goal?

From #Divakar's comment, the command to achieve this goal is np.atleast_2d.

How to use Numpy .tobytes() to serialize objects

How do you serialized/deserialize a numpy array?
A = np.random.randint(0, 10, 40).reshape(8, 5)
print(A)
print (A.dtype)
snapshot = A
serialized = snapshot.tobytes()
[[9 5 5 7 4]
[3 8 8 1 0]
[5 7 1 0 2]
[2 2 7 1 2]
[2 6 3 5 4]
[7 5 4 8 3]
[2 4 2 4 7]
[3 4 2 6 2]]
int64
Returns
deserialized = np.frombuffer(serialized).astype(np.int64)
print (deserialized)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0]

There is a mismatch between the default dtype used to generate A and in np.frombuffer. Works as expected when using the correct dtype (may depend on the machine / Python / numpy version):
# Python 3.6 64-bits with numpy 1.12.1 64-bits
A = np.random.randint(0, 10, 40).reshape(8, 5)
print(A)
>>> array([[3, 3, 5, 3, 9],
[1, 4, 7, 1, 8],
[1, 7, 4, 3, 0],
[9, 2, 9, 1, 2],
[2, 8, 9, 1, 1],
[3, 3, 5, 2, 6],
[5, 0, 2, 7, 6],
[2, 8, 8, 0, 7]])
A.dtype
>>> dtype('int32')
deserialized = np.frombuffer(A.tobytes(), dtype=np.int32).reshape(A.shape)
print(deserialized)
>>> array([[3, 3, 5, 3, 9],
[1, 4, 7, 1, 8],
[1, 7, 4, 3, 0],
[9, 2, 9, 1, 2],
[2, 8, 9, 1, 1],
[3, 3, 5, 2, 6],
[5, 0, 2, 7, 6],
[2, 8, 8, 0, 7]])

Compare lists of column rows and using filters on them in pandas

sales = [(3588, [1,2,3,4,5,6], [1,38,9,2,18,5]),
(3588, [2,5,7], [1,2,4,8,14]),
(3588, [3,10,13], [1,3,4,6,12]),
(3588, [4,5,61], [1,2,3,4,11,5]),
(3590, [3,5,6,1,21], [3,10,13]),
(3590, [8,1,2,4,6,9], [2,5,7]),
(3591, [1,2,4,5,13], [1,2,3,4,5,6])
]
labels = ['goods_id', 'properties_id_x', 'properties_id_y']
df = pd.DataFrame.from_records(sales, columns=labels)
df
Out[4]:
goods_id properties_id_x properties_id_y
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
1 3588 [2, 5, 7] [1, 2, 4, 8, 14]
2 3588 [3, 10, 13] [1, 3, 4, 6, 12]
3 3588 [4, 5, 61] [1, 2, 3, 4, 11, 5]
4 3590 [3, 5, 6, 1, 21] [3, 10, 13]
5 3590 [8, 1, 2, 4, 6, 9] [2, 5, 7]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]
Having df of goods and their properties. Need to compare goods properties_id_x with properties_id_y row by row and return only those rows whose lists have both "1" and "5" in them. Cannot figure out how to do it.
Desired output:
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]

Option 1:
In [176]: mask = df.apply(lambda r: {1,5} <= (set(r['properties_id_x']) & set(r['properties_id_y'])), axis=1)
In [177]: mask
Out[177]:
0 True
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
In [178]: df[mask]
Out[178]:
goods_id properties_id_x properties_id_y
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]
Option 2:
In [183]: mask = df.properties_id_x.map(lambda x: {1,5} <= set(x)) & df.properties_id_y.map(lambda x: {1,5} <= set(x))
In [184]: df[mask]
Out[184]:
goods_id properties_id_x properties_id_y
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]

You could also use a dict intersection
df["intersect"] = df.apply(lambda x: set(x["properties_id_x"]).intersection(x["properties_id_y"]), axis=1)
df[df["intersect"].map(lambda x: (1 in x) and (5 in x))]
>> 0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
>> 6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]

Nested List using indexing and slicing

How do I slice or index this list in order to get the answer below? I've tried doing multiple methods of slicing and nothing has worked for me.
L = [0, [], [1,2,3,4], [[5],[6,7]], [8,9,10]]
newL = [L[0],L[2][1],L[2][2],L[3][0]]
Answer: [0, 2, 3, [5 ,6], 8, 10]
newL is what I have so far, but I can't seem to get the [6,7] split in the nested list.

We start with:
L = [0, [], [1, 2, 3, 4], [[5], [6, 7]], [8, 9, 10]]
We want:
[0, 2, 3, [5, 6], 8, 10]
Let's start from the farthest inside. We need [5, 6]. These are both buried:
>>> L[3][0][0]
5
>>> L[3][1][0]
6
>>> [L[3][0][0], L[3][1][0]]
[5, 6]
One layer farther out we need 2, 3:
>>> L[2][1]
2
>>> L[2][2]
3
Now put it together:
>>> [L[0], L[2][1], L[2][2], [L[3][0][0], L[3][1][0]], L[4][0], L[4][2]]
[0, 2, 3, [5, 6], 8, 10]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pivot and transpose dataset using PySpark - apache-spark

Related

Simplify numpy expression [duplicate]

How to get the number of rows if you have both 1D and 2D arrays?

How to use Numpy .tobytes() to serialize objects

Compare lists of column rows and using filters on them in pandas

Nested List using indexing and slicing

Categories

Resources