Related
I have around ~30M of records, containing sales data, looking like this:
item
type
days_diff
placed_orders
cancelled_orders
console
ps5
-10
8
1
console
xbox
-8
6
0
console
ps5
-5
4
4
console
xbox
-1
10
7
console
xbox
0
2
3
games
ps5
-11
48
9
games
ps5
-3
2
4
games
xbox
5
10
2
I would like to decrease the number of rows, by creating list of lists corresponding to particular item, like this:
item
types
days_diff
placed_orders
cancelled_orders
console
['ps5', 'xbox']
[[-10, -5],[-8, -1, 0]]
[[8, 4],[6, 10, 2]]
[[1, 4],[0, 7, 3]]
games
['ps5' ,'xbox']
[[-11, -3],[5]]
[[48, 2],[10]]
[[9, 4],[2]]
How can achieve it using PySpark?
You can achieve this by performing 2 groupBy the first on the couple ("item", "type") and then on the column ("item"):
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame(
data=[["console", "ps5", -10, 8, 1], ["console", "xbox", -8, 6, 0],
["console", "ps5", -5, 4, 4], ["console", "xbox", -1, 10, 7], ["console", "xbox", 0, 2, 3],
["games", "ps5", -11, 48, 9], ["games", "ps5", -3, 2, 4], ["games", "xbox", 5, 10, 2]
], schema=["item", "type", "days_diff", "placed_orders", "cancelled_orders"])
df = df.groupBy("item", "type").agg(
collect_list("days_diff").alias("days_diff"),
collect_list("placed_orders").alias("placed_orders"),
collect_list("cancelled_orders").alias("cancelled_orders")
)
df = df.groupBy("item").agg(
collect_list("type").alias("types"),
collect_list("days_diff").alias("days_diff"),
collect_list("placed_orders").alias("placed_orders"),
collect_list("cancelled_orders").alias("cancelled_orders")
)
df.show(10, False)
+-------+-----------+------------------------+--------------------+-------------------+
|item |types |days_diff |placed_orders |cancelled_orders |
+-------+-----------+------------------------+--------------------+-------------------+
|console|[ps5, xbox]|[[-10, -5], [-8, -1, 0]]|[[8, 4], [6, 10, 2]]|[[1, 4], [0, 7, 3]]|
|games |[ps5, xbox]|[[-11, -3], [5]] |[[48, 2], [10]] |[[9, 4], [2]] |
+-------+-----------+------------------------+--------------------+-------------------+
I want to create 2 dimensional array and populate it with the following information.
n = 7
j = [0, 1, 2, 3, 4, 5, 6]
k = [0, 2, 4, 3, 3, 2, 1]
l = [0 , 46, 52, 30, 36 ,56, 40]
so, the the first list of the List L should b [0,0,0], second list of list L should be [1,2,46], third should of List L should be the list [2, 4, 52] and so on. But its not working and it keep overriding the values
#!/usr/bin/python3.6
from operator import itemgetter
n = 7
j = [0, 1, 2, 3, 4, 5, 6]
k = [0, 2, 4, 3, 3, 2, 1]
l = [0 , 46, 52, 30, 36 ,56, 40]
L = [[0]*3]*n
print(L)
x = 0
y = 0
z = 0
while x < len(j):
L[x][0] = j[x]
x = x + 1
while y < len(k):
L[y][1] = k[y]
y = y + 1
while z < len(l):
L[z][2] = l[z]
z = z + 1
print (L)
The current output is
Initialization of L, and thats ok
[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]]
And L after modification, which is very wrong
[[6, 1, 40], [6, 1, 40], [6, 1, 40], [6, 1, 40], [6, 1, 40], [6, 1, 40], [6, 1, 40]]
First and foremost:
L = [[0]*3]*n
is the wrong way to initialize a 2D list. If I set l[0][0] to 1, look what happens:
n = 7
L = [[0]*3]*n
L[0][0] = 1
print(L)
Output:
[[1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0, 0]]
To properly initialize a 2D list, do:
L = [[0] * 3 for _ in range(n)]
If you change it to this, your code will work.
But actually, your problem can be solved in a much simpler way. Just use zip:
j = [0, 1, 2, 3, 4, 5, 6]
k = [0, 2, 4, 3, 3, 2, 1]
l = [0, 46, 52, 30, 36, 56, 40]
result = [list(x) for x in zip(j, k, l)]
print(result)
Output:
[[0, 0, 0],
[1, 2, 46],
[2, 4, 52],
[3, 3, 30],
[4, 3, 36],
[5, 2, 56],
[6, 1, 40]]
sales = [(3588, [1,2,3,4,5,6], [1,38,9,2,18,5]),
(3588, [2,5,7], [1,2,4,8,14]),
(3588, [3,10,13], [1,3,4,6,12]),
(3588, [4,5,61], [1,2,3,4,11,5]),
(3590, [3,5,6,1,21], [3,10,13]),
(3590, [8,1,2,4,6,9], [2,5,7]),
(3591, [1,2,4,5,13], [1,2,3,4,5,6])
]
labels = ['goods_id', 'properties_id_x', 'properties_id_y']
df = pd.DataFrame.from_records(sales, columns=labels)
df
Out[4]:
goods_id properties_id_x properties_id_y
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
1 3588 [2, 5, 7] [1, 2, 4, 8, 14]
2 3588 [3, 10, 13] [1, 3, 4, 6, 12]
3 3588 [4, 5, 61] [1, 2, 3, 4, 11, 5]
4 3590 [3, 5, 6, 1, 21] [3, 10, 13]
5 3590 [8, 1, 2, 4, 6, 9] [2, 5, 7]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]
Having df of goods and their properties. Need to compare goods properties_id_x with properties_id_y row by row and return only those rows whose lists have both "1" and "5" in them. Cannot figure out how to do it.
Desired output:
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]
Option 1:
In [176]: mask = df.apply(lambda r: {1,5} <= (set(r['properties_id_x']) & set(r['properties_id_y'])), axis=1)
In [177]: mask
Out[177]:
0 True
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
In [178]: df[mask]
Out[178]:
goods_id properties_id_x properties_id_y
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]
Option 2:
In [183]: mask = df.properties_id_x.map(lambda x: {1,5} <= set(x)) & df.properties_id_y.map(lambda x: {1,5} <= set(x))
In [184]: df[mask]
Out[184]:
goods_id properties_id_x properties_id_y
0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]
You could also use a dict intersection
df["intersect"] = df.apply(lambda x: set(x["properties_id_x"]).intersection(x["properties_id_y"]), axis=1)
df[df["intersect"].map(lambda x: (1 in x) and (5 in x))]
>> 0 3588 [1, 2, 3, 4, 5, 6] [1, 38, 9, 2, 18, 5]
>> 6 3591 [1, 2, 4, 5, 13] [1, 2, 3, 4, 5, 6]
First few examples:
Input:
10
1
4 5 6
Output:
6
another one:
Input:
10
2
3 3 3
7 7 4
Output:
4
I put this code it is correct for some cases but not for all where is the problem?
n = int(input())
q = int(input())
z = 0
repeat = 0
ans = 0
answ = []
arrx = []
arry = []
for i in range(q):
maxi = 0
x,y,w = [int(i) for i in input().split()]
x,y = x+1, y+1
if((arrx.count(x)>=1)):
index = arrx.index(x)
if(y==arry[index]):
if(answ[index]==ans):
repeat += answ[index]
z = answ[index]
arrx.append(x)
arry.append(y)
if((w>x or w>y) or (w>(n-x) or w>(n-y))):
maxi = max(x, y, (n-x), (n-y))
if(((x>=w) or (y>=w)) or (((n-x)>=w) or ((n-y)>=w))):
maxi = w
ans = max(ans, maxi)
answ.append(ans)
if(ans>z):
repeat = 0
print(ans+repeat)
The problem I see with your code is you are handling the data as two one dimensional arrays, arrx and arry, when the problem calls for a two dimensional array. You should be able to print out your data structure and see the heat map for the volcanoes. For the first example, you've got a single hot volcano in the middle of the map:
10
1
4 5 6
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[2, 2, 2, 2, 2, 2, 2, 2, 2, 1]
[2, 3, 3, 3, 3, 3, 3, 3, 2, 1]
[2, 3, 4, 4, 4, 4, 4, 3, 2, 1]
[2, 3, 4, 5, 5, 5, 4, 3, 2, 1]
[2, 3, 4, 5, 6, 5, 4, 3, 2, 1]
[2, 3, 4, 5, 5, 5, 4, 3, 2, 1]
[2, 3, 4, 4, 4, 4, 4, 3, 2, 1]
[2, 3, 3, 3, 3, 3, 3, 3, 2, 1]
[2, 2, 2, 2, 2, 2, 2, 2, 2, 1]
Where the hottest (6) spot is obviously the one volcano itself. For the second example, you've got two cooler volcanos:
10
2
3 3 3
3 3 3
7 7 4
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 1, 1, 1, 1, 1, 0, 0, 0, 0]
[0, 1, 2, 2, 2, 1, 0, 0, 0, 0]
[0, 1, 2, 3, 2, 1, 0, 0, 0, 0]
[0, 1, 2, 2, 3, 2, 1, 1, 1, 1]
[0, 1, 1, 1, 2, 3, 2, 2, 2, 2]
[0, 0, 0, 0, 1, 2, 3, 3, 3, 2]
[0, 0, 0, 0, 1, 2, 3, 4, 3, 2]
[0, 0, 0, 0, 1, 2, 3, 3, 3, 2]
[0, 0, 0, 0, 1, 2, 2, 2, 2, 2]
Where the hot spot will either be the hotter of the two volcanos or potentially some spot that falls in their overlap that gets heated by both. In this case, the overlap spots don't get hotter than the hotest (4) volcano. But if the volcanoes were closer, one or more might have.
I have one key/value pair RDD
{(("a", "b"), 1), (("a", "c"), 3), (("c", "d"), 5)}
how could I get the sparse matrix:
0 1 3 0
1 0 0 0
3 0 0 5
0 0 5 0
i.e.
from pyspark.mllib.linalg import Matrices
Matrices.sparse(4, 4, [0, 2, 3, 5, 6], [1, 2, 0, 0, 3, 2], [1, 3, 1, 3, 5, 5])
or
import numpy as np
from scipy.sparse import csc_matrix
data = [1, 3, 1, 3, 5, 5]
indices = [1, 2, 0, 0, 3, 2]
indptr = [0, 2, 3, 5, 6]
csc_matrix((data, indices, indptr), shape=(4, 4), dtype=np.float)
Could you apply pivot to dataframe then convert to matrix?