I have a data set in pyspark like this :
from collections import namedtuple
user_row = namedtuple('user_row', 'id time category value'.split())
data = [
user_row(1,1,'speed','50'),
user_row(1,1,'speed','60'),
user_row(1,2,'door', 'open'),
user_row(1,2,'door','open'),
user_row(1,2,'door','close'),
user_row(1,2,'speed','75'),
user_row(2,10,'speed','30'),
user_row(2,11,'door', 'open'),
user_row(2,12,'door','open'),
user_row(2,13,'speed','50'),
user_row(2,13,'speed','40')
]
user_df = spark.createDataFrame(data)
user_df.show()
+---+----+--------+-----+
| id|time|category|value|
+---+----+--------+-----+
| 1| 1| speed| 50|
| 1| 1| speed| 60|
| 1| 2| door| open|
| 1| 2| door| open|
| 1| 2| door|close|
| 1| 2| speed| 75|
| 2| 10| speed| 30|
| 2| 11| door| open|
| 2| 12| door| open|
| 2| 13| speed| 50|
| 2| 13| speed| 40|
+---+----+--------+-----+
What I want to get is something like below where grouping by id and time and pivot on category and if it is numeric return the average and if it is categorical it returns the mode.
+---+----+--------+-----+
| id|time| door|speed|
+---+----+--------+-----+
| 1| 1| null| 55|
| 1| 2| open| 75|
| 2| 10| null| 30|
| 2| 11| open| null|
| 2| 12| open| null|
| 2| 13| null| 45|
+---+----+--------+-----+
I tried this but for categorical value it returns null (I am not worry about nulls in speed column)
df = user_df\
.groupBy('id','time')\
.pivot('category')\
.agg(avg('value'))\
.orderBy(['id', 'time'])\
df.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 1| 2|null| 75.0|
| 2| 10|null| 30.0|
| 2| 11|null| null|
| 2| 12|null| null|
| 2| 13|null| 45.0|
+---+----+----+-----+
You can do an additional pivot and coalesce them. try this.
import pyspark.sql.functions as F
from collections import namedtuple
user_row = namedtuple('user_row', 'id time category value'.split())
data = [
user_row(1,1,'speed','50'),
user_row(1,1,'speed','60'),
user_row(1,2,'door', 'open'),
user_row(1,2,'door','open'),
user_row(1,2,'door','close'),
user_row(1,2,'speed','75'),
user_row(2,10,'speed','30'),
user_row(2,11,'door', 'open'),
user_row(2,12,'door','open'),
user_row(2,13,'speed','50'),
user_row(2,13,'speed','40')
]
user_df = spark.createDataFrame(data)
#%%
#user_df.show()
df = user_df.groupBy('id','time')\
.pivot('category')\
.agg(F.avg('value').alias('avg'),F.max('value').alias('max'))\
#%%
expr1= [x for x in df.columns if '_avg' in x]
expr2= [x for x in df.columns if 'max' in x]
expr=zip(expr1,expr2)
#%%
sel_expr= [F.coalesce(x[0],x[1]).alias(x[0].split('_')[0]) for x in expr]
#%%
df_final = df.select('id','time',*sel_expr).orderBy('id','time')
df_final.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 1| 2|open| 75.0|
| 2| 10|null| 30.0|
| 2| 11|open| null|
| 2| 12|open| null|
| 2| 13|null| 45.0|
+---+----+----+-----+
Try collecting the data and transforming as required
spark 2.4+
user_df.groupby('id','time').pivot('category').agg(collect_list('value')).\
select('id','time',col('door')[0].alias('door'),expr('''aggregate(speed, cast(0.0 as double), (acc, x) -> acc + x, acc -> acc/size(speed))''').alias('speed')).show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 2| 13|null| 45.0|
| 2| 11|open| null|
| 2| 12|open| null|
| 2| 10|null| 30.0|
| 1| 2|open| 75.0|
+---+----+----+-----+
I have made this algorithm, but with higher numbers looks like that doesn't work or its very slow, it will run in a cluster of big data(cloudera), so i think that i have to put the function into pyspark, any tip how improve it please
import pandas as pd import itertools as itts
number_list = [10953, 10423, 10053]
def reducer(nums): def ranges(n): print(n) return range(n, -1, -1)
num_list = list(map(ranges, nums)) return list(itts.product(*num_list))
data=pd.DataFrame(reducer(number_list)) print(data)
You can use crossJoin with DataFrame:
Here we have a simple example trying to compute the cross-product of three arrays,
i.e. [1,0], [2,1,0], [3,2,1,0]. Their cross-product should have 2*3*4 = 24 elements.
The code below shows how to achieve this.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
df1 = spark.createDataFrame([(1,),(0,)], ['v1'])
df2 = spark.createDataFrame([(2,), (1,),(0,)], ['v2'])
df3 = spark.createDataFrame([(3,), (2,),(1,),(0,)], ['v3'])
df1.show()
df2.show()
df3.show()
+---+
| v1|
+---+
| 1|
| 0|
+---+
+---+
| v2|
+---+
| 2|
| 1|
| 0|
+---+
+---+
| v3|
+---+
| 3|
| 2|
| 1|
| 0|
+---+
df = df1.crossJoin(df2).crossJoin(df3)
print('----------- Total rows: ', df.count())
df.show(30)
----------- Total rows: 24
+---+---+---+
| v1| v2| v3|
+---+---+---+
| 1| 2| 3|
| 1| 2| 2|
| 1| 2| 1|
| 1| 2| 0|
| 1| 1| 3|
| 1| 1| 2|
| 1| 1| 1|
| 1| 1| 0|
| 1| 0| 3|
| 1| 0| 2|
| 1| 0| 1|
| 1| 0| 0|
| 0| 2| 3|
| 0| 2| 2|
| 0| 2| 1|
| 0| 2| 0|
| 0| 1| 3|
| 0| 1| 2|
| 0| 1| 1|
| 0| 1| 0|
| 0| 0| 3|
| 0| 0| 2|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Your computation is pretty big:
(10953+1)*(10423+1)*(10053+1)=1148010922784, about 1 trillion rows. I would suggest increase the numbers slowly, spark is not as fast as you think when it involves table joins.
Also, try use broadcast on all your initial DataFrames, i.e. df1, df2, df3. See if it helps.
In my graph I need to detect vertices that do not have inbound relations. Using the example below, "a" is the only node that is not being related by the anyone.
a --> b
b --> c
c --> d
c --> b
I would really appreciate any examples to detect "a" type nodes in my graph.
Thanks
unfortunately the approach is not as simple because the graph.degress, graph.inDegrees, graph.outDegrees functions are not returning vertices with 0 edges.
(see documentation for Scala which holds true for Python too https://graphframes.github.io/graphframes/docs/_site/api/scala/index.html#org.graphframes.GraphFrame)
so the following code will always return a empty dataframe
g=Graph(vertices,edges)
# check for start points
g.inDegrees.filter("inDegree==0").show()
+---+--------+
| id|inDegree|
+---+--------+
+---+--------+
# or check for end points
g.outDegrees.filter("outDegree==0").show()
+---+---------+
| id|outDegree|
+---+---------+
+---+---------+
# or check for any vertices that are alone without edge
g.degrees.filter("degree==0").show()
+---+------+
| id|degree|
+---+------+
+---+------+
what works is a left, right or full join of the inDegree and outDegree result and filter on the NULL values of the respective column
the join will provide you a merged columns with NULL values on the start and end positions
g.inDegrees.join(g.outDegrees,on="id",how="full").show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| b6| 1| null|
| a3| 1| 1|
| a4| 1| null|
| c7| 1| 1|
| b2| 1| 2|
| c9| 3| 1|
| c5| 1| 1|
| c1| null| 1|
| c6| 1| 1|
| a2| 1| 1|
| b3| 1| 1|
| b1| null| 1|
| c8| 3| null|
| a1| null| 1|
| c4| 1| 4|
| c3| 1| 1|
| b4| 1| 1|
| c2| 1| 3|
|c10| 1| null|
| b5| 2| 1|
+---+--------+---------+
now you can filter on what search
my_in_Degrees=g.inDegrees
my_out_Degrees=g.outDegrees
# get starting vertices (no more childs)
my_in_Degrees.join(my_out_Degrees,on="id",how="full").filter(my_in_Degrees.inDegree.isNull()).show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| c1| null| 1|
| b1| null| 1|
| a1| null| 1|
+---+--------+---------+
# get ending vertices (no more parents)
my_in_Degrees.join(my_out_Degrees,on="id",how="full").filter(my_out_Degrees.outDegree.isNull()).show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| b6| 1| null|
| a4| 1| null|
|c10| 1| null|
+---+--------+---------+
I have a data frame like below in pyspark.
+---+-------------+----+
| id| device| val|
+---+-------------+----+
| 3| mac pro| 1|
| 1| iphone| 2|
| 1|android phone| 2|
| 1| windows pc| 2|
| 1| spy camera| 2|
| 2| spy camera| 3|
| 2| iphone| 3|
| 3| spy camera| 1|
| 3| cctv| 1|
+---+-------------+----+
I want to populate some columns based on the below lists
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
I have done like below.
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
I got the desired result.
Now I want to do some change to the code I want to populate the column value after I divide the cat column with the value in the data frame for that id.
I tried something like below but didn't get the correct result
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ df.val).show()
How can I get what I want?
edit
Expected result
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 0.5| 1| 0.5|
| 3| 1| null| 2|
| 2|null| 0.33| 0.33|
+---+----+------+--------+
Aggregation would need an aggregation function, a simple column would not be identified
Since val column contains same value for each group of id column, you can use first inbuilt function as
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ F.first(df.val)).show()
which should give you
+---+----+------------------+------------------+
| id| pc| phones| security|
+---+----+------------------+------------------+
| 3| 1.0| null| 2.0|
| 1| 0.5| 1.0| 0.5|
| 2|null|0.3333333333333333|0.3333333333333333|
+---+----+------------------+------------------+
I have a json file which I import using the following code:
spark = SparkSession.builder.master("local").appName('GPS').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("SensorData.json")
The result is a dataframe similar to this:
+---+---+
| A| B|
+---+---+
| 1| 3|
| 2| 1|
| 2| 3|
| 1| 2|
| 3| 1|
| 1| 2|
| 2| 1|
| 1| 3|
| 1| 2|
+---+---+
My task is using PySpark to reduce the data to only the most frequent combinations of two columns (A and B)
So the wanted output is this
+---+---+-----+
| A| B|count|
+---+---+-----+
| 1| 2| 3|
| 2| 1| 2|
+---+---+-----+
You can do that with a combination of groupBy and limit:
spark = SparkSession.builder.master("local").appName('GPS').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("SensorData.json")
df.groupBy("A","B")
.count()
.sort("count",ascending = False)
.limit(2)
.show()
+---+---+-----+
| A| B|count|
+---+---+-----+
| 1| 2| 3|
| 2| 1| 2|
+---+---+-----+