Divide aggregate value using values from data frame in PySpark - apache-spark

I have a data frame like below in pyspark.
+---+-------------+----+
| id| device| val|
+---+-------------+----+
| 3| mac pro| 1|
| 1| iphone| 2|
| 1|android phone| 2|
| 1| windows pc| 2|
| 1| spy camera| 2|
| 2| spy camera| 3|
| 2| iphone| 3|
| 3| spy camera| 1|
| 3| cctv| 1|
+---+-------------+----+
I want to populate some columns based on the below lists
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
I have done like below.
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
I got the desired result.
Now I want to do some change to the code I want to populate the column value after I divide the cat column with the value in the data frame for that id.
I tried something like below but didn't get the correct result
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ df.val).show()
How can I get what I want?
edit
Expected result
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 0.5| 1| 0.5|
| 3| 1| null| 2|
| 2|null| 0.33| 0.33|
+---+----+------+--------+

Aggregation would need an aggregation function, a simple column would not be identified
Since val column contains same value for each group of id column, you can use first inbuilt function as
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ F.first(df.val)).show()
which should give you
+---+----+------------------+------------------+
| id| pc| phones| security|
+---+----+------------------+------------------+
| 3| 1.0| null| 2.0|
| 1| 0.5| 1.0| 0.5|
| 2|null|0.3333333333333333|0.3333333333333333|
+---+----+------------------+------------------+

Related

aggregate function in lit() of pyspark along with withColumn

I have column quantity in dataframe. I want to add a new column to this dataframe with each record having min("Quantity"). I am trying to use lit() in pyspark. something like below
df.withColumn("min_quant", lit(min(col("Quantity")))).show().
It's resulting in the getting below error
grouping expressions sequence is empty, and `InvoiceNo` is not an aggregate function.
Wrap (min(`Quantity`) AS `min_quant`) in windowing function(s) or wrap
This is working:
df.withColumn("min_quant", lit(2)).show().
But, in place of 2 here, I want min(Quantity). Am I missing something?
Please try using window function as min() function needs aggregation.
val windowSpec = Window.orderBy("InvoiceNo")
df.withColumn("min_quant", min("Quantity") over(windowSpec)).show()
Sample Result:
+---------+----+--------+---------+
|InvoiceNo|name|Quantity|min_quant|
+---------+----+--------+---------+
| 1| ABC| 19| 1|
| 1| ABC| 1| 1|
| 1| ABC| 8| 1|
| 1| ABC| 389| 1|
| 1| ABC| 196| 1|
| 2| CBD| 10| 1|
| 2| CBD| 946| 1|
| 3| XYZ| 3| 1|
+---------+----+--------+---------+

Joining Two Dataframes: Lost tasks

I have time series data in a PySpark DataFrame. Each of my signals (value column) should be assigned a unique id. However the id values are imprecise and need to be extended to both sides. The original DataFrame looks like this:
df_start
+------+----+-------+
| time | id | value |
+------+----+-------+
| 1| 0| 1.0|
| 2| 1| 2.0|
| 3| 1| 2.0|
| 4| 0| 1.0|
| 5| 0| 0.0|
| 6| 0| 1.0|
| 7| 2| 2.0|
| 8| 2| 3.0|
| 9| 2| 2.0|
| 10| 0| 1.0|
| 11| 0| 0.0|
+------+----+-------+
The desired output is:
df_desired
+------+----+-------+
| time | id | value |
+------+----+-------+
| 1| 1| 1.0|
| 2| 1| 2.0|
| 3| 1| 2.0|
| 4| 1| 1.0|
| 6| 2| 1.0|
| 7| 2| 2.0|
| 8| 2| 3.0|
| 9| 2| 2.0|
| 10| 2| 1.0|
| 11| 2| 1.0|
+------+----+-------+
So there are two things happening here:
The id column is not precise enough: Each id starts logging a (here 1 and 1) time steps to late, and ends b time steps (here 1 and 2) to early. I therefore have to replace some zeros with their respective id.
After 'padding' the entries in the id column, I remove all remaining rows with id=0. (Here only for row with time=5.)
Luckily I know, for each Id, what the relative logging time delay is. Currently, I convert this to the absolute, correct logging times in
df_join
+----+-------+-------+
| id | min_t | max_t |
+----+-------+-------+
| 1| 1| 4|
| 2| 6| 11|
+----+-------+-------+
which I use to then 'filter' the original data using a join
df_desired = df_join.join(df_start,
df_start.time.between(df_join.min_t, df_join.max_t)
)
which results in the desired output.
In reality df_join has at least 400 000 rows and df_start has about 10 billion rows, of which we keep most.
When I run this on our cluster, I am at some point getting warnings like Lost task, ExecutorLostFailure, Container marked as failed, Exit code: 134.
I suspect the executors are running out of memory, however I have not found any solution.

GraphFrames detect exclusive outbound relations

In my graph I need to detect vertices that do not have inbound relations. Using the example below, "a" is the only node that is not being related by the anyone.
a --> b
b --> c
c --> d
c --> b
I would really appreciate any examples to detect "a" type nodes in my graph.
Thanks
unfortunately the approach is not as simple because the graph.degress, graph.inDegrees, graph.outDegrees functions are not returning vertices with 0 edges.
(see documentation for Scala which holds true for Python too https://graphframes.github.io/graphframes/docs/_site/api/scala/index.html#org.graphframes.GraphFrame)
so the following code will always return a empty dataframe
g=Graph(vertices,edges)
# check for start points
g.inDegrees.filter("inDegree==0").show()
+---+--------+
| id|inDegree|
+---+--------+
+---+--------+
# or check for end points
g.outDegrees.filter("outDegree==0").show()
+---+---------+
| id|outDegree|
+---+---------+
+---+---------+
# or check for any vertices that are alone without edge
g.degrees.filter("degree==0").show()
+---+------+
| id|degree|
+---+------+
+---+------+
what works is a left, right or full join of the inDegree and outDegree result and filter on the NULL values of the respective column
the join will provide you a merged columns with NULL values on the start and end positions
g.inDegrees.join(g.outDegrees,on="id",how="full").show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| b6| 1| null|
| a3| 1| 1|
| a4| 1| null|
| c7| 1| 1|
| b2| 1| 2|
| c9| 3| 1|
| c5| 1| 1|
| c1| null| 1|
| c6| 1| 1|
| a2| 1| 1|
| b3| 1| 1|
| b1| null| 1|
| c8| 3| null|
| a1| null| 1|
| c4| 1| 4|
| c3| 1| 1|
| b4| 1| 1|
| c2| 1| 3|
|c10| 1| null|
| b5| 2| 1|
+---+--------+---------+
now you can filter on what search
my_in_Degrees=g.inDegrees
my_out_Degrees=g.outDegrees
# get starting vertices (no more childs)
my_in_Degrees.join(my_out_Degrees,on="id",how="full").filter(my_in_Degrees.inDegree.isNull()).show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| c1| null| 1|
| b1| null| 1|
| a1| null| 1|
+---+--------+---------+
# get ending vertices (no more parents)
my_in_Degrees.join(my_out_Degrees,on="id",how="full").filter(my_out_Degrees.outDegree.isNull()).show()
+---+--------+---------+
| id|inDegree|outDegree|
+---+--------+---------+
| b6| 1| null|
| a4| 1| null|
|c10| 1| null|
+---+--------+---------+

Simplify code and reduce join statements in pyspark data frames

I have a data frame in pyspark like below.
df.show()
+---+-------------+
| id| device|
+---+-------------+
| 3| mac pro|
| 1| iphone|
| 1|android phone|
| 1| windows pc|
| 1| spy camera|
| 2| spy camera|
| 2| iphone|
| 3| spy camera|
| 3| cctv|
+---+-------------+
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
from pyspark.sql.functions import col
phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")
phones_df.show()
+---+------+
| id|phones|
+---+------+
| 1| 2|
| 2| 1|
+---+------+
pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")
pc_df.show()
+---+---+
| id| pc|
+---+---+
| 1| 1|
| 3| 1|
+---+---+
security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")
security_df.show()
+---+--------+
| id|security|
+---+--------+
| 1| 1|
| 2| 1|
| 3| 2|
+---+--------+
Then I want to do a full outer join on all the three data frames. I have done like below.
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
Final_df.show()
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 2| 1| 1|
| 2| 1|null| 1|
| 3| null| 1| 2|
+---+------+----+--------+
I am able to get what I want but want to simplify my code.
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement
How can I do this? Could anyone explain.
Here is one way using when.otherwise to map column to categories, and then pivot it to the desired output:
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 1| 2| 1|
| 3| 1| null| 2|
| 2|null| 1| 1|
+---+----+------+--------+

Reducing a dataframe to the most frequent combinations of two columns

I have a json file which I import using the following code:
spark = SparkSession.builder.master("local").appName('GPS').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("SensorData.json")
The result is a dataframe similar to this:
+---+---+
| A| B|
+---+---+
| 1| 3|
| 2| 1|
| 2| 3|
| 1| 2|
| 3| 1|
| 1| 2|
| 2| 1|
| 1| 3|
| 1| 2|
+---+---+
My task is using PySpark to reduce the data to only the most frequent combinations of two columns (A and B)
So the wanted output is this
+---+---+-----+
| A| B|count|
+---+---+-----+
| 1| 2| 3|
| 2| 1| 2|
+---+---+-----+
You can do that with a combination of groupBy and limit:
spark = SparkSession.builder.master("local").appName('GPS').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("SensorData.json")
df.groupBy("A","B")
.count()
.sort("count",ascending = False)
.limit(2)
.show()
+---+---+-----+
| A| B|count|
+---+---+-----+
| 1| 2| 3|
| 2| 1| 2|
+---+---+-----+

Resources