Finding the Delta of a Column within Pyspark Interactive Shell - apache-spark

I have this DataFrame:
DataFrame[visitors: int, beach: string, Date: date]
With the following data:
+-----------+-------------+--------+
|date |beach |visitors|
+-----------+------------+---------+
|2020-03-02 |Bondi Beach |205 |
|2020-03-02 |Nissi Beach |218 |
|2020-03-03 |Bar Beach |201 |
|2020-03-04 |Navagio |102 |
|2020-03-04 |Champangne |233 |
|2020-03-05 |Lighthouse |500 |
|2020-03-06 |Mazo |318 |
+-----------+------------+---------+
And I'm looking to find the delta of these columns using the data in the visitor's column.
Expected output:
+-----------+-------------+--------+-------+
|date |beach |visitors| Delta |
+-----------+------------+---------+-------+
|2020-03-02 |Bondi Beach |205 |-13 | (205-218)
|2020-03-02 |Nissi Beach |218 |17 | (218-201)
|2020-03-03 |Bar Beach |201 |99 | (201-102)
|2020-03-04 |Navagio |102 |-131 | (102-233)
|2020-03-04 |Champangne |233 |-267 | (233-500)
|2020-03-05 |Lighthouse |500 |182 | (500-318)
|2020-03-06 |Mazo |318 |318 | (318-0)
+-----------+------------+---------+-------+

You can use the lead function for your problem. Since the lead of the last row is null, I'm using the coalesce function to replace nulls with the visitors' column.
from pyspark.sql.window import Window
from pyspark.sql.functions import *
w=Window().orderBy("date")
df.withColumn("delta", col("visitors") - lead("visitors").over(w))\
.withColumn('delta', coalesce('delta', 'visitors')).show()
+----------+-----------+--------+-----+
| date| beach|visitors|delta|
+----------+-----------+--------+-----+
|2020-03-02|Bondi Beach| 205| -13|
|2020-03-02|Nissi Beach| 218| 17|
|2020-03-03| Bar Beach| 201| 99|
|2020-03-04| Navagio| 102| -131|
|2020-03-04| Champangne| 233| -267|
|2020-03-05| Lighthouse| 500| 182|
|2020-03-06| Mazo| 318| 318|
+----------+-----------+--------+-----+
Note: I'm just ordering by the date field. It will be good to have another column like an id to include in the order by clause so that the order is maintained. Also, using a window without partitions can have a performance impact.

Related

populate master table from daily table for updated and new inserted records

I have two table which has few records
name is the column on which I can apply join condition
Table A master table
#+-------------+----------+---------------------------+---------+
#| name | Value | date |city |
#+-------------+----------+---------------------------+---------|
#| RHDM | 123 | 2-07-2020 12:00:55:842 |New York |
#| Rohit | 345 | 1-05-2021 11:50:55:222 |Berlin |
#| kerry | 785 | 3-04-2020 11:60:55:840 |Landon |
I have other table with almost same number of columns but the date and value column get changes daily
TableB
#+-------------+----------+---------------------------+---------+
#| name | Value | date |city |
#+-------------+----------+---------------------------+---------+
#| Rohit | 350 | 12-07-2021 12:00:55:842 | Berlin | value and date changed
#| Bob | 985 | 23-04-2020 10:00:55:842 |India | new record
#| kerry | 785 | 13-04-2020 12:00:55:842 | Landon | only date change
I need output as Table3 which need to have all records from table A plus update records from table B ,If there is any change in value and date column that has to pick from tableB into table A
#+-------------+----------+----------------------------+---------+
#| name | Value | date |City |
#+-------------+----------+----------------------------+---------+
#| RHDM | 123 | 2-07-2020 12:00:55:842 |New York |
#| Rohit | 350 | 12-07-2021 12:00:55:842 |Berlin |
#| kerry | 785 | 13-04-2020 12:00:55:842 |Landon |
#| Bob | 985 | 23-04-2020 10:00:55:842 |India |
In python pandas I would have done by creating two df like dfA,dfB and then
result = pd.merge(dfA,dfB,on=['name'],how='outer'indicator=True)
and apply further logic , can anyone suggest how to do it in pyspark,spark-sql
Simply do a join :
from pysql.sql import functions as F
df3 = dfA.join(
dfB,
on="name",
how="full", # Or "outer", same thing
).select(
F.col("name"),
F.coalesce(dfB["Value"], dfA["Value"]).alias("Value"),
F.coalesce(dfB["date"], dfA["date"]).alias("date"),
)
df3.show()
+-----+-----+----------+
| name|Value| date|
+-----+-----+----------+
|kerry| 785|13-04-2020|
| Bob| 985|23-04-2020|
| RHDM| 123| 2-07-2020|
|Rohit| 350|12-07-2021|
+-----+-----+----------+

pyspark pivot without aggregation

I am looking to essentially pivot without requiring an aggregation at the end to keep the dataframe in tact and not create a grouped object
As an example have this:
+---------++---------++---------++---------+
| country| code |Value | ids
+---------++---------++---------++---------+
| Mexico |food_1_3 |apple | 1
| Mexico |food_1_3 |banana | 2
| Canada |beverage_2 |milk | 1
| Mexico |beverage_2 |water | 2
+---------++---------++---------++---------+
Need this:
+---------++---------++---------++----------+
| country| id |food_1_3 | beverage_2|
+---------++---------++---------++----------+
| Mexico | 1 |apple | |
| Mexico | 2 |banana |water |
| Canada | 1 | |milk |
|+---------++---------++---------++---------+
I have tried
(df.groupby(df.country, df.id).pivot("code").agg(first('Value').alias('Value')))
but just get essentially a top 1. In my real case I have 20 columns some with just integers and others with strings... so sums, counts, collect_list none of those aggs have worked out...
That's because your 'id' is not unique. Add a unique index column and that should work:
import pyspark.sql.functions as F
pivoted = df.groupby(df.country, df.id, F.monotonically_increasing_id().alias('index')).pivot("code").agg(F.first('Value').alias('Value')).drop('index')
pivoted.show()
+-------+---+----------+--------+
|country|ids|beverage_2|food_1_3|
+-------+---+----------+--------+
| Mexico| 1| null| apple|
| Mexico| 2| water| null|
| Canada| 1| milk| null|
| Mexico| 2| null| banana|
+-------+---+----------+--------+

filter dataframe by multiple columns after exploding

My df contains product names and corresponding information. Relevant here is the name and country sold to:
+--------------------+-------------------------+
| Product_name|collect_set(Countries_en)|
+--------------------+-------------------------+
| null| [Belgium,United K...|
| #5 pecan/almond| [Belgium]|
| #8 mango/strawberry| [Belgium]|
|& Sully A Mild Th...| [Belgium,France]|
|"70CL Liqueu...| [Belgium,France]|
|"Gingembre&q...| [Belgium]|
|"Les Schtrou...| [Belgium,France]|
|"Sho-key&quo...| [Belgium]|
|"mini Chupa ...| [Belgium,France]|
| 'S Lands beste| [Belgium]|
|'T vlierbos confi...| [Belgium]|
|(H)eat me - Spagh...| [Belgium]|
| -cheese flips| [Belgium]|
| .soupe cerfeuil| [Belgium]|
|1 1/2 Minutes Bas...| [Belgium,Luxembourg]|
| 1/2 Reblochon AOP| [Belgium]|
| 1/2 nous de jambon| [Belgium]|
|1/2 tarte cerise ...| [Belgium]|
|10 Original Knack...| [Belgium,France,S...|
| 10 pains au lait| [Belgium,France]|
+--------------------+-------------------------+
sample input data:
[Row(code=2038002038.0, Product_name='Formula 2 men multi vitaminic', Countries_en='France,Ireland,Italy,Mexico,United States,Argentina-espanol,Armenia-pyсский,Aruba-espanol,Asia-pacific,Australia-english,Austria-deutsch,Azerbaijan-русский,Belarus-pyсский,Belgium-francais,Belgium-nederlands,Bolivia-espanol,Bosnia-i-hercegovina-bosnian,Botswana-english,Brazil-portugues,Bulgaria-български,Cambodia-english,Cambodia-ភាសាខ្មែរ,Canada-english,Canada-francais,Chile-espanol,China-中文,Colombia-espanol,Costa-rica-espanol,Croatia-hrvatski,Cyprus-ελληνικά,Czech-republic-čeština,Denmark-dansk,Ecuador-espanol,El-salvador-espanol,Estonia-eesti,Europe,Finland-suomi,France-francais,Georgia-ქართული,Germany-deutsch,Ghana-english,Greece-ελληνικά,Guatemala-espanol,Honduras-espanol,Hong-kong-粵語,Hungary-magyar,Iceland-islenska,India-english,Indonesia-bahasa-indonesia,Ireland-english,Israel-עברית,Italy-italiano,Jamaica-english,Japan-日本語,Kazakhstan-pyсский,Korea-한국어,Kyrgyzstan-русский,Latvia-latviešu,Lebanon-english,Lesotho-english,Lithuania-lietuvių,Macau-中文,Malaysia-bahasa-melayu,Malaysia-english,Malaysia-中文,Mexico-espanol,Middle-east-africa,Moldova-roman,Mongolia-монгол-хэл,Namibia-english,Netherlands-nederlands,New-zealand-english,Nicaragua-espanol,North-macedonia-македонски-јазик,Norway-norsk,Panama-espanol,Paraguay-espanol,Peru-espanol,Philippines-english,Poland-polski,Portugal-portugues,Puerto-rico-espanol,Republica-dominicana-espanol,Romania-romană,Russia-русский,Serbia-srpski,Singapore-english,Slovak-republic-slovenčina,Slovenia-slovene,South-africa-english,Spain-espanol,Swaziland-english,Sweden-svenska,Switzerland-deutsch,Switzerland-francais,Taiwan-中文,Thailand-ไทย,Trinidad-tobago-english,Turkey-turkce,Ukraine-yкраї́нська,United-kingdom-english,United-states-english,United-states-espanol,Uruguay-espanol,Venezuela-espanol,Vietnam-tiếng-việt,Zambia-english', Traces_en=None, Additives_tags=None, Main_category_en='Vitamins', Image_url='https://static.openfoodfacts.org/images/products/203/800/203/8/front_en.12.400.jpg', Quantity='60 compresse', Packaging_tags='barattolo,tablet', )]
Since I want to explore to which countries the products are sold to besides Belgium i split the country column to show every country individually using the code below
#create df with grouped products
countriesDF = productsDF\
.select("Product_name", "Countries_en")\
.groupBy("Product_name")\
.agg(F.collect_set("Countries_en").cast("string").alias("Countries"))\
.orderBy("Product_name")
#split df to show countries the product is sold to in a seperate column
countriesDF = countriesDF\
.where(col("Countries")!="null")\
.select("Product_name",\
F.split("Countries", ",").alias("Countries"),
F.posexplode(F.split("Countries", ",")).alias("pos", "val")
)\
.drop("val")\
.select(
"Product_name",
F.concat(F.lit("Countries"),F.col("pos").cast("string")).alias("name"),
F.expr("Countries[pos]").alias("val")
)\
.groupBy("Product_name").pivot("name").agg(F.first("val"))\
.show()
However, this table now has over 400 columns for countries alone which is not presentable. So my question is:
am I doing the splitting / exploding correctly?
can I split the df so I get the countries as column names (e.g. 'France' instead of 'countries1' etc.) counting the number of times the product is sold in this country?
Some sample data :
val sampledf = Seq(("p1","BELGIUM,GERMANY"),("p1","BELGIUM,ITALY"),("p1","GERMANY"),("p2","BELGIUM")).toDF("Product_name","Countries_en")
Transform to required df :
df = sampledf
.withColumn("country_list",split(col("Countries_en"),","))
.select(col("Product_name"), explode(col("country_list")).as("country"))
+------------+-------+
|Product_name|country|
+------------+-------+
| p1|BELGIUM|
| p1|GERMANY|
| p1|BELGIUM|
| p1| ITALY|
| p1|GERMANY|
| p2|BELGIUM|
+------------+-------+
If you need only counts per country :
countDF = df.groupBy("Product_name","country").count()
countDF.show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|BELGIUM| 2|
| p1|GERMANY| 1|
| p2|BELGIUM| 1|
+------------+-------+-----+
Except Belgium :
countDF.filter(col("country") =!="BELGIUM").show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|GERMANY| 1|
+------------+-------+-----+
And if you really want countries as Columns :
countDF.groupBy("Product_name").pivot("country").agg(first("count"))
+------------+-------+-------+
|Product_name|BELGIUM|GERMANY|
+------------+-------+-------+
| p2| 1| null|
| p1| 2| 1|
+------------+-------+-------+
And you can .drop("BELGIUM") to achieve it.
Final code used:
#create df where countries are split off
df = productsDF\
.withColumn("country_list",split(col("Countries_en"),","))\
.select(col("Product_name"), explode(col("country_list")).alias("Country"))\
#create count and filter out Country Belgium, Product Name can be changed as needed
countDF = df.groupBy("Product","Country").count()\
.filter(col("Country") !="Belgium")\
.filter(col('Product') == 'Café').show()

Difference between explode and explode_outer

What is the difference between explode and explode_outer? The documentation for both functions is the same and also the examples for both functions are identical:
SELECT explode(array(10, 20));
10
20
and
SELECT explode_outer(array(10, 20));
10
20
The Spark source suggests that there is a difference between the two functions
expression[Explode]("explode"),
expressionGeneratorOuter[Explode]("explode_outer")
but what is the effect of expressionGeneratorOuter compared to expression?
explode creates a row for each element in the array or map column by ignoring null or empty values in array whereas explode_outer returns all values in array or map including null or empty.
For example, for the following dataframe-
id | name | likes
_______________________________
1 | Luke | [baseball, soccer]
2 | Lucy | null
explode gives the following output-
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
Whereas explode_outer gives the following output-
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
2 | Lucy | null
SELECT explode(col1) from values (array(10,20)), (null)
returns
+---+
|col|
+---+
| 10|
| 20|
+---+
while
SELECT explode_outer(col1) from values (array(10,20)), (null)
returns
+----+
| col|
+----+
| 10|
| 20|
|null|
+----+

Divide spark dataframe into chunks using row values as separators

In my PySpark code I have a DataFrame populated with data coming from a sensor and each single row has timestamp, event_description and event_value.
Each sensor event is composed by measurements defined by an id and a value. The only guarantee I have is that all the "phases" related to a single event are included between two EV_SEP rows (unsorted).
Inside each event "block" there is an event label which is the value associated to EV_CODE.
+-------------------------+------------+-------------+
| timestamp | event_id | event_value |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:12.540 | EV_SEP | ----- |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:14.201 | EV_2 | 10 |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:13.331 | EV_1 | 11 |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:15.203 | EV_CODE | ABC |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:16.670 | EV_SEP | ----- |
+-------------------------+------------+-------------+
I would like to create a new column containing that label, so that I know that all the events are associated to that label:
+-------------------------+----------+-------------+------------+
| timestamp | event_id | event_value | event_code |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:12.540 | EV_SEP | ----- | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:14.201 | EV_2 | 10 | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:13.331 | EV_1 | 11 | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:15.203 | EV_CODE | ABC | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:16.670 | EV_SEP | ----- | ABC |
+-------------------------+----------+-------------+------------+
With pandas I can easily get the indexes of the EV_SEP rows, split the table into blocks, take the EV_CODE from each block and create an event_code column with such value.
A possible solution would be:
Sort the DataFrame according to timestamp
Convert the dataframe to a RDD and call zipWithIndex
get the indexes containing EV_SEP
calculate the block ranges (start_index, end_index)
Process single "chunks" (filtering on indexes) to extract EV_CODE
finally create the wanted column
Is there any better way to solve this problem?
from pyspark.sql import functions as f
Sample data:
df.show()
+-----------------------+--------+-----------+
|timestamp |event_id|event_value|
+-----------------------+--------+-----------+
|2017-01-01 00:00:12.540|EV_SEP |null |
|2017-01-01 00:00:14.201|EV_2 |10 |
|2017-01-01 00:00:13.331|EV_1 |11 |
|2017-01-01 00:00:15.203|EV_CODE |ABC |
|2017-01-01 00:00:16.670|EV_SEP |null |
|2017-01-01 00:00:20.201|EV_2 |10 |
|2017-01-01 00:00:24.203|EV_CODE |DEF |
|2017-01-01 00:00:31.670|EV_SEP |null |
+-----------------------+--------+-----------+
Add index:
df_idx = df.filter(df['event_id'] == 'EV_SEP') \
.withColumn('idx', f.row_number().over(Window.partitionBy().orderBy(df['timestamp'])))
df_block = df.filter(df['event_id'] != 'EV_SEP').withColumn('idx', f.lit(0))
'Spread' index:
df = df_idx.union(df_block).withColumn('idx', f.max('idx').over(
Window.partitionBy().orderBy('timestamp').rowsBetween(Window.unboundedPreceding, Window.currentRow))).cache()
Add EV_CODE:
df_code = df.filter(df['event_id'] == 'EV_CODE').withColumnRenamed('event_value', 'event_code')
df = df.join(df_code, on=[df['idx'] == df_code['idx']]) \
.select(df['timestamp'], df['event_id'], df['event_value'], df_code['event_code'])
Finally:
+-----------------------+--------+-----------+----------+
|timestamp |event_id|event_value|event_code|
+-----------------------+--------+-----------+----------+
|2017-01-01 00:00:12.540|EV_SEP |null |ABC |
|2017-01-01 00:00:13.331|EV_1 |11 |ABC |
|2017-01-01 00:00:14.201|EV_2 |10 |ABC |
|2017-01-01 00:00:15.203|EV_CODE |ABC |ABC |
|2017-01-01 00:00:16.670|EV_SEP |null |DEF |
|2017-01-01 00:00:20.201|EV_2 |10 |DEF |
|2017-01-01 00:00:24.203|EV_CODE |DEF |DEF |
+-----------------------+--------+-----------+----------+
Creating a new Hadoop InputFormat would be a more computationally efficient way to accomplish your goal here (although is arguably the same or more gymnastics in terms of code). You can specify alternative Hadoop input formats using sc.hadoopFile in the Python API, but you must take care of conversion from the Java format to Python. You can then specify the format. The converters available in PySpark are relatively few but this reference proposes using the Avro converter as an example. You might also simply find it convenient to let your custom Hadoop input format output text which you then additionally parse in Python to avoid the issue of implementing a converter.
Once you have that in place, you would create a special input format (in Java or Scala using the Hadoop API's) to treat the special sequences of rows having EV_SEP as record delimiters instead of the newline character. You could do this quite simply by collecting rows as they are read in an accumulator (just a simple ArrayList could do as a proof-of-concept) and then emitting the accumulated list of records when you find two EV_SEP rows in a row.
I would point out that using TextInputFormat as a basis for such a design might be tempting, but that the input format will split such files arbitrarily at newline characters and you will need to implement custom logic to properly support splitting the files. Alternatively, you can avoid the problem by simply not implementing file splitting. This is a simple modification to the partitioner.
If you do need to split files, the basic idea is:
Pick a split offset by evenly dividing the file into parts
Seek to the offset
Seek back character-by-character from the offset to where the delimiter sequence is found (in this case two rows in a row with type EV_SEP.
Detecting these sequences for the edge case around file splitting would be a challenge. I would suggest establishing the largest byte-width of rows and reading sliding-window chunks of an appropriate width (basically 2x the size of the rows) backwards from your starting point, then matching against those windows using a precompiled Java regex and Matcher. This is similar to how Sequence Files find their sync marks, but uses a regex to detect the sequence instead of strict equality.
As a side note, I would be concerned given some of the other background you mention here that sorting the DataFrame by timestamp could alter the contents of events that happen in the same time period in different files.

Resources