How to aggregate timestamp data in Spark to smaller time frame - apache-spark

I'm working on a project using New York taxi data. The data contain records for pickup location (PULocationID), and the timestamp (tpep_pickup_datetime) for that particular pick-up record.
I want to aggregate the data to be hourly for each location. The aggregation should have an hourly count of pick-ups per location.

The information you provided is a bit lacking. From what I understood, these could be possible aggregation options.
Using date_trunc
from pyspark.sql import functions as F
df = df.groupBy(
F.date_trunc('hour', 'tpep_pickup_datetime').alias('hour'),
'PULocationID',
).count()
df.show()
# +-------------------+------------+-----+
# | hour|PULocationID|count|
# +-------------------+------------+-----+
# |2020-01-01 00:00:00| 238| 1|
# |2020-01-01 02:00:00| 238| 2|
# |2020-01-01 02:00:00| 193| 1|
# |2020-01-01 01:00:00| 238| 2|
# |2020-01-01 00:00:00| 7| 1|
# +-------------------+------------+-----+
Using window
from pyspark.sql import functions as F
df = df.groupBy(
F.window('tpep_pickup_datetime', '1 hour').alias('hour'),
'PULocationID',
).count()
df.show(truncate=0)
# +------------------------------------------+------------+-----+
# |hour |PULocationID|count|
# +------------------------------------------+------------+-----+
# |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|238 |2 |
# |[2020-01-01 01:00:00, 2020-01-01 02:00:00]|238 |2 |
# |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|238 |1 |
# |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|193 |1 |
# |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|7 |1 |

Related

Overwrite the rows containing NULL in Dataframe by another Dataframe in PySpark

I have two dataframes df and df1, in df I have NULL in some columns, but in df1 I have non-null values for these columns. I just need to overwrite rows where the NULL exists.
The df is below:
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| Id| Name|Country| City| Address| Latitude| Longitude|
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| 42949672960|Americana Resort ...| US| Dillon| 135 Main St| null| null|
| 42949672965|Comfort Inn Delan...| US| Deland|400 E Internation...| 29.054737| -81.297208|
| 60129542147|Ubaa Old Crawford...| US| Des Plaines| 5460 N River Rd| null| null|
The df1 is below:
+-------------+--------------------+-------+------------+--------------------+----------+------------+
| Id| Name|Country| City| Address| Latitude| Longitude|
+-------------+--------------------+-------+------------+--------------------+----------+------------+
| 42949672960|Americana Resort ...| US| Dillon| 135 Main St|39.6286685|-106.0451009|
| 60129542147|Ubaa Old Crawford...| US| Des Plaines| 5460 N River Rd|42.0654049| -87.8916252|
+-------------+--------------------+-------+------------+--------------------+----------+------------+
I want this result:
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| Id| Name|Country| City| Address| Latitude| Longitude|
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| 42949672960|Americana Resort ...| US| Dillon| 135 Main St|39.6286685|-106.0451009|
| 42949672965|Comfort Inn Delan...| US| Deland|400 E Internation...| 29.054737| -81.297208|
...
...
You can either left join or inner join them then using coalesce to pick first non-null lat/lon.
df1
+-----------+---------+----------+
| id| lat| lon|
+-----------+---------+----------+
|42949672960| null| null|
|42949672965|29.054737|-81.297208|
|60129542147| null| null|
+-----------+---------+----------+
df2
+-----------+----------+------------+
| id| lat| lon|
+-----------+----------+------------+
|42949672960|39.6286685|-106.0451009|
|60129542147|42.0654049| -87.8916252|
+-----------+----------+------------+
Join them together
from pyspark.sql import functions as F
(df1
.join(df2, on=['id'], how='left')
.select(
F.col('id'),
F.coalesce(df1['lat'], df2['lat']).alias('lat'),
F.coalesce(df1['lon'], df2['lon']).alias('lon')
)
.show()
)
# +-----------+----------+------------+
# | id| lat| lon|
# +-----------+----------+------------+
# |42949672965| 29.054737| -81.297208|
# |60129542147|42.0654049| -87.8916252|
# |42949672960|39.6286685|-106.0451009|
# +-----------+----------+------------+

Transpose each record into multiple columns in pyspark dataframe

I am looking to transpose each record into multiple columns in pyspark dataframe.
This is my dataframe:
+--------+-------------+--------------+------------+------+
|level_1 |level_2 |level_3 |level_4 |UNQ_ID|
+--------+-------------+--------------+------------+------+
|D Group|Investments |ORB |ECM |1 |
|E Group|Investment |Origination |Execution |2 |
+--------+-------------+--------------+------------+------+
Required dataframe is:
+--------+---------------+------+
|level |name |UNQ_ID|
+--------+---------------+------+
|level_1 |D Group |1 |
|level_1 |E Group |2 |
|level_2 |Investments |1 |
|level_2 |Investment |2 |
|level_3 |ORB |1 |
|level_3 |Origination |2 |
|level_4 |ECM |1 |
|level_4 |Execution |2 |
+--------+---------------+------+
The easier way using stack function:
import pyspark.sql.functions as f
output_df = df.selectExpr('stack(4, "level_1", level_1, "level_2", level_2, "level_3", level_3, "level_4", level_4) as (level, name)', 'UNQ_ID')
output_df.show()
# +-------+-----------+------+
# | level| name|UNQ_ID|
# +-------+-----------+------+
# |level_1| D Group| 1|
# |level_2|Investments| 1|
# |level_3| ORB| 1|
# |level_4| ECM| 1|
# |level_1| E Group| 2|
# |level_2|Investments| 2|
# |level_3|Origination| 2|
# |level_4| Execution| 2|
# +-------+-----------+------+

How to get updated or new records by comparing two dataframe in pyspark

I have two dataframes like this:
df2.show()
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 11| 500|
|Liza| 20| 900|
+----+-------+------+
df3.show()
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 10| 700|
| Cal| 70| 888|
+----+-------+------+
df2 here, represents existing database records and df3 represents new records/updated records(any column) which need to be inserted/updated into db.For ex: NAME=PPan the new balance is 10 as per df3. so For NAME=PPan entire row has to be replaced in df2 and for NAME=Cal, a new row has to be added and for name=Liza will be untouched like this:
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 10| 700|
|Liza| 20| 900|
| Cal| 70| 888|
+----+-------+------+
How can I achieve this use case?
First you need to join both dataframes using full method to keep unmatched rows (new) and to updating the matched records I do prefer to use select with coalesce function:
joined_df = df2.alias('rec').join(df3.alias('upd'), on='NAME', how='full')
# +----+-------+------+-------+------+
# |NAME|BALANCE|SALARY|BALANCE|SALARY|
# +----+-------+------+-------+------+
# |Cal |null |null |70 |888 |
# |Liza|20 |900 |null |null |
# |PPan|11 |500 |10 |700 |
# +----+-------+------+-------+------+
output_df = joined_df.selectExpr(
'NAME',
'COALESCE(upd.BALANCE, rec.BALANCE) BALANCE',
'COALESCE(upd.SALARY, rec.SALARY) SALARY'
)
output_df.sort('BALANCE').show(truncate=False)
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan|10 |700 |
|Liza|20 |900 |
|Cal |70 |888 |
+----+-------+------+

How to reset index and find specific id?

I have an id column for each person (data with the same id belongs to one person). I want these:
Now the id column is not based on numbering, it's 10 digit. How can I reset id with integers, e.g. 1, 2, 3, 4?
For example:
id col1
12a4 summer
12a4 goest
3b yes
3b No
3b why
4t Hi
Output:
id col1
1 summer
1 goest
2 yes
2 No
2 why
3 Hi
How I can get the data corresponding to id=2?
In the above example:
id col1
2 yes
2 No
2 why
from pyspark.sql import SparkSession
from pyspark.sql import Window, functions as F
spark = SparkSession.builder.getOrCreate()
data = [
('12a4', 'summer'),
('12a4', 'goest'),
('3b', 'yes'),
('3b', 'No'),
('3b', 'why'),
('4t', 'Hi')
]
df1 = spark.createDataFrame(data, ['id', 'col1'])
df1.show()
# +----+------+
# | id| col1|
# +----+------+
# |12a4|summer|
# |12a4| goest|
# | 3b| yes|
# | 3b| No|
# | 3b| why|
# | 4t| Hi|
# +----+------+
df = df1.select('id').distinct()
df = df.withColumn('new_id', F.row_number().over(Window.orderBy('id')))
df.show()
# +----+------+
# | id|new_id|
# +----+------+
# |12a4| 1|
# | 3b| 2|
# | 4t| 3|
# +----+------+
df = df.join(df1, 'id', 'full')
df.show()
# +----+------+------+
# | id|new_id| col1|
# +----+------+------+
# |12a4| 1|summer|
# |12a4| 1| goest|
# | 4t| 3| Hi|
# | 3b| 2| yes|
# | 3b| 2| No|
# | 3b| 2| why|
# +----+------+------+
df = df.drop('id').withColumnRenamed('new_id', 'id')
df.show()
# +---+------+
# | id| col1|
# +---+------+
# | 1|summer|
# | 1| goest|
# | 3| Hi|
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+------+
df = df.filter(F.col('id') == 2)
df.show()
# +---+----+
# | id|col1|
# +---+----+
# | 2| yes|
# | 2| No|
# | 2| why|
# +---+----+

How to join two data frames in Apache Spark and merge keys into one column?

I have two following Spark data frames:
sale_df:
|user_id|total_sale|
+-------+----------+
| a| 1100|
| b| 2100|
| c| 3300|
| d| 4400
and target_df:
user_id|personalized_target|
+-------+-------------------+
| b| 1000|
| c| 2000|
| d| 3000|
| e| 4000|
+-------+-------------------+
How can I join them in a way that output is:
user_id total_sale personalized_target
a 1100 NA
b 2100 1000
c 3300 2000
d 4400 4000
e NA 4000
I have tried all most all the join types but it seems that single join can not make the desired output.
Any PySpark or SQL and HiveContext can help.
You can use the equi-join synthax in Scala
val output = sales_df.join(target_df,Seq("user_id"),joinType="outer")
You should check if it works in python:
output = sales_df.join(target_df,['user_id'],"outer")
You need to perform an outer equi-join :
data1 = [['a', 1100], ['b', 2100], ['c', 3300], ['d', 4400]]
sales = sqlContext.createDataFrame(data1,['user_id','total_sale'])
data2 = [['b', 1000],['c',2000],['d',3000],['e',4000]]
target = sqlContext.createDataFrame(data2,['user_id','personalized_target'])
sales.join(target, 'user_id', "outer").show()
# +-------+----------+-------------------+
# |user_id|total_sale|personalized_target|
# +-------+----------+-------------------+
# | e| null| 4000|
# | d| 4400| 3000|
# | c| 3300| 2000|
# | b| 2100| 1000|
# | a| 1100| null|
# +-------+----------+-------------------+

Resources