I would like to compare 2 data frames and I want to pull out the records based on below 3 conditions.
If the record is matching, 'SAME' should come in a new column FLAG.
If the record not matching, if it is from df1 (suppose No.66), 'DF1' should come in FLAG column.
If the record not matching, if it is from df2 (suppose No.77), 'DF2' should come in FLAG column.
Here whole RECORD need to consider and verify. Record wise comparison.
Also i need to check like this for millions of records using PySpark code.
df1:
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
df2:
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3000,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Xom,5000,mex,IT,2/11/2019
77,XYZ,5000,mex,IT,2/11/2019
Expected Output:
No,Name,Sal,Address,Dept,Join_Date,FLAG
11,Sam,1000,ind,IT,2/11/2019,SAME
22,Tom,2000,usa,HR,2/11/2019,SAME
33,Kom,3500,uk,IT,2/11/2019,DF1
33,Kom,3000,uk,IT,2/11/2019,DF2
44,Nom,4000,can,HR,2/11/2019,SAME
55,Vom,5000,mex,IT,2/11/2019,DF1
55,Xom,5000,mex,IT,2/11/2019,DF2
66,XYZ,5000,mex,IT,2/11/2019,DF1
77,XYZ,5000,mex,IT,2/11/2019,DF2
I loaded input data like below, but not getting idea on how to proceed.
df1 = pd.read_csv("D:\\inputs\\file1.csv")
df2 = pd.read_csv("D:\\inputs\\file2.csv")
Any help is appreciated. Thanks.
# Requisite packages to import
import sys
from pyspark.sql.functions import lit, count, col, when
from pyspark.sql.window import Window
# Create the two dataframes
df1 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3500,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Vom',5000,'mex','IT','2/11/2019'),(66,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df2 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3000,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Xom',5000,'mex','IT','2/11/2019'),(77,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df1 = df1.withColumn('FLAG',lit('DF1'))
df2 = df2.withColumn('FLAG',lit('DF2'))
# Concatenate the two DataFrames, to create one big dataframe.
df = df1.union(df2)
Use window function to check if the count of same rows is more than 1 and if it indeed is, then mark column FLAG as SAME, else keep it the way it is. Finally, drop the duplicates.
my_window = Window.partitionBy('No','Name','Sal','Address','Dept','Join_Date').rowsBetween(-sys.maxsize, sys.maxsize)
df = df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()
df.show()
+---+----+----+-------+----+---------+----+
| No|Name| Sal|Address|Dept|Join_Date|FLAG|
+---+----+----+-------+----+---------+----+
| 33| Kom|3000| uk| IT|2/11/2019| DF2|
| 44| Nom|4000| can| HR|2/11/2019|SAME|
| 22| Tom|2000| usa| HR|2/11/2019|SAME|
| 77| XYZ|5000| mex| IT|2/11/2019| DF2|
| 55| Xom|5000| mex| IT|2/11/2019| DF2|
| 11| Sam|1000| ind| IT|2/11/2019|SAME|
| 66| XYZ|5000| mex| IT|2/11/2019| DF1|
| 55| Vom|5000| mex| IT|2/11/2019| DF1|
| 33| Kom|3500| uk| IT|2/11/2019| DF1|
+---+----+----+-------+----+---------+----+
I think you can solve your problem with the creation of temporary columns to indicate the source and a join. Then you only have to check for the conditions, i.e. if both sources are present or if only one source is there and which one.
Consider the following code:
from pyspark.sql.functions import *
df1= sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),\
(22,'Tom',2000,'usa','HR','2/11/2019'),(33,'Kom',3500,'uk','IT','2/11/2019'),\
(44,'Nom',4000,'can','HR','2/11/2019'),(55,'Vom',5000,'mex','IT','2/11/2019'),\
(66,'XYZ',5000,'mex','IT','2/11/2019')], \
["No","Name","Sal","Address","Dept","Join_Date"])
df2= sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),\
(22,'Tom',2000,'usa','HR','2/11/2019'),(33,'Kom',3000,'uk','IT','2/11/2019'),\
(44,'Nom',4000,'can','HR','2/11/2019'),(55,'Xom',5000,'mex','IT','2/11/2019'),\
(77,'XYZ',5000,'mex','IT','2/11/2019')], \
["No","Name","Sal","Address","Dept","Join_Date"])
#creation of your example dataframes
df1 = df1.withColumn("Source1", lit("DF1"))
df2 = df2.withColumn("Source2", lit("DF2"))
#temporary columns to refer the origin later
df1.join(df2, ["No","Name","Sal","Address","Dept","Join_Date"],"full")\
#full join on all columns, but source is only set if record appears in original dataframe\
.withColumn("FLAG",when(col("Source1").isNotNull() & col("Source2").isNotNull(), "SAME")\
#condition if record appears in both dataframes\
.otherwise(when(col("Source1").isNotNull(), "DF1").otherwise("DF2")))\
#condition if record appears in one dataframe\
.drop("Source1","Source2").show() #remove temporary columns and show result
Output:
+---+----+----+-------+----+---------+----+
| No|Name| Sal|Address|Dept|Join_Date|FLAG|
+---+----+----+-------+----+---------+----+
| 33| Kom|3000| uk| IT|2/11/2019| DF2|
| 44| Nom|4000| can| HR|2/11/2019|SAME|
| 22| Tom|2000| usa| HR|2/11/2019|SAME|
| 77| XYZ|5000| mex| IT|2/11/2019| DF2|
| 55| Xom|5000| mex| IT|2/11/2019| DF2|
| 11| Sam|1000| ind| IT|2/11/2019|SAME|
| 66| XYZ|5000| mex| IT|2/11/2019| DF1|
| 55| Vom|5000| mex| IT|2/11/2019| DF1|
| 33| Kom|3500| uk| IT|2/11/2019| DF1|
+---+----+----+-------+----+---------+----+
Imagine a Spark Dataframe consisting of value observations from variables. Each observation has a specific timestamp and those timestamps are not the same between different variables. This is because the timestamp is generated when the value of a variable changed and is recorded.
#Variable Time Value
#852-YF-007 2016-05-10 00:00:00 0
#852-YF-007 2016-05-09 23:59:00 0
#852-YF-007 2016-05-09 23:58:00 0
Problem I would like to put all variables into the same frequency (for instance 10min) using forward-fill. To visualize this, I copied a page from the Book "Python for Data Analysis". Question: How to do that on a Spark Dataframe in an efficient way?
Question: How to do that on a Spark Dataframe in an efficient way?
Spark DataFrame is simply not a good choice for an operation like this one. In general SQL primitives won't be expressive enough and PySpark DataFrame doesn't provide low level access required to implement it.
While re-sampling can be easily represented using epoch / timestamp arithmetics. With data like this:
from pyspark.sql.functions import col, max as max_, min as min_
df = (spark
.createDataFrame([
("2012-06-13", 0.694), ("2012-06-20", -2.669), ("2012-06-27", 0.245)],
["ts", "val"])
.withColumn("ts", col("ts").cast("date").cast("timestamp")))
we can re-sample input:
day = 60 * 60 * 24
epoch = (col("ts").cast("bigint") / day).cast("bigint") * day
with_epoch = df.withColumn("epoch", epoch)
min_epoch, max_epoch = with_epoch.select(min_("epoch"), max_("epoch")).first()
and join with reference:
# Reference range
ref = spark.range(
min_epoch, max_epoch + 1, day
).toDF("epoch")
(ref
.join(with_epoch, "epoch", "left")
.orderBy("epoch")
.withColumn("ts_resampled", col("epoch").cast("timestamp"))
.show(15, False))
## +----------+---------------------+------+---------------------+
## |epoch |ts |val |ts_resampled |
## +----------+---------------------+------+---------------------+
## |1339459200|2012-06-13 00:00:00.0|0.694 |2012-06-12 02:00:00.0|
## |1339545600|null |null |2012-06-13 02:00:00.0|
## |1339632000|null |null |2012-06-14 02:00:00.0|
## |1339718400|null |null |2012-06-15 02:00:00.0|
## |1339804800|null |null |2012-06-16 02:00:00.0|
## |1339891200|null |null |2012-06-17 02:00:00.0|
## |1339977600|null |null |2012-06-18 02:00:00.0|
## |1340064000|2012-06-20 00:00:00.0|-2.669|2012-06-19 02:00:00.0|
## |1340150400|null |null |2012-06-20 02:00:00.0|
## |1340236800|null |null |2012-06-21 02:00:00.0|
## |1340323200|null |null |2012-06-22 02:00:00.0|
## |1340409600|null |null |2012-06-23 02:00:00.0|
## |1340496000|null |null |2012-06-24 02:00:00.0|
## |1340582400|null |null |2012-06-25 02:00:00.0|
## |1340668800|2012-06-27 00:00:00.0|0.245 |2012-06-26 02:00:00.0|
## +----------+---------------------+------+---------------------+
In Spark >= 3.1 replace
col("epoch").cast("timestamp")
with
from pyspark.sql.functions import timestamp_seconds
timestamp_seconds("epoch")
Using low level APIs it is possible to fill data like this as I've shown in my answer to Spark / Scala: forward fill with last observation. Using RDDs we could also avoid shuffling data twice (once for join, once for reordering).
But there is much more important problem here. Spark performs optimally when problem can be reduced to element wise or partition wise computations. While forward fill is the case when it is possible, as far as I am aware this is typically not the case with commonly used time series models and if some operation requires a sequential access then Spark won't provide any benefits at all.
So if you work with series which are large enough to require distributed data structure you'll probably want to aggregate it to some object that can be easily handled by a single machine and then use your favorite non-distributed tool to handle the rest.
If you work with multiple time series where each can be handled in memory then there is of course sparkts, but I know you're already aware of that.
I once answered a similar question, it'a bit of a hack but the idea makes sense in your case. Map every value on to a list, then flatten the list vertically.
From: Inserting records in a spark dataframe:
You can generate timestamp ranges, flatten them and select rows
import pyspark.sql.functions as func
from pyspark.sql.types import IntegerType, ArrayType
a=sc.parallelize([[670098928, 50],[670098930, 53], [670098934, 55]])\
.toDF(['timestamp','price'])
f=func.udf(lambda x:range(x,x+5),ArrayType(IntegerType()))
a.withColumn('timestamp',f(a.timestamp))\
.withColumn('timestamp',func.explode(func.col('timestamp')))\
.groupBy('timestamp')\
.agg(func.max(func.col('price')))\
.show()
+---------+----------+
|timestamp|max(price)|
+---------+----------+
|670098928| 50|
|670098929| 50|
|670098930| 53|
|670098931| 53|
|670098932| 53|
|670098933| 53|
|670098934| 55|
|670098935| 55|
|670098936| 55|
|670098937| 55|
|670098938| 55|
+---------+----------+
This is an old post, though I recently had to solve this with Spark 3.2. Here's the solution I came up with to both up-sample and down-sample the time-series to obtain exactly one data-point per object and per time period.
Assuming the following input data that we want to re-sample per day. Some variable have several data points per day, some have no data during several days:
from pyspark.sql.types import StructType, StringType, ArrayType, DoubleType, TimestampType
from pyspark.sql.functions import udf, date_trunc, row_number, desc, coalesce, datediff, lead, explode, col, lit
from pyspark.sql import Window, Row
from datetime import datetime, timedelta
df = spark.createDataFrame([
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-01T03:34:23.000"), value=1.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-01T04:34:23.000"), value=10.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-01T05:34:23.000"), value=100.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-02T01:34:23.000"), value=2.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-02T05:34:23.000"), value=3.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-02T02:34:23.000"), value=200.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-02T05:34:23.000"), value=200.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T10:34:23.000"), value=40.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T12:34:23.000"), value=42.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T14:34:23.000"), value=46.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-05T14:34:23.000"), value=6.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-07T09:34:23.000"), value=7.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-07T08:34:23.000"), value=70.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-07T05:34:23.000"), value=700.),
])
I first need this simple udf which essentially just builds a sequence of timestamps:
#udf(ArrayType(TimestampType()))
def pad_time(count: int, start_time: datetime):
if repeated_count is None:
return []
else:
return [start_time + timedelta(days=c) for c in range(count)]
Down-sampling can be done with a simple groupBy or partitionBy, keeping max 1 value per variable each day (I chose partitionBy in the example below).
Up-sampling with a "fill-forward" strategy can be done by measuring the size of a time gap between 2 successive rows, and then using this information to call the udf above.
df
# down-sampling by keeping the last value of each variable each day.
.withColumn("record_day", date_trunc("DAY", "record_ts"))
.withColumn("row_num",
row_number().over(
Window.partitionBy("variable", "record_day").orderBy(desc("record_ts"))
))
.where("row_num == 1")
# up-sampling part 1: counts the number of days to be filled (or 1 for the very last value)
.withColumn("gap",
coalesce(
datediff(
lead("record_day").over(Window.partitionBy("variable").orderBy("record_day")),
"record_day"),
lit(1))
)
.select(
# up-sampling part 2: just, pad the time axis as dictated by "gap", and the other two fields will be repeated
explode(pad_time("gap", "record_day")).alias("record_day"),
"variable",
"value",
)
.orderBy("record_day", "variable")
The results looks like that:
+-------------------+--------+-----+
| record_day|variable|value|
+-------------------+--------+-----+
|2021-10-01 00:00:00| A| 1.0|
|2021-10-01 00:00:00| B| 10.0|
|2021-10-01 00:00:00| C|100.0|
|2021-10-02 00:00:00| A| 3.0|
|2021-10-02 00:00:00| B| 10.0|
|2021-10-02 00:00:00| C|200.0|
|2021-10-03 00:00:00| A| 3.0|
|2021-10-03 00:00:00| B| 10.0|
|2021-10-03 00:00:00| C|200.0|
|2021-10-04 00:00:00| A| 3.0|
|2021-10-04 00:00:00| B| 46.0|
|2021-10-04 00:00:00| C|200.0|
|2021-10-05 00:00:00| A| 6.0|
|2021-10-05 00:00:00| B| 46.0|
|2021-10-05 00:00:00| C|200.0|
|2021-10-06 00:00:00| A| 6.0|
|2021-10-06 00:00:00| B| 46.0|
|2021-10-06 00:00:00| C|200.0|
|2021-10-07 00:00:00| A| 7.0|
|2021-10-07 00:00:00| B| 70.0|
|2021-10-07 00:00:00| C|700.0|
+-------------------+--------+-----+
Since Spark 2.4, you can use sequence built-in function with a window to generate all the timestamps between date of change and next date of change, and then use explode to flatten those timestamps.
If we start with the following dataframe df:
+----------+-------------------+---------+
|variable |time |value |
+----------+-------------------+---------+
|852-YF-007|2012-06-13 00:00:00|0.694283 |
|852-YF-007|2012-06-20 00:00:00|-2.669195|
|852-YF-007|2012-06-27 00:00:00|0.245842 |
+----------+-------------------+---------+
when we use the following code:
from pyspark.sql import Window
from pyspark.sql import functions as F
next_start_time = F.lead('time').over(Window.partitionBy('variable').orderBy('time'))
end_time = F.when(next_start_time.isNull(),
F.col('time')
).otherwise(
F.date_sub(next_start_time, 1)
)
result = df.withColumn('start', F.col('time')) \
.withColumn('stop', end_time) \
.withColumn('time', F.explode(F.sequence(
F.col('start'), F.col('stop'), F.expr("INTERVAL 1 DAY"))
)) \
.drop('start', 'stop')
You get the following result dataframe:
+----------+-------------------+---------+
|variable |time |value |
+----------+-------------------+---------+
|852-YF-007|2012-06-13 00:00:00|0.694283 |
|852-YF-007|2012-06-14 00:00:00|0.694283 |
|852-YF-007|2012-06-15 00:00:00|0.694283 |
|852-YF-007|2012-06-16 00:00:00|0.694283 |
|852-YF-007|2012-06-17 00:00:00|0.694283 |
|852-YF-007|2012-06-18 00:00:00|0.694283 |
|852-YF-007|2012-06-19 00:00:00|0.694283 |
|852-YF-007|2012-06-20 00:00:00|-2.669195|
|852-YF-007|2012-06-21 00:00:00|-2.669195|
|852-YF-007|2012-06-22 00:00:00|-2.669195|
|852-YF-007|2012-06-23 00:00:00|-2.669195|
|852-YF-007|2012-06-24 00:00:00|-2.669195|
|852-YF-007|2012-06-25 00:00:00|-2.669195|
|852-YF-007|2012-06-26 00:00:00|-2.669195|
|852-YF-007|2012-06-27 00:00:00|0.245842 |
+----------+-------------------+---------+