Inserting records in a spark dataframe

Inserting records in a spark dataframe - apache-spark

I have a dataframe in pyspark. Here is what it looks like,
+---------+---------+
|timestamp| price |
+---------+---------+
|670098928| 50 |
|670098930| 53 |
|670098934| 55 |
+---------+---------+
I want to fill in the gaps in timestamp with the previous state, so that I can get a perfect set to calculate time weighted averages. Here is what the output should be like -
+---------+---------+
|timestamp| price |
+---------+---------+
|670098928| 50 |
|670098929| 50 |
|670098930| 53 |
|670098931| 53 |
|670098932| 53 |
|670098933| 53 |
|670098934| 55 |
+---------+---------+
Eventually, I want to persist this new dataframe on disk and visualize my analysis.
How do I do this in pyspark? (For simplicity sake, I have just kept 2 columns. My actual dataframe has 89 columns with ~670 million records before filling the gaps.)

You can generate timestamp ranges, flatten them and select rows
import pyspark.sql.functions as func
from pyspark.sql.types import IntegerType, ArrayType
a=sc.parallelize([[670098928, 50],[670098930, 53], [670098934, 55]])\
.toDF(['timestamp','price'])
f=func.udf(lambda x:range(x,x+5),ArrayType(IntegerType()))
a.withColumn('timestamp',f(a.timestamp))\
.withColumn('timestamp',func.explode(func.col('timestamp')))\
.groupBy('timestamp')\
.agg(func.max(func.col('price')))\
.show()
+---------+----------+
|timestamp|max(price)|
+---------+----------+
|670098928| 50|
|670098929| 50|
|670098930| 53|
|670098931| 53|
|670098932| 53|
|670098933| 53|
|670098934| 55|
|670098935| 55|
|670098936| 55|
|670098937| 55|
|670098938| 55|
+---------+----------+

Related

Spark DataFrame Get Difference between values of two rows

I have calculated the average temperature for two cities grouped by seasons, but I'm having trouble in with getting the difference between the avg(TemperatureF) for City A vs City B. Here is an example of what my Spark Scala DataFrame looks like:
City
Season
avg(TemperatureF)
A
Fall
52
A
Spring
50
A
Summer
72
A
Winter
25
B
Fall
49
B
Spring
44
B
Summer
69
B
Winter
22

You may use the pivot function as follows:
df.groupBy('Season').pivot('City').agg(f.first('avg')) \
.withColumn('diff', f.expr('A - B')) \
.show()
+------+---+---+----+
|Season| A| B|diff|
+------+---+---+----+
|Spring| 50| 44| 6.0|
|Summer| 72| 69| 3.0|
| Fall| 52| 49| 3.0|
|Winter| 25| 22| 3.0|
+------+---+---+----+

How to get updated or new records by comparing two dataframe in pyspark

I have two dataframes like this:
df2.show()
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 11| 500|
|Liza| 20| 900|
+----+-------+------+
df3.show()
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 10| 700|
| Cal| 70| 888|
+----+-------+------+
df2 here, represents existing database records and df3 represents new records/updated records(any column) which need to be inserted/updated into db.For ex: NAME=PPan the new balance is 10 as per df3. so For NAME=PPan entire row has to be replaced in df2 and for NAME=Cal, a new row has to be added and for name=Liza will be untouched like this:
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 10| 700|
|Liza| 20| 900|
| Cal| 70| 888|
+----+-------+------+
How can I achieve this use case?

First you need to join both dataframes using full method to keep unmatched rows (new) and to updating the matched records I do prefer to use select with coalesce function:
joined_df = df2.alias('rec').join(df3.alias('upd'), on='NAME', how='full')
# +----+-------+------+-------+------+
# |NAME|BALANCE|SALARY|BALANCE|SALARY|
# +----+-------+------+-------+------+
# |Cal |null |null |70 |888 |
# |Liza|20 |900 |null |null |
# |PPan|11 |500 |10 |700 |
# +----+-------+------+-------+------+
output_df = joined_df.selectExpr(
'NAME',
'COALESCE(upd.BALANCE, rec.BALANCE) BALANCE',
'COALESCE(upd.SALARY, rec.SALARY) SALARY'
)
output_df.sort('BALANCE').show(truncate=False)
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan|10 |700 |
|Liza|20 |900 |
|Cal |70 |888 |
+----+-------+------+

Case Based scenerio in pyspark

The DataFrame in pyspark looks like below.
model,DAYS
MarutiDesire,15
MarutiErtiga,30
Suzukicelerio,45
I10lxi,60
Verna,55
Output i am trying to get like
Output : I am trying to get the output as
when days less than 30 than economical,
between 30 and 60 than average,
and when greater than 60 than Low Profit
Code i tried but giving incorrect output.
dataset1.selectExpr("*", "CASE WHEN DAYS <=30 THEN 'ECONOMICAL' WHEN DAYS>30 AND LESS THEN 60 THEN 'AVERAGE' ELSE 'LOWPROFIT' END REASON").show()
kindly share your suggestion. is there any better way to do this in pyspark.

>>> from pyspark.sql.functions import *
>>> df.show()
+-------------+----+
| model|DAYS|
+-------------+----+
| MarutiDesire| 15|
| MarutiErtiga| 30|
|Suzukicelerio| 45|
| I10lxi| 60|
| Verna| 55|
+-------------+----+
>>> df.withColumn("REMARKS", when(col("DAYS") < 30, lit("ECONOMICAL")).when((col("DAYS") >= 30) & (col("DAYS") < 60), lit("AVERAGE")).otherwise(lit("LOWPROFIT"))).show()
+-------------+----+----------+
| model|DAYS| REMARKS|
+-------------+----+----------+
| MarutiDesire| 15|ECONOMICAL|
| MarutiErtiga| 30| AVERAGE|
|Suzukicelerio| 45| AVERAGE|
| I10lxi| 60| LOWPROFIT|
| Verna| 55| AVERAGE|
+-------------+----+----------+

How to compare records from PySpark data frames

I would like to compare 2 data frames and I want to pull out the records based on below 3 conditions.
If the record is matching, 'SAME' should come in a new column FLAG.
If the record not matching, if it is from df1 (suppose No.66), 'DF1' should come in FLAG column.
If the record not matching, if it is from df2 (suppose No.77), 'DF2' should come in FLAG column.
Here whole RECORD need to consider and verify. Record wise comparison.
Also i need to check like this for millions of records using PySpark code.
df1:
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
df2:
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3000,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Xom,5000,mex,IT,2/11/2019
77,XYZ,5000,mex,IT,2/11/2019
Expected Output:
No,Name,Sal,Address,Dept,Join_Date,FLAG
11,Sam,1000,ind,IT,2/11/2019,SAME
22,Tom,2000,usa,HR,2/11/2019,SAME
33,Kom,3500,uk,IT,2/11/2019,DF1
33,Kom,3000,uk,IT,2/11/2019,DF2
44,Nom,4000,can,HR,2/11/2019,SAME
55,Vom,5000,mex,IT,2/11/2019,DF1
55,Xom,5000,mex,IT,2/11/2019,DF2
66,XYZ,5000,mex,IT,2/11/2019,DF1
77,XYZ,5000,mex,IT,2/11/2019,DF2
I loaded input data like below, but not getting idea on how to proceed.
df1 = pd.read_csv("D:\\inputs\\file1.csv")
df2 = pd.read_csv("D:\\inputs\\file2.csv")
Any help is appreciated. Thanks.

# Requisite packages to import
import sys
from pyspark.sql.functions import lit, count, col, when
from pyspark.sql.window import Window
# Create the two dataframes
df1 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3500,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Vom',5000,'mex','IT','2/11/2019'),(66,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df2 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3000,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Xom',5000,'mex','IT','2/11/2019'),(77,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df1 = df1.withColumn('FLAG',lit('DF1'))
df2 = df2.withColumn('FLAG',lit('DF2'))
# Concatenate the two DataFrames, to create one big dataframe.
df = df1.union(df2)
Use window function to check if the count of same rows is more than 1 and if it indeed is, then mark column FLAG as SAME, else keep it the way it is. Finally, drop the duplicates.
my_window = Window.partitionBy('No','Name','Sal','Address','Dept','Join_Date').rowsBetween(-sys.maxsize, sys.maxsize)
df = df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()
df.show()
+---+----+----+-------+----+---------+----+
| No|Name| Sal|Address|Dept|Join_Date|FLAG|
+---+----+----+-------+----+---------+----+
| 33| Kom|3000| uk| IT|2/11/2019| DF2|
| 44| Nom|4000| can| HR|2/11/2019|SAME|
| 22| Tom|2000| usa| HR|2/11/2019|SAME|
| 77| XYZ|5000| mex| IT|2/11/2019| DF2|
| 55| Xom|5000| mex| IT|2/11/2019| DF2|
| 11| Sam|1000| ind| IT|2/11/2019|SAME|
| 66| XYZ|5000| mex| IT|2/11/2019| DF1|
| 55| Vom|5000| mex| IT|2/11/2019| DF1|
| 33| Kom|3500| uk| IT|2/11/2019| DF1|
+---+----+----+-------+----+---------+----+

I think you can solve your problem with the creation of temporary columns to indicate the source and a join. Then you only have to check for the conditions, i.e. if both sources are present or if only one source is there and which one.
Consider the following code:
from pyspark.sql.functions import *
df1= sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),\
(22,'Tom',2000,'usa','HR','2/11/2019'),(33,'Kom',3500,'uk','IT','2/11/2019'),\
(44,'Nom',4000,'can','HR','2/11/2019'),(55,'Vom',5000,'mex','IT','2/11/2019'),\
(66,'XYZ',5000,'mex','IT','2/11/2019')], \
["No","Name","Sal","Address","Dept","Join_Date"])
df2= sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),\
(22,'Tom',2000,'usa','HR','2/11/2019'),(33,'Kom',3000,'uk','IT','2/11/2019'),\
(44,'Nom',4000,'can','HR','2/11/2019'),(55,'Xom',5000,'mex','IT','2/11/2019'),\
(77,'XYZ',5000,'mex','IT','2/11/2019')], \
["No","Name","Sal","Address","Dept","Join_Date"])
#creation of your example dataframes
df1 = df1.withColumn("Source1", lit("DF1"))
df2 = df2.withColumn("Source2", lit("DF2"))
#temporary columns to refer the origin later
df1.join(df2, ["No","Name","Sal","Address","Dept","Join_Date"],"full")\
#full join on all columns, but source is only set if record appears in original dataframe\
.withColumn("FLAG",when(col("Source1").isNotNull() & col("Source2").isNotNull(), "SAME")\
#condition if record appears in both dataframes\
.otherwise(when(col("Source1").isNotNull(), "DF1").otherwise("DF2")))\
#condition if record appears in one dataframe\
.drop("Source1","Source2").show() #remove temporary columns and show result
Output:
+---+----+----+-------+----+---------+----+
| No|Name| Sal|Address|Dept|Join_Date|FLAG|
+---+----+----+-------+----+---------+----+
| 33| Kom|3000| uk| IT|2/11/2019| DF2|
| 44| Nom|4000| can| HR|2/11/2019|SAME|
| 22| Tom|2000| usa| HR|2/11/2019|SAME|
| 77| XYZ|5000| mex| IT|2/11/2019| DF2|
| 55| Xom|5000| mex| IT|2/11/2019| DF2|
| 11| Sam|1000| ind| IT|2/11/2019|SAME|
| 66| XYZ|5000| mex| IT|2/11/2019| DF1|
| 55| Vom|5000| mex| IT|2/11/2019| DF1|
| 33| Kom|3500| uk| IT|2/11/2019| DF1|
+---+----+----+-------+----+---------+----+

PySpark: how to resample frequencies

Imagine a Spark Dataframe consisting of value observations from variables. Each observation has a specific timestamp and those timestamps are not the same between different variables. This is because the timestamp is generated when the value of a variable changed and is recorded.
#Variable Time Value
#852-YF-007 2016-05-10 00:00:00 0
#852-YF-007 2016-05-09 23:59:00 0
#852-YF-007 2016-05-09 23:58:00 0
Problem I would like to put all variables into the same frequency (for instance 10min) using forward-fill. To visualize this, I copied a page from the Book "Python for Data Analysis". Question: How to do that on a Spark Dataframe in an efficient way?

Question: How to do that on a Spark Dataframe in an efficient way?
Spark DataFrame is simply not a good choice for an operation like this one. In general SQL primitives won't be expressive enough and PySpark DataFrame doesn't provide low level access required to implement it.
While re-sampling can be easily represented using epoch / timestamp arithmetics. With data like this:
from pyspark.sql.functions import col, max as max_, min as min_
df = (spark
.createDataFrame([
("2012-06-13", 0.694), ("2012-06-20", -2.669), ("2012-06-27", 0.245)],
["ts", "val"])
.withColumn("ts", col("ts").cast("date").cast("timestamp")))
we can re-sample input:
day = 60 * 60 * 24
epoch = (col("ts").cast("bigint") / day).cast("bigint") * day
with_epoch = df.withColumn("epoch", epoch)
min_epoch, max_epoch = with_epoch.select(min_("epoch"), max_("epoch")).first()
and join with reference:
# Reference range
ref = spark.range(
min_epoch, max_epoch + 1, day
).toDF("epoch")
(ref
.join(with_epoch, "epoch", "left")
.orderBy("epoch")
.withColumn("ts_resampled", col("epoch").cast("timestamp"))
.show(15, False))
## +----------+---------------------+------+---------------------+
## |epoch |ts |val |ts_resampled |
## +----------+---------------------+------+---------------------+
## |1339459200|2012-06-13 00:00:00.0|0.694 |2012-06-12 02:00:00.0|
## |1339545600|null |null |2012-06-13 02:00:00.0|
## |1339632000|null |null |2012-06-14 02:00:00.0|
## |1339718400|null |null |2012-06-15 02:00:00.0|
## |1339804800|null |null |2012-06-16 02:00:00.0|
## |1339891200|null |null |2012-06-17 02:00:00.0|
## |1339977600|null |null |2012-06-18 02:00:00.0|
## |1340064000|2012-06-20 00:00:00.0|-2.669|2012-06-19 02:00:00.0|
## |1340150400|null |null |2012-06-20 02:00:00.0|
## |1340236800|null |null |2012-06-21 02:00:00.0|
## |1340323200|null |null |2012-06-22 02:00:00.0|
## |1340409600|null |null |2012-06-23 02:00:00.0|
## |1340496000|null |null |2012-06-24 02:00:00.0|
## |1340582400|null |null |2012-06-25 02:00:00.0|
## |1340668800|2012-06-27 00:00:00.0|0.245 |2012-06-26 02:00:00.0|
## +----------+---------------------+------+---------------------+
In Spark >= 3.1 replace
col("epoch").cast("timestamp")
with
from pyspark.sql.functions import timestamp_seconds
timestamp_seconds("epoch")
Using low level APIs it is possible to fill data like this as I've shown in my answer to Spark / Scala: forward fill with last observation. Using RDDs we could also avoid shuffling data twice (once for join, once for reordering).
But there is much more important problem here. Spark performs optimally when problem can be reduced to element wise or partition wise computations. While forward fill is the case when it is possible, as far as I am aware this is typically not the case with commonly used time series models and if some operation requires a sequential access then Spark won't provide any benefits at all.
So if you work with series which are large enough to require distributed data structure you'll probably want to aggregate it to some object that can be easily handled by a single machine and then use your favorite non-distributed tool to handle the rest.
If you work with multiple time series where each can be handled in memory then there is of course sparkts, but I know you're already aware of that.

I once answered a similar question, it'a bit of a hack but the idea makes sense in your case. Map every value on to a list, then flatten the list vertically.
From: Inserting records in a spark dataframe:
You can generate timestamp ranges, flatten them and select rows
import pyspark.sql.functions as func
from pyspark.sql.types import IntegerType, ArrayType
a=sc.parallelize([[670098928, 50],[670098930, 53], [670098934, 55]])\
.toDF(['timestamp','price'])
f=func.udf(lambda x:range(x,x+5),ArrayType(IntegerType()))
a.withColumn('timestamp',f(a.timestamp))\
.withColumn('timestamp',func.explode(func.col('timestamp')))\
.groupBy('timestamp')\
.agg(func.max(func.col('price')))\
.show()
+---------+----------+
|timestamp|max(price)|
+---------+----------+
|670098928| 50|
|670098929| 50|
|670098930| 53|
|670098931| 53|
|670098932| 53|
|670098933| 53|
|670098934| 55|
|670098935| 55|
|670098936| 55|
|670098937| 55|
|670098938| 55|
+---------+----------+

This is an old post, though I recently had to solve this with Spark 3.2. Here's the solution I came up with to both up-sample and down-sample the time-series to obtain exactly one data-point per object and per time period.
Assuming the following input data that we want to re-sample per day. Some variable have several data points per day, some have no data during several days:
from pyspark.sql.types import StructType, StringType, ArrayType, DoubleType, TimestampType
from pyspark.sql.functions import udf, date_trunc, row_number, desc, coalesce, datediff, lead, explode, col, lit
from pyspark.sql import Window, Row
from datetime import datetime, timedelta
df = spark.createDataFrame([
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-01T03:34:23.000"), value=1.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-01T04:34:23.000"), value=10.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-01T05:34:23.000"), value=100.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-02T01:34:23.000"), value=2.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-02T05:34:23.000"), value=3.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-02T02:34:23.000"), value=200.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-02T05:34:23.000"), value=200.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T10:34:23.000"), value=40.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T12:34:23.000"), value=42.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T14:34:23.000"), value=46.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-05T14:34:23.000"), value=6.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-07T09:34:23.000"), value=7.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-07T08:34:23.000"), value=70.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-07T05:34:23.000"), value=700.),
])
I first need this simple udf which essentially just builds a sequence of timestamps:
#udf(ArrayType(TimestampType()))
def pad_time(count: int, start_time: datetime):
if repeated_count is None:
return []
else:
return [start_time + timedelta(days=c) for c in range(count)]
Down-sampling can be done with a simple groupBy or partitionBy, keeping max 1 value per variable each day (I chose partitionBy in the example below).
Up-sampling with a "fill-forward" strategy can be done by measuring the size of a time gap between 2 successive rows, and then using this information to call the udf above.
df
# down-sampling by keeping the last value of each variable each day.
.withColumn("record_day", date_trunc("DAY", "record_ts"))
.withColumn("row_num",
row_number().over(
Window.partitionBy("variable", "record_day").orderBy(desc("record_ts"))
))
.where("row_num == 1")
# up-sampling part 1: counts the number of days to be filled (or 1 for the very last value)
.withColumn("gap",
coalesce(
datediff(
lead("record_day").over(Window.partitionBy("variable").orderBy("record_day")),
"record_day"),
lit(1))
)
.select(
# up-sampling part 2: just, pad the time axis as dictated by "gap", and the other two fields will be repeated
explode(pad_time("gap", "record_day")).alias("record_day"),
"variable",
"value",
)
.orderBy("record_day", "variable")
The results looks like that:
+-------------------+--------+-----+
| record_day|variable|value|
+-------------------+--------+-----+
|2021-10-01 00:00:00| A| 1.0|
|2021-10-01 00:00:00| B| 10.0|
|2021-10-01 00:00:00| C|100.0|
|2021-10-02 00:00:00| A| 3.0|
|2021-10-02 00:00:00| B| 10.0|
|2021-10-02 00:00:00| C|200.0|
|2021-10-03 00:00:00| A| 3.0|
|2021-10-03 00:00:00| B| 10.0|
|2021-10-03 00:00:00| C|200.0|
|2021-10-04 00:00:00| A| 3.0|
|2021-10-04 00:00:00| B| 46.0|
|2021-10-04 00:00:00| C|200.0|
|2021-10-05 00:00:00| A| 6.0|
|2021-10-05 00:00:00| B| 46.0|
|2021-10-05 00:00:00| C|200.0|
|2021-10-06 00:00:00| A| 6.0|
|2021-10-06 00:00:00| B| 46.0|
|2021-10-06 00:00:00| C|200.0|
|2021-10-07 00:00:00| A| 7.0|
|2021-10-07 00:00:00| B| 70.0|
|2021-10-07 00:00:00| C|700.0|
+-------------------+--------+-----+

Since Spark 2.4, you can use sequence built-in function with a window to generate all the timestamps between date of change and next date of change, and then use explode to flatten those timestamps.
If we start with the following dataframe df:
+----------+-------------------+---------+
|variable |time |value |
+----------+-------------------+---------+
|852-YF-007|2012-06-13 00:00:00|0.694283 |
|852-YF-007|2012-06-20 00:00:00|-2.669195|
|852-YF-007|2012-06-27 00:00:00|0.245842 |
+----------+-------------------+---------+
when we use the following code:
from pyspark.sql import Window
from pyspark.sql import functions as F
next_start_time = F.lead('time').over(Window.partitionBy('variable').orderBy('time'))
end_time = F.when(next_start_time.isNull(),
F.col('time')
).otherwise(
F.date_sub(next_start_time, 1)
)
result = df.withColumn('start', F.col('time')) \
.withColumn('stop', end_time) \
.withColumn('time', F.explode(F.sequence(
F.col('start'), F.col('stop'), F.expr("INTERVAL 1 DAY"))
)) \
.drop('start', 'stop')
You get the following result dataframe:
+----------+-------------------+---------+
|variable |time |value |
+----------+-------------------+---------+
|852-YF-007|2012-06-13 00:00:00|0.694283 |
|852-YF-007|2012-06-14 00:00:00|0.694283 |
|852-YF-007|2012-06-15 00:00:00|0.694283 |
|852-YF-007|2012-06-16 00:00:00|0.694283 |
|852-YF-007|2012-06-17 00:00:00|0.694283 |
|852-YF-007|2012-06-18 00:00:00|0.694283 |
|852-YF-007|2012-06-19 00:00:00|0.694283 |
|852-YF-007|2012-06-20 00:00:00|-2.669195|
|852-YF-007|2012-06-21 00:00:00|-2.669195|
|852-YF-007|2012-06-22 00:00:00|-2.669195|
|852-YF-007|2012-06-23 00:00:00|-2.669195|
|852-YF-007|2012-06-24 00:00:00|-2.669195|
|852-YF-007|2012-06-25 00:00:00|-2.669195|
|852-YF-007|2012-06-26 00:00:00|-2.669195|
|852-YF-007|2012-06-27 00:00:00|0.245842 |
+----------+-------------------+---------+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Inserting records in a spark dataframe - apache-spark

Related

Spark DataFrame Get Difference between values of two rows

How to get updated or new records by comparing two dataframe in pyspark

Case Based scenerio in pyspark

How to compare records from PySpark data frames

PySpark: how to resample frequencies

Categories

Resources