How to use pyspark to do some calculations of a csv file? - apache-spark

I am working on one CSV file below using PySpark(on databricks), but I am not sure how to get the total scan (event name) duration time. Assume one scan per time.
timestamp
event
value
1
2020-11-17_19:15:33.438102
scan
start
2
2020-11-17_19:18:33.433002
scan
end
3
2020-11-17_20:05:21.538125
scan
start
4
2020-11-17_20:13:08.528102
scan
end
5
2020-11-17_21:23:19.635104
pending
start
6
2020-11-17_21:33:26.572123
pending
end
7
2020-11-17_22:05:29.738105
pending
start
.........
Below are some of my thoughts:
first get scan start time
scan_start = df[(df['event'] == 'scan') & (df['value'] == 'start')]
scan_start_time = scan_start['timestamp']
get scan end time
scan_end = df[(df['event'] == 'scan') & (df['value'] == 'end')]
scan_end_time = scan_start['timestamp']
the duration of each scan
each_duration = scan_end_time.values - scan_start_time.values
total duration
total_duration_ns = each_duration.sum()
But, I am not sure how to do the calculation in PySpark.
First, do we need to create a schema to pre-define the column name 'timestamp' type in timestamp? (Assume all the column types (timestamp, event, value) are in str type)
On the other hand, assume we have multiple(1000+files) similar to the above CSV files stored in databricks, how can we create a reusable code for all the CSV files. Eventually, create one table to store info of the total scan_duration.
Can someone please share with me some code in PySpark?
Thank you so much

This code will compute for each row the difference between the current timestamp and the timestamp in the previous row.
I'm creating a dataframe for reproducibility.
from pyspark.sql import SparkSession, Window
from pyspark.sql.types import *
from pyspark.sql.functions import regexp_replace, col, lag
import pandas as pd
spark = SparkSession.builder.appName("DataFarme").getOrCreate()
data = pd.DataFrame(
{
"timestamp": ["2020-11-17_19:15:33.438102","2020-11-17_19:18:33.433002","2020-11-17_20:05:21.538125","2020-11-17_20:13:08.528102"],
"event": ["scan","scan","scan","scan"],
"value": ["start","end","start","end"]
}
)
df=spark.createDataFrame(data)
df.show()
# +--------------------+-----+-----+
# | timestamp|event|value|
# +--------------------+-----+-----+
# |2020-11-17_19:15:...| scan|start|
# |2020-11-17_19:18:...| scan| end|
# |2020-11-17_20:05:...| scan|start|
# |2020-11-17_20:13:...| scan| end|
# +--------------------+-----+-----+
Convert "timestamp" column to TimestampType() to be able to compute differences:
df=df.withColumn("timestamp",
regexp_replace(col("timestamp"),"_"," "))
df.show(truncate=False)
# +——————————-------------———+---——+—---—+
# |timestamp |event|value|
# +————————————-------------—+---——+—---—+
# |2020-11-17 19:15:33.438102|scan |start|
# |2020-11-17 19:18:33.433002|scan |end |
# |2020-11-17 20:05:21.538125|scan |start|
# |2020-11-17 20:13:08.528102|scan |end |
# +——————————-------------———+---——+---——+
df = df.withColumn("timestamp",
regexp_replace(col("timestamp"),"_"," ").cast(TimestampType()))
df.dtypes
# [('timestamp', 'timestamp'), ('event', 'string'), ('value', 'string')]
Use pyspark.sql.functions.lag function that returns the value of the previous row (offset=1 by default).
See also How to calculate the difference between rows in PySpark? or Applying a Window function to calculate differences in pySpark
df.withColumn("lag_previous", col("timestamp").cast("long") - lag('timestamp').over(
Window.orderBy('timestamp')).cast("long")).show(truncate=False)
# WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Using Window without partition gives a warning.
It is better to partition the dataframe for the Window operation, I partitioned here by type of event:
df.withColumn("lag_previous", col("timestamp").cast("long") - lag('timestamp').over(
Window.partitionBy("event").orderBy('timestamp')).cast("long")).show(truncate=False)
# +———————————-------------——+---——+—---—+—------—————+
# |timestamp |event|value|lag_previous|
# +———————————-------------——+---——+---——+------——————+
# |2020-11-17 19:15:33.438102|scan |start|null |
# |2020-11-17 19:18:33.433002|scan |end |180 |
# |2020-11-17 20:05:21.538125|scan |start|2808 |
# |2020-11-17 20:13:08.528102|scan |end |467 |
# +—————-------------————————+---——+—---—+—------—————+
From this table you can filter out the rows with "end" value to get the total durations.

Related

Date from week date format: 2022-W02-1 (ISO 8601) [duplicate]

This question already has answers here:
to_date gives null on format yyyyww (202001 and 202053)
(3 answers)
Closed 10 months ago.
Having a date, I create a column with ISO 8601 week date format:
from pyspark.sql import functions as F
df = spark.createDataFrame([('2019-03-18',), ('2019-12-30',), ('2022-01-03',), ('2022-01-10',)], ['date_col'])
df = df.withColumn(
'iso_from_date',
F.concat_ws(
'-',
F.expr('extract(yearofweek from date_col)'),
F.lpad(F.weekofyear('date_col'), 3, 'W0'),
F.expr('weekday(date_col) + 1')
)
)
df.show()
# +----------+-------------+
# | date_col|iso_from_date|
# +----------+-------------+
# |2019-03-18| 2019-W12-1|
# |2019-12-30| 2020-W01-1|
# |2022-01-03| 2022-W01-1|
# |2022-01-10| 2022-W02-1|
# +----------+-------------+
Using Spark 3, how to get back the date, given ISO 8601 week date?
I tried the following, but it is both, incorrect and uses LEGACY configuration which I don't like.
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
df.withColumn('date_from_iso', F.to_date('iso_from_date', "YYYY-'W'ww-uu")).show()
# +----------+-------------+-------------+
# | date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18| 2019-W12-1| 2019-03-18|
# |2019-12-30| 2020-W01-1| 2019-12-30|
# |2022-01-03| 2022-W01-1| 2021-12-27|
# |2022-01-10| 2022-W02-1| 2022-01-03|
# +----------+-------------+-------------+
I am aware of the possibility to create a udf which works:
import datetime
#F.udf('date')
def iso_to_date(iso_date):
return datetime.datetime.strptime(iso_date, '%G-W%V-%u')
df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()
But I am looking for a more efficient option. The ideal option should not use LEGACY configuration and be translatable to SQL or Scala (no inefficient udf).
In PySpark, I have found a nicer than udf option. This will use pandas_udf which is vectorized (more efficient):
import pandas as pd
#F.pandas_udf('date')
def iso_to_date(iso_date: pd.Series) -> pd.Series:
return pd.to_datetime(iso_date, format='%G-W%V-%u')
df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()
# +----------+-------------+-------------+
# | date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18| 2019-W12-1| 2019-03-18|
# |2019-12-30| 2020-W01-1| 2019-12-30|
# |2022-01-03| 2022-W01-1| 2022-01-03|
# |2022-01-10| 2022-W02-1| 2022-01-10|
# +----------+-------------+-------------+
It works in Spark 3 without the LEGACY configuration. So it's acceptable.
However, there is room for improvement, as this option is not transferable to SQL or Scala.

(KNN ) row compute use outer DataFrame on pyspark

question
my data structure is like this:
train_info:(over 30000 rows)
----------
odt:string (unique)
holiday_type:string
od_label:string
array:array<double> (with variable length depend on different odt and holiday_type )
useful_index:array<int> (length same as vectors)
...(other not important cols)
label_data:(over 40000 rows)
----------
holiday_type:string
od_label: string
l_origin_array:array<double> (with variable length)
...(other not important cols)
my expected result is like this(length same with train_info):
--------------
odt:string
holiday_label:string
od_label:string
prediction:int
my solution is like this:
if __name__=='__main __'
loop_item = train_info.collect()
result = knn_for_loop(spark, loop_item,train_info.schema,label_data)
----- do something -------
def knn_for_loop(spark, predict_list, schema, label_data):
result = list()
for i in predict_list:
# turn this Row col to Data Frame and join on label data
# across to this row data pick label data array data
predict_df = spark.sparkContext.parallelize([i]).toDF(schema) \
.join(label_data, on=['holiday_type', "od_label"], how='left') \
.withColumn("l_array",
UDFuncs.value_from_array_by_index(f.col('l_origin_array'), f.col("useful_index"))) \
.toPandas()
# pandas execute
train_x = predict_df.l_array.values
train_y = predict_df.label.values
test_x = predict_df.array.values[0]
test_y = KNN(train_x, train_y, test_x)
result.append((i['odt'], i['holiday_type'], i['od_label'], test_y))
return result
it's worked but is really slow, I estimate each row need 18s.
in R language I can do this easily using do function:
train_info%>%group_by(odt)%>%do(.,knn_loop,label_data)
something my tries
I tried to join them before use,and query them when I compute, but the data is too large to run (these two df have 400 million rows after join and It takes up 180 GB disk space on hive and query really slowly).
I tried to use pandas_udf, but it only allows one pd.data.frame parameter and slow).
I tried to use UDF, but UDF can't receive data frame obj.
I tried to use spark-knn package ,but I run with error,may be my offline
installation is wrong .
thanks for your help.

How to dynamically know if pySpark DF has a null/empty value for given columns?

I have to check if incoming data is having any null or "" or " " value or not. The column for which I have to check is not fixed. I am reading from a config where the column name is stored for different files with permissible null-ability.
+----------+------------------+--------------------------------------------+
| FileName | Nullable | Columns |
+----------+------------------+--------------------------------------------+
| Sales | Address2,Phone2 | OrderID,Address1,Address2,Phone1,Phone2 |
| Invoice | Bank,OfcAddress | InvoiceNo,InvoiceID,Amount,Bank,OfcAddress |
+----------+------------------+--------------------------------------------+
So for each data/file I have to see which field shouldn't contain null. On basis of that process/error out the file. Is there any pythonic way to do this?
The table structure you’re showing makes me believe you have read the file containing these job details as a Spark DataFrame. You probably shouldn’t, as it’s very likely not big data. If you have it as a Spark DataFrame, collect it to the driver, so that you can create separate Spark jobs for each file.
Then, each job is fairly straightforward: you have a certain file location from which you must read. That info is captured by the FileName, I presume. Now, I will also presume the file format for each of these files is identical. If not, you’ll have to add meta data indicating the file format. For now, I assume it’s CSV.
Next, you must determine the subset of columns that needs to be checked for the presence of nulls. That’s easy: given that you have a list of all columns in the DataFrame (which could’ve been derived from the DataFrame generated by the previous step (the loading)) and a list of all columns that can contain nulls, the list of columns that can’t contain nulls is simply the difference between these two.
Finally, you aggregate over the DataFrame the number of nulls within each of these columns. As this is a DataFrame aggregate, there’s only one row in the result set, so you can take head to bring it to the driver. Cast is to a dict for easier access to the attributes.
I’ve added a function, summarize_positive_counts, that returns the columns where there was at least one null record found, thereby invalidating the claim in the original table.
df.show(truncate=False)
# +--------+---------------+------------------------------------------+
# |FileName|Nullable |Columns |
# +--------+---------------+------------------------------------------+
# |Sales |Address2,Phone2|OrderID,Address1,Address2,Phone1,Phone2 |
# |Invoice |Bank,OfcAddress|InvoiceNo,InvoiceID,Amount,Bank,OfcAddress|
# +--------+---------------+------------------------------------------+
jobs = df.collect() # bring it to the driver, to create new Spark jobs from its
from pyspark.sql.functions import col, sum as spark_sum
def report_null_counts(frame, job):
cols_to_verify_not_null = (set(job.Columns.split(","))
.difference(job.Nullable.split(",")))
null_counts = frame.agg(*(spark_sum(col(_).isNull().cast("int")).alias(_)
for _ in cols_to_verify_not_null))
return null_counts.head().asDict()
def summarize_positive_counts(filename, null_counts):
return {filename: [colname for colname, nbr_of_nulls in null_counts.items()
if nbr_of_nulls > 0]}
for job in jobs: # embarassingly parallellizable
frame = spark.read.csv(job.FileName, header=True)
null_counts = report_null_counts(frame, job)
print(summarize_positive_counts(job.FileName, null_counts))

Poor performance on Window Lag function for large Spark dataframes [duplicate]

My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe.
For example, I have:
>>> df.show()
+-----+----------+
|index| col1|
+-----+----------+
| 0.0|0.58734024|
| 1.0|0.67304325|
| 2.0|0.85154736|
| 3.0| 0.5449719|
+-----+----------+
If I choose to calculate these using "Window" functions, then I can do that like so:
>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc())
>>> import pyspark.sql.functions as f
>>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show()
+-----+----------+-----------+
|index| col1| diffs_col1|
+-----+----------+-----------+
| 0.0|0.58734024|0.085703015|
| 1.0|0.67304325| 0.17850411|
| 2.0|0.85154736|-0.30657548|
| 3.0| 0.5449719| null|
+-----+----------+-----------+
Question: I explicitly partitioned the dataframe in a single partition. What is the performance impact of this and, if there is, why is that so and how could I avoid it? Because when I do not specify a partition, I get the following warning:
16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one.
The difference is only in the number of partitions created in total. Let's illustrate that with an example using simple dataset with 10 partitions and 1000 records:
df = spark.range(0, 1000, 1, 10).toDF("index").withColumn("col1", f.randn(42))
If you define frame without partition by clause
w_unpart = Window.orderBy(f.col("index").asc())
and use it with lag
df_lag_unpart = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
)
there will be only one partition in total:
df_lag_unpart.rdd.glom().map(len).collect()
[1000]
Compared to that frame definition with dummy index (simplified a bit compared to your code:
w_part = Window.partitionBy(f.lit(0)).orderBy(f.col("index").asc())
will use number of partitions equal to spark.sql.shuffle.partitions:
spark.conf.set("spark.sql.shuffle.partitions", 11)
df_lag_part = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_part) - f.col("col1")
)
df_lag_part.rdd.glom().count()
11
with only one non-empty partition:
df_lag_part.rdd.glom().filter(lambda x: x).count()
1
Unfortunately there is no universal solution which can be used to address this problem in PySpark. This just an inherent mechanism of the implementation combined with distributed processing model.
Since index column is sequential you could generate artificial partitioning key with fixed number of records per block:
rec_per_block = df.count() // int(spark.conf.get("spark.sql.shuffle.partitions"))
df_with_block = df.withColumn(
"block", (f.col("index") / rec_per_block).cast("int")
)
and use it to define frame specification:
w_with_block = Window.partitionBy("block").orderBy("index")
df_lag_with_block = df_with_block.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_with_block) - f.col("col1")
)
This will use expected number of partitions:
df_lag_with_block.rdd.glom().count()
11
with roughly uniform data distribution (we cannot avoid hash collisions):
df_lag_with_block.rdd.glom().map(len).collect()
[0, 180, 0, 90, 90, 0, 90, 90, 100, 90, 270]
but with a number of gaps on the block boundaries:
df_lag_with_block.where(f.col("diffs_col1").isNull()).count()
12
Since boundaries are easy to compute:
from itertools import chain
boundary_idxs = sorted(chain.from_iterable(
# Here we depend on sequential identifiers
# This could be generalized to any monotonically increasing
# id by taking min and max per block
(idx - 1, idx) for idx in
df_lag_with_block.groupBy("block").min("index")
.drop("block").rdd.flatMap(lambda x: x)
.collect()))[2:] # The first boundary doesn't carry useful inf.
you can always select:
missing = df_with_block.where(f.col("index").isin(boundary_idxs))
and fill these separately:
# We use window without partitions here. Since number of records
# will be small this won't be a performance issue
# but will generate "Moving all data to a single partition" warning
missing_with_lag = missing.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
).select("index", f.col("diffs_col1").alias("diffs_fill"))
and join:
combined = (df_lag_with_block
.join(missing_with_lag, ["index"], "leftouter")
.withColumn("diffs_col1", f.coalesce("diffs_col1", "diffs_fill")))
to get desired result:
mismatched = combined.join(df_lag_unpart, ["index"], "outer").where(
combined["diffs_col1"] != df_lag_unpart["diffs_col1"]
)
assert mismatched.count() == 0

Avoid performance impact of a single partition mode in Spark window functions

My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe.
For example, I have:
>>> df.show()
+-----+----------+
|index| col1|
+-----+----------+
| 0.0|0.58734024|
| 1.0|0.67304325|
| 2.0|0.85154736|
| 3.0| 0.5449719|
+-----+----------+
If I choose to calculate these using "Window" functions, then I can do that like so:
>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc())
>>> import pyspark.sql.functions as f
>>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show()
+-----+----------+-----------+
|index| col1| diffs_col1|
+-----+----------+-----------+
| 0.0|0.58734024|0.085703015|
| 1.0|0.67304325| 0.17850411|
| 2.0|0.85154736|-0.30657548|
| 3.0| 0.5449719| null|
+-----+----------+-----------+
Question: I explicitly partitioned the dataframe in a single partition. What is the performance impact of this and, if there is, why is that so and how could I avoid it? Because when I do not specify a partition, I get the following warning:
16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one.
The difference is only in the number of partitions created in total. Let's illustrate that with an example using simple dataset with 10 partitions and 1000 records:
df = spark.range(0, 1000, 1, 10).toDF("index").withColumn("col1", f.randn(42))
If you define frame without partition by clause
w_unpart = Window.orderBy(f.col("index").asc())
and use it with lag
df_lag_unpart = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
)
there will be only one partition in total:
df_lag_unpart.rdd.glom().map(len).collect()
[1000]
Compared to that frame definition with dummy index (simplified a bit compared to your code:
w_part = Window.partitionBy(f.lit(0)).orderBy(f.col("index").asc())
will use number of partitions equal to spark.sql.shuffle.partitions:
spark.conf.set("spark.sql.shuffle.partitions", 11)
df_lag_part = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_part) - f.col("col1")
)
df_lag_part.rdd.glom().count()
11
with only one non-empty partition:
df_lag_part.rdd.glom().filter(lambda x: x).count()
1
Unfortunately there is no universal solution which can be used to address this problem in PySpark. This just an inherent mechanism of the implementation combined with distributed processing model.
Since index column is sequential you could generate artificial partitioning key with fixed number of records per block:
rec_per_block = df.count() // int(spark.conf.get("spark.sql.shuffle.partitions"))
df_with_block = df.withColumn(
"block", (f.col("index") / rec_per_block).cast("int")
)
and use it to define frame specification:
w_with_block = Window.partitionBy("block").orderBy("index")
df_lag_with_block = df_with_block.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_with_block) - f.col("col1")
)
This will use expected number of partitions:
df_lag_with_block.rdd.glom().count()
11
with roughly uniform data distribution (we cannot avoid hash collisions):
df_lag_with_block.rdd.glom().map(len).collect()
[0, 180, 0, 90, 90, 0, 90, 90, 100, 90, 270]
but with a number of gaps on the block boundaries:
df_lag_with_block.where(f.col("diffs_col1").isNull()).count()
12
Since boundaries are easy to compute:
from itertools import chain
boundary_idxs = sorted(chain.from_iterable(
# Here we depend on sequential identifiers
# This could be generalized to any monotonically increasing
# id by taking min and max per block
(idx - 1, idx) for idx in
df_lag_with_block.groupBy("block").min("index")
.drop("block").rdd.flatMap(lambda x: x)
.collect()))[2:] # The first boundary doesn't carry useful inf.
you can always select:
missing = df_with_block.where(f.col("index").isin(boundary_idxs))
and fill these separately:
# We use window without partitions here. Since number of records
# will be small this won't be a performance issue
# but will generate "Moving all data to a single partition" warning
missing_with_lag = missing.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
).select("index", f.col("diffs_col1").alias("diffs_fill"))
and join:
combined = (df_lag_with_block
.join(missing_with_lag, ["index"], "leftouter")
.withColumn("diffs_col1", f.coalesce("diffs_col1", "diffs_fill")))
to get desired result:
mismatched = combined.join(df_lag_unpart, ["index"], "outer").where(
combined["diffs_col1"] != df_lag_unpart["diffs_col1"]
)
assert mismatched.count() == 0

Resources