Pyspark - Applying custom function on structured streaming - apache-spark

I have 4 columns ['clienttimestamp",'sensor_id','actvivity',"incidents"]. From kafka stream, i consume data,preprocess and aggregate in window.
If i do with groupby with ".count()", The stream works very well writing each window with their count in the console.
This works,
df = df.withWatermark("clientTimestamp", "1 minutes")\
.groupby(window(df.clientTimestamp, "1 minutes", "1 minutes"), col('sensor_type')).count()
query = df.writeStream.outputMode("append").format('console').start()
query.awaitTermination()
But the real motive is to find the total time for which critical activity was live.
i.e. For each sensor_type, i group the data by window and i get the list of critical activity and find the total time for which the all critical activity lasted" (The code is below). But am not sure if i am using the udf in right way! Because below method does not work! Can anyone provide an example of applying a custom function for each group of window and to write the output to console.
This does not work
#f.pandas_udf(schemahh, f.PandasUDFType.GROUPED_MAP)
def calculate_time(pdf):
pdf = pdf.reset_index(drop=True)
total_time = 0
index_list = pdf.index[pdf['activity'] == 'critical'].to_list()
for ind in index_list:
start = pdf.loc[ind]['clientTimestamp']
end = pdf.loc[ind + 1]['clientTimestamp']
diff = start - end
time_n_mins = round(diff.seconds / 60, 2)
total_time = total_time + time_n_mins
largest_session_time = total_time
new_pdf = pd.DataFrame(columns=['sensor_type', 'largest_session_time'])
new_pdf.loc[0] = [pdf.loc[0]['sensor_type'], largest_session_time]
return new_pdf
df = df.withWatermark("clientTimestamp", "1 minutes")\
.groupby(window(df.clientTimestamp, "1 minutes", "1 minutes"), col('sensor_type'), col('activity')).apply(calculate_time)
query = df.writeStream.outputMode("append").format('console').start()
query.awaitTermination()

Related

Single Task Taking Long Time in PySpark

I am running a PySpark application where I am reading several Parquet files into Spark dataframes and created temporary views on them to use in my SQL query. So I have like 18 views where some are ~ 1TB, few in several GBs and some other smaller views. I am joining all of these and running my business logic to get the desired outcome. My code takes extremely long time to run (>3 hours) for this data. Looking at the Spark History Server, I can see there's one task that seems the culprit as the time taken, data spilled to memory and disk, shuffle read/write everything is way higher than the median. This indicates a data skew. So I even used salting on my large dataframes before creating the temp views. However there's still no difference in the execution time. I checked the number of partitions and it's already 792 (maximum I can have my current Glue config). I have also enabled adaptive query execution and adaptive skewJoin handling.
My original dataset was extremely huge largest table being ~40TB and has 2.5 years of data. I am trying to do a one time historical load and was unsuccessful on running over the entire data. With trial and error, I had to reduce this to processing 1TB of data at a time (for the largest table) which is still taking 3+ hours. This is not a scalable approach and hence I am looking for some inputs to optimize this.
Below are my app details:
Number of workers = 792
Spark config:
spark= (SparkSession
.builder
.appName("scmCaseAlertDatamartFullLoad")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.adaptive.enabled","true")
.config("spark.sql.broadcastTimeout","900")
.config("spark.sql.adaptive.skewJoin.enabled","true")
.getOrCreate()
)
Code (just included key relevant methods, starting point is loadSCMCseAlertData()):
def getIncomingMatchesFullData(self):
select_query_incoming_matches_full_data = """
SELECT DISTINCT alrt.caseid AS case_id,
alrt.alertid AS alert_id,
alrt.accountid AS account_id,
sc.created_time AS case_created_time,
sc.last_updated_time AS case_last_updated_time,
alrt.srccreatedtime AS alert_created_time,
aud.last_updated_by AS case_last_updated_by,
sc.closed_time AS case_last_close_time,
lcs.status AS case_status,
lcst.state AS case_state,
lcra.responsive_action,
sc.assigned_to AS case_assigned_to,
cr1.team_name AS case_assigned_to_team,
sc.resolved_by AS case_resolved_by,
cr2.team_name AS case_resolved_by_team,
aud.last_annotation AS case_last_annotation,
ca.name AS case_approver,
alrt.screeningdecision AS screening_decision,
ap.accountpool AS division,
lcd.decision AS case_current_decision,
CASE
WHEN sm.grylaclientid LIKE '%AddressService%' THEN 'Address Service'
WHEN sm.grylaclientid LIKE '%GrylaOrderProcessingService%' THEN 'Retail Checkout Service'
WHEN sm.grylaclientid = 'urn:cdo:GrylaBatchScreeningAAA:AWS:Default' THEN 'Batch Screening'
WHEN sm.grylaclientid = 'urn:cdo:OfficerJennyBindle:AWS:Default' THEN 'API'
ELSE 'Other'
END AS channel,
ap.businesstype AS business_type,
ap.businessname AS business_name,
ap.marketplaceid AS ap_marketplace_id,
ap.region AS ap_region,
ap.memberid AS ap_member_id,
ap.secondaryaccountpool AS secondary_account_pool,
sm.action AS client_action,
acl.added_by,
acl.lnb_id AS accept_list_lnb_id,
acl.created_time AS accept_list_created_time,
acl.source_case_id AS accept_list_source_case_id,
acs.status AS accept_list_status,
ap.street1 AS ap_line_1,
ap.street2 AS ap_line_2,
ap.street3 AS ap_line_3,
ap.city AS ap_city,
ap.state AS ap_state,
ap.postalcode AS ap_postal_code,
ap.country AS ap_country,
ap.fullname AS ap_full_name,
ap.email AS ap_email,
sm.screening_match_id AS dp_screening_match_id,
CASE
WHEN sm.matchtype = 'name_only_matching_details' THEN 'Name Only'
WHEN sm.matchtype = 'address_only_matching_details' THEN 'Address Only'
WHEN sm.matchtype = 'address_matching_details' THEN 'Address'
WHEN sm.matchtype = 'scr_matching_details' THEN 'SCR'
WHEN sm.matchtype = 'hotkey_matching_details' THEN 'HotKey'
END AS match_type,
sm.matchaction AS match_action,
alrt.batchfilename AS batch_file_id,
REGEXP_REPLACE(dp.name, '\\n|\\r|\\t', ' ') AS dp_matched_add_full_name,
dp.street AS dp_line1,
'' AS dp_line2,
dp.city AS dp_city,
dp.state AS dp_state,
dp.postalcode AS dp_postal_code,
dp.country AS dp_country,
dp.matchedplaces AS scr_value,
dp.hotkeyvalues AS hotkey_value,
sm.acceptlistid AS suppressed_by_accept_list_id,
sm.suppresseddedupe AS is_deduped,
sm.matchhash AS hash,
sm.matchdecision AS match_decision,
ap.addressid AS amazon_address_id,
ap.dateofbirth AS date_of_birth,
sm.grylaclientid AS gryla_client_id,
cr1.name AS case_assigned_to_role,
cr2.name AS case_resolved_by_role,
alrt.screeningengine AS screening_engine,
sm.srccreatedtime AS match_created_time,
sm.srclastupdatedtime AS match_updated_time,
to_date(sm.srclastupdatedtime,"yyyy-MM-dd") AS match_updated_date,
sm.match_updated_time_msec,
sm.suppressedby AS match_suppressed_by
FROM
cm_screening_match sm
JOIN
cm_screening_match_redshift smr ON sm.screening_match_id = smr.screening_match_id
LEFT JOIN
cm_case_alert alrt ON sm.screening_match_id = alrt.screening_match_id
LEFT JOIN
cm_amazon_party ap ON sm.screening_match_id = ap.screening_match_id
LEFT JOIN
cm_denied_party dp ON sm.screening_match_id = dp.screening_match_id
LEFT JOIN
cm_spectre_case sc ON alrt.caseid = sc.case_id
LEFT JOIN
cm_lookup_case_status lcs ON sc.status_id = lcs.status_id
LEFT JOIN
cm_lookup_case_state lcst ON sc.state_id = lcst.state_id
LEFT JOIN
cm_lookup_case_decision lcd ON sc.decision_id = lcd.decision_id
LEFT JOIN
cm_lookup_case_responsive_action lcra ON sc.responsive_action_id = lcra.responsive_action_id
LEFT JOIN
cm_user cu1 ON sc.assigned_to = cu1.alias
LEFT JOIN
cm_role cr1 ON cu1.current_role_id = cr1.role_id
LEFT JOIN
cm_user cu2 ON sc.resolved_by = cu2.alias
LEFT JOIN
cm_role cr2 ON cu2.current_role_id = cr2.role_id
LEFT JOIN
cm_accept_list acl ON acl.screening_match_id = sm.screening_match_id
LEFT JOIN
cm_lookup_accept_list_status acs ON acs.status_id = acl.status_id
LEFT JOIN
(
SELECT case_id,
last_value(username) OVER (PARTITION BY case_id ORDER BY created_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_updated_by,
last_value(description) OVER (PARTITION BY case_id ORDER BY created_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_annotation
FROM cm_spectre_case_audit
) aud ON sc.case_id = aud.case_id
LEFT JOIN
cm_approver ca ON sc.approver_id = ca.approver_id
"""
print(select_query_incoming_matches_full_data)
incomingMatchesFullDF = self.spark.sql(select_query_incoming_matches_full_data)
return incomingMatchesFullDF
def getBaseTables(self,matchtime_lower_threshold,matchtime_upper_threshold,cursor):
print('Fetching datalake data for matches created after: {}' .format(matchtime_lower_threshold))
matchDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_match)
matchDF = matchDF.select("screening_match_id","grylaclientid","action","matchtype","matchaction","acceptlistid","suppresseddedupe","matchhash","matchdecision","srccreatedtime","srclastupdatedtime","suppressedby","lastupdatedtime")
#.withColumn("screentime",to_timestamp("screentime")) \
matchDF = matchDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#matchDF = matchDF.repartition(2400,"screening_match_id")
matchDF = self.getLatestRecord(matchDF)
matchDF = matchDF.withColumn("salt", rand())
matchDF = matchDF.repartition("salt")
matchDF.createOrReplaceTempView("cm_screening_match")
print("Total from matchDF:",matchDF.count())
print("Number of paritions in matchDF: " ,matchDF.rdd.getNumPartitions())
alertDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_alert)
alertDF = alertDF.select("screening_match_id","caseid","alertid","accountid","srccreatedtime","screeningdecision","batchfilename","screeningengine","lastupdatedtime")
alertDF = alertDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#alertDF = alertDF.repartition(2400,"screening_match_id")
alertDF = self.getLatestRecord(alertDF)
alertDF = alertDF.withColumn("salt", rand())
alertDF = alertDF.repartition("salt")
alertDF.createOrReplaceTempView("cm_case_alert")
print("Total from alertDF:",alertDF.count())
print("Number of paritions in alertDF: " ,alertDF.rdd.getNumPartitions())
apDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_ap)
apDF = apDF.select("screening_match_id","accountpool","businesstype","businessname","marketplaceid","region","memberid","secondaryaccountpool","street1","street2","street3","city","state","postalcode","country","fullname","email","addressid","dateofbirth","lastupdatedtime")
apDF = apDF.withColumn("dateofbirth",to_date("dateofbirth","yyyy-MM-dd")) \
.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())) \
.drop("lastupdatedtime")
#apDF = apDF.repartition(2400,"screening_match_id")
apDF = self.getLatestRecord(apDF)
apDF = apDF.withColumn("salt", rand())
apDF = apDF.repartition("salt")
apDF.createOrReplaceTempView("cm_amazon_party")
print("Total from apDF:",apDF.count())
print("Number of paritions in apDF: " ,apDF.rdd.getNumPartitions())
dpDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_dp)
dpDF = dpDF.select("screening_match_id","name","street","city","state","postalcode","country","matchedplaces","hotkeyvalues","lastupdatedtime")
dpDF = dpDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#dpDF = dpDF.repartition(2400,"screening_match_id")
dpDF = self.getLatestRecord(dpDF)
dpDF = dpDF.withColumn("salt", rand())
dpDF = dpDF.repartition("salt")
dpDF.createOrReplaceTempView("cm_denied_party")
print("Total from dpDF:",dpDF.count())
print("Number of paritions in dpDF: " ,dpDF.rdd.getNumPartitions())
print('Fetching data from Redshift Base tables...')
self.getRedshiftData(matchtime_lower_threshold,matchtime_upper_threshold,cursor)
caseAuditDF = self.spark.read.parquet(self.data_input_case_audit)
caseAuditDF.createOrReplaceTempView("cm_spectre_case_audit")
caseDF = self.spark.read.parquet(self.data_input_case)
caseDF.createOrReplaceTempView("cm_spectre_case")
caseStatusDF = self.spark.read.parquet(self.data_input_case_status)
caseStatusDF.createOrReplaceTempView("cm_lookup_case_status")
caseStateDF = self.spark.read.parquet(self.data_input_case_state)
caseStateDF.createOrReplaceTempView("cm_lookup_case_state")
caseDecisionDF = self.spark.read.parquet(self.data_input_case_decision)
caseDecisionDF.createOrReplaceTempView("cm_lookup_case_decision")
caseRespActDF = self.spark.read.parquet(self.data_input_case_responsive_action)
caseRespActDF.createOrReplaceTempView("cm_lookup_case_responsive_action")
userDF = self.spark.read.parquet(self.data_input_user)
userDF.createOrReplaceTempView("cm_user")
userSnapshotDF = self.spark.read.parquet(self.data_input_user_snapshot)
userSnapshotDF.createOrReplaceTempView("v_cm_user_snapshot")
roleDF = self.spark.read.parquet(self.data_input_role)
roleDF.createOrReplaceTempView("cm_role")
skillDF = self.spark.read.parquet(self.data_input_skill)
skillDF.createOrReplaceTempView("cm_skill")
lookupSkillDF = self.spark.read.parquet(self.data_input_lookup_skills)
lookupSkillDF.createOrReplaceTempView("cm_lookup_skills")
skillTypeDF = self.spark.read.parquet(self.data_input_skill_type)
skillTypeDF.createOrReplaceTempView("cm_skill_type")
acceptListDF = self.spark.read.parquet(self.data_input_accept_list)
acceptListDF.createOrReplaceTempView("cm_accept_list")
lookupAcceptListStatusDF = self.spark.read.parquet(self.data_input_lookup_accept_list_status)
lookupAcceptListStatusDF.createOrReplaceTempView("cm_lookup_accept_list_status")
approverDF = self.spark.read.parquet(self.data_input_approver)
approverDF.createOrReplaceTempView("cm_approver")
screeningMatchDF_temp = self.spark.read.parquet(self.data_input_screening_match_redshift)
screeningMatchLookupDF_temp = self.spark.read.parquet(self.data_input_lookup_screening_match_redshift)
screeningMatchLookupDF_temp_new = screeningMatchLookupDF_temp.withColumnRenamed("screening_match_id","lookupdf_screening_match_id")
"""
The screening_match_id in datalake table is a mix of alphanumeric match IDs (the ones in cm_lookup_screening_match_id in Redshift) and numeric (the ones in cm_screening_match in Redshift). Hence we combine the match IDs from both the Redshift tables. Also, there are matches which were created in the past but updated recently. Since updated date is only present in cm_screening_match and not in cm_lookup_screening_match_id, we will only have the numeric match Ids. When we join this to datalake table, we won't be able to find these matches as they are present in the alphanumeric form in datalake. Hence what we do is read the entire table of cm_lookup_screening_match_id and join it with cm_screening_match to enrich cm_screening_match with the alphanumeric match Id. Finally we filter cm_lookup_screening_match_id only for newly created matches and combine with the matches from enriched version of cm_screening_match.
"""
screeningMatchDF_enriched = screeningMatchDF_temp.join(screeningMatchLookupDF_temp_new,screeningMatchDF_temp.screening_match_id == screeningMatchLookupDF_temp_new.lookupdf_screening_match_id,"left")
screeningMatchDF_enriched = screeningMatchDF_enriched.withColumn("screening_match_id",col("screening_match_id").cast(StringType()))
screeningMatchDF = screeningMatchDF_enriched.select(col("screening_match_id")).union(screeningMatchDF_enriched.select(col("match_event_id")))
screeningMatchLookupDF = screeningMatchLookupDF_temp_new.filter("created_time > '{}'" .format(matchtime_lower_threshold)).select(col("match_event_id"))
screeningMatchRedshiftDF = screeningMatchDF.union(screeningMatchLookupDF)
#screeningMatchRedshiftDF = screeningMatchRedshiftDF.repartition(792,"screening_match_id")
screeningMatchRedshiftDF = screeningMatchRedshiftDF.withColumn("salt", rand())
screeningMatchRedshiftDF = screeningMatchRedshiftDF.repartition("salt")
screeningMatchRedshiftDF.createOrReplaceTempView("cm_screening_match_redshift")
print("Total from screeningMatchRedshiftDF:",screeningMatchRedshiftDF.count())
def loadSCMCaseAlertTable(self):
print('Getting the thresholds for data to be loaded')
matchtime_lower_threshold = self.getLowerThreshold('scm_case_alert_data')
print('Match time lower threshold is: {}' .format(matchtime_lower_threshold))
matchtime_upper_threshold = self.default_upper_threshold
print('Match time upper threshold is: {}' .format(matchtime_upper_threshold))
print("Getting the required base tables")
con = self.get_redshift_connection()
cursor = con.cursor()
self.getBaseTables(matchtime_lower_threshold,matchtime_upper_threshold,cursor)
print("Getting the enriched dataset for incoming matches (the ones to be inserted or updated)")
incomingMatchesFullDF = self.getIncomingMatchesFullData()
print("Total records in incomingMatchesFullDF: ", incomingMatchesFullDF.count())
print("Copying the incoming data to temp work dir")
print("Clearing work directory: {}" .format(self.work_scad_path))
self.deleteAllObjectsFromS3Prefix(self.dest_bucket,self.dest_work_prefix_scad)
print("Writing data to work dir: {}" .format(self.work_scad_path))
#.coalesce(1) \
incomingMatchesFullDF.write \
.partitionBy("match_updated_date") \
.mode("overwrite") \
.parquet(self.work_scad_path + self.work_dir_partitioned_table_scad)
print("Data copied to work dir")
print("Reading data from work dir in a temporary dataframe")
incomingMatchesFullDF_copy = self.spark.read.parquet(self.work_scad_path + "scm_case_alert_data_work.parquet/")
if self.update_mode == 'overwrite':
print("Datamart update mode is overwrite. New data will replace existing data.")
print("Publishing to Redshift")
self.publishToRedshift(con,cursor)
print("Publishing to Redshift complete")
elif self.update_mode == 'upsert':
print("Datamart update mode is upsert. New data will be loaded and existing data will be updated.")
print("Checking for cases updated between {} and {}" .format(matchtime_lower_threshold,matchtime_upper_threshold))
updatedCasesDF = self.getUpdatedCases(matchtime_lower_threshold,matchtime_upper_threshold)
updatedCasesDF.createOrReplaceTempView("updated_cases")
print("Getting updated case attributes")
updatedCaseAttributesDF = self.getUpdatedCaseAttributes()
print("Moving updated case data to temp work directory: {}".format(self.work_updated_cases_path))
print("Clearing work directory")
self.deleteAllObjectsFromS3Prefix(self.dest_bucket,self.dest_work_prefix_updated_cases)
try:
print("Writing data to work dir: {}" .format(self.work_updated_cases_path))
updatedCaseAttributesDF.coalesce(1) \
.write \
.mode("overwrite") \
.parquet(self.work_updated_cases_path + "updated_cases.parquet")
except Exception as e:
e = sys.exc_info()[0]
print("No data to write to work dir")
print("Starting the process to publish data to Redshift")
self.publishToRedshift(con,cursor)
print("Publishing to Redshift complete")
print('Updating metadata table')
matchtime_lower_threshold_new = incomingMatchesFullDF_copy.agg({'match_updated_time': 'max'}).collect()[0][0]
if matchtime_lower_threshold_new is not None:
matchtime_lower_threshold_new_formatted = matchtime_lower_threshold_new.strftime("%Y-%m-%d %H:%M:%S")
print("Latest match time lower threshold with new load: {}" .format(matchtime_lower_threshold_new_formatted))
self.updatePipelineMetadata('scm_case_alert_data','max_data_update_time',matchtime_lower_threshold_new_formatted)
else:
print("No new matches, leaving max_data_update_time for match as it is")
print("Metadata table up to date")
print("Committing the updates to Redshift and closing the connection")
con.commit() #Committing after the metadata table is updated to ensure the datamart data and threshold are aligned
cursor.close()
con.close()
Spark History Server Screenshot:
As you have correctly felt, you're having data skew issues. This is really apparent from your last screenshot. Have a look at the shuffle read/write sizes! The thing that you have to find out is: for which shuffle operation (looks like a join) are you having this issue?
Only salting the large dataframes without knowing where your skew is wont solve the issue.
So, my proposed plan of action:
You see that stage 112 from your picture is the problematic stage. Figure out which join operation this is about. In the SQL tab of the web-ui you can find that stage 112 and hover over it. That should give you enough info to figure out which shuffle/join key is skewed.
Once you know which key is skewed, understand the statistical contents of your key using spark-shell or something like that. Figure out which value is overly common. This will help in making future decisions. A simple df.groupBy("problematicKey").count will already be really interesting.
Once you know that, you can go ahead and salt that specific key.
But you're absolutely on the right track! Keeping an eye on that Tasks page and the time it takes for each task is a great approach!
Hope this helps :)

Best practice to reduce the number of spark job started. Use union?

With Spark, starting a job take a time.
For a complex workflow, it's possible to invoke a job in a loop.
But, we pay for each 'start'.
def test_loop(spark):
all_datas = []
for i in ['CZ12905K01', 'CZ12809WRH', 'CZ129086RP']:
all_datas.extend(spark.sql(f"""
select * from data where id=='{i}'
""").collect()) # Star a job
return all_datas
Sometime, it's possible to explode the loop to a big job, with 'union'.
def test_union(spark):
full_request = None
for i in ['CZ12905K01', 'CZ12809WRH' ,'CZ129086RP']:
q = f"""
select '{i}' ID,* from data where leh_be_lot_id=='{i}'
"""
partial_df = spark.sql(q)
if not full_request:
full_request = partial_df
else:
full_request = full_request.union(partial_df)
return full_request.collect() # Start a job
For clarity, my samples are elementary (I know, I can use in (...)) . The real requests will be more complex.
It's a good idea ?
With union approach, I can reduce drastically the number of jobs submitted, but with a more complex job.
My tests show that:
It's possible to union > 1000 request for a better performance
For 950 requests with local[6]
With 0 union : 1h53m
with 10 unions : 20m01s
with 100 unions : 7m12s
with 200 unions : 6m02s
with 500 unions : 6m25s
Sometime, the union version must "broadcast big data", or generate an "Out of memory"
My final approach: set a level_of_union
Merge some request, start the job, get the data
continue the loop with another batch
def test_union(spark,level_of_union):
full_request = None
all_datas = []
todo=['CZ12905K01', 'CZ12809WRH' ,'CZ129086RP']
for idx,i in enumerate(todo):
q = f"""
select '{i}' ID,* from leh_be where leh_be_lot_id=='{i}'
"""
partial_df = spark.sql(q)
if not full_request:
full_request = partial_df
else:
full_request = full_request.union(partial_df)
if idx % level_of_union == level_of_union-1 or idx == len(todo)-1:
all_datas.extend(full_request.collect()) # Start a job
full_request=None
return all_datas
Make a test to adjust the meta parameter : level_of_union

How To use Pysark Window Function on new unprocessed data?

I have developed window functions on pyspark DataFrame to calculate Total Transaction Amount made by customer on monthly basis per transaction.
For Eg:
Input Table has data:
And the window function process the data and inserts it into table
Now, if i get new transactions today, I want to develop a code where it loads the last month transaction into spark dataframe and running window function on new rows and saving it into Processed Table. The current window function will process all the rows and then need to manually avoid already inserted records and insert only new records. This will use high resources and high memory, when the window function becomes for a year.
#Function to apply window function
def cumulative_total_CR(df, from_column, to_column, window_function):
intermediate_column = from_column + "_temp"
df = df.withColumn(from_column,df[from_column].cast("double"))
df = df.withColumn(intermediate_column,when(col("Flow") == 'C',df[from_column]).otherwise(0))
df = df.withColumn(to_column, F.sum(intermediate_column).over(window_function))
return df
def cumulative_total_DR(df, from_column, to_column, window_function):
intermediate_column = from_column + "_temp"
df = df.withColumn(from_column,df[from_column].cast("double"))
df = df.withColumn(intermediate_column,when(col("Flow") == 'D',df[from_column]).otherwise(0))
df = df.withColumn(to_column, F.sum(intermediate_column).over(window_function))
return df
#Window Function:
window = (Window.partitionBy("CUSNO").orderBy(F.col(TxnDateTime).cast('long')).rangeBetween(-30,0))
df = load.data.from.hive
#appending TxnDate and TxnTime into new column TxnDateTime with type casting as timestamp and format as 'yyyy-MM-dd HH:mm:ss.SSS'
df = cumulative_total_CR(df, "TXNAMT", "Total_Cr_Monthly_Amt", window_function_30_days)
df = cumulative_total_DR(df, "TXNAMT", "Total_Dr_Monthly_Amt", window_function_30_days)
df = saving.data.to.disk for new records

(Databricks Ad monetisation example) How do I find the latest match in a stream?

In the blog post "Introducing Stream-Stream Joins in Apache Spark 2.3" joining clicks with impressions based on their adId is discussed:
# Define watermarks
impressionsWithWatermark = impressions \
.selectExpr("adId AS impressionAdId", "impressionTime") \
.withWatermark("impressionTime", "10 seconds ") # max 10 seconds late
clicksWithWatermark = clicks \
.selectExpr("adId AS clickAdId", "clickTime") \
.withWatermark("clickTime", "20 seconds") # max 20 seconds late
# Inner join with time range conditions
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 minutes
"""
)
)
I'd like to know if its possible to filter the resulting stream so that only the rows with latest clickTime are included in each "query interval".
The query interval is the interval given in the query join condition:
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 minutes
So I might get the following sequence
{type:impression, impressionAdId:1, timestamp: 1}
{type:click, clickAdId:1, timestamp: 1}
{type:click, clickAdId:1, timestamp: 15}
And after t=60s or so spark emits the following row in the dataframe:
{impressionTimestamp: 1, clickTimestamp: 15: clickAddId: 1, impressionAdId: 1}
I only posted python code because that was what was in the article, answers with java or scala code are welcome too.

how to implement spark sql pagination query

Does anyone how to do pagination in spark sql query?
I need to use spark sql but don't know how to do pagination.
Tried:
select * from person limit 10, 10
It has been 6 years, don't know if it was possible back then
I would add a sequential id on the answer and search for registers between offset and offset + limit
On pure spark sql query it would be something like this, for offset 10 and limit 10
WITH count_person AS (
SELECT *, monotonically_increasing_id() AS count FROM person)
SELECT * FROM count_person WHERE count > 10 AND count < 20
On Pyspark it would be very similar
import pyspark.sql.functions as F
offset = 10
limit = 10
df = df.withColumn('_id', F.monotonically_increasing_id())
df = df.where(F.col('_id').between(offset, offset + limit))
Its flexible and fast enough even for a big volume of data
karthik's answer will fail if there are duplicate rows in the dataframe. 'except' will remove all rows in df1 which are in df2 .
val filteredRdd = df.rdd.zipWithIndex().collect { case (r, i) if 10 >= start && i <=20 => r }
val newDf = sqlContext.createDataFrame(filteredRdd, df.schema)
There is no support for offset as of now in spark sql. One of the alternatives you can use for paging is through DataFrames using except method.
Example: You want to iterate with a paging limit of 10, you can do the following:
DataFrame df1;
long count = df.count();
int limit = 10;
while(count > 0){
df1 = df.limit(limit);
df1.show(); //will print 10, next 10, etc rows
df = df.except(df1);
count = count - limit;
}
If you want to do say, LIMIT 50, 100 in the first go, you can do the following:
df1 = df.limit(50);
df2 = df.except(df1);
df2.limit(100); //required result
Hope this helps!
Please find bellow a useful PySpark (Python 3 and Spark 3) class named SparkPaging which abstract the pagination mecanism :
https://gitlab.com/enahwe/public/lib/spark/sparkpaging
Here's the usage :
SparkPaging
Class for paging dataframes and datasets
Example
- Init example 1:
Approach by specifying a limit.
sp = SparkPaging(initData=df, limit=753)
- Init example 2:
Approach by specifying a number of pages (if there's a rest, the number of pages will be incremented).
sp = SparkPaging(initData=df, pages=6)
- Init example 3:
Approach by specifying a limit.
sp = SparkPaging()
sp.init(initData=df, limit=753)
- Init example 4:
Approach by specifying a number of pages (if there's a rest, the number of pages will be incremented).
sp = SparkPaging()
sp.init(initData=df, pages=6)
- Reset:
sp.reset()
- Iterate example:
print("- Total number of rows = " + str(sp.initDataCount))
print("- Limit = " + str(sp.limit))
print("- Number of pages = " + str(sp.pages))
print("- Number of rows in the last page = " + str(sp.numberOfRowsInLastPage))
while (sp.page < sp.pages-1):
df_page = sp.next()
nbrRows = df_page.count()
print(" Page " + str(sp.page) + '/' + str(sp.pages) + ": Number of rows = " + str(nbrRows))
- Output:
- Total number of rows = 4521
- Limit = 753
- Number of pages = 7
- Number of rows in the last page = 3
Page 0/7: Number of rows = 753
Page 1/7: Number of rows = 753
Page 2/7: Number of rows = 753
Page 3/7: Number of rows = 753
Page 4/7: Number of rows = 753
Page 5/7: Number of rows = 753
Page 6/7: Number of rows = 3

Resources