Auto increment id in delta table while inserting

Auto increment id in delta table while inserting - apache-spark

I have a problem regarding merging csv files using pysparkSQL with delta table. I managed to create upsert function that update if matched and insert if not matched.
I want to add column ID to the final delta table and increment it each time we insert data. This column identify each row in our delta table. Is there any way to put that in place ?
def Merge(dict1, dict2):
res = {**dict1, **dict2}
return res
def create_default_values_dict(correspondance_df,marketplace):
dict_output = {}
for field in get_nan_keys_values(get_mapping_dict(correspondance_df, marketplace)):
dict_output[field] = 'null'
# We want to increment the id row each time we perform an insertion (TODO TODO TODO)
# if field == 'id':
# dict_output['id'] = col('id')+1
# else:
return dict_output
def create_matched_update_dict(mapping, products_table, updates_table):
output = {}
for k,v in mapping.items():
if k == 'source_name':
output['products.source_name'] = lit(v)
else:
output[products_table + '.' + k] = F.when(col(updates_table + '.' + v).isNull(), col(products_table + '.' + k)).when(col(updates_table + '.' + v).isNotNull(), col(updates_table + '.' + v))
return output
insert_dict = create_not_matched_insert_dict(mapping, 'products', 'updates')
default_dict = create_default_values_dict(correspondance_df_products, 'Cdiscount')
insert_values = Merge(insert_dict, default_dict)
update_values = create_matched_update_dict(mapping, 'products', 'updates')
delta_table_products.alias('products').merge(
updates_df_table.limit(20).alias('updates'),
"products.barcode_ean == updates.ean") \
.whenMatchedUpdate(set = update_values) \
.whenNotMatchedInsert(values = insert_values)\
.execute()
I tried to increment the column id in the function create_default_values_dict but it's seems to not working well, it doesn't auto increment by 1. Is there another way to solve this problem ? Thanks in advance :)

Databricks has IDENTITY columns for hosted Spark
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY
[ ( [ START WITH start ] [ INCREMENT BY step ] ) ]
This works on Delta tables.
Example:
create table gen1 (
id long GENERATED ALWAYS AS IDENTITY
, t string
)
Requires Runtime version 10.4 or above.

Delta does not support auto-increment column types.
In general, Spark doesn't use auto-increment IDs, instead favoring monotonically increasing IDs. See functions.monotonically_increasing_id().
If you want to achieve auto-increment behavior you will have to use multiple Delta operations, e.g., query the max value + add it to a row_number() column computed via a window function + then write. This is problematic for two reasons:
Unless you introduce an external locking mechanism or some other way to ensure that no updates to the table happen in-between finding the max value and writing, you can end up with invalid data.
Using row_number() will reduce parallelism to 1, forcing all the data through a single core, which will be very slow with large data.
Bottom line, you really do not want to use auto-increment columns with Spark.
Hope this helps.

Related

Single Task Taking Long Time in PySpark

I am running a PySpark application where I am reading several Parquet files into Spark dataframes and created temporary views on them to use in my SQL query. So I have like 18 views where some are ~ 1TB, few in several GBs and some other smaller views. I am joining all of these and running my business logic to get the desired outcome. My code takes extremely long time to run (>3 hours) for this data. Looking at the Spark History Server, I can see there's one task that seems the culprit as the time taken, data spilled to memory and disk, shuffle read/write everything is way higher than the median. This indicates a data skew. So I even used salting on my large dataframes before creating the temp views. However there's still no difference in the execution time. I checked the number of partitions and it's already 792 (maximum I can have my current Glue config). I have also enabled adaptive query execution and adaptive skewJoin handling.
My original dataset was extremely huge largest table being ~40TB and has 2.5 years of data. I am trying to do a one time historical load and was unsuccessful on running over the entire data. With trial and error, I had to reduce this to processing 1TB of data at a time (for the largest table) which is still taking 3+ hours. This is not a scalable approach and hence I am looking for some inputs to optimize this.
Below are my app details:
Number of workers = 792
Spark config:
spark= (SparkSession
.builder
.appName("scmCaseAlertDatamartFullLoad")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.adaptive.enabled","true")
.config("spark.sql.broadcastTimeout","900")
.config("spark.sql.adaptive.skewJoin.enabled","true")
.getOrCreate()
)
Code (just included key relevant methods, starting point is loadSCMCseAlertData()):
def getIncomingMatchesFullData(self):
select_query_incoming_matches_full_data = """
SELECT DISTINCT alrt.caseid AS case_id,
alrt.alertid AS alert_id,
alrt.accountid AS account_id,
sc.created_time AS case_created_time,
sc.last_updated_time AS case_last_updated_time,
alrt.srccreatedtime AS alert_created_time,
aud.last_updated_by AS case_last_updated_by,
sc.closed_time AS case_last_close_time,
lcs.status AS case_status,
lcst.state AS case_state,
lcra.responsive_action,
sc.assigned_to AS case_assigned_to,
cr1.team_name AS case_assigned_to_team,
sc.resolved_by AS case_resolved_by,
cr2.team_name AS case_resolved_by_team,
aud.last_annotation AS case_last_annotation,
ca.name AS case_approver,
alrt.screeningdecision AS screening_decision,
ap.accountpool AS division,
lcd.decision AS case_current_decision,
CASE
WHEN sm.grylaclientid LIKE '%AddressService%' THEN 'Address Service'
WHEN sm.grylaclientid LIKE '%GrylaOrderProcessingService%' THEN 'Retail Checkout Service'
WHEN sm.grylaclientid = 'urn:cdo:GrylaBatchScreeningAAA:AWS:Default' THEN 'Batch Screening'
WHEN sm.grylaclientid = 'urn:cdo:OfficerJennyBindle:AWS:Default' THEN 'API'
ELSE 'Other'
END AS channel,
ap.businesstype AS business_type,
ap.businessname AS business_name,
ap.marketplaceid AS ap_marketplace_id,
ap.region AS ap_region,
ap.memberid AS ap_member_id,
ap.secondaryaccountpool AS secondary_account_pool,
sm.action AS client_action,
acl.added_by,
acl.lnb_id AS accept_list_lnb_id,
acl.created_time AS accept_list_created_time,
acl.source_case_id AS accept_list_source_case_id,
acs.status AS accept_list_status,
ap.street1 AS ap_line_1,
ap.street2 AS ap_line_2,
ap.street3 AS ap_line_3,
ap.city AS ap_city,
ap.state AS ap_state,
ap.postalcode AS ap_postal_code,
ap.country AS ap_country,
ap.fullname AS ap_full_name,
ap.email AS ap_email,
sm.screening_match_id AS dp_screening_match_id,
CASE
WHEN sm.matchtype = 'name_only_matching_details' THEN 'Name Only'
WHEN sm.matchtype = 'address_only_matching_details' THEN 'Address Only'
WHEN sm.matchtype = 'address_matching_details' THEN 'Address'
WHEN sm.matchtype = 'scr_matching_details' THEN 'SCR'
WHEN sm.matchtype = 'hotkey_matching_details' THEN 'HotKey'
END AS match_type,
sm.matchaction AS match_action,
alrt.batchfilename AS batch_file_id,
REGEXP_REPLACE(dp.name, '\\n|\\r|\\t', ' ') AS dp_matched_add_full_name,
dp.street AS dp_line1,
'' AS dp_line2,
dp.city AS dp_city,
dp.state AS dp_state,
dp.postalcode AS dp_postal_code,
dp.country AS dp_country,
dp.matchedplaces AS scr_value,
dp.hotkeyvalues AS hotkey_value,
sm.acceptlistid AS suppressed_by_accept_list_id,
sm.suppresseddedupe AS is_deduped,
sm.matchhash AS hash,
sm.matchdecision AS match_decision,
ap.addressid AS amazon_address_id,
ap.dateofbirth AS date_of_birth,
sm.grylaclientid AS gryla_client_id,
cr1.name AS case_assigned_to_role,
cr2.name AS case_resolved_by_role,
alrt.screeningengine AS screening_engine,
sm.srccreatedtime AS match_created_time,
sm.srclastupdatedtime AS match_updated_time,
to_date(sm.srclastupdatedtime,"yyyy-MM-dd") AS match_updated_date,
sm.match_updated_time_msec,
sm.suppressedby AS match_suppressed_by
FROM
cm_screening_match sm
JOIN
cm_screening_match_redshift smr ON sm.screening_match_id = smr.screening_match_id
LEFT JOIN
cm_case_alert alrt ON sm.screening_match_id = alrt.screening_match_id
LEFT JOIN
cm_amazon_party ap ON sm.screening_match_id = ap.screening_match_id
LEFT JOIN
cm_denied_party dp ON sm.screening_match_id = dp.screening_match_id
LEFT JOIN
cm_spectre_case sc ON alrt.caseid = sc.case_id
LEFT JOIN
cm_lookup_case_status lcs ON sc.status_id = lcs.status_id
LEFT JOIN
cm_lookup_case_state lcst ON sc.state_id = lcst.state_id
LEFT JOIN
cm_lookup_case_decision lcd ON sc.decision_id = lcd.decision_id
LEFT JOIN
cm_lookup_case_responsive_action lcra ON sc.responsive_action_id = lcra.responsive_action_id
LEFT JOIN
cm_user cu1 ON sc.assigned_to = cu1.alias
LEFT JOIN
cm_role cr1 ON cu1.current_role_id = cr1.role_id
LEFT JOIN
cm_user cu2 ON sc.resolved_by = cu2.alias
LEFT JOIN
cm_role cr2 ON cu2.current_role_id = cr2.role_id
LEFT JOIN
cm_accept_list acl ON acl.screening_match_id = sm.screening_match_id
LEFT JOIN
cm_lookup_accept_list_status acs ON acs.status_id = acl.status_id
LEFT JOIN
(
SELECT case_id,
last_value(username) OVER (PARTITION BY case_id ORDER BY created_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_updated_by,
last_value(description) OVER (PARTITION BY case_id ORDER BY created_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_annotation
FROM cm_spectre_case_audit
) aud ON sc.case_id = aud.case_id
LEFT JOIN
cm_approver ca ON sc.approver_id = ca.approver_id
"""
print(select_query_incoming_matches_full_data)
incomingMatchesFullDF = self.spark.sql(select_query_incoming_matches_full_data)
return incomingMatchesFullDF
def getBaseTables(self,matchtime_lower_threshold,matchtime_upper_threshold,cursor):
print('Fetching datalake data for matches created after: {}' .format(matchtime_lower_threshold))
matchDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_match)
matchDF = matchDF.select("screening_match_id","grylaclientid","action","matchtype","matchaction","acceptlistid","suppresseddedupe","matchhash","matchdecision","srccreatedtime","srclastupdatedtime","suppressedby","lastupdatedtime")
#.withColumn("screentime",to_timestamp("screentime")) \
matchDF = matchDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#matchDF = matchDF.repartition(2400,"screening_match_id")
matchDF = self.getLatestRecord(matchDF)
matchDF = matchDF.withColumn("salt", rand())
matchDF = matchDF.repartition("salt")
matchDF.createOrReplaceTempView("cm_screening_match")
print("Total from matchDF:",matchDF.count())
print("Number of paritions in matchDF: " ,matchDF.rdd.getNumPartitions())
alertDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_alert)
alertDF = alertDF.select("screening_match_id","caseid","alertid","accountid","srccreatedtime","screeningdecision","batchfilename","screeningengine","lastupdatedtime")
alertDF = alertDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#alertDF = alertDF.repartition(2400,"screening_match_id")
alertDF = self.getLatestRecord(alertDF)
alertDF = alertDF.withColumn("salt", rand())
alertDF = alertDF.repartition("salt")
alertDF.createOrReplaceTempView("cm_case_alert")
print("Total from alertDF:",alertDF.count())
print("Number of paritions in alertDF: " ,alertDF.rdd.getNumPartitions())
apDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_ap)
apDF = apDF.select("screening_match_id","accountpool","businesstype","businessname","marketplaceid","region","memberid","secondaryaccountpool","street1","street2","street3","city","state","postalcode","country","fullname","email","addressid","dateofbirth","lastupdatedtime")
apDF = apDF.withColumn("dateofbirth",to_date("dateofbirth","yyyy-MM-dd")) \
.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())) \
.drop("lastupdatedtime")
#apDF = apDF.repartition(2400,"screening_match_id")
apDF = self.getLatestRecord(apDF)
apDF = apDF.withColumn("salt", rand())
apDF = apDF.repartition("salt")
apDF.createOrReplaceTempView("cm_amazon_party")
print("Total from apDF:",apDF.count())
print("Number of paritions in apDF: " ,apDF.rdd.getNumPartitions())
dpDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_dp)
dpDF = dpDF.select("screening_match_id","name","street","city","state","postalcode","country","matchedplaces","hotkeyvalues","lastupdatedtime")
dpDF = dpDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#dpDF = dpDF.repartition(2400,"screening_match_id")
dpDF = self.getLatestRecord(dpDF)
dpDF = dpDF.withColumn("salt", rand())
dpDF = dpDF.repartition("salt")
dpDF.createOrReplaceTempView("cm_denied_party")
print("Total from dpDF:",dpDF.count())
print("Number of paritions in dpDF: " ,dpDF.rdd.getNumPartitions())
print('Fetching data from Redshift Base tables...')
self.getRedshiftData(matchtime_lower_threshold,matchtime_upper_threshold,cursor)
caseAuditDF = self.spark.read.parquet(self.data_input_case_audit)
caseAuditDF.createOrReplaceTempView("cm_spectre_case_audit")
caseDF = self.spark.read.parquet(self.data_input_case)
caseDF.createOrReplaceTempView("cm_spectre_case")
caseStatusDF = self.spark.read.parquet(self.data_input_case_status)
caseStatusDF.createOrReplaceTempView("cm_lookup_case_status")
caseStateDF = self.spark.read.parquet(self.data_input_case_state)
caseStateDF.createOrReplaceTempView("cm_lookup_case_state")
caseDecisionDF = self.spark.read.parquet(self.data_input_case_decision)
caseDecisionDF.createOrReplaceTempView("cm_lookup_case_decision")
caseRespActDF = self.spark.read.parquet(self.data_input_case_responsive_action)
caseRespActDF.createOrReplaceTempView("cm_lookup_case_responsive_action")
userDF = self.spark.read.parquet(self.data_input_user)
userDF.createOrReplaceTempView("cm_user")
userSnapshotDF = self.spark.read.parquet(self.data_input_user_snapshot)
userSnapshotDF.createOrReplaceTempView("v_cm_user_snapshot")
roleDF = self.spark.read.parquet(self.data_input_role)
roleDF.createOrReplaceTempView("cm_role")
skillDF = self.spark.read.parquet(self.data_input_skill)
skillDF.createOrReplaceTempView("cm_skill")
lookupSkillDF = self.spark.read.parquet(self.data_input_lookup_skills)
lookupSkillDF.createOrReplaceTempView("cm_lookup_skills")
skillTypeDF = self.spark.read.parquet(self.data_input_skill_type)
skillTypeDF.createOrReplaceTempView("cm_skill_type")
acceptListDF = self.spark.read.parquet(self.data_input_accept_list)
acceptListDF.createOrReplaceTempView("cm_accept_list")
lookupAcceptListStatusDF = self.spark.read.parquet(self.data_input_lookup_accept_list_status)
lookupAcceptListStatusDF.createOrReplaceTempView("cm_lookup_accept_list_status")
approverDF = self.spark.read.parquet(self.data_input_approver)
approverDF.createOrReplaceTempView("cm_approver")
screeningMatchDF_temp = self.spark.read.parquet(self.data_input_screening_match_redshift)
screeningMatchLookupDF_temp = self.spark.read.parquet(self.data_input_lookup_screening_match_redshift)
screeningMatchLookupDF_temp_new = screeningMatchLookupDF_temp.withColumnRenamed("screening_match_id","lookupdf_screening_match_id")
"""
The screening_match_id in datalake table is a mix of alphanumeric match IDs (the ones in cm_lookup_screening_match_id in Redshift) and numeric (the ones in cm_screening_match in Redshift). Hence we combine the match IDs from both the Redshift tables. Also, there are matches which were created in the past but updated recently. Since updated date is only present in cm_screening_match and not in cm_lookup_screening_match_id, we will only have the numeric match Ids. When we join this to datalake table, we won't be able to find these matches as they are present in the alphanumeric form in datalake. Hence what we do is read the entire table of cm_lookup_screening_match_id and join it with cm_screening_match to enrich cm_screening_match with the alphanumeric match Id. Finally we filter cm_lookup_screening_match_id only for newly created matches and combine with the matches from enriched version of cm_screening_match.
"""
screeningMatchDF_enriched = screeningMatchDF_temp.join(screeningMatchLookupDF_temp_new,screeningMatchDF_temp.screening_match_id == screeningMatchLookupDF_temp_new.lookupdf_screening_match_id,"left")
screeningMatchDF_enriched = screeningMatchDF_enriched.withColumn("screening_match_id",col("screening_match_id").cast(StringType()))
screeningMatchDF = screeningMatchDF_enriched.select(col("screening_match_id")).union(screeningMatchDF_enriched.select(col("match_event_id")))
screeningMatchLookupDF = screeningMatchLookupDF_temp_new.filter("created_time > '{}'" .format(matchtime_lower_threshold)).select(col("match_event_id"))
screeningMatchRedshiftDF = screeningMatchDF.union(screeningMatchLookupDF)
#screeningMatchRedshiftDF = screeningMatchRedshiftDF.repartition(792,"screening_match_id")
screeningMatchRedshiftDF = screeningMatchRedshiftDF.withColumn("salt", rand())
screeningMatchRedshiftDF = screeningMatchRedshiftDF.repartition("salt")
screeningMatchRedshiftDF.createOrReplaceTempView("cm_screening_match_redshift")
print("Total from screeningMatchRedshiftDF:",screeningMatchRedshiftDF.count())
def loadSCMCaseAlertTable(self):
print('Getting the thresholds for data to be loaded')
matchtime_lower_threshold = self.getLowerThreshold('scm_case_alert_data')
print('Match time lower threshold is: {}' .format(matchtime_lower_threshold))
matchtime_upper_threshold = self.default_upper_threshold
print('Match time upper threshold is: {}' .format(matchtime_upper_threshold))
print("Getting the required base tables")
con = self.get_redshift_connection()
cursor = con.cursor()
self.getBaseTables(matchtime_lower_threshold,matchtime_upper_threshold,cursor)
print("Getting the enriched dataset for incoming matches (the ones to be inserted or updated)")
incomingMatchesFullDF = self.getIncomingMatchesFullData()
print("Total records in incomingMatchesFullDF: ", incomingMatchesFullDF.count())
print("Copying the incoming data to temp work dir")
print("Clearing work directory: {}" .format(self.work_scad_path))
self.deleteAllObjectsFromS3Prefix(self.dest_bucket,self.dest_work_prefix_scad)
print("Writing data to work dir: {}" .format(self.work_scad_path))
#.coalesce(1) \
incomingMatchesFullDF.write \
.partitionBy("match_updated_date") \
.mode("overwrite") \
.parquet(self.work_scad_path + self.work_dir_partitioned_table_scad)
print("Data copied to work dir")
print("Reading data from work dir in a temporary dataframe")
incomingMatchesFullDF_copy = self.spark.read.parquet(self.work_scad_path + "scm_case_alert_data_work.parquet/")
if self.update_mode == 'overwrite':
print("Datamart update mode is overwrite. New data will replace existing data.")
print("Publishing to Redshift")
self.publishToRedshift(con,cursor)
print("Publishing to Redshift complete")
elif self.update_mode == 'upsert':
print("Datamart update mode is upsert. New data will be loaded and existing data will be updated.")
print("Checking for cases updated between {} and {}" .format(matchtime_lower_threshold,matchtime_upper_threshold))
updatedCasesDF = self.getUpdatedCases(matchtime_lower_threshold,matchtime_upper_threshold)
updatedCasesDF.createOrReplaceTempView("updated_cases")
print("Getting updated case attributes")
updatedCaseAttributesDF = self.getUpdatedCaseAttributes()
print("Moving updated case data to temp work directory: {}".format(self.work_updated_cases_path))
print("Clearing work directory")
self.deleteAllObjectsFromS3Prefix(self.dest_bucket,self.dest_work_prefix_updated_cases)
try:
print("Writing data to work dir: {}" .format(self.work_updated_cases_path))
updatedCaseAttributesDF.coalesce(1) \
.write \
.mode("overwrite") \
.parquet(self.work_updated_cases_path + "updated_cases.parquet")
except Exception as e:
e = sys.exc_info()[0]
print("No data to write to work dir")
print("Starting the process to publish data to Redshift")
self.publishToRedshift(con,cursor)
print("Publishing to Redshift complete")
print('Updating metadata table')
matchtime_lower_threshold_new = incomingMatchesFullDF_copy.agg({'match_updated_time': 'max'}).collect()[0][0]
if matchtime_lower_threshold_new is not None:
matchtime_lower_threshold_new_formatted = matchtime_lower_threshold_new.strftime("%Y-%m-%d %H:%M:%S")
print("Latest match time lower threshold with new load: {}" .format(matchtime_lower_threshold_new_formatted))
self.updatePipelineMetadata('scm_case_alert_data','max_data_update_time',matchtime_lower_threshold_new_formatted)
else:
print("No new matches, leaving max_data_update_time for match as it is")
print("Metadata table up to date")
print("Committing the updates to Redshift and closing the connection")
con.commit() #Committing after the metadata table is updated to ensure the datamart data and threshold are aligned
cursor.close()
con.close()
Spark History Server Screenshot:

As you have correctly felt, you're having data skew issues. This is really apparent from your last screenshot. Have a look at the shuffle read/write sizes! The thing that you have to find out is: for which shuffle operation (looks like a join) are you having this issue?
Only salting the large dataframes without knowing where your skew is wont solve the issue.
So, my proposed plan of action:
You see that stage 112 from your picture is the problematic stage. Figure out which join operation this is about. In the SQL tab of the web-ui you can find that stage 112 and hover over it. That should give you enough info to figure out which shuffle/join key is skewed.
Once you know which key is skewed, understand the statistical contents of your key using spark-shell or something like that. Figure out which value is overly common. This will help in making future decisions. A simple df.groupBy("problematicKey").count will already be really interesting.
Once you know that, you can go ahead and salt that specific key.
But you're absolutely on the right track! Keeping an eye on that Tasks page and the time it takes for each task is a great approach!
Hope this helps :)

How to make my identity column consecutive on delta table in Azure Databricks?

I am trying to create a delta table with a consecutive identity column. The goal is for our clients to see if there is some data they did not receive from us.
It looks like the generated identity column is not consecutive. Which makes the "INCREMENT BY 1" quite misleading.
store_visitor_type_name = ["apple","peach","banana","mango","ananas"]
card_type_name = ["door","desk","light","coach","sink"]
store_visitor_type_desc = ["monday","tuesday","wednesday","thursday","friday"]
colnames = ["column2","column3","column4"]
data_frame = spark.createDataFrame(zip(store_visitor_type_name,card_type_name,store_visitor_type_desc),colnames)
data_frame.createOrReplaceTempView('vw_increment')
data_frame.display()
%sql
CREATE or REPLACE TABLE TEST(
`column1SK` BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1)
,`column2` STRING
,`column3` STRING
,`column4` STRING
,`inserted_timestamp` TIMESTAMP
,`modified_timestamp` TIMESTAMP
)
USING delta
LOCATION '/mnt/Marketing/Sales';
MERGE INTO TEST as target
USING vw_increment as source
ON target.`column2` = source.`column2`
WHEN MATCHED
AND (target.`column3` <> source.`column3`
OR target.`column4` <> source.`column4`)
THEN
UPDATE SET
`column2` = source.`column2`
,`modified_timestamp` = current_timestamp()
WHEN NOT MATCHED THEN
INSERT (
`column2`
,`column3`
,`column4`
,`modified_timestamp`
,`inserted_timestamp`
) VALUES (
source.`column2`
,source.`column3`
,source.`column4`
,current_timestamp()
,current_timestamp()
)
I'm getting the following results. You can see this is not sequential.What is also very confusing is that it is not starting at 1, while explicitely mentionned in the query.
I can see in the documentation (https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters) :
The automatically assigned values start with start and increment by
step. Assigned values are unique but are not guaranteed to be
contiguous. Both parameters are optional, and the default value is 1.
step cannot be 0.
Is there a workaround to make this identity column consecutive ?
I guess I could have another column and do a ROW_NUMBER operation after the MERGE, but it looks expensive.

You can utilize Pyspark to achieve the requirement instead of using row_number() function.
I have read the TEST table as a spark dataframe and converted it to pandas on spark dataframe. In pandas dataframe, using reset_index(), I have created a new index column.
Then I have converted it back to spark dataframe. I have added 1 to the index column values since the index starts with 0.
df = spark.sql("select * from test")
pdf = df.to_pandas_on_spark()
#to create new index column.
pdf.reset_index(inplace=True)
final_df = pdf.to_spark()
#Since index starts from 0, I have added 1 to it.
final_df.withColumn('index',final_df['index']+1).show()

Out of memory while checking string columns and saving error values to Databricks

I need to do a double quotes check in a dataframe. So I am iterating through all the columns for this check but takes lot of time. I am using Azure Databricks for this.
for column in columns_list:
column_name = "`" + column + "`"
df_reject = source_data.withColumn("flag_quotes",when(source_data[column_name].rlike("[\"\"]"),lit("Yes")).otherwise(lit("No")))
df_quo_rejected_df = df_reject.filter(col("flag_quotes") == "Yes")
df_quo_rejected_df = df_quo_rejected_df.withColumn('Error', lit(err))
df_quo_rejected_df.coalesce(1).write.mode("append").option("header","true")\
.option("delimiter",delimiter)\
.format("com.databricks.spark.csv")\
.save(filelocwrite)
I have got around 500 columns with 40 million records. I tried union the dataframes every iteration but the operation does OOM after sometime. So I save the dataframe and append it every iteration. Please help me with a way to optimize the running time.

Instead of looping through columns you can try checking their values using exists.
from pyspark.sql import functions as F
columns_list = [f"`{c}`" for c in columns_list]
df_reject = source_data.filter(F.exists(F.array(*columns_list), lambda x: x.rlike("[\"\"]")))
df_cols_add = df_reject.select('*', F.lit('Yes').alias('flag_quotes'), F.lit(err).alias('Error'))

Spark window function and taking first and last values per column per partition (aggregation over window)

Imagine I have a huge dataset which I partitionBy('id'). Assume that id is unique to a person, so there could be n number of rows per id and the goal is to reduce it to one.
Basically, aggregating to make id distinct.
w = Window().partitionBy(id).rowsBetween(-sys.maxsize, sys.maxsize)
test1 = {
key: F.first(key, True).over(w).alias(key)
for key in some_dict.keys()
if (some_dict[key] == 'test1')
}
test2 = {
key: F.last(key, True).over(w).alias(k)
for k in some_dict.keys()
if (some_dict[k] == 'test2')
}
Assume that I have some_dict with values either as test1 or test2 and based on the value, I either take the first or last as shown above.
How do I actually call aggregate and reduce this?
cols = {**test1, **test2}
cols = list(cols.value())
df.select(*cols).groupBy('id').agg(*cols) # Doesnt work
The above clearly doesn't work. Any ideas?
Goal here is : I have 5 unique IDs and 25 rows with each ID having 5 rows. I want to reduce it to 5 rows from 25.

Let assume you dataframe name df which contains duplicate use below method
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
window = Window.partitionBy(df['id']).orderBy(df['id'])
final = df.withColumn("row_id", row_number.over(window)).filter("row_id = 1")
final.show(10,False)
change the order by condition in case there is specific criteria so that particular record will be on top of partition

How single partition batch in Cassandra function for multiple column update?

We have multiple update queries in a single partition of a single columnfamily. Like below
update t1 set username = 'abc', url = 'www.something.com', age = ? where userid = 100;
update t1 set username = 'abc', url = 'www.something.com', weight = ? where userid = 100;
update t1 set username = 'abc', url = 'www.something.com', height = ? where userid = 100;
username, url will be always same and are mandatory fields. But depending on the information given there will be extra columns.
As this is a single partition operation and we need atomicity + isolation. We will execute this in a batch.
As per Doc
A BATCH statement combines multiple data modification language (DML) statements (INSERT, UPDATE, DELETE) into a single logical operation, and sets a client-supplied timestamp for all columns written by the statements in the batch.
Now as we are updating columns(username, url) with same value in multiple statement, will C* combines it as a single statement before executing it like
update t1 set username = 'abc', url = 'www.something.com', age = ?, weight = ?, height = ? where userid = 100;
or same value will be upsert?
Another question is that, as they all have the same timestamp how C* resolves that conflict. Will C* compare every column (username, url) value.
As they all have the same timestamp C* resolves the conflict by choosing the largest value for the cells. Atomic Batch in Cassandra
Or should we add queries in batch like below. In this case we have to check username, url has already been added in statement.
update t1 set username = 'abc', url = 'www.something.com', age = ? where userid = 100;
update t1 set weight = ? where userid = 100;
update t1 set height = ? where userid = 100;
In short what will be the best way to do it.

For your first question(will C* combines it as a single statement?) answer is yes.
A single partition batch is applied as a single row mutation.
check this link for details:
https://issues.apache.org/jira/browse/CASSANDRA-6737
For your second question(Will C* compare every column (username, url) value?) answer is also yes.
As given in the answer of your provided link "Conflict is resolved by choosing the largest value for the cells"
So, you can write queries in batch in either way(given in your question).
As it will ultimately converted to a single write internally.

You are using Single partition batch so everything goes into a single partition.So all of your update will be merge and applied with a single RowMutation.
And so your update will be applied with no batch log, atomic, isolated

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Auto increment id in delta table while inserting - apache-spark

Related

Single Task Taking Long Time in PySpark

How to make my identity column consecutive on delta table in Azure Databricks?

Out of memory while checking string columns and saving error values to Databricks

Spark window function and taking first and last values per column per partition (aggregation over window)

How single partition batch in Cassandra function for multiple column update?

Categories

Resources