AWS Glue / Spark - AnalysisException Cannot resolve column name after Filter -> Left Join - apache-spark

I am trying to do filter and left join operations on some CSV files from AWS Glue 2.0 with pySpark. Sometimes, if the filter filters out all data or if the input csv is empty my job crashes with:
AnalysisException: 'Cannot resolve column name "col_a" among ();'
-I have seen this exception occurring for other people on a number of other questions, but I think my issue is losing the header information when the rows are removed -> is this a DynamicFrame feature (could not find anything about it in the aws glue docs)?
-I realise that I could do the filters after all the joins but wanted to avoid this because it seems like it might be more expensive, and because ideally I would like the job to also not crash if the input data was an empty CSV.
-Any suggestions greatly appreciated :)
Here is a mock of the pySpark code (please note that in the real thing I would like to chain together many joins, transforms and filters):
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_name", table_name = "table_1", transformation_ctx = "DataSource0")
DataSource1 = glueContext.create_dynamic_frame.from_catalog(database = "db_name", table_name = "table_2", transformation_ctx = "DataSource1")
Table_1_Renames = ApplyMapping.apply(frame = DataSource0, mappings = [("col_a", "string", "col_a", "string"), ("col_b", "string", "col_xyz", "string")], transformation_ctx = "Transform0")
Table_2_Renames = ApplyMapping.apply(frame = DataSource1, mappings = [("col0", "string", "col0_renamed", "string"), ("col1", "string", "col1_renamed", "string")], transformation_ctx = "Transform1")
Table_1_Filter = Filter.apply(frame = Table_1_Renames, f = lambda row : (bool(re.match("KeepValue", row["col_b"]))), transformation_ctx = "Table_1_Filter")
Table_1_Filter_DF = Table_1_Filter.toDF()
Table_2_Renames_DF = Table_2_Renames.toDF()
#If the original data was empty, or the filter removes all rows of the data, we get:
#AnalysisException: 'Cannot resolve column name "col_a" among ();'
LeftJoin_1 = DynamicFrame.fromDF(Table_1_Filter_DF.join(Table_2_Renames_DF, (Table_1_Filter_DF['col_a'] == Table_2_Renames_DF['col0_renamed']), "left"), glueContext, "LeftJoin_1") ```

Related

Single Task Taking Long Time in PySpark

I am running a PySpark application where I am reading several Parquet files into Spark dataframes and created temporary views on them to use in my SQL query. So I have like 18 views where some are ~ 1TB, few in several GBs and some other smaller views. I am joining all of these and running my business logic to get the desired outcome. My code takes extremely long time to run (>3 hours) for this data. Looking at the Spark History Server, I can see there's one task that seems the culprit as the time taken, data spilled to memory and disk, shuffle read/write everything is way higher than the median. This indicates a data skew. So I even used salting on my large dataframes before creating the temp views. However there's still no difference in the execution time. I checked the number of partitions and it's already 792 (maximum I can have my current Glue config). I have also enabled adaptive query execution and adaptive skewJoin handling.
My original dataset was extremely huge largest table being ~40TB and has 2.5 years of data. I am trying to do a one time historical load and was unsuccessful on running over the entire data. With trial and error, I had to reduce this to processing 1TB of data at a time (for the largest table) which is still taking 3+ hours. This is not a scalable approach and hence I am looking for some inputs to optimize this.
Below are my app details:
Number of workers = 792
Spark config:
spark= (SparkSession
.builder
.appName("scmCaseAlertDatamartFullLoad")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.adaptive.enabled","true")
.config("spark.sql.broadcastTimeout","900")
.config("spark.sql.adaptive.skewJoin.enabled","true")
.getOrCreate()
)
Code (just included key relevant methods, starting point is loadSCMCseAlertData()):
def getIncomingMatchesFullData(self):
select_query_incoming_matches_full_data = """
SELECT DISTINCT alrt.caseid AS case_id,
alrt.alertid AS alert_id,
alrt.accountid AS account_id,
sc.created_time AS case_created_time,
sc.last_updated_time AS case_last_updated_time,
alrt.srccreatedtime AS alert_created_time,
aud.last_updated_by AS case_last_updated_by,
sc.closed_time AS case_last_close_time,
lcs.status AS case_status,
lcst.state AS case_state,
lcra.responsive_action,
sc.assigned_to AS case_assigned_to,
cr1.team_name AS case_assigned_to_team,
sc.resolved_by AS case_resolved_by,
cr2.team_name AS case_resolved_by_team,
aud.last_annotation AS case_last_annotation,
ca.name AS case_approver,
alrt.screeningdecision AS screening_decision,
ap.accountpool AS division,
lcd.decision AS case_current_decision,
CASE
WHEN sm.grylaclientid LIKE '%AddressService%' THEN 'Address Service'
WHEN sm.grylaclientid LIKE '%GrylaOrderProcessingService%' THEN 'Retail Checkout Service'
WHEN sm.grylaclientid = 'urn:cdo:GrylaBatchScreeningAAA:AWS:Default' THEN 'Batch Screening'
WHEN sm.grylaclientid = 'urn:cdo:OfficerJennyBindle:AWS:Default' THEN 'API'
ELSE 'Other'
END AS channel,
ap.businesstype AS business_type,
ap.businessname AS business_name,
ap.marketplaceid AS ap_marketplace_id,
ap.region AS ap_region,
ap.memberid AS ap_member_id,
ap.secondaryaccountpool AS secondary_account_pool,
sm.action AS client_action,
acl.added_by,
acl.lnb_id AS accept_list_lnb_id,
acl.created_time AS accept_list_created_time,
acl.source_case_id AS accept_list_source_case_id,
acs.status AS accept_list_status,
ap.street1 AS ap_line_1,
ap.street2 AS ap_line_2,
ap.street3 AS ap_line_3,
ap.city AS ap_city,
ap.state AS ap_state,
ap.postalcode AS ap_postal_code,
ap.country AS ap_country,
ap.fullname AS ap_full_name,
ap.email AS ap_email,
sm.screening_match_id AS dp_screening_match_id,
CASE
WHEN sm.matchtype = 'name_only_matching_details' THEN 'Name Only'
WHEN sm.matchtype = 'address_only_matching_details' THEN 'Address Only'
WHEN sm.matchtype = 'address_matching_details' THEN 'Address'
WHEN sm.matchtype = 'scr_matching_details' THEN 'SCR'
WHEN sm.matchtype = 'hotkey_matching_details' THEN 'HotKey'
END AS match_type,
sm.matchaction AS match_action,
alrt.batchfilename AS batch_file_id,
REGEXP_REPLACE(dp.name, '\\n|\\r|\\t', ' ') AS dp_matched_add_full_name,
dp.street AS dp_line1,
'' AS dp_line2,
dp.city AS dp_city,
dp.state AS dp_state,
dp.postalcode AS dp_postal_code,
dp.country AS dp_country,
dp.matchedplaces AS scr_value,
dp.hotkeyvalues AS hotkey_value,
sm.acceptlistid AS suppressed_by_accept_list_id,
sm.suppresseddedupe AS is_deduped,
sm.matchhash AS hash,
sm.matchdecision AS match_decision,
ap.addressid AS amazon_address_id,
ap.dateofbirth AS date_of_birth,
sm.grylaclientid AS gryla_client_id,
cr1.name AS case_assigned_to_role,
cr2.name AS case_resolved_by_role,
alrt.screeningengine AS screening_engine,
sm.srccreatedtime AS match_created_time,
sm.srclastupdatedtime AS match_updated_time,
to_date(sm.srclastupdatedtime,"yyyy-MM-dd") AS match_updated_date,
sm.match_updated_time_msec,
sm.suppressedby AS match_suppressed_by
FROM
cm_screening_match sm
JOIN
cm_screening_match_redshift smr ON sm.screening_match_id = smr.screening_match_id
LEFT JOIN
cm_case_alert alrt ON sm.screening_match_id = alrt.screening_match_id
LEFT JOIN
cm_amazon_party ap ON sm.screening_match_id = ap.screening_match_id
LEFT JOIN
cm_denied_party dp ON sm.screening_match_id = dp.screening_match_id
LEFT JOIN
cm_spectre_case sc ON alrt.caseid = sc.case_id
LEFT JOIN
cm_lookup_case_status lcs ON sc.status_id = lcs.status_id
LEFT JOIN
cm_lookup_case_state lcst ON sc.state_id = lcst.state_id
LEFT JOIN
cm_lookup_case_decision lcd ON sc.decision_id = lcd.decision_id
LEFT JOIN
cm_lookup_case_responsive_action lcra ON sc.responsive_action_id = lcra.responsive_action_id
LEFT JOIN
cm_user cu1 ON sc.assigned_to = cu1.alias
LEFT JOIN
cm_role cr1 ON cu1.current_role_id = cr1.role_id
LEFT JOIN
cm_user cu2 ON sc.resolved_by = cu2.alias
LEFT JOIN
cm_role cr2 ON cu2.current_role_id = cr2.role_id
LEFT JOIN
cm_accept_list acl ON acl.screening_match_id = sm.screening_match_id
LEFT JOIN
cm_lookup_accept_list_status acs ON acs.status_id = acl.status_id
LEFT JOIN
(
SELECT case_id,
last_value(username) OVER (PARTITION BY case_id ORDER BY created_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_updated_by,
last_value(description) OVER (PARTITION BY case_id ORDER BY created_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_annotation
FROM cm_spectre_case_audit
) aud ON sc.case_id = aud.case_id
LEFT JOIN
cm_approver ca ON sc.approver_id = ca.approver_id
"""
print(select_query_incoming_matches_full_data)
incomingMatchesFullDF = self.spark.sql(select_query_incoming_matches_full_data)
return incomingMatchesFullDF
def getBaseTables(self,matchtime_lower_threshold,matchtime_upper_threshold,cursor):
print('Fetching datalake data for matches created after: {}' .format(matchtime_lower_threshold))
matchDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_match)
matchDF = matchDF.select("screening_match_id","grylaclientid","action","matchtype","matchaction","acceptlistid","suppresseddedupe","matchhash","matchdecision","srccreatedtime","srclastupdatedtime","suppressedby","lastupdatedtime")
#.withColumn("screentime",to_timestamp("screentime")) \
matchDF = matchDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#matchDF = matchDF.repartition(2400,"screening_match_id")
matchDF = self.getLatestRecord(matchDF)
matchDF = matchDF.withColumn("salt", rand())
matchDF = matchDF.repartition("salt")
matchDF.createOrReplaceTempView("cm_screening_match")
print("Total from matchDF:",matchDF.count())
print("Number of paritions in matchDF: " ,matchDF.rdd.getNumPartitions())
alertDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_alert)
alertDF = alertDF.select("screening_match_id","caseid","alertid","accountid","srccreatedtime","screeningdecision","batchfilename","screeningengine","lastupdatedtime")
alertDF = alertDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#alertDF = alertDF.repartition(2400,"screening_match_id")
alertDF = self.getLatestRecord(alertDF)
alertDF = alertDF.withColumn("salt", rand())
alertDF = alertDF.repartition("salt")
alertDF.createOrReplaceTempView("cm_case_alert")
print("Total from alertDF:",alertDF.count())
print("Number of paritions in alertDF: " ,alertDF.rdd.getNumPartitions())
apDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_ap)
apDF = apDF.select("screening_match_id","accountpool","businesstype","businessname","marketplaceid","region","memberid","secondaryaccountpool","street1","street2","street3","city","state","postalcode","country","fullname","email","addressid","dateofbirth","lastupdatedtime")
apDF = apDF.withColumn("dateofbirth",to_date("dateofbirth","yyyy-MM-dd")) \
.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())) \
.drop("lastupdatedtime")
#apDF = apDF.repartition(2400,"screening_match_id")
apDF = self.getLatestRecord(apDF)
apDF = apDF.withColumn("salt", rand())
apDF = apDF.repartition("salt")
apDF.createOrReplaceTempView("cm_amazon_party")
print("Total from apDF:",apDF.count())
print("Number of paritions in apDF: " ,apDF.rdd.getNumPartitions())
dpDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_dp)
dpDF = dpDF.select("screening_match_id","name","street","city","state","postalcode","country","matchedplaces","hotkeyvalues","lastupdatedtime")
dpDF = dpDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#dpDF = dpDF.repartition(2400,"screening_match_id")
dpDF = self.getLatestRecord(dpDF)
dpDF = dpDF.withColumn("salt", rand())
dpDF = dpDF.repartition("salt")
dpDF.createOrReplaceTempView("cm_denied_party")
print("Total from dpDF:",dpDF.count())
print("Number of paritions in dpDF: " ,dpDF.rdd.getNumPartitions())
print('Fetching data from Redshift Base tables...')
self.getRedshiftData(matchtime_lower_threshold,matchtime_upper_threshold,cursor)
caseAuditDF = self.spark.read.parquet(self.data_input_case_audit)
caseAuditDF.createOrReplaceTempView("cm_spectre_case_audit")
caseDF = self.spark.read.parquet(self.data_input_case)
caseDF.createOrReplaceTempView("cm_spectre_case")
caseStatusDF = self.spark.read.parquet(self.data_input_case_status)
caseStatusDF.createOrReplaceTempView("cm_lookup_case_status")
caseStateDF = self.spark.read.parquet(self.data_input_case_state)
caseStateDF.createOrReplaceTempView("cm_lookup_case_state")
caseDecisionDF = self.spark.read.parquet(self.data_input_case_decision)
caseDecisionDF.createOrReplaceTempView("cm_lookup_case_decision")
caseRespActDF = self.spark.read.parquet(self.data_input_case_responsive_action)
caseRespActDF.createOrReplaceTempView("cm_lookup_case_responsive_action")
userDF = self.spark.read.parquet(self.data_input_user)
userDF.createOrReplaceTempView("cm_user")
userSnapshotDF = self.spark.read.parquet(self.data_input_user_snapshot)
userSnapshotDF.createOrReplaceTempView("v_cm_user_snapshot")
roleDF = self.spark.read.parquet(self.data_input_role)
roleDF.createOrReplaceTempView("cm_role")
skillDF = self.spark.read.parquet(self.data_input_skill)
skillDF.createOrReplaceTempView("cm_skill")
lookupSkillDF = self.spark.read.parquet(self.data_input_lookup_skills)
lookupSkillDF.createOrReplaceTempView("cm_lookup_skills")
skillTypeDF = self.spark.read.parquet(self.data_input_skill_type)
skillTypeDF.createOrReplaceTempView("cm_skill_type")
acceptListDF = self.spark.read.parquet(self.data_input_accept_list)
acceptListDF.createOrReplaceTempView("cm_accept_list")
lookupAcceptListStatusDF = self.spark.read.parquet(self.data_input_lookup_accept_list_status)
lookupAcceptListStatusDF.createOrReplaceTempView("cm_lookup_accept_list_status")
approverDF = self.spark.read.parquet(self.data_input_approver)
approverDF.createOrReplaceTempView("cm_approver")
screeningMatchDF_temp = self.spark.read.parquet(self.data_input_screening_match_redshift)
screeningMatchLookupDF_temp = self.spark.read.parquet(self.data_input_lookup_screening_match_redshift)
screeningMatchLookupDF_temp_new = screeningMatchLookupDF_temp.withColumnRenamed("screening_match_id","lookupdf_screening_match_id")
"""
The screening_match_id in datalake table is a mix of alphanumeric match IDs (the ones in cm_lookup_screening_match_id in Redshift) and numeric (the ones in cm_screening_match in Redshift). Hence we combine the match IDs from both the Redshift tables. Also, there are matches which were created in the past but updated recently. Since updated date is only present in cm_screening_match and not in cm_lookup_screening_match_id, we will only have the numeric match Ids. When we join this to datalake table, we won't be able to find these matches as they are present in the alphanumeric form in datalake. Hence what we do is read the entire table of cm_lookup_screening_match_id and join it with cm_screening_match to enrich cm_screening_match with the alphanumeric match Id. Finally we filter cm_lookup_screening_match_id only for newly created matches and combine with the matches from enriched version of cm_screening_match.
"""
screeningMatchDF_enriched = screeningMatchDF_temp.join(screeningMatchLookupDF_temp_new,screeningMatchDF_temp.screening_match_id == screeningMatchLookupDF_temp_new.lookupdf_screening_match_id,"left")
screeningMatchDF_enriched = screeningMatchDF_enriched.withColumn("screening_match_id",col("screening_match_id").cast(StringType()))
screeningMatchDF = screeningMatchDF_enriched.select(col("screening_match_id")).union(screeningMatchDF_enriched.select(col("match_event_id")))
screeningMatchLookupDF = screeningMatchLookupDF_temp_new.filter("created_time > '{}'" .format(matchtime_lower_threshold)).select(col("match_event_id"))
screeningMatchRedshiftDF = screeningMatchDF.union(screeningMatchLookupDF)
#screeningMatchRedshiftDF = screeningMatchRedshiftDF.repartition(792,"screening_match_id")
screeningMatchRedshiftDF = screeningMatchRedshiftDF.withColumn("salt", rand())
screeningMatchRedshiftDF = screeningMatchRedshiftDF.repartition("salt")
screeningMatchRedshiftDF.createOrReplaceTempView("cm_screening_match_redshift")
print("Total from screeningMatchRedshiftDF:",screeningMatchRedshiftDF.count())
def loadSCMCaseAlertTable(self):
print('Getting the thresholds for data to be loaded')
matchtime_lower_threshold = self.getLowerThreshold('scm_case_alert_data')
print('Match time lower threshold is: {}' .format(matchtime_lower_threshold))
matchtime_upper_threshold = self.default_upper_threshold
print('Match time upper threshold is: {}' .format(matchtime_upper_threshold))
print("Getting the required base tables")
con = self.get_redshift_connection()
cursor = con.cursor()
self.getBaseTables(matchtime_lower_threshold,matchtime_upper_threshold,cursor)
print("Getting the enriched dataset for incoming matches (the ones to be inserted or updated)")
incomingMatchesFullDF = self.getIncomingMatchesFullData()
print("Total records in incomingMatchesFullDF: ", incomingMatchesFullDF.count())
print("Copying the incoming data to temp work dir")
print("Clearing work directory: {}" .format(self.work_scad_path))
self.deleteAllObjectsFromS3Prefix(self.dest_bucket,self.dest_work_prefix_scad)
print("Writing data to work dir: {}" .format(self.work_scad_path))
#.coalesce(1) \
incomingMatchesFullDF.write \
.partitionBy("match_updated_date") \
.mode("overwrite") \
.parquet(self.work_scad_path + self.work_dir_partitioned_table_scad)
print("Data copied to work dir")
print("Reading data from work dir in a temporary dataframe")
incomingMatchesFullDF_copy = self.spark.read.parquet(self.work_scad_path + "scm_case_alert_data_work.parquet/")
if self.update_mode == 'overwrite':
print("Datamart update mode is overwrite. New data will replace existing data.")
print("Publishing to Redshift")
self.publishToRedshift(con,cursor)
print("Publishing to Redshift complete")
elif self.update_mode == 'upsert':
print("Datamart update mode is upsert. New data will be loaded and existing data will be updated.")
print("Checking for cases updated between {} and {}" .format(matchtime_lower_threshold,matchtime_upper_threshold))
updatedCasesDF = self.getUpdatedCases(matchtime_lower_threshold,matchtime_upper_threshold)
updatedCasesDF.createOrReplaceTempView("updated_cases")
print("Getting updated case attributes")
updatedCaseAttributesDF = self.getUpdatedCaseAttributes()
print("Moving updated case data to temp work directory: {}".format(self.work_updated_cases_path))
print("Clearing work directory")
self.deleteAllObjectsFromS3Prefix(self.dest_bucket,self.dest_work_prefix_updated_cases)
try:
print("Writing data to work dir: {}" .format(self.work_updated_cases_path))
updatedCaseAttributesDF.coalesce(1) \
.write \
.mode("overwrite") \
.parquet(self.work_updated_cases_path + "updated_cases.parquet")
except Exception as e:
e = sys.exc_info()[0]
print("No data to write to work dir")
print("Starting the process to publish data to Redshift")
self.publishToRedshift(con,cursor)
print("Publishing to Redshift complete")
print('Updating metadata table')
matchtime_lower_threshold_new = incomingMatchesFullDF_copy.agg({'match_updated_time': 'max'}).collect()[0][0]
if matchtime_lower_threshold_new is not None:
matchtime_lower_threshold_new_formatted = matchtime_lower_threshold_new.strftime("%Y-%m-%d %H:%M:%S")
print("Latest match time lower threshold with new load: {}" .format(matchtime_lower_threshold_new_formatted))
self.updatePipelineMetadata('scm_case_alert_data','max_data_update_time',matchtime_lower_threshold_new_formatted)
else:
print("No new matches, leaving max_data_update_time for match as it is")
print("Metadata table up to date")
print("Committing the updates to Redshift and closing the connection")
con.commit() #Committing after the metadata table is updated to ensure the datamart data and threshold are aligned
cursor.close()
con.close()
Spark History Server Screenshot:
As you have correctly felt, you're having data skew issues. This is really apparent from your last screenshot. Have a look at the shuffle read/write sizes! The thing that you have to find out is: for which shuffle operation (looks like a join) are you having this issue?
Only salting the large dataframes without knowing where your skew is wont solve the issue.
So, my proposed plan of action:
You see that stage 112 from your picture is the problematic stage. Figure out which join operation this is about. In the SQL tab of the web-ui you can find that stage 112 and hover over it. That should give you enough info to figure out which shuffle/join key is skewed.
Once you know which key is skewed, understand the statistical contents of your key using spark-shell or something like that. Figure out which value is overly common. This will help in making future decisions. A simple df.groupBy("problematicKey").count will already be really interesting.
Once you know that, you can go ahead and salt that specific key.
But you're absolutely on the right track! Keeping an eye on that Tasks page and the time it takes for each task is a great approach!
Hope this helps :)

An error occurred while calling o91.sql. MERGE INTO TABLE is not supported temporarily

I use Glue 3.0 - Supports Spark 3.1 and Python 3 from an infrastructure perspective. I am trying to do MERGE INTO target USING source operation in spark sql for a table UPSERT operation. However, I am getting the below error for the same:
An error occurred while calling o91.sql. MERGE INTO TABLE is not supported temporarily.
I am not using any Delta Table, I read directly from a postgreSQL - AuroraDb using spark dataframe reader which is my target. The source here is another dataframe read from parquet file using spark dataframe reader.
I have tried changing the Glue Version but it did not help. When I looked for answers in internet I get links to Iceberg and DeltaTable. Is my approach to the problem is correct. Please share you inputs.
The code is provided as below:
def changeDataCapture(inputDf, currDf, spark):
inputDf.createOrReplaceTempView('inputDf')
currDf.createOrReplaceTempView('currDf')
currDf = spark.sql("""
MERGE INTO currDf USING inputDf
ON currDf.REG_NB = inputDf.registerNumber
AND currDf.ANN_RTN_DT = inputDf.annual_return_date
WHEN MATCHED
THEN UPDATE SET
currDf.LAST_SEEN_DT = inputDf.LAST_SEEN_DT,
currDf.TO_DB_DT = inputDf.TO_DB_DT,
currDf.TO_DB_TM = inputDf.TO_DB_TM,
currDf.BATCH_ID = inputDf.BATCH_ID,
currDf.DATA_PROC_ID = inputDf.DATA_PROC_ID,
currDf.FIRST_SEEN_DT = CASE
WHEN currDf.CO_REG_DEBT = inputDf.registered_indebtedness
AND currDf.HLDR_LIST_CD = inputDf.holder_list_indicator
AND currDf.HLDR_LEGAL_STAT = inputDf.holder_legal_status
AND currDf.HLDR_REFRESH_CD = inputDf.holder_refresh_flag
AND currDf.HLDR_SUPRESS_IN = inputDf.HLDR_SUPRESS_IN
AND currDf.BULK_LIST_ID = inputDf.Bulk_List_In
THEN currDf.FIRST_SEEN_DT
ELSE inputDf.FIRST_SEEN_DT
END,
currDf.SUPERSEDED_DT = CASE
WHEN currDf.CO_REG_DEBT = inputDf.registered_indebtedness
AND currDf.HLDR_LIST_CD = inputDf.holder_list_indicator
AND currDf.HLDR_LEGAL_STAT = inputDf.holder_legal_status
AND currDf.HLDR_REFRESH_CD = inputDf.holder_refresh_flag
AND currDf.HLDR_SUPRESS_IN = inputDf.HLDR_SUPRESS_IN
AND currDf.BULK_LIST_ID = inputDf.Bulk_List_In
THEN currDf.SUPERSEDED_DT
ELSE inputDf.SUPERSEDED_DT
END
WHEN NOT MATCHED
THEN INSERT
(REG_NB, ANN_RTN_DT, SUPERSEDED_DT, TO_DB_DT, TO_DB_TM, FIRST_SEEN_DT, LAST_SEEN_DT, BATCH_ID,
DATA_PROC_ID, CO_REG_DEBT, HLDR_LIST_CD, HLDR_LIST_DT, HLDR_LEGAL_STAT,
HLDR_REFRESH_CD, HLDR_SUPRESS_IN, BULK_LIST_ID, DOC_TYPE_CD)
VALUES
(registerNumber, annual_return_date, SUPERSEDED_DT, TO_DB_DT, TO_DB_TM, FIRST_SEEN_DT, LAST_SEEN_DT,
BATCH_ID, DATA_PROC_ID, registered_indebtedness, holder_list_indicator,
holder_list_date, holder_legal_status, holder_refresh_flag, HLDR_SUPRESS_IN,
Bulk_List_In, DOC_TYPE_CD)
""")
return currDf
Thanks

AWS Glue performance when write

After performing joins and aggregation i want the output to be in 1 file and partition based on some column.
when I use repartition(1) the time taken by job is 1 hr and if I remove preparation(1) there will be multiple partitions of that file it takes 30 mins (refer to example below).
So is there a way to write data into 1 file ??
...
...
df= df.repartition(1)
glueContext.write_dynamic_frame.from_options(
frame = df,
connection_type = "s3",
connection_options = {
"path": "s3://s3path"
"partitionKeys": ["choice"]
},
format = "csv",
transformation_ctx = "datasink2")
Is there any other way to increase the write performance. does changing format helps? and how to achieve parallelism by having 1 file output
S3 storage example
**if repartition(1)** // what I want but takes more time
choice=0/part-00-001
..
..
choice=500/part-00-001
**if removed** // takes less time but multiple files are present
choice=0/part-00-001
....
choice=0/part-00-0032
..
..
choice=500/part-00-001
....
choice=500/part-00-0032
Instead of using df.repartition(1)
USE df.repartition("choice")
df= df.repartition("choice")
glueContext.write_dynamic_frame.from_options(
frame = df,
connection_type = "s3",
connection_options = {
"path": "s3://s3path"
"partitionKeys": ["choice"]
},
format = "csv",
transformation_ctx = "datasink2")
If the goal is to have one single file, use coalesce instead of repartition, it avoids data shuffle.

How to Include the Value of Partitioned Column in a Spark data frame or Spark SQL Temp Table in AWS Glue?

I am using python 3, Glue 1.0 for this code.
I have partitioned data in S3. The data is partitioned in year,month,day,extra_field_name columns.
When I load the data into data frame, I am getting all the columns in it's schema other than the partitioned ones.
Here is the code and output
glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": path_list, "recurse" : True, 'groupFiles': 'inPartition'}, format = "parquet").toDF().registerTempTable(final_arguement_list["read_table_" + str(i+1)])
The path_list variable contains a string of list of paths that need to be loaded into a data frame.
I am printing schema using the below command
glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": path_list, "recurse" : True}, format = "parquet").toDF().printSchema()
The schema that I am getting in cloudwatch logs does not contain any of the partitioned columns.
Please note that I have already tried loading data by giving path by only providing path till year, month, day, extra_field_name separately but still getting only those columns which are present in the parquet files itself.
I was able to do this with an additional step of having a crawler crawl the directory on S3, and then use the table from Glue Catalog as the source for Glue ETL.
Once you have a crawler over the location s3://path/to/source/data/, automatically year,month and day will be treated as partition columns. And then you could try the following in your Glue ETL script.
data_dyf = glueContext.create_dynamic_frame.from_catalog(
database = db_name,
table_name = tbl_name,
push_down_predicate="(year=='2018' and month=='05')"
)
You can find more details here
As a workaround, I have created a duplicate column in the data frame itself named - year_2, month_2, day_2 and extra_field_name_2 as a copy of year, month, day and extra_field_name.
During data ingestion phase, I have partitioned the data frame on year, month, day and extra_field_name and stored it in S3 which retains the column value of year_2, month_2, day_2 and extra_field_name_2 in the parquet files itself.
While performing data manipulation, I am loading the data in a dynamic frame by providing the list of paths in the following manner:
['s3://path/to/source/data/year=2018/month=1/day=4/', 's3://path/to/source/data/year=2018/month=1/day=5/', 's3://path/to/source/data/year=2018/month=1/day=6/']
This gives me year_2, month_2, day_2 and extra_field_name_2 in the dynamic frame that I can further use for data manipulation.
Try passing the basePath to the connection_options argument:
glueContext.create_dynamic_frame_from_options(
connection_type = "s3",
connection_options = {
"paths": path_list,
"recurse" : True,
"basePath": "s3://path/to/source/data/"
},
format = "parquet").toDF().printSchema()
This way, partition discovery will discover the partitions that are above your paths. According to the documentation, these options will be passed to the Spark SQL DataSource.
Edit: given that your experiment shows it doesn’t work, have you considered passing the top level directory and filtering from there for the dates of interest? The Reader will only read the relevant Hive partitions, as the filter gets ”pushed down” to the file system.
(glueContext.create_dynamic_frame_from_options(
connection_type = "s3",
connection_options = {
"paths": ["s3://path/to/source/data/"],
"recurse" : True,
},
format = "parquet")
.toDF()
.filter(
(col("year") == 2018)
&& (col("month") == 1)
&& (col("day").between(4, 6)
).printSchema()

Get BigQuery table schema using google.cloud

I can for example get BigQuery data into local python with:
import os
from google.cloud import bigquery
project_id = "example-project"
dataset_id = "exapmle_dataset"
table_id = "table_id"
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id
bq = bigquery.Client()
query = "SELECT * FROM {}.{} LIMIT 5".format(dataset_id, table_id)
resp = bq.run_sync_query(query)
resp.run()
data_list = resp.rows
The result:
print(data_list)
>>> [('BEDD', '1',), ('A75', '1',), ('CE3F', '1',), ('0D8C', '1',), ('3E9C', '1',)]
How do I then go and get the schema for this table? Such that, for example
headings = ('heading1', 'heading2')
# or
schema_dict = {'fields': [{'name': 'heading1', 'type': 'STRING'}, {'name': 'heading2', 'type': 'STRING'}]}
You can use the schema method from your resp variable.
After running the query you can retrieve it:
schema = resp.schema
schema will be a list containing the definition for each column in your query.
As an example, lets say this is your query:
query = "select '1' as fv, STRUCT<i INT64, j INT64> (1, 2) t from `dataset.table` limit 1"
The schema will be a list containing 2 entries:
[<google.cloud.bigquery.schema.SchemaField at 0x7ffa64fe6e50>,
<google.cloud.bigquery.schema.SchemaField at 0x7ffa64fe6b10>]
For each object in schema, you have the methods field_type, fields, mode and name so if you run:
schema[0].field_type, schema[0].mode, schema[0].name
The result is "STRING", "NULLABLE", "fv".
As the second column is a record, then if you run:
schema[1].field_type, schema[1].mode, schema[1].name, schema[1].fields
The result is:
"RECORD", "NULLABLE", "t", [google schema 1, google schema 2]
Where google schema 1 contains the definition for the inner fields within the record.
As far as I know, there's no way of getting a dictionary as you showed in your question, which means you'll have to loop over the entries in schema and build it yourself. It should be simple though. Not sure if this is working as I haven't fully tested it but it might give you an idea on how to do it:
def extract_schema(schema_resp):
l = []
for schema_obj in schema_resp:
r = {}
r['name'] = schema_obj.name
r['type'] = schema_obj.field_type
r['mode'] = schema_obj.mode
if schema_obj.fields:
r['fields'] = extract_schema(schema_obj.fields)
l.append(r)
return l
So you'd just have to run schema = extract_schema(resp.schema) and (hopefully) you'll be good to go.

Resources