Spark doesn't read the file properly

Spark doesn't read the file properly - apache-spark

I run Flume to ingest Twitter data into HDFS (in JSON format) and run Spark to read that file.
But somehow, it doesn't return the correct result: it seems the content of the file is not updated.
Here's my Flume configuration:
TwitterAgent01.sources = Twitter
TwitterAgent01.channels = MemoryChannel01
TwitterAgent01.sinks = HDFS
TwitterAgent01.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent01.sources.Twitter.channels = MemoryChannel01
TwitterAgent01.sources.Twitter.consumerKey = xxx
TwitterAgent01.sources.Twitter.consumerSecret = xxx
TwitterAgent01.sources.Twitter.accessToken = xxx
TwitterAgent01.sources.Twitter.accessTokenSecret = xxx
TwitterAgent01.sources.Twitter.keywords = some_keywords
TwitterAgent01.sinks.HDFS.channel = MemoryChannel01
TwitterAgent01.sinks.HDFS.type = hdfs
TwitterAgent01.sinks.HDFS.hdfs.path = hdfs://hadoop01:8020/warehouse/raw/twitter/provider/m=%Y%m/
TwitterAgent01.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent01.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent01.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent01.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent01.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent01.sinks.HDFS.hdfs.rollInterval = 86400
TwitterAgent01.channels.MemoryChannel01.type = memory
TwitterAgent01.channels.MemoryChannel01.capacity = 10000
TwitterAgent01.channels.MemoryChannel01.transactionCapacity = 10000
After that I check the output with hdfs dfs -cat and it returns more than 1000 rows, meaning that the data was successfully inserted.
But in Spark that's not the case
spark.read.json("/warehouse/raw/twitter/provider").filter("m=201802").show()
only has 6 rows.
Did I miss something here?

I'm not entirely sure of why you specified the latter part of the path as the condition expression of filter.
I believe to correctly read your file you can just write:
spark.read.json("/warehouse/raw/twitter/provider/m=201802").show()

Related

Single Task Taking Long Time in PySpark

I am running a PySpark application where I am reading several Parquet files into Spark dataframes and created temporary views on them to use in my SQL query. So I have like 18 views where some are ~ 1TB, few in several GBs and some other smaller views. I am joining all of these and running my business logic to get the desired outcome. My code takes extremely long time to run (>3 hours) for this data. Looking at the Spark History Server, I can see there's one task that seems the culprit as the time taken, data spilled to memory and disk, shuffle read/write everything is way higher than the median. This indicates a data skew. So I even used salting on my large dataframes before creating the temp views. However there's still no difference in the execution time. I checked the number of partitions and it's already 792 (maximum I can have my current Glue config). I have also enabled adaptive query execution and adaptive skewJoin handling.
My original dataset was extremely huge largest table being ~40TB and has 2.5 years of data. I am trying to do a one time historical load and was unsuccessful on running over the entire data. With trial and error, I had to reduce this to processing 1TB of data at a time (for the largest table) which is still taking 3+ hours. This is not a scalable approach and hence I am looking for some inputs to optimize this.
Below are my app details:
Number of workers = 792
Spark config:
spark= (SparkSession
.builder
.appName("scmCaseAlertDatamartFullLoad")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.adaptive.enabled","true")
.config("spark.sql.broadcastTimeout","900")
.config("spark.sql.adaptive.skewJoin.enabled","true")
.getOrCreate()
)
Code (just included key relevant methods, starting point is loadSCMCseAlertData()):
def getIncomingMatchesFullData(self):
select_query_incoming_matches_full_data = """
SELECT DISTINCT alrt.caseid AS case_id,
alrt.alertid AS alert_id,
alrt.accountid AS account_id,
sc.created_time AS case_created_time,
sc.last_updated_time AS case_last_updated_time,
alrt.srccreatedtime AS alert_created_time,
aud.last_updated_by AS case_last_updated_by,
sc.closed_time AS case_last_close_time,
lcs.status AS case_status,
lcst.state AS case_state,
lcra.responsive_action,
sc.assigned_to AS case_assigned_to,
cr1.team_name AS case_assigned_to_team,
sc.resolved_by AS case_resolved_by,
cr2.team_name AS case_resolved_by_team,
aud.last_annotation AS case_last_annotation,
ca.name AS case_approver,
alrt.screeningdecision AS screening_decision,
ap.accountpool AS division,
lcd.decision AS case_current_decision,
CASE
WHEN sm.grylaclientid LIKE '%AddressService%' THEN 'Address Service'
WHEN sm.grylaclientid LIKE '%GrylaOrderProcessingService%' THEN 'Retail Checkout Service'
WHEN sm.grylaclientid = 'urn:cdo:GrylaBatchScreeningAAA:AWS:Default' THEN 'Batch Screening'
WHEN sm.grylaclientid = 'urn:cdo:OfficerJennyBindle:AWS:Default' THEN 'API'
ELSE 'Other'
END AS channel,
ap.businesstype AS business_type,
ap.businessname AS business_name,
ap.marketplaceid AS ap_marketplace_id,
ap.region AS ap_region,
ap.memberid AS ap_member_id,
ap.secondaryaccountpool AS secondary_account_pool,
sm.action AS client_action,
acl.added_by,
acl.lnb_id AS accept_list_lnb_id,
acl.created_time AS accept_list_created_time,
acl.source_case_id AS accept_list_source_case_id,
acs.status AS accept_list_status,
ap.street1 AS ap_line_1,
ap.street2 AS ap_line_2,
ap.street3 AS ap_line_3,
ap.city AS ap_city,
ap.state AS ap_state,
ap.postalcode AS ap_postal_code,
ap.country AS ap_country,
ap.fullname AS ap_full_name,
ap.email AS ap_email,
sm.screening_match_id AS dp_screening_match_id,
CASE
WHEN sm.matchtype = 'name_only_matching_details' THEN 'Name Only'
WHEN sm.matchtype = 'address_only_matching_details' THEN 'Address Only'
WHEN sm.matchtype = 'address_matching_details' THEN 'Address'
WHEN sm.matchtype = 'scr_matching_details' THEN 'SCR'
WHEN sm.matchtype = 'hotkey_matching_details' THEN 'HotKey'
END AS match_type,
sm.matchaction AS match_action,
alrt.batchfilename AS batch_file_id,
REGEXP_REPLACE(dp.name, '\\n|\\r|\\t', ' ') AS dp_matched_add_full_name,
dp.street AS dp_line1,
'' AS dp_line2,
dp.city AS dp_city,
dp.state AS dp_state,
dp.postalcode AS dp_postal_code,
dp.country AS dp_country,
dp.matchedplaces AS scr_value,
dp.hotkeyvalues AS hotkey_value,
sm.acceptlistid AS suppressed_by_accept_list_id,
sm.suppresseddedupe AS is_deduped,
sm.matchhash AS hash,
sm.matchdecision AS match_decision,
ap.addressid AS amazon_address_id,
ap.dateofbirth AS date_of_birth,
sm.grylaclientid AS gryla_client_id,
cr1.name AS case_assigned_to_role,
cr2.name AS case_resolved_by_role,
alrt.screeningengine AS screening_engine,
sm.srccreatedtime AS match_created_time,
sm.srclastupdatedtime AS match_updated_time,
to_date(sm.srclastupdatedtime,"yyyy-MM-dd") AS match_updated_date,
sm.match_updated_time_msec,
sm.suppressedby AS match_suppressed_by
FROM
cm_screening_match sm
JOIN
cm_screening_match_redshift smr ON sm.screening_match_id = smr.screening_match_id
LEFT JOIN
cm_case_alert alrt ON sm.screening_match_id = alrt.screening_match_id
LEFT JOIN
cm_amazon_party ap ON sm.screening_match_id = ap.screening_match_id
LEFT JOIN
cm_denied_party dp ON sm.screening_match_id = dp.screening_match_id
LEFT JOIN
cm_spectre_case sc ON alrt.caseid = sc.case_id
LEFT JOIN
cm_lookup_case_status lcs ON sc.status_id = lcs.status_id
LEFT JOIN
cm_lookup_case_state lcst ON sc.state_id = lcst.state_id
LEFT JOIN
cm_lookup_case_decision lcd ON sc.decision_id = lcd.decision_id
LEFT JOIN
cm_lookup_case_responsive_action lcra ON sc.responsive_action_id = lcra.responsive_action_id
LEFT JOIN
cm_user cu1 ON sc.assigned_to = cu1.alias
LEFT JOIN
cm_role cr1 ON cu1.current_role_id = cr1.role_id
LEFT JOIN
cm_user cu2 ON sc.resolved_by = cu2.alias
LEFT JOIN
cm_role cr2 ON cu2.current_role_id = cr2.role_id
LEFT JOIN
cm_accept_list acl ON acl.screening_match_id = sm.screening_match_id
LEFT JOIN
cm_lookup_accept_list_status acs ON acs.status_id = acl.status_id
LEFT JOIN
(
SELECT case_id,
last_value(username) OVER (PARTITION BY case_id ORDER BY created_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_updated_by,
last_value(description) OVER (PARTITION BY case_id ORDER BY created_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_annotation
FROM cm_spectre_case_audit
) aud ON sc.case_id = aud.case_id
LEFT JOIN
cm_approver ca ON sc.approver_id = ca.approver_id
"""
print(select_query_incoming_matches_full_data)
incomingMatchesFullDF = self.spark.sql(select_query_incoming_matches_full_data)
return incomingMatchesFullDF
def getBaseTables(self,matchtime_lower_threshold,matchtime_upper_threshold,cursor):
print('Fetching datalake data for matches created after: {}' .format(matchtime_lower_threshold))
matchDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_match)
matchDF = matchDF.select("screening_match_id","grylaclientid","action","matchtype","matchaction","acceptlistid","suppresseddedupe","matchhash","matchdecision","srccreatedtime","srclastupdatedtime","suppressedby","lastupdatedtime")
#.withColumn("screentime",to_timestamp("screentime")) \
matchDF = matchDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#matchDF = matchDF.repartition(2400,"screening_match_id")
matchDF = self.getLatestRecord(matchDF)
matchDF = matchDF.withColumn("salt", rand())
matchDF = matchDF.repartition("salt")
matchDF.createOrReplaceTempView("cm_screening_match")
print("Total from matchDF:",matchDF.count())
print("Number of paritions in matchDF: " ,matchDF.rdd.getNumPartitions())
alertDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_alert)
alertDF = alertDF.select("screening_match_id","caseid","alertid","accountid","srccreatedtime","screeningdecision","batchfilename","screeningengine","lastupdatedtime")
alertDF = alertDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#alertDF = alertDF.repartition(2400,"screening_match_id")
alertDF = self.getLatestRecord(alertDF)
alertDF = alertDF.withColumn("salt", rand())
alertDF = alertDF.repartition("salt")
alertDF.createOrReplaceTempView("cm_case_alert")
print("Total from alertDF:",alertDF.count())
print("Number of paritions in alertDF: " ,alertDF.rdd.getNumPartitions())
apDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_ap)
apDF = apDF.select("screening_match_id","accountpool","businesstype","businessname","marketplaceid","region","memberid","secondaryaccountpool","street1","street2","street3","city","state","postalcode","country","fullname","email","addressid","dateofbirth","lastupdatedtime")
apDF = apDF.withColumn("dateofbirth",to_date("dateofbirth","yyyy-MM-dd")) \
.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())) \
.drop("lastupdatedtime")
#apDF = apDF.repartition(2400,"screening_match_id")
apDF = self.getLatestRecord(apDF)
apDF = apDF.withColumn("salt", rand())
apDF = apDF.repartition("salt")
apDF.createOrReplaceTempView("cm_amazon_party")
print("Total from apDF:",apDF.count())
print("Number of paritions in apDF: " ,apDF.rdd.getNumPartitions())
dpDF = self.getDatalakeData(matchtime_lower_threshold,matchtime_upper_threshold,self.data_input_dp)
dpDF = dpDF.select("screening_match_id","name","street","city","state","postalcode","country","matchedplaces","hotkeyvalues","lastupdatedtime")
dpDF = dpDF.withColumn("match_updated_time_msec",col("lastupdatedtime").cast(LongType())).drop("lastupdatedtime")
#dpDF = dpDF.repartition(2400,"screening_match_id")
dpDF = self.getLatestRecord(dpDF)
dpDF = dpDF.withColumn("salt", rand())
dpDF = dpDF.repartition("salt")
dpDF.createOrReplaceTempView("cm_denied_party")
print("Total from dpDF:",dpDF.count())
print("Number of paritions in dpDF: " ,dpDF.rdd.getNumPartitions())
print('Fetching data from Redshift Base tables...')
self.getRedshiftData(matchtime_lower_threshold,matchtime_upper_threshold,cursor)
caseAuditDF = self.spark.read.parquet(self.data_input_case_audit)
caseAuditDF.createOrReplaceTempView("cm_spectre_case_audit")
caseDF = self.spark.read.parquet(self.data_input_case)
caseDF.createOrReplaceTempView("cm_spectre_case")
caseStatusDF = self.spark.read.parquet(self.data_input_case_status)
caseStatusDF.createOrReplaceTempView("cm_lookup_case_status")
caseStateDF = self.spark.read.parquet(self.data_input_case_state)
caseStateDF.createOrReplaceTempView("cm_lookup_case_state")
caseDecisionDF = self.spark.read.parquet(self.data_input_case_decision)
caseDecisionDF.createOrReplaceTempView("cm_lookup_case_decision")
caseRespActDF = self.spark.read.parquet(self.data_input_case_responsive_action)
caseRespActDF.createOrReplaceTempView("cm_lookup_case_responsive_action")
userDF = self.spark.read.parquet(self.data_input_user)
userDF.createOrReplaceTempView("cm_user")
userSnapshotDF = self.spark.read.parquet(self.data_input_user_snapshot)
userSnapshotDF.createOrReplaceTempView("v_cm_user_snapshot")
roleDF = self.spark.read.parquet(self.data_input_role)
roleDF.createOrReplaceTempView("cm_role")
skillDF = self.spark.read.parquet(self.data_input_skill)
skillDF.createOrReplaceTempView("cm_skill")
lookupSkillDF = self.spark.read.parquet(self.data_input_lookup_skills)
lookupSkillDF.createOrReplaceTempView("cm_lookup_skills")
skillTypeDF = self.spark.read.parquet(self.data_input_skill_type)
skillTypeDF.createOrReplaceTempView("cm_skill_type")
acceptListDF = self.spark.read.parquet(self.data_input_accept_list)
acceptListDF.createOrReplaceTempView("cm_accept_list")
lookupAcceptListStatusDF = self.spark.read.parquet(self.data_input_lookup_accept_list_status)
lookupAcceptListStatusDF.createOrReplaceTempView("cm_lookup_accept_list_status")
approverDF = self.spark.read.parquet(self.data_input_approver)
approverDF.createOrReplaceTempView("cm_approver")
screeningMatchDF_temp = self.spark.read.parquet(self.data_input_screening_match_redshift)
screeningMatchLookupDF_temp = self.spark.read.parquet(self.data_input_lookup_screening_match_redshift)
screeningMatchLookupDF_temp_new = screeningMatchLookupDF_temp.withColumnRenamed("screening_match_id","lookupdf_screening_match_id")
"""
The screening_match_id in datalake table is a mix of alphanumeric match IDs (the ones in cm_lookup_screening_match_id in Redshift) and numeric (the ones in cm_screening_match in Redshift). Hence we combine the match IDs from both the Redshift tables. Also, there are matches which were created in the past but updated recently. Since updated date is only present in cm_screening_match and not in cm_lookup_screening_match_id, we will only have the numeric match Ids. When we join this to datalake table, we won't be able to find these matches as they are present in the alphanumeric form in datalake. Hence what we do is read the entire table of cm_lookup_screening_match_id and join it with cm_screening_match to enrich cm_screening_match with the alphanumeric match Id. Finally we filter cm_lookup_screening_match_id only for newly created matches and combine with the matches from enriched version of cm_screening_match.
"""
screeningMatchDF_enriched = screeningMatchDF_temp.join(screeningMatchLookupDF_temp_new,screeningMatchDF_temp.screening_match_id == screeningMatchLookupDF_temp_new.lookupdf_screening_match_id,"left")
screeningMatchDF_enriched = screeningMatchDF_enriched.withColumn("screening_match_id",col("screening_match_id").cast(StringType()))
screeningMatchDF = screeningMatchDF_enriched.select(col("screening_match_id")).union(screeningMatchDF_enriched.select(col("match_event_id")))
screeningMatchLookupDF = screeningMatchLookupDF_temp_new.filter("created_time > '{}'" .format(matchtime_lower_threshold)).select(col("match_event_id"))
screeningMatchRedshiftDF = screeningMatchDF.union(screeningMatchLookupDF)
#screeningMatchRedshiftDF = screeningMatchRedshiftDF.repartition(792,"screening_match_id")
screeningMatchRedshiftDF = screeningMatchRedshiftDF.withColumn("salt", rand())
screeningMatchRedshiftDF = screeningMatchRedshiftDF.repartition("salt")
screeningMatchRedshiftDF.createOrReplaceTempView("cm_screening_match_redshift")
print("Total from screeningMatchRedshiftDF:",screeningMatchRedshiftDF.count())
def loadSCMCaseAlertTable(self):
print('Getting the thresholds for data to be loaded')
matchtime_lower_threshold = self.getLowerThreshold('scm_case_alert_data')
print('Match time lower threshold is: {}' .format(matchtime_lower_threshold))
matchtime_upper_threshold = self.default_upper_threshold
print('Match time upper threshold is: {}' .format(matchtime_upper_threshold))
print("Getting the required base tables")
con = self.get_redshift_connection()
cursor = con.cursor()
self.getBaseTables(matchtime_lower_threshold,matchtime_upper_threshold,cursor)
print("Getting the enriched dataset for incoming matches (the ones to be inserted or updated)")
incomingMatchesFullDF = self.getIncomingMatchesFullData()
print("Total records in incomingMatchesFullDF: ", incomingMatchesFullDF.count())
print("Copying the incoming data to temp work dir")
print("Clearing work directory: {}" .format(self.work_scad_path))
self.deleteAllObjectsFromS3Prefix(self.dest_bucket,self.dest_work_prefix_scad)
print("Writing data to work dir: {}" .format(self.work_scad_path))
#.coalesce(1) \
incomingMatchesFullDF.write \
.partitionBy("match_updated_date") \
.mode("overwrite") \
.parquet(self.work_scad_path + self.work_dir_partitioned_table_scad)
print("Data copied to work dir")
print("Reading data from work dir in a temporary dataframe")
incomingMatchesFullDF_copy = self.spark.read.parquet(self.work_scad_path + "scm_case_alert_data_work.parquet/")
if self.update_mode == 'overwrite':
print("Datamart update mode is overwrite. New data will replace existing data.")
print("Publishing to Redshift")
self.publishToRedshift(con,cursor)
print("Publishing to Redshift complete")
elif self.update_mode == 'upsert':
print("Datamart update mode is upsert. New data will be loaded and existing data will be updated.")
print("Checking for cases updated between {} and {}" .format(matchtime_lower_threshold,matchtime_upper_threshold))
updatedCasesDF = self.getUpdatedCases(matchtime_lower_threshold,matchtime_upper_threshold)
updatedCasesDF.createOrReplaceTempView("updated_cases")
print("Getting updated case attributes")
updatedCaseAttributesDF = self.getUpdatedCaseAttributes()
print("Moving updated case data to temp work directory: {}".format(self.work_updated_cases_path))
print("Clearing work directory")
self.deleteAllObjectsFromS3Prefix(self.dest_bucket,self.dest_work_prefix_updated_cases)
try:
print("Writing data to work dir: {}" .format(self.work_updated_cases_path))
updatedCaseAttributesDF.coalesce(1) \
.write \
.mode("overwrite") \
.parquet(self.work_updated_cases_path + "updated_cases.parquet")
except Exception as e:
e = sys.exc_info()[0]
print("No data to write to work dir")
print("Starting the process to publish data to Redshift")
self.publishToRedshift(con,cursor)
print("Publishing to Redshift complete")
print('Updating metadata table')
matchtime_lower_threshold_new = incomingMatchesFullDF_copy.agg({'match_updated_time': 'max'}).collect()[0][0]
if matchtime_lower_threshold_new is not None:
matchtime_lower_threshold_new_formatted = matchtime_lower_threshold_new.strftime("%Y-%m-%d %H:%M:%S")
print("Latest match time lower threshold with new load: {}" .format(matchtime_lower_threshold_new_formatted))
self.updatePipelineMetadata('scm_case_alert_data','max_data_update_time',matchtime_lower_threshold_new_formatted)
else:
print("No new matches, leaving max_data_update_time for match as it is")
print("Metadata table up to date")
print("Committing the updates to Redshift and closing the connection")
con.commit() #Committing after the metadata table is updated to ensure the datamart data and threshold are aligned
cursor.close()
con.close()
Spark History Server Screenshot:

As you have correctly felt, you're having data skew issues. This is really apparent from your last screenshot. Have a look at the shuffle read/write sizes! The thing that you have to find out is: for which shuffle operation (looks like a join) are you having this issue?
Only salting the large dataframes without knowing where your skew is wont solve the issue.
So, my proposed plan of action:
You see that stage 112 from your picture is the problematic stage. Figure out which join operation this is about. In the SQL tab of the web-ui you can find that stage 112 and hover over it. That should give you enough info to figure out which shuffle/join key is skewed.
Once you know which key is skewed, understand the statistical contents of your key using spark-shell or something like that. Figure out which value is overly common. This will help in making future decisions. A simple df.groupBy("problematicKey").count will already be really interesting.
Once you know that, you can go ahead and salt that specific key.
But you're absolutely on the right track! Keeping an eye on that Tasks page and the time it takes for each task is a great approach!
Hope this helps :)

An error occurred while calling o91.sql. MERGE INTO TABLE is not supported temporarily

I use Glue 3.0 - Supports Spark 3.1 and Python 3 from an infrastructure perspective. I am trying to do MERGE INTO target USING source operation in spark sql for a table UPSERT operation. However, I am getting the below error for the same:
An error occurred while calling o91.sql. MERGE INTO TABLE is not supported temporarily.
I am not using any Delta Table, I read directly from a postgreSQL - AuroraDb using spark dataframe reader which is my target. The source here is another dataframe read from parquet file using spark dataframe reader.
I have tried changing the Glue Version but it did not help. When I looked for answers in internet I get links to Iceberg and DeltaTable. Is my approach to the problem is correct. Please share you inputs.
The code is provided as below:
def changeDataCapture(inputDf, currDf, spark):
inputDf.createOrReplaceTempView('inputDf')
currDf.createOrReplaceTempView('currDf')
currDf = spark.sql("""
MERGE INTO currDf USING inputDf
ON currDf.REG_NB = inputDf.registerNumber
AND currDf.ANN_RTN_DT = inputDf.annual_return_date
WHEN MATCHED
THEN UPDATE SET
currDf.LAST_SEEN_DT = inputDf.LAST_SEEN_DT,
currDf.TO_DB_DT = inputDf.TO_DB_DT,
currDf.TO_DB_TM = inputDf.TO_DB_TM,
currDf.BATCH_ID = inputDf.BATCH_ID,
currDf.DATA_PROC_ID = inputDf.DATA_PROC_ID,
currDf.FIRST_SEEN_DT = CASE
WHEN currDf.CO_REG_DEBT = inputDf.registered_indebtedness
AND currDf.HLDR_LIST_CD = inputDf.holder_list_indicator
AND currDf.HLDR_LEGAL_STAT = inputDf.holder_legal_status
AND currDf.HLDR_REFRESH_CD = inputDf.holder_refresh_flag
AND currDf.HLDR_SUPRESS_IN = inputDf.HLDR_SUPRESS_IN
AND currDf.BULK_LIST_ID = inputDf.Bulk_List_In
THEN currDf.FIRST_SEEN_DT
ELSE inputDf.FIRST_SEEN_DT
END,
currDf.SUPERSEDED_DT = CASE
WHEN currDf.CO_REG_DEBT = inputDf.registered_indebtedness
AND currDf.HLDR_LIST_CD = inputDf.holder_list_indicator
AND currDf.HLDR_LEGAL_STAT = inputDf.holder_legal_status
AND currDf.HLDR_REFRESH_CD = inputDf.holder_refresh_flag
AND currDf.HLDR_SUPRESS_IN = inputDf.HLDR_SUPRESS_IN
AND currDf.BULK_LIST_ID = inputDf.Bulk_List_In
THEN currDf.SUPERSEDED_DT
ELSE inputDf.SUPERSEDED_DT
END
WHEN NOT MATCHED
THEN INSERT
(REG_NB, ANN_RTN_DT, SUPERSEDED_DT, TO_DB_DT, TO_DB_TM, FIRST_SEEN_DT, LAST_SEEN_DT, BATCH_ID,
DATA_PROC_ID, CO_REG_DEBT, HLDR_LIST_CD, HLDR_LIST_DT, HLDR_LEGAL_STAT,
HLDR_REFRESH_CD, HLDR_SUPRESS_IN, BULK_LIST_ID, DOC_TYPE_CD)
VALUES
(registerNumber, annual_return_date, SUPERSEDED_DT, TO_DB_DT, TO_DB_TM, FIRST_SEEN_DT, LAST_SEEN_DT,
BATCH_ID, DATA_PROC_ID, registered_indebtedness, holder_list_indicator,
holder_list_date, holder_legal_status, holder_refresh_flag, HLDR_SUPRESS_IN,
Bulk_List_In, DOC_TYPE_CD)
""")
return currDf
Thanks

Saving data from SDSS query in Python

I got the data from SDSS by the following query
query = """SELECT TOP 20000
p.psfMag_u, p.psfMag_g, p.psfMag_r, p.psfMag_i, p.psfMag_z, s.class
FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1
ORDER BY s.specobjid ASC
"""
data = SDSS.query_sql(query).to_pandas()
psfMag_g psfMag_u psfMag_r psfMag_i psfMag_z class
19.08541 17.41787 16.59118 16.03507 15.68560 GALAXY
There are 20000 data.
I want to save in csv format. How to save the data?

LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems

I'm developing code in SQL spark reading tables in Hive ( HDFS ) .
The problem is that when I load my code in the shell of spark, recursively me the following message :
"WARN LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems."
I run the code that is:
val query_fare_details = sql("""
SELECT *
FROM fare_details
WHERE fardet_cd_carrier = 'LA'
AND fardet_cd_origin_city = 'SCL'
AND fardet_cd_dest_city = 'MIA'
AND fardet_cd_fare_basis = 'NNE0F0O1'
""")
query_fare_details.registerTempTable("query_fare_details")
val matchFAR1 = sql("""
SELECT *
FROM query_fare_details f
JOIN fare_rules r ON f.fardet_cd_carrier = r.farrul_cd_carrier
AND f.fardet_num_rule_tariff = r.farrul_num_rule_tariff
AND f.fardet_cd_fare_rule_bigint = r.farrul_cd_fare_rule_bigint
AND f.fardet_cd_fare_basis = r.farrul_cd_fare_basis
LIMIT 10""")
matchFAR1.show(5)
Any idea what goes wrong?

you can safely ignore this warning. This is not an error
Refer [https://issues.apache.org/jira/browse/SPARK-3057][1]

Percona Xtradb cluster crashing

We have a Percona Xtradb cluster with 5 nodes and an arbitrator. One of our Php developers ran a bad query on the cluster, crashing all the nodes. After the crash, we could not collect any error log to tell us what really went wrong as the entire cluster crashed without performing any logging.
I have always thought that when a single query is executed on the cluster, it is processed by only one of the nodes in the cluster. So if the query is bad (to the point of killing a db server), it should only crash the one node thats processing it, leaving the cluster running with the remaining 4 nodes.
This behavior has puzzled us and we would like to understand what is really going on especially that this is the second time this is happening. Why would a query running on the cluster while processed by one of the nodes would cause other nodes in the cluster to crash in case of some issue while being processed?
Below is our my.cnf config:
#
# Default values.
[mysqld_safe]
flush_caches
numa_interleave
#
#
[mysqld]
back_log = 65535
binlog_format = ROW
character_set_server = utf8
collation_server = utf8_general_ci
datadir = /var/lib/mysql
default_storage_engine = InnoDB
expand_fast_index_creation = 1
expire_logs_days = 7
innodb_autoinc_lock_mode = 2
innodb_buffer_pool_instances = 16
innodb_buffer_pool_populate = 1
innodb_buffer_pool_size = 32G # XXX 64GB RAM, 80%
innodb_data_file_path = ibdata1:64M;ibdata2:64M:autoextend
innodb_file_format = Barracuda
innodb_file_per_table
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_io_capacity = 1600
innodb_large_prefix
innodb_locks_unsafe_for_binlog = 1
innodb_log_file_size = 64M
innodb_print_all_deadlocks = 1
innodb_read_io_threads = 64
innodb_stats_on_metadata = FALSE
innodb_support_xa = FALSE
innodb_write_io_threads = 64
log-bin = mysqld-bin
log-queries-not-using-indexes
log-slave-updates
long_query_time = 1
max_allowed_packet = 64M
max_connect_errors = 4294967295
max_connections = 4096
min_examined_row_limit = 1000
port = 3306
relay-log-recovery = TRUE
skip-name-resolve
slow_query_log = 1
slow_query_log_timestamp_always = 1
table_open_cache = 4096
thread_cache = 1024
tmpdir = /db/tmp
transaction_isolation = REPEATABLE-READ
updatable_views_with_limit = 0
user = mysql
wait_timeout = 60
#
# Galera Variable config
wsrep_cluster_address = gcomm://ip_1, ip_2, ip_3,ip_4,ip_4,ip_5
wsrep_cluster_name = cluster_db
wsrep_provider = /usr/lib/libgalera_smm.so
wsrep_provider_options = "gcache.size=4G"
wsrep_slave_threads = 32
wsrep_sst_auth = "user:password"
wsrep_sst_donor = "db1"
#wsrep_sst_method = xtrabackup_throttle
wsrep_sst_method = xtrabackup-v2
#
# XXX You *MUST* change!
server-id = 1

Can you post the query? SELECT queries only execute on a single node but all write queries will execute everywhere. What's in your error log?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark doesn't read the file properly - apache-spark

I'm not entirely sure of why you specified the latter part of the path as the condition expression of filter. I believe to correctly read your file you can just write: spark.read.json("/warehouse/raw/twitter/provider/m=201802").show()

Related

Single Task Taking Long Time in PySpark

An error occurred while calling o91.sql. MERGE INTO TABLE is not supported temporarily

Saving data from SDSS query in Python

LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems

Percona Xtradb cluster crashing

Categories

Resources