How to include partitioned column in pyspark dataframe read method - apache-spark

I am writing Avro file-based from a parquet file. I have read the file as below:
Reading data
dfParquet = spark.read.format("parquet").option("mode", "FAILFAST")
.load("/Users/rashmik/flight-time.parquet")
Writing data
I have written the file in Avro format as below:
dfParquetRePartitioned.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
As expected, I got data partitioned by OP_CARRIER.
Reading Avro partitioned data from a specific partition
In another job, I need to read data from the output of the above job, i.e. from datasink/avro directory. I am using the below code to read from datasink/avro
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load("datasink/avro/OP_CARRIER=AA")
It reads data successfully, but as expected OP_CARRIER column is not available in dfAvro dataframe as it is a partition column of the first job. Now my requirement is to include OP_CARRIER field also in 2nd dataframe i.e. in dfAvro. Could somebody help me with this?
I am referring documentation from the spark document, but I am not able to locate the relevant information. Any pointer will be very helpful.

You replicate the same column value with a different alias.
dfParquetRePartitioned.withColumn("OP_CARRIER_1", lit(df.OP_CARRIER)) \
.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
This would give you what you wanted. But with a different alias.
Or you can also do it during reading. If location is dynamic then you can easily append the column.
path = "datasink/avro/OP_CARRIER=AA"
newcol = path.split("/")[-1].split("=")
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load(path).withColumn(newcol[0], lit(newcol[1]))
If the value is static its way more easy to add it during the data read.

Related

Can i exclude the column used for partitioning when writing to parquet?

i need to create parquet files, reading from jdbc. The table is quite big and all columns are varchars. So i created a new column with a random int to make partitioning.
so my read jdbc looks something like this:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
and my write to parquet looks something like this:
data_df.write.mode("overwrite").parquet("parquetfile.parquet").partitionBy('random_number')
The generated parquet also contains the 'random_number' column, but i only made that column for partitioning, is there a way to exclude that column to the writing of the parquet files?
Thanks for any help, i'm new to spark :)
I'm expecting to exclude the random_number column, but lack the knowledge if this is possible if i need the column for partitioning
So do you want to repartition in memory using a column but not writing it, you can just use .repartition(col("random_number")) before writing droping the column then write your data:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
.repartition(col("random_number")).drop("random_number")
then:
data_df.write.mode("overwrite").parquet("parquetfile.parquet")

Handling Duplicates in Databricks autoloader

I am new to this Databricks Autoloader, we have a requirement where we need to process the data from AWS s3 to delta table via Databricks autoloader. I was testing this autoloader so I came across duplicate issue that is if i upload a file with name say emp_09282021.csv having same data as emp_09272021.csv then it is not detecting any duplicate it is simply inserting them so if I had 5 rows in emp_09272021.csv file now it will become 10 rows as I upload emp_09282021.csv file.
below is the code that i tried:
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.format("delta") \
.option("mergeSchema", "true") \
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start("s3://some-s3-path/spark_stream_processing/target/")
any guidance please to handle this?
It's not the task of the autoloader to detect duplicates, it provides you the possibility to ingest data, but you need to handle duplicates yourself. There are several approaches to that:
Use built-in dropDuplicates function. It's recommended to use it with watermarking to avoid creating a huge state, but you need to have some column that will be used as event time, and it should be part of dropDuplicate list (see docs for more details):
streamingDf \
.withWatermark("eventTime", "10 seconds") \
.dropDuplicates("col1", "eventTime")
Use Delta's merge capability - you just need to insert data that isn't in the Delta table, but you need to use foreachBatch for that. Something like this (please note that table should already exist, or you need to add a handling of non-existent table):
from delta.tables import *
def drop_duplicates(df, epoch):
table = DeltaTable.forPath(spark,
"s3://some-s3-path/spark_stream_processing/target/")
dname = "destination"
uname = "updates"
dup_columns = ["col1", "col2"]
merge_condition = " AND ".join([f"{dname}.{col} = {uname}.{col}"
for col in dup_columns])
table.alias(dname).merge(df.alias(uname), merge_condition)\
.whenNotMatchedInsertAll().execute()
# ....
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.foreachBatch(drop_duplicates)\
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start()
In this code you need to change the dup_columns variable to specify columns that are used to detect duplicates.

Pass additional arguments to foreachBatch in pyspark

I am using foreachBatch in pyspark structured streaming to write each microbatch to SQL Server using JDBC. I need to use the same process for several tables, and I'd like to reuse the same writer function by adding an additional argument for table name, but I'm not sure how to pass the table name argument.
The example here is pretty helpful, but in the python example the table name is hardcoded, and it looks like in the scala example they're referencing a global variable(?) I would like to pass the name of the table into the function.
The function given in the python example at the link above is:
def writeToSQLWarehose(df, epochId):
df.write \
.format("com.databricks.spark.sqldw") \
.mode('overwrite') \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forward_spark_azure_storage_credentials", "true") \
.option("dbtable", "my_table_in_dw_copy") \
.option("tempdir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.save()
I'd like to use something like this:
def writeToSQLWarehose(df, epochId, tableName):
df.write \
.format("com.databricks.spark.sqldw") \
.mode('overwrite') \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forward_spark_azure_storage_credentials", "true") \
.option("dbtable", tableName) \
.option("tempdir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.save()
But I'm not sure how to pass the additional argument through foreachBatch.
Something like this should work.
streamingDF.writeStream.foreachBatch(lambda df,epochId: writeToSQLWarehose(df, epochId,tableName )).start()
Samellas' solution does not work if you need to run multiple streams. The foreachBatch function gets serialised and sent to Spark worker. The parameter seems to be still a shared variable within the worker and may change during the execution.
My solution is to add parameter as a literate column in the batch dataframe (passing a silver data lake table path to the merge operation):
.withColumn("dl_tablePath", func.lit(silverPath))
.writeStream.format("delta")
.foreachBatch(insertIfNotExisting)
In the batch function insertIfNotExisting, I pick up the parameter and drop the parameter column:
def insertIfNotExisting(batchDf, batchId):
tablePath = batchDf.select("dl_tablePath").limit(1).collect()[0][0]
realDf = batchDf.drop("dl_tablePath")

Spark 2.3.1 AWS EMR not returning data for some columns yet works in Athena/Presto and Spectrum

I am using PySpark on Spark 2.3.1 on AWS EMR (Python 2.7.14)
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 100) \
.enableHiveSupport() \
.getOrCreate()
spark.sql('select `message.country` from datalake.leads_notification where `message.country` is not null').show(10)
This returns no data, 0 rows found.
Every value for each row in above table is returned Null.
Data is stored in PARQUET.
When I ran same SQL query on AWS Athena/Presto or on AWs Redshift Spectrum then I get all column data returned correctly (most column values are not null).
This is the Athena SQL and Redshift SQL query that returns correct data:
select "message.country" from datalake.leads_notification where "message.country" is not null limit 10;
I use AWS Glue catalog in all cases.
The column above is NOT partitioned but the table is partitioned on other columns. I tried to use repair table, it did not help.
i.e. MSCK REPAIR TABLE datalake.leads_notification
i tried Schema Merge = True like so:
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("spark.sql.parquet.mergeSchema", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 200) \
.enableHiveSupport() \
.getOrCreate()
No difference, still every value of one column is nulls even though some are not null.
This column was added as the last column to the table so most data is indeed null but some rows are not null. The column is listed at last on the column list in catalog, sitting just above the partitioned columns.
Nevertheless Athena/Presto retrieves all non-null values OK and so does Redshift Spectrum too but alas EMR Spark 2.3.1 PySpark shows all values for this column as "null". All other columns in Spark are retrieved correctly.
Can anyone help me to debug this problem please?
Hive Schema is hard to cut and paste here due to output format.
***CREATE TABLE datalake.leads_notification(
message.environment.siteorigin string,
dcpheader.dcploaddateutc string,
message.id int,
message.country string,
message.financepackage.id string,
message.financepackage.version string)
PARTITIONED BY (
partition_year_utc string,
partition_month_utc string,
partition_day_utc string,
job_run_guid string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://blahblah/leads_notification/leads_notification/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='weekly_datalake_crawler',
'averageRecordSize'='3136',
'classification'='parquet',
'compressionType'='none',
'objectCount'='2',
'recordCount'='897025',
'sizeKey'='1573529662',
'spark.sql.create.version'='2.2 or prior',
'spark.sql.sources.schema.numPartCols'='4',
'spark.sql.sources.schema.numParts'='3',
'spark.sql.sources.schema.partCol.0'='partition_year_utc',
'spark.sql.sources.schema.partCol.1'='partition_month_utc',
'spark.sql.sources.schema.partCol.2'='partition_day_utc',
'spark.sql.sources.schema.partCol.3'='job_run_guid',
'typeOfData'='file')***
Last 3 columns all have the same problems in Spark:
message.country string,
message.financepackage.id string,
message.financepackage.version string
All return OK in Athena/Presto and Redshift Spectrum using same catalog.
I apologize for my editing.
thank you
do step 5 schema inspection:
http://www.openkb.info/2015/02/how-to-build-and-use-parquet-tools-to.html
my bet is these new column names in parquet definition are either upper case (while other column names are lower case) or new column names in parquet definition are either lower case (while other column names are upper case)
see Spark issues reading parquet files
https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("spark.sql.parquet.mergeSchema", "true") \
.config("spark.sql.hive.convertMetastoreParquet", "false") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 200) \
.enableHiveSupport() \
.getOrCreate()
This is the solution: note the
.config("spark.sql.hive.convertMetastoreParquet", "false")
The schema columns are all in lower case and the schema was created by AWS Glue, not by my custom code so I dont really know what caused the problem so using the above is probably the safe default setting when schema creation is not directly under your control. This is a major trap, IMHO, so I hope this will help someone else in future.
Thanks to tooptoop4 who pointed out the article:
https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0

How to specify column data type when writing Spark DataFrame to Oracle

I want to write a Spark DataFrame to an Oracle table by using Oracle JDBC driver. My code is listed below:
url = "jdbc:oracle:thin:#servername:sid"
mydf.write \
.mode("overwrite") \
.option("truncate", "true") \
.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.OracleDriver") \
.option("createTableColumnTypes", "desc clob, price double") \
.option("user", "Steven") \
.option("password", "123456") \
.option("dbtable", "table1").save()
What I want is to specify the desc column to clob type and the price column to double precision type. But Spark show me that the clob type is not supported. The length of desc string is about 30K. I really need your help. Thanks
As per this note specifies that there are some data types that are not supported. If the target table is already created with CLOB data type then createTableColumnTypes may be redundant. You can check if writing to a CLOB column is possible with spark jdbc if table is already created.
Create your table in mysql with your required schema , now use mode='append' and save records .
mode='append' only insert records without modify table schema.

Resources