How can I achieve following pyspark behavior using replaceWhere clause - apache-spark

sourceSystem = df_sourceSystem.select(col('SourceSystem')).collect()[0][0]
print(sourceSystem)
df_inactive = df_existing.join(df, 'CustomerKey', 'left_anti')
df = df.union(df_inactive)
return df, sourceSystem
spark.read.format("delta").load(load_path).show()
df.display()
df.write.format('delta') \
.mode('overwrite') \
.partitionBy('SourceSystem') \
.option('maxRecordsPerFile', 5000000) \
.option("replaceWhere", "SourceSystem = 'sourceSystem'") \
.save(f'abfss://lakedev#abclakedev01.dfs.core.windows.net/xyz/Presented/functionabc/schema_version= 5/')
df.display()
the error is following --
AnalysisException: Data written out does not match replaceWhere 'SourceSystem = 'A''*
AnalysisException: Data written out does not match replaceWhere 'SourceSystem = 'A''. CHECK
constraint EXPRESSION(('SourceSystem = A)) (`SourceSystem` = 'A') violated by row with values:
SourceSystem : B
My understanding is that there are multiple sourcesystem in df but replacehwere is not able to select the particular one - how to resolve it
i tried with
.option("replaceWhere", f"SourceSystem = '{sourceSystem }'") \
but did not work

Related

Difference between using spark SQL and SQL

My main question concerns about performance. Looking at the code below:
query = """
SELECT Name, Id FROM Customers WHERE Id <> 1 ORDER BY Id
"""
df = spark.read.format(jdbc) \
.option("url", "connectionString") \
.option("user", user) \
.option("password", password) \
.option("numPartitions", 10) \
.option("partitionColumn", "Id") \
.option("lowerBound", lowerBound) \
.option("upperBound", upperBound) \
.option("dbtable", query) \
.load()
As far as I understand, this command will be sent to the DB process the query and return the value to spark.
Now considering the code below:
df = spark.read.jdbc(url = mssqlconnection,
table = "dbo.Customers",
properties = mssql_prop
).select(
f.col("Id"),
f.col("Name")
).where("Id = <> 1").orderBy(f.col("Id"))
I know that spark will load the entire table into memory and then execute the filters on the dataframe.
Finally, the last code snippet:
df = spark.read.jdbc(url = mssqlconnection,
table = "dbo.Customers",
properties = mssql_prop
)
final_df = spark_session.sql("""
SELECT Name, Id FROM Customers WHERE Id <> 1 ORDER BY Id
""")
I have 3 questions:
Among the 3 codes, which one is the most correct. I always use the second approach, is this correct?
What is the difference between using a spark.sql and using the commands directly in the dataframe according to the second code snipper?
What is the ideal number of lines for me to start using spark? Is it worth using in queries that return less than 1 million rows?

Issue with converting spark-sql which has a oracle blob column into java byte[]

I would appreciate if anybody can point me to the code snippet for
converting spark SQL that has a oracle blob column into java byte[], here is what I have, but getting error.
Dataset<tableX> dataset = sparkSession.read()
.format("jdbc")
.option("url", "jdbc:oracle:thin:#(DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = xx )(PORT = 1234))(CONNECT_DATA = (SERVER = DEDICATED)(SERVICE_NAME = xy)))")
.option("dbtable", "(select lob_id , blob_data from tableX ) test1")
.option("user", "user1")
.option("password", "pass1")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.load();
//dataset.show();
dataset.foreach((ForeachFunction<tableX>) row -> {
byte blobData[] = row.getAs("blob_data");
}
Thank you.
DataframeReader.load returns an untyped dataframe (aka Dataset<Row>). When calling foreach with a ForeachFunction<tableX> a compiler error will occur.
Option 1: stick to the untyped dataframe and use ForeachFunction<Row>:
Dataset<Row> dataframe = sparkSession.read() ... .load();
dataframe.foreach((ForeachFunction<Row>) row -> {
byte blobData[] = (byte[])row.getAs("blob_data");
System.out.println(Arrays.toString(blobData));
});
Option 2: after reading the dataframe transform it into a typed dataset and use ForeachFunction<tableX>. A class tableX should exist and contain the fields of the original query as members:
Dataset<tableX> dataset = dataframe.as(Encoders.bean(tableX.class));
dataset.foreach((ForeachFunction<tableX>) row -> {
byte blobData[] = row.getBlob_data();
System.out.println(Arrays.toString(blobData));
});
Option 2 assumes that the class tableX has a getter for the field blob_data named getBlob_data().

PySpark pyspark.sql.DataFrameReader.jdbc() doesn't accept datetime type upperbound argument as document says

I found in the document for jdbc function in PySpark 3.0.1 at
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader, it says:
column – the name of a column of numeric, date, or timestamp type that
will be used for partitioning.
I thought it accepts a datetime column to partition the query.
So I tried this on EMR-6.2.0 (PySpark 3.0.1):
sql_conn_params = get_spark_conn_params() # my function
sql_conn_params['column'] ='EVENT_CAPTURED'
sql_conn_params['numPartitions'] = 8
# sql_conn_params['upperBound'] = datetime.strptime('2016-01-01', '%Y-%m-%d') # another trial
# sql_conn_params['lowerBound'] = datetime.strptime(''2016-01-10', '%Y-%m-%d')
sql_conn_params['upperBound'] = '2016-01-01 00:00:00'
sql_conn_params['lowerBound'] = '2016-01-10 00:00:00'
df = (spark.read.jdbc(
table=tablize(sql),
**sql_conn_params
))
df.show()
I got this error:
invalid literal for int() with base 10: '2016-01-01 00:00:00'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 625, in jdbc
return self._df(self._jreader.jdbc(url, table, column, int(lowerBound), int(upperBound),
ValueError: invalid literal for int() with base 10: '2016-01-01 00:00:00'
I looked at the source code here
https://github.com/apache/spark/blob/master/python/pyspark/sql/readwriter.py#L865
and found it doesn't support datetime type as document says.
My question is:
It doesn't support datetime type partition column in PySpark as the code shows, but why the document says it supports it?
Thanks,
Yan
It supports.
The issue here is that the spark.read.jdbc method currently only supports parameters upper/lower bounds for integral type columns.
But you can use load method and DataFrameReader.option to specifiy upperBound and lowerBound for other column types date/timestamp :
df = spark.read.format("jdbc") \
.option("url", "jdbc:mysql://server/db") \
.option("dbtable", "table_name") \
.option("user", "user") \
.option("password", "xxxx") \
.option("partitionColumn", "EVENT_CAPTURED") \
.option("lowerBound", "2016-01-01 00:00:00") \
.option("upperBound", "2016-01-10 00:00:00") \
.option("numPartitions", "8") \
.load()
Or by passing dict of options:
df = spark.read.format("jdbc") \
.options(*sql_conn_params)\
.load()
You can see all available options and examples here: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Spark dynamic partitioning: SchemaColumnConvertNotSupportedException on read

Question
Is there any way to store data with different (not compatible) schemas in different partitions?
The issue
I use PySpark v2.4.5, parquet format and dynamic partitioning with the following hierachy: BASE_PATH/COUNTRY=US/TYPE=sms/YEAR=2020/MONTH=04/DAY=10/. Unfortunatelly it can't be changed.
I got SchemaColumnConvertNotSupportedException on read. That happens because schema differs between different types (i.e. between sms and mms). Looks like Spark trying to merge to schemas on read under the hood.
If to be more precise, I can read data for F.col('TYPE') == 'sms', because mms schema can be converted to sms. But when I'm filtering by F.col('TYPE') == 'mms', than Spark fails.
Code
# Works, because Spark doesn't try to merge schemas
spark_session \
.read \
.option('mergeSchema', False) \
.parquet(BASE_PATH + '/COUNTRY_CODE=US/TYPE=mms/YEAR=2020/MONTH=04/DAY=07/HOUR=00') \
.show()
# Doesn't work, because Spark trying to merge schemas for TYPE=sms and TYPE=mms. Mms data can't be converted to merged schema.
# Types are correct, from explain Spark treat date partitions as integers
# Predicate pushdown isn't used for some reason, there is no PushedFilter in explained plan
spark_session \
.read \
.option('mergeSchema', False) \
.parquet(BASE_PATH) \
.filter(F.col('COUNTRY') == 'US') \
.filter(F.col('TYPE') == 'mms') \
.filter(F.col('YEAR') == 2020) \
.filter(F.col('MONTH') == 4) \
.filter(F.col('DAY') == 10) \
.show()
Just for situation it may be useful for someone. It's possible to have different data within different partitions. To make Spark no infer schema for parquet - specify the schema:
spark_session \
.read \
.schema(some_schema) \
.option('mergeSchema', False) \
.parquet(BASE_PATH) \
.filter(F.col('COUNTRY') == 'US') \
.filter(F.col('TYPE') == 'mms') \
.filter(F.col('YEAR') == 2020) \
.filter(F.col('MONTH') == 4) \
.filter(F.col('DAY') == 10) \
.show()

pyspark sql with where clause throws column does not exist error

I am using pyspark to load csv to redshift. I want to query how manny rows got added.
I create a new column using the withcolumn function:
csvdata=df.withColumn("file_uploaded", lit("test"))
I see that this column gets created and I can query using psql. But, when I try to query using pyspark sql context, I get a error:
py4j.protocol.Py4JJavaError: An error occurred while calling o77.showString.
: java.sql.SQLException: [Amazon](500310) Invalid operation: column "test" does not exist in billingreports;
Interestingly, I am able to query other columnns but not just the new column I added.
Appreciate any pointers on how to resolve this issue.
Complete code:
df=spark.read.option("header","true").csv('/mnt/spark/redshift/umcompress/' +
filename)
csvdata=df.withColumn("fileuploaded", lit("test"))
countorig=csvdata.count()
## This executes without error
csvdata.write \
.format("com.databricks.spark.redshift") \
.option("url", jdbc_url) \
.option("dbtable", dbname) \
.option("tempformat", "CSV") \
.option("tempdir", "s3://" + s3_bucket + "/temp") \
.mode("append") \
.option("aws_iam_role", iam_role).save()
select="select count(*) from " + dbname + " where fileuploaded='test'"
## Error occurs
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", jdbc_url) \
.option("query", select) \
.option("tempdir", "s3://" + s3_bucket + "/test") \
.option("aws_iam_role", iam_role) \
.load()
newcounnt=df.count()
Thanks for responding.
Dataframe does have the new column called file_uploaded
Here is the query:
select="select count(*) from billingreports where file_uploaded='test'"
I have printed the schema
|-- file_uploaded: string (nullable = true)
df.show() shows that the new column is added.
I just want to add a pre determined string to this column as value.
Your Dataframe csvdata will have a new column named file_uploaded, with default value "test" in all rows of df. This error shows that it is trying to access a column named test, which does not exist in the dataframe billingreports thus the error. Print the schema before querying the column with billingreports.dtypes or better try to take a sample of your dataframe with billingreports.show() and see if the column has correct name and values.
It will be better if you share the query which resulted in this exception, as the exception is thrown for dataframe billingreports.

Resources