pyspark sql with where clause throws column does not exist error - apache-spark

I am using pyspark to load csv to redshift. I want to query how manny rows got added.
I create a new column using the withcolumn function:
csvdata=df.withColumn("file_uploaded", lit("test"))
I see that this column gets created and I can query using psql. But, when I try to query using pyspark sql context, I get a error:
py4j.protocol.Py4JJavaError: An error occurred while calling o77.showString.
: java.sql.SQLException: [Amazon](500310) Invalid operation: column "test" does not exist in billingreports;
Interestingly, I am able to query other columnns but not just the new column I added.
Appreciate any pointers on how to resolve this issue.
Complete code:
df=spark.read.option("header","true").csv('/mnt/spark/redshift/umcompress/' +
filename)
csvdata=df.withColumn("fileuploaded", lit("test"))
countorig=csvdata.count()
## This executes without error
csvdata.write \
.format("com.databricks.spark.redshift") \
.option("url", jdbc_url) \
.option("dbtable", dbname) \
.option("tempformat", "CSV") \
.option("tempdir", "s3://" + s3_bucket + "/temp") \
.mode("append") \
.option("aws_iam_role", iam_role).save()
select="select count(*) from " + dbname + " where fileuploaded='test'"
## Error occurs
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", jdbc_url) \
.option("query", select) \
.option("tempdir", "s3://" + s3_bucket + "/test") \
.option("aws_iam_role", iam_role) \
.load()
newcounnt=df.count()
Thanks for responding.
Dataframe does have the new column called file_uploaded
Here is the query:
select="select count(*) from billingreports where file_uploaded='test'"
I have printed the schema
|-- file_uploaded: string (nullable = true)
df.show() shows that the new column is added.
I just want to add a pre determined string to this column as value.

Your Dataframe csvdata will have a new column named file_uploaded, with default value "test" in all rows of df. This error shows that it is trying to access a column named test, which does not exist in the dataframe billingreports thus the error. Print the schema before querying the column with billingreports.dtypes or better try to take a sample of your dataframe with billingreports.show() and see if the column has correct name and values.
It will be better if you share the query which resulted in this exception, as the exception is thrown for dataframe billingreports.

Related

ClickHouse housepower driver with spark

I'm new to Stack and Spark so please forgive me my simplicity and mistakes!
I have a problem with Clickhouse and spark (2.4.7), I work on a jupyter notebook.
Basically, I want to insert dataframe with Array column to Clickhouse with Array(String) column. Using yandex driver this is impossible, because jdbc doesn't support Arrays, right? ;)
So I wanted to run Spark Session with housepower jar: clickhouse-native-jdbc-shaded-2.6.4.jar, because I read that they added handling Arrays - correct me if I'm wrong.
And I want to get a query from Clickhouse via jdbc.
spark = SparkSession\
.builder\
.enableHiveSupport()\
.appName(f'custom-events-test)\
.config("spark.jars", "drivers/clickhouse-native-jdbc-shaded-2.6.4.jar")\
.getOrCreate()
My query:
query = """
select date as date,
partnerID as partnerID,
sessionID as sessionID,
toString(mapKeys(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as keys,
toString(mapValues(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as values
from audience.uber_all
array join eventContents as ec
PREWHERE date = ('2022-07-22')
WHERE partnerID = 'XXX'
and ec.custom != '{}'
order by date, partnerID
"""
and my code:
df_tab = spark.read \
.format("jdbc") \
.option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
.option("url", f"jdbc:clickhouse://{ch_host}:9000/{ch_db}") \
.option("query", query) \
.option("user", ch_user) \
.option("password", ch_pass) \
.load()
But there I get an error:
housepower_error
BUT when I run above query with yandex driver: ru.yandex.clickhouse.ClickHouseDriver
everything works fine. (even with housepower jar)
This error also appears when I want to import column like this:
JSONExtractKeysAndValues(ec.custom, 'String')
or
toString(JSONExtractKeysAndValues(ec.custom, 'String'))
What am I doing wrong ?
And tell me how to insert a DF with Array column using spark jdbc to Clickhouse table also with Array(String) column? I was looking everywhere but coudn't find a solution...
Thank you in advance !

Difference between using spark SQL and SQL

My main question concerns about performance. Looking at the code below:
query = """
SELECT Name, Id FROM Customers WHERE Id <> 1 ORDER BY Id
"""
df = spark.read.format(jdbc) \
.option("url", "connectionString") \
.option("user", user) \
.option("password", password) \
.option("numPartitions", 10) \
.option("partitionColumn", "Id") \
.option("lowerBound", lowerBound) \
.option("upperBound", upperBound) \
.option("dbtable", query) \
.load()
As far as I understand, this command will be sent to the DB process the query and return the value to spark.
Now considering the code below:
df = spark.read.jdbc(url = mssqlconnection,
table = "dbo.Customers",
properties = mssql_prop
).select(
f.col("Id"),
f.col("Name")
).where("Id = <> 1").orderBy(f.col("Id"))
I know that spark will load the entire table into memory and then execute the filters on the dataframe.
Finally, the last code snippet:
df = spark.read.jdbc(url = mssqlconnection,
table = "dbo.Customers",
properties = mssql_prop
)
final_df = spark_session.sql("""
SELECT Name, Id FROM Customers WHERE Id <> 1 ORDER BY Id
""")
I have 3 questions:
Among the 3 codes, which one is the most correct. I always use the second approach, is this correct?
What is the difference between using a spark.sql and using the commands directly in the dataframe according to the second code snipper?
What is the ideal number of lines for me to start using spark? Is it worth using in queries that return less than 1 million rows?

Spark write Dataframe to SQL Server Table with Insert Identity On

I have a Spark Dataframe that I want to push to an SQL table on a remote server. The table has an Id column that is set as an identity column. The Dataframe I want to push also has as Id column, and I want to use those Ids in the SQL table, without removing the identity option for the column.
I write the dataframe like this:
df.write.format("jdbc") \
.mode(mode) \
.option("url", jdbc_url) \
.option("dbtable", table_name) \
.option("user", jdbc_username) \
.option("password", jdbc_password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.save()
But I get the following response:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 41.0 (TID 41, 10.1.0.4, executor 0): java.sql.BatchUpdateException: Cannot insert explicit value for identity column in table &#39Table' when IDENTITY_INSERT is set to OFF.
I have tried to add a query to the writing like:
query = f"SET IDENTITY_INSERT Table ON;"
df.write.format("jdbc") \
.mode(mode) \
.option("url", jdbc_url) \
.option("query", query) \
.option("dbtable", table_name) \
.option("user", jdbc_username) \
.option("password", jdbc_password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.save()
But that just throws an SQL syntax error:
IllegalArgumentException: Both 'dbtable' and 'query' can not be specified at the same time.
Or if I try to run a read with the query first:
com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'SET'.
This must be because it only supports SELECT statements.
Is it possible to do in Spark, or would I need to use a different connector and combine the setting of the insert identity on, together with regular insert into statements?
I would prefer a solution that allowed me to keep writing through the Spark context. But I am open to other solutions.
One way to work around this issue is the following:
Save your dataframe as a temporary table in your database.
Set identity insert to ON.
Insert into your real table the content of your temporary table.
Set identity insert to OFF.
Drop your temporary table.
Here's a pseudo code example:
tablename = "MyTable"
tmp_tablename = tablename+"tmp"
df.write.format("jdbc").options(..., dtable=tmp_tablename).save()
columns = ','.join(df.columns)
query = f"""
SET IDENTITY_INSERT {tablename} ON;
INSERT INTO {tablename} ({columns})
SELECT {columns} FROM {tmp_tablename};
SET IDENTITY_INSERT {tablename} OFF;
DROP TABLE {tmp_tablename};
"""
execute(query) # You can use Cursor from pyodbc for example to execute raw SQL queries

How to specify column data type when writing Spark DataFrame to Oracle

I want to write a Spark DataFrame to an Oracle table by using Oracle JDBC driver. My code is listed below:
url = "jdbc:oracle:thin:#servername:sid"
mydf.write \
.mode("overwrite") \
.option("truncate", "true") \
.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.OracleDriver") \
.option("createTableColumnTypes", "desc clob, price double") \
.option("user", "Steven") \
.option("password", "123456") \
.option("dbtable", "table1").save()
What I want is to specify the desc column to clob type and the price column to double precision type. But Spark show me that the clob type is not supported. The length of desc string is about 30K. I really need your help. Thanks
As per this note specifies that there are some data types that are not supported. If the target table is already created with CLOB data type then createTableColumnTypes may be redundant. You can check if writing to a CLOB column is possible with spark jdbc if table is already created.
Create your table in mysql with your required schema , now use mode='append' and save records .
mode='append' only insert records without modify table schema.

Running custom Apache Phoenix SQL query in PySpark

Could someone provide an example using pyspark on how to run a custom Apache Phoenix SQL query and store the result of that query in a RDD or DF. Note: I am looking for a custom query and not an entire table to be read into a RDD.
From Phoenix Documentation, to load an entire table I can use this:
table = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "<TABLENAME>") \
.option("zkUrl", "<hostname>:<port>") \
.load()
I want to know what is the corresponding equivalent for using a custom SQL
sqlResult = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("sql", "select * from <TABLENAME> where <CONDITION>") \
.option("zkUrl", "<HOSTNAME>:<PORT>") \
.load()
Thanks.
This can be done using Phoenix as a JDBC data source as given below:
sql = '(select COL1, COL2 from TABLE where COL3 = 5) as TEMP_TABLE'
df = sqlContext.read.format('jdbc')\
.options(driver="org.apache.phoenix.jdbc.PhoenixDriver", url='jdbc:phoenix:<HOSTNAME>:<PORT>', dbtable=sql).load()
df.show()
However it should be noted that if there are column aliases in the SQL statement then the .show() statement would throw up an exception (It will work if you use .select() to select the columns that are not aliased), this is a possible bug in Phoenix.
Here you need to use .sql to work with custom queries. Here is syntax
dataframe = sqlContext.sql("select * from <table> where <condition>")
dataframe.show()
To Spark2, I didn't have problem with .show() function, and I did not use .select() function to print all values of DataFrame coming from Phoenix.
So, make sure that your sql query has been inside parentheses, look my example:
val sql = " (SELECT P.PERSON_ID as PERSON_ID, P.LAST_NAME as LAST_NAME, C.STATUS as STATUS FROM PERSON P INNER JOIN CLIENT C ON C.CLIENT_ID = P.PERSON_ID) "
val dft = dfPerson.sparkSession.read.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", "jdbc:phoenix:<HOSTNAME>:<PORT>")
.option("useUnicode", "true")
.option("continueBatchOnError", "true")
.option("dbtable", sql)
.load()
dft.show();
It shows me:
+---------+--------------------+------+
|PERSON_ID| LAST_NAME|STATUS|
+---------+--------------------+------+
| 1005| PerDiem|Active|
| 1008|NAMEEEEEEEEEEEEEE...|Active|
| 1009| Admission|Active|
| 1010| Facility|Active|
| 1011| MeUP|Active|
+---------+--------------------+------+

Resources