not null not working in Pyspark when using impala jdbc driver - apache-spark

I am new to Pyspark. I am using Impala JDBC driver ImpalaJDBC41.jar . In my pyspark code, I use the below.
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:impala://<instance>:21051") \
.option("query", "select dst_val,node_name,trunc(starttime,'SS') as starttime from def.tbl_dst where node_name is not null and trunc(starttime,'HH') >= trunc(hours_add(now(),-1),'HH') and trunc(starttime,'HH') < trunc(now(),'HH')") \
.option("user", "") \
.option("password", "") \
.load()
But the above does not work and the "node_name is not null" is not working. Also the trunc(starttime,'SS') is also not working. Any help would be appreciated.
sample input data :
dst_val,node_name,starttime
BCD098,,2021-03-26 15:42:06.890000000
BCD043,HKR_NODEF,2021-03-26 20:31:09
BCD038,BCF_NODEK,2021-03-26 21:29:10
Expected output :
dst_val,node_name,starttime
BCD043,HKR_NODEF,2021-03-26 20:31:09
BCD038,BCF_NODEK,2021-03-26 21:29:10
For debugging , I am tryin to print the df.show. But no use.
I am using df.show() , but it is still showing the record with null. The datatype of node_name is "STRING"

can you please use this ?
select dst_val,node_name,cast( from_timestamp(starttime,'SSS') as bigint) as starttime from def.tbl_dst where (node_name is not null or node_name<>'' ) and trunc(starttime,'HH') >= trunc(hours_add(now(),-1),'HH') and trunc(starttime,'HH') < trunc(now(),'HH')
I think node_name has empty space in it and above sql(i added or node_name<>'') will take care of it.
Now, if you have some non-printable character, then we may have to check accordingly.
EDIT : Since not null is working in Impala, i think this may be a spark issue.

Related

How to filter data using spark.read in place?

I try read data in Delta format from ADLS. I want read some portion of that data using filter in place. Same approach worked for me during reading JDBC format
query = f"""
select * from {table_name}
where
createdate < to_date('{createdate}','YYYY-MM-DD HH24:MI:SS') or
modifieddate < to_date('{modifieddate}','YYYY-MM-DD HH24:MI:SS')
"""
return spark.read \
.format("jdbc") \
.option("url", url) \
.option("query", query) \
.option("user", username) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
So I tried to create in similar way reading delta using query but it reads whole table.
return spark.read \
.format("delta") \
.option("query", query) \
.load(path)
How could I solve this issue without reading full df and then filter it?
Thanks in advance!
Spark uses a functionality called predicate pushdown to optimize queries.
In the first case, the filters can be passed on to the oracle database.
Delta does not work that way. There can be optimisations through data skipping and Z-ordering, but since you are essentially querying parquet files, you have to read the all of them in memory and filter afterwards.

ClickHouse housepower driver with spark

I'm new to Stack and Spark so please forgive me my simplicity and mistakes!
I have a problem with Clickhouse and spark (2.4.7), I work on a jupyter notebook.
Basically, I want to insert dataframe with Array column to Clickhouse with Array(String) column. Using yandex driver this is impossible, because jdbc doesn't support Arrays, right? ;)
So I wanted to run Spark Session with housepower jar: clickhouse-native-jdbc-shaded-2.6.4.jar, because I read that they added handling Arrays - correct me if I'm wrong.
And I want to get a query from Clickhouse via jdbc.
spark = SparkSession\
.builder\
.enableHiveSupport()\
.appName(f'custom-events-test)\
.config("spark.jars", "drivers/clickhouse-native-jdbc-shaded-2.6.4.jar")\
.getOrCreate()
My query:
query = """
select date as date,
partnerID as partnerID,
sessionID as sessionID,
toString(mapKeys(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as keys,
toString(mapValues(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as values
from audience.uber_all
array join eventContents as ec
PREWHERE date = ('2022-07-22')
WHERE partnerID = 'XXX'
and ec.custom != '{}'
order by date, partnerID
"""
and my code:
df_tab = spark.read \
.format("jdbc") \
.option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
.option("url", f"jdbc:clickhouse://{ch_host}:9000/{ch_db}") \
.option("query", query) \
.option("user", ch_user) \
.option("password", ch_pass) \
.load()
But there I get an error:
housepower_error
BUT when I run above query with yandex driver: ru.yandex.clickhouse.ClickHouseDriver
everything works fine. (even with housepower jar)
This error also appears when I want to import column like this:
JSONExtractKeysAndValues(ec.custom, 'String')
or
toString(JSONExtractKeysAndValues(ec.custom, 'String'))
What am I doing wrong ?
And tell me how to insert a DF with Array column using spark jdbc to Clickhouse table also with Array(String) column? I was looking everywhere but coudn't find a solution...
Thank you in advance !

Spark JDBC read API: Determining the number of partitions dynamically for a column of type datetime

I'm trying to read a table from an RDS MySQL instance using PySpark. It's a huge table, hence I want to parallelize the read operation by making use of the partitioning concept. The table doesn't have a numeric column to find the number of partitions. Instead, it has a timestamp column (i.e. datetime type).
I found the lower and upper bounds by retrieving the min and max values of the timestamp column. However, I'm not sure if there's a standard formula to find out the number of partitions dynamically. Here is what I'm doing currently (hardcoding the value for numPartititons parameter):
select_sql = "SELECT {} FROM {}".format(columns, table)
partition_info = {'partition_column': 'col1',
'lower_bound': '<result of min(col1)>',
'upper_bound': '<result of max(col1)>',
'num_partitions': '10'}
read_df = spark.read.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("dbtable", select_sql) \
.option("user", user) \
.option("password", password) \
.option("useSSL", False) \
.option("partitionColumn", partition_info['partition_column']) \
.option("lowerBound", partition_info['lower_bound'])) \
.option("upperBound", partition_info['upper_bound'])) \
.option("numPartitions", partition_info['num_partitions']) \
.load()
Please suggest me a solution/your approach that works. Thanks
How to set numPartitions depends on your cluster's definition. There are no right or wrong or automatic settings here. As long as you understand the logic behind partitionColumn, lowerBound, upperBound, numPartitions, and probably lots of benchmarking, you can decide what's the right number.
Pyspark - df.cache().count() taking forever to run
What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?

How to specify column data type when writing Spark DataFrame to Oracle

I want to write a Spark DataFrame to an Oracle table by using Oracle JDBC driver. My code is listed below:
url = "jdbc:oracle:thin:#servername:sid"
mydf.write \
.mode("overwrite") \
.option("truncate", "true") \
.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.OracleDriver") \
.option("createTableColumnTypes", "desc clob, price double") \
.option("user", "Steven") \
.option("password", "123456") \
.option("dbtable", "table1").save()
What I want is to specify the desc column to clob type and the price column to double precision type. But Spark show me that the clob type is not supported. The length of desc string is about 30K. I really need your help. Thanks
As per this note specifies that there are some data types that are not supported. If the target table is already created with CLOB data type then createTableColumnTypes may be redundant. You can check if writing to a CLOB column is possible with spark jdbc if table is already created.
Create your table in mysql with your required schema , now use mode='append' and save records .
mode='append' only insert records without modify table schema.

Running custom Apache Phoenix SQL query in PySpark

Could someone provide an example using pyspark on how to run a custom Apache Phoenix SQL query and store the result of that query in a RDD or DF. Note: I am looking for a custom query and not an entire table to be read into a RDD.
From Phoenix Documentation, to load an entire table I can use this:
table = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "<TABLENAME>") \
.option("zkUrl", "<hostname>:<port>") \
.load()
I want to know what is the corresponding equivalent for using a custom SQL
sqlResult = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("sql", "select * from <TABLENAME> where <CONDITION>") \
.option("zkUrl", "<HOSTNAME>:<PORT>") \
.load()
Thanks.
This can be done using Phoenix as a JDBC data source as given below:
sql = '(select COL1, COL2 from TABLE where COL3 = 5) as TEMP_TABLE'
df = sqlContext.read.format('jdbc')\
.options(driver="org.apache.phoenix.jdbc.PhoenixDriver", url='jdbc:phoenix:<HOSTNAME>:<PORT>', dbtable=sql).load()
df.show()
However it should be noted that if there are column aliases in the SQL statement then the .show() statement would throw up an exception (It will work if you use .select() to select the columns that are not aliased), this is a possible bug in Phoenix.
Here you need to use .sql to work with custom queries. Here is syntax
dataframe = sqlContext.sql("select * from <table> where <condition>")
dataframe.show()
To Spark2, I didn't have problem with .show() function, and I did not use .select() function to print all values of DataFrame coming from Phoenix.
So, make sure that your sql query has been inside parentheses, look my example:
val sql = " (SELECT P.PERSON_ID as PERSON_ID, P.LAST_NAME as LAST_NAME, C.STATUS as STATUS FROM PERSON P INNER JOIN CLIENT C ON C.CLIENT_ID = P.PERSON_ID) "
val dft = dfPerson.sparkSession.read.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", "jdbc:phoenix:<HOSTNAME>:<PORT>")
.option("useUnicode", "true")
.option("continueBatchOnError", "true")
.option("dbtable", sql)
.load()
dft.show();
It shows me:
+---------+--------------------+------+
|PERSON_ID| LAST_NAME|STATUS|
+---------+--------------------+------+
| 1005| PerDiem|Active|
| 1008|NAMEEEEEEEEEEEEEE...|Active|
| 1009| Admission|Active|
| 1010| Facility|Active|
| 1011| MeUP|Active|
+---------+--------------------+------+

Resources