ClickHouse housepower driver with spark - apache-spark

I'm new to Stack and Spark so please forgive me my simplicity and mistakes!
I have a problem with Clickhouse and spark (2.4.7), I work on a jupyter notebook.
Basically, I want to insert dataframe with Array column to Clickhouse with Array(String) column. Using yandex driver this is impossible, because jdbc doesn't support Arrays, right? ;)
So I wanted to run Spark Session with housepower jar: clickhouse-native-jdbc-shaded-2.6.4.jar, because I read that they added handling Arrays - correct me if I'm wrong.
And I want to get a query from Clickhouse via jdbc.
spark = SparkSession\
.builder\
.enableHiveSupport()\
.appName(f'custom-events-test)\
.config("spark.jars", "drivers/clickhouse-native-jdbc-shaded-2.6.4.jar")\
.getOrCreate()
My query:
query = """
select date as date,
partnerID as partnerID,
sessionID as sessionID,
toString(mapKeys(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as keys,
toString(mapValues(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as values
from audience.uber_all
array join eventContents as ec
PREWHERE date = ('2022-07-22')
WHERE partnerID = 'XXX'
and ec.custom != '{}'
order by date, partnerID
"""
and my code:
df_tab = spark.read \
.format("jdbc") \
.option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
.option("url", f"jdbc:clickhouse://{ch_host}:9000/{ch_db}") \
.option("query", query) \
.option("user", ch_user) \
.option("password", ch_pass) \
.load()
But there I get an error:
housepower_error
BUT when I run above query with yandex driver: ru.yandex.clickhouse.ClickHouseDriver
everything works fine. (even with housepower jar)
This error also appears when I want to import column like this:
JSONExtractKeysAndValues(ec.custom, 'String')
or
toString(JSONExtractKeysAndValues(ec.custom, 'String'))
What am I doing wrong ?
And tell me how to insert a DF with Array column using spark jdbc to Clickhouse table also with Array(String) column? I was looking everywhere but coudn't find a solution...
Thank you in advance !

Related

How to filter data using spark.read in place?

I try read data in Delta format from ADLS. I want read some portion of that data using filter in place. Same approach worked for me during reading JDBC format
query = f"""
select * from {table_name}
where
createdate < to_date('{createdate}','YYYY-MM-DD HH24:MI:SS') or
modifieddate < to_date('{modifieddate}','YYYY-MM-DD HH24:MI:SS')
"""
return spark.read \
.format("jdbc") \
.option("url", url) \
.option("query", query) \
.option("user", username) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
So I tried to create in similar way reading delta using query but it reads whole table.
return spark.read \
.format("delta") \
.option("query", query) \
.load(path)
How could I solve this issue without reading full df and then filter it?
Thanks in advance!
Spark uses a functionality called predicate pushdown to optimize queries.
In the first case, the filters can be passed on to the oracle database.
Delta does not work that way. There can be optimisations through data skipping and Z-ordering, but since you are essentially querying parquet files, you have to read the all of them in memory and filter afterwards.

not null not working in Pyspark when using impala jdbc driver

I am new to Pyspark. I am using Impala JDBC driver ImpalaJDBC41.jar . In my pyspark code, I use the below.
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:impala://<instance>:21051") \
.option("query", "select dst_val,node_name,trunc(starttime,'SS') as starttime from def.tbl_dst where node_name is not null and trunc(starttime,'HH') >= trunc(hours_add(now(),-1),'HH') and trunc(starttime,'HH') < trunc(now(),'HH')") \
.option("user", "") \
.option("password", "") \
.load()
But the above does not work and the "node_name is not null" is not working. Also the trunc(starttime,'SS') is also not working. Any help would be appreciated.
sample input data :
dst_val,node_name,starttime
BCD098,,2021-03-26 15:42:06.890000000
BCD043,HKR_NODEF,2021-03-26 20:31:09
BCD038,BCF_NODEK,2021-03-26 21:29:10
Expected output :
dst_val,node_name,starttime
BCD043,HKR_NODEF,2021-03-26 20:31:09
BCD038,BCF_NODEK,2021-03-26 21:29:10
For debugging , I am tryin to print the df.show. But no use.
I am using df.show() , but it is still showing the record with null. The datatype of node_name is "STRING"
can you please use this ?
select dst_val,node_name,cast( from_timestamp(starttime,'SSS') as bigint) as starttime from def.tbl_dst where (node_name is not null or node_name<>'' ) and trunc(starttime,'HH') >= trunc(hours_add(now(),-1),'HH') and trunc(starttime,'HH') < trunc(now(),'HH')
I think node_name has empty space in it and above sql(i added or node_name<>'') will take care of it.
Now, if you have some non-printable character, then we may have to check accordingly.
EDIT : Since not null is working in Impala, i think this may be a spark issue.

Spark 2.3.1 AWS EMR not returning data for some columns yet works in Athena/Presto and Spectrum

I am using PySpark on Spark 2.3.1 on AWS EMR (Python 2.7.14)
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 100) \
.enableHiveSupport() \
.getOrCreate()
spark.sql('select `message.country` from datalake.leads_notification where `message.country` is not null').show(10)
This returns no data, 0 rows found.
Every value for each row in above table is returned Null.
Data is stored in PARQUET.
When I ran same SQL query on AWS Athena/Presto or on AWs Redshift Spectrum then I get all column data returned correctly (most column values are not null).
This is the Athena SQL and Redshift SQL query that returns correct data:
select "message.country" from datalake.leads_notification where "message.country" is not null limit 10;
I use AWS Glue catalog in all cases.
The column above is NOT partitioned but the table is partitioned on other columns. I tried to use repair table, it did not help.
i.e. MSCK REPAIR TABLE datalake.leads_notification
i tried Schema Merge = True like so:
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("spark.sql.parquet.mergeSchema", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 200) \
.enableHiveSupport() \
.getOrCreate()
No difference, still every value of one column is nulls even though some are not null.
This column was added as the last column to the table so most data is indeed null but some rows are not null. The column is listed at last on the column list in catalog, sitting just above the partitioned columns.
Nevertheless Athena/Presto retrieves all non-null values OK and so does Redshift Spectrum too but alas EMR Spark 2.3.1 PySpark shows all values for this column as "null". All other columns in Spark are retrieved correctly.
Can anyone help me to debug this problem please?
Hive Schema is hard to cut and paste here due to output format.
***CREATE TABLE datalake.leads_notification(
message.environment.siteorigin string,
dcpheader.dcploaddateutc string,
message.id int,
message.country string,
message.financepackage.id string,
message.financepackage.version string)
PARTITIONED BY (
partition_year_utc string,
partition_month_utc string,
partition_day_utc string,
job_run_guid string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://blahblah/leads_notification/leads_notification/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='weekly_datalake_crawler',
'averageRecordSize'='3136',
'classification'='parquet',
'compressionType'='none',
'objectCount'='2',
'recordCount'='897025',
'sizeKey'='1573529662',
'spark.sql.create.version'='2.2 or prior',
'spark.sql.sources.schema.numPartCols'='4',
'spark.sql.sources.schema.numParts'='3',
'spark.sql.sources.schema.partCol.0'='partition_year_utc',
'spark.sql.sources.schema.partCol.1'='partition_month_utc',
'spark.sql.sources.schema.partCol.2'='partition_day_utc',
'spark.sql.sources.schema.partCol.3'='job_run_guid',
'typeOfData'='file')***
Last 3 columns all have the same problems in Spark:
message.country string,
message.financepackage.id string,
message.financepackage.version string
All return OK in Athena/Presto and Redshift Spectrum using same catalog.
I apologize for my editing.
thank you
do step 5 schema inspection:
http://www.openkb.info/2015/02/how-to-build-and-use-parquet-tools-to.html
my bet is these new column names in parquet definition are either upper case (while other column names are lower case) or new column names in parquet definition are either lower case (while other column names are upper case)
see Spark issues reading parquet files
https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("spark.sql.parquet.mergeSchema", "true") \
.config("spark.sql.hive.convertMetastoreParquet", "false") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 200) \
.enableHiveSupport() \
.getOrCreate()
This is the solution: note the
.config("spark.sql.hive.convertMetastoreParquet", "false")
The schema columns are all in lower case and the schema was created by AWS Glue, not by my custom code so I dont really know what caused the problem so using the above is probably the safe default setting when schema creation is not directly under your control. This is a major trap, IMHO, so I hope this will help someone else in future.
Thanks to tooptoop4 who pointed out the article:
https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0

How to specify column data type when writing Spark DataFrame to Oracle

I want to write a Spark DataFrame to an Oracle table by using Oracle JDBC driver. My code is listed below:
url = "jdbc:oracle:thin:#servername:sid"
mydf.write \
.mode("overwrite") \
.option("truncate", "true") \
.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.OracleDriver") \
.option("createTableColumnTypes", "desc clob, price double") \
.option("user", "Steven") \
.option("password", "123456") \
.option("dbtable", "table1").save()
What I want is to specify the desc column to clob type and the price column to double precision type. But Spark show me that the clob type is not supported. The length of desc string is about 30K. I really need your help. Thanks
As per this note specifies that there are some data types that are not supported. If the target table is already created with CLOB data type then createTableColumnTypes may be redundant. You can check if writing to a CLOB column is possible with spark jdbc if table is already created.
Create your table in mysql with your required schema , now use mode='append' and save records .
mode='append' only insert records without modify table schema.

Running custom Apache Phoenix SQL query in PySpark

Could someone provide an example using pyspark on how to run a custom Apache Phoenix SQL query and store the result of that query in a RDD or DF. Note: I am looking for a custom query and not an entire table to be read into a RDD.
From Phoenix Documentation, to load an entire table I can use this:
table = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "<TABLENAME>") \
.option("zkUrl", "<hostname>:<port>") \
.load()
I want to know what is the corresponding equivalent for using a custom SQL
sqlResult = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("sql", "select * from <TABLENAME> where <CONDITION>") \
.option("zkUrl", "<HOSTNAME>:<PORT>") \
.load()
Thanks.
This can be done using Phoenix as a JDBC data source as given below:
sql = '(select COL1, COL2 from TABLE where COL3 = 5) as TEMP_TABLE'
df = sqlContext.read.format('jdbc')\
.options(driver="org.apache.phoenix.jdbc.PhoenixDriver", url='jdbc:phoenix:<HOSTNAME>:<PORT>', dbtable=sql).load()
df.show()
However it should be noted that if there are column aliases in the SQL statement then the .show() statement would throw up an exception (It will work if you use .select() to select the columns that are not aliased), this is a possible bug in Phoenix.
Here you need to use .sql to work with custom queries. Here is syntax
dataframe = sqlContext.sql("select * from <table> where <condition>")
dataframe.show()
To Spark2, I didn't have problem with .show() function, and I did not use .select() function to print all values of DataFrame coming from Phoenix.
So, make sure that your sql query has been inside parentheses, look my example:
val sql = " (SELECT P.PERSON_ID as PERSON_ID, P.LAST_NAME as LAST_NAME, C.STATUS as STATUS FROM PERSON P INNER JOIN CLIENT C ON C.CLIENT_ID = P.PERSON_ID) "
val dft = dfPerson.sparkSession.read.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", "jdbc:phoenix:<HOSTNAME>:<PORT>")
.option("useUnicode", "true")
.option("continueBatchOnError", "true")
.option("dbtable", sql)
.load()
dft.show();
It shows me:
+---------+--------------------+------+
|PERSON_ID| LAST_NAME|STATUS|
+---------+--------------------+------+
| 1005| PerDiem|Active|
| 1008|NAMEEEEEEEEEEEEEE...|Active|
| 1009| Admission|Active|
| 1010| Facility|Active|
| 1011| MeUP|Active|
+---------+--------------------+------+

Resources