Snowflake interprets boolean values in parquet as NULL? - apache-spark

Parquet Entry Example (All entries have is_active_entity as true)
{
"is_active_entity": true,
"is_removed": false
}
Query that demonstrates all values are taken as NULL
select $1:IS_ACTIVE_ENTITY::boolean, count(*) from #practitioner_delta_stage/part-00000-49224c02-150b-493b-8036-54ab30a8ff40-c000.snappy.parquet group by $1:IS_ACTIVE_ENTITY::boolean ;
Output has only one group for NULL
$1:IS_ACTIVE_ENTITY::BOOLEAN COUNT(*)
NULL 4930277
I don't know where I am going wrong, Spark writes the correct schema in parquet as evident from the example but snowflake takes it as NULL.
How do I fix this?

The columns in your file are quoted. As a consequence "is_active_entity" is not the same like "IS_ACTIVE_ENTITY"
Please try this query:
select $1:is_active_entity::boolean, count(*) from #practitioner_delta_stage/part-00000-49224c02-150b-493b-8036-54ab30a8ff40-c000.snappy.parquet group by $1:IS_ACTIVE_ENTITY::boolean ;
More infos: https://docs.snowflake.com/en/sql-reference/identifiers-syntax.html#:~:text=The%20identifier%20is%20case%2Dsensitive.

Related

Spark partition filter is skipped when table is used in where condition, why?

Maybe someone observed this behavior and knows why Spark takes this route.
I wanted to read only few partitions from partitioned table.
SELECT *
FROM my_table
WHERE snapshot_date IN('2023-01-06', '2023-01-07')
results in (part of) the physical plan:
-- Location: PreparedDeltaFileIndex [dbfs:/...]
-- PartitionFilters: [cast(snapshot_date#282634 as string) IN (2023-01-06,2033-01-07)]
It is very fast, ~1s, in the execution plan I see it is using those provided datasets as arguments for partition filters.
If I try to provide filter predicate in form of the one column table it does full table scan and it takes 100x longer.
SELECT *
FROM
my_table
WHERE snapshot_date IN (
SELECT snapshot_date
FROM (VALUES('2023-01-06'), ('2023-01-07')) T(snapshot_date)
)
-- plan
Location: PreparedDeltaFileIndex [dbfs:/...]
PartitionFilters: []
ReadSchema: ...
I was unable to find any query hints that would force Spark to push down this predicate.
One can easily do for loop in python and wrap logic of reading a table with desired dates and read them one by one. But I'm not sure it is possible in SQL.
Is there any option/switch I have missed?
I don't think pushing down this kind of predicate is something supported by Spark's HiveMetaStore client, today.
So in first case, HiveShim.convertFilters(...) method will transform
:
WHERE snapshot_date IN ('2023-01-06', '2023-01-07')
into a filtering predicate understood by HMS as
snapshot_date="2023-01-06" or snapshot_date="2023-01-07"
but in the second, sub-select, case the condition will be skipped altogether.
/**
* Converts catalyst expression to the format that Hive's getPartitionsByFilter() expects, i.e.
* a string that represents partition predicates like "str_key=\"value\" and int_key=1 ...".
*
* Unsupported predicates are skipped.
*/
def convertFilters(table: Table, filters: Seq[Expression]): String = {
lazy val dateFormatter = DateFormatter()
:
:

PySpark Pushing down timestamp filter

I'm using PySpark version 2.4 to read some tables using jdbc with a Postgres driver.
df = spark.read.jdbc(url=data_base_url, table="tablename", properties=properties)
One column is a timestamp column and I want to filter it like this:
df_new_data = df.where(df.ts > last_datetime )
This way the filter is pushed down as a SQL query but the datetime format
is not right. So I tried this approach
df_new_data = df.where(df.ts > F.date_format( F.lit(last_datetime), "y-MM-dd'T'hh:mm:ss.SSS") )
but then the filter is no pushed down anymore.
Can someone clarify why this is the case ?
While loading the data from a Database table, if you want to push down queries to database and get few result rows, instead of providing the 'table', you can provide the 'Query' and return just the result as a DataFrame. This way, we can leverage database engine to process the query and return only the results to Spark.
The table parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query FROM clause. Note that alias is mandatory to be provided in query.
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
df.show()

Replacing blanks with Null in PySpark

I am working on a Hive table on Hadoop and doing Data wrangling with PySpark. I read the dataset:
dt = sqlContext.sql('select * from db.table1')
df.select("var1").printSchema()
|-- var1: string (nullable = true)
have some empty values in the dataset that Spark seems to be unable to recognize! I can easily find Null values by
df.where(F.isNull(F.col("var1"))).count()
10163101
but when I use
df.where(F.col("var1")=='').count()
it gives me zero however when I check in sql, I have 6908 empty values.
Here are SQL queries and their results:
SELECT count(*)
FROM [Y].[dbo].[table1]
where var1=''
6908
And
SELECT count(*)
FROM [Y].[dbo].[table1]
where var1 is null
10163101
the counts for SQL and Pyspark table are the same:
df.count()
10171109
and
SELECT count(*)
FROM [Y].[dbo].[table1]
10171109
And when I try to find blanks by using length or size, I get an error:
dt.where(F.size(F.col("var1")) == 0).count()
AnalysisException: "cannot resolve 'size(var1)' due to data type
mismatch: argument 1 requires (array or map) type, however, 'var1'
is of string type.;"
How should I address this issue? My Spark version is '1.6.3'
Thanks
I tried regexp and finally was able to find those blanks!!
dtnew = dt.withColumn('test',F.regexp_replace(F.col('var1') , '\s+|,',''))
dtnew.where(F.col('test')=='').count()
6908

Getting ValidationFailureSemanticException on 'INSERT OVEWRITE'

I am creating a DataFrame and registering that DataFrame as temp view using df.createOrReplaceTempView('mytable'). After that I try to write the content from 'mytable' into Hive table(It has partition) using the following query
insert overwrite table
myhivedb.myhivetable
partition(testdate) // ( 1) : Note here : I have a partition named 'testdate'
select
Field1,
Field2,
...
TestDate //(2) : Note here : I have a field named 'TestDate' ; Both (1) & (2) have the same name
from
mytable
when I execute this query, I am getting the following error
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.Table$ValidationFailureSemanticException: Partition spec
{testdate=, TestDate=2013-01-01}
Looks like I am getting this error because of the same field names ; ie testdate(the partition in Hive) & TestDate (The field in temp table 'mytable')
Whereas if my partition name testdate is different from the fieldname(ie TestDate), the query executes successuflly. Example...
insert overwrite table
myhivedb.myhivetable
partition(my_partition) //Note here the partition name is not 'testdate'
select
Field1,
Field2,
...
TestDate
from
mytable
My guess is it looks like a Bug in Spark...but would like to have second opinion...Am I missing something here?
#DuduMarkovitz #dhee ; apologies for being too late for the response. I am finally able to resolve the issue. Earlier I was creating the table using cameCase(in the CREATE statement) which seems to be the reason for the Exception. Now i have created the table using the DDL where field names are in lower case. This has resolved my issue

Spark sql : how to count double values

I have a 100-millions line table, I would like to know how many unique values I have on a CTAC column.
I tried :
SELECT COUNT(*)
FROM ( SELECT CTAC
FROM my_table
GROUP BY CTAC
HAVING COUNT(*) > 1)
but this gives me an error :
sql.AnalysisException : cannot recognize input near '<EOF>' in subquery source
Can we do a subquery in spark ? If so, how ?
Which query should I try to solve my question ?
Try differently as
println(dataFrame.select("CTAC").distinct.count)

Resources