Spark SQL ignoring dynamic partition filter value - apache-spark

Running into an issue on Spark 2.4 on EMR 5.20 in AWS.
I have a string column as a partition, which has date values. My goal is to have the max value of this column be referenced as a filter. The values look like this 2019-01-01 for January 1st, 2019.
In this query, I am trying to filter to a certain date value (which is a string data type), and Spark ends up reading all directories, not just the resulting max(value).
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= (select max(mypartitioncolumn) from myothertable) group by 1,2,3 ").show
However, in this instance, If I hardcode the value, it only reads the proper directory.
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= '2019-01-01' group by 1,2,3 ").show
Why is Spark not recognizing both methods in the same way? I made sure that if I run the select max(mypartitioncolumn) from myothertable query, it shows the exact same value as my hardcoded method (as well as the same datatype).
I can't find anything in the documentation that differentiates partition querying other than data type differences. I checked to make sure that my schema in both the source table as well as value are string types, and also tried to cast my value as a string as well cast( (select max(mypartitioncolumn) from myothertable) as string), it doesn't make any difference.

Workaround by changing configuration
sql("set spark.sql.hive.convertMetastoreParquet = false")
Spark docs
"When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default."

Related

Avoiding use of SELECT in WHERE

I have an input file on hdfs in CSV format with following cols: date, time, public_ip
Using this I need to filter out data from quite a big table (~100M rows daily). The table has the following structure (roughly):
CREATE TABLE big_table (
`user_id` int,
`ip` string,
`timestamp_from` timestamp,
`timestamp_to` timestamp)
PARTITIONED BY (`PARTITION_DATE` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat';
I need to read CSV data and then filter big_table checking which user_ids have been using the desired ip address in selected period.
I tried using spark SQL with different joins, without much luck. Whatever I do, spark is simply not "smart" enough to limit big table per partition. I also tried using WHERE PARTITION_DATE IN (SELECT DISTINCT date FROM csv_file, but this was also quite slow.
CSV should have up to 20 different days or so.
Here's my solution - I ended up picking up distinct days and using this as a string:
spark.sql("select date from csv_file group by date").createOrReplaceTempView("csv_file_uniq_date")
val partitions=spark.sql("select * from csv_file_uniq_date").collect.mkString(sep=",").replaceAll("[\\[\\]]","")
spark.sql("select user_id, timestamp_from, timestamp_to from big_table where partition_date in (" + partitions + ") group by user_id, timestamp_from, timestamp_to").write.csv("output.csv")
Now, this does the work - I cut the tasks from 100s of thousands to thousands, but I feel quite unhappy with the implementation. Could someone point me to the right direction? How to avoid pulling this as a string of comma separated partition values?
Using spark 2.2
Cheers!
What you expect is called Dynamic Partition Pruning, by which Spark will be smart enough to resolve the partitions to filter from the direct join condition.
This feature is available from Spark 3.0 as part of Adaptive Query Execution improvements.
Find more details from this link
It is disabled by default, can be enabled by setting spark.sql.adaptive.enabled=true

Creating PySpark data frame without any alterations in column names

I am creating tables using SparkSQL with below CTAS command.
CREATE TABLE TBL2
STORED AS ORC
LOCATION "dbfs:/loc"
TBLPROPERTIES("orc.compress" = "SNAPPY")
AS
SELECT Col1
, ColNext2
, ColNext3
, ...
FROM TBL1
After that, I am reading files underlying above newly created location (TBL2) using below PySpark code. However, the data frame below is getting created with all column names in lowercase only. Whereas the expected result is in camel case as I am doing with CTAS above.
df = spark.read.format('ORC') \
.option('inferSchema',True) \
.option('header',True) \
.load('dbfs:/loc')
data_frame.show()
Actual output:
col1 colnext2 colnext3 ...
Expected Output:
Col1 ColNext2 ColNext2 ...
In version 2.3 and earlier, when reading from a Parquet data source table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether spark.sql.caseSensitive is set to true or false. Since 2.4, when spark.sql.caseSensitive is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. This change also applies to Parquet Hive tables when spark.sql.hive.convertMetastoreParquet is set to true.
source

Unable to coerce to a formatted date - Cassandra timestamp type

I have the values stored for timestamp type column in cassandra table in format of
2018-10-27 11:36:37.950000+0000 (GMT date).
I get Unable to coerce '2018-10-27 11:36:37.950000+0000' to a formatted date (long) when I run below query to get data.
select create_date from test_table where create_date='2018-10-27 11:36:37.950000+0000' allow filtering;
How to get the query working if the data is already stored in the table (of format, 2018-10-27 11:36:37.950000+0000) and also perform range (>= or <=) operations on create_date column?
I tried with create_date='2018-10-27 11:36:37.95Z',
create_date='2018-10-27 11:36:37.95' create_date='2018-10-27 11:36:37.95'too.
Is it possible to perform filtering on this kind of timestamp type data?
P.S. Using cqlsh to run query on cassandra table.
In first case, the problem is that you specify timestamp with microseconds, while Cassandra operates with milliseconds - try to remove the three last digits - .950 instead of .950000 (see this document for details). The timestamps are stored inside Cassandra as 64-bit number, and then formatted when printing results using the format specified by datetimeformat options of cqlshrc (see doc). Dates without explicit timezone will require that default timezone is specified in cqlshrc.
Regarding your question about filtering the data - this query will work only for small amounts of data, and on bigger data sizes will most probably timeout, as it will need to scan all data in the cluster. Also, the data won't be sorted correctly, because sorting happens only inside single partition.
If you want to perform such queries, then maybe the Spark Cassandra Connector will be the better choice, as it can effectively select required data, and then you can perform sorting, etc. Although this will require much more resources.
I recommend to take DS220 course from DataStax Academy to understand how to model data for Cassandra.
This is works for me
var datetime = DateTime.UtcNow.ToString("yyyy-MM-dd HH:MM:ss");
var query = $"SET updatedat = '{datetime}' WHERE ...

spark to hive data types

is there a way, how to convert an input string field into orc table with column specified as varchar(xx) in sparksql select query? Or I have to use some workaroud? I'm using Spark 1.6.
I found on Cloudera forum, Spark does not care about length, it saves the value as string with no size limit.
The table is inserted into Hive OK, but I'm little bit worried about data quality.
temp_table = sqlContext.table(ext)
df = temp_table.select(temp_dable.day.cast('string'))
I would like to see something like that :)))
df = temp_table.select(temp_dable.day.cast('varchar(100)'))
Edit:
df.write.partitionBy(part).mode('overwrite').insertInto(int)
Table I'm inserting into is saved as an ORC file (the line above probably should have .format('orc')).
I found here, that If I specify a column as a varchar(xx) type, than the input string will be cutoff to the xx length.
Thx

SPARK-HIVE-key differences between Hive and Parquet from the perspective of table schema processing

I am new in spark and hive. I do not understand the statement
"Hive considers all columns nullable, while nullability in Parquet is significant"
If any one explain the statement with example it will better for me. Thank your.
In standard SQL syntax, when you create a table, you can state that a specific column is "nullable" (i.e. may contain a Null value) or not (i.e. trying to insert/update a Null value will throw an error).Nullable is the default.
Parquet schema syntax supports the same concept, although when using AVRO serialization, not-nullable is the default.
Caveat -- when you use Spark to read multiple Parquet files, these files may have different schemas. Imagine that the schema definition has changed over time, and newer files have 2 more Nullable columns at the end. Then you have to request "schema merging" so that Spark reads the schema from all files (not just one at random) to make sure that all these schemas are compatible, then at read-time the "undefined" columns are defaulted to Null for older files.
Hive HQL syntax does not support the standard SQL feature; every column is, and must be, nullable -- simply because Hive does not have total control on its data files!
Imagine a Hive partitioned table with 2 partitions...
one partition uses TextFile format and contains CSV dumps from
different sources, some showing up all expected columns, some missing
the last 2 columns because they use an older definition
the second partition uses Parquet format for history, created by Hive INSERT-SELECT queries, but older
Parquet files are missing the last 2 columns also, because they were created using the older table definition
For the Parquet-based partition, Hive does "schema merging", but instead of merging the file schemas together (like Spark), it merges each file schema with the table schema -- ignoring columns that are not defined in the table, and defaulting to Null all table columns that are not in the file.
Note that for the CSV-based partition, it's much more brutal, because the CSV files don't have a "schema" -- they just have a list of values that are mapped to the table columns, in order. On reaching EOL all missing columns are set to Null; on reaching the value for the last column, any extra value on the line is ignored.

Resources