Spark-Hive partitioning - apache-spark

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?

As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Related

Partitions order when reading Iceberg table by Spark

I have a large partitioned Iceberg table ordered by some columns. Now I want to scan through some filtered parts of that table using Spark and toLocalIterator(), preserving the order.
When my filter condition outputs the data from single partition everything is OK - the rows are ordered as expected.
The problem happens when there are multiple partitions in result - they come to me in random order. Of course I can add ORDER BY to my select statement, but that triggers expensive sorting, which is totally unnecessary if I only could explicitly specify the order for partitions.
The question is: how to tell Spark to use that order (or some other order)? Or broader: how to leverage from ordering columns in Iceberg schema?

Apache Hive: CREATE TABLE statement without schema over parquet can fail to infer partition column

I have a partitioned parquet at the following path:
/path/to/partitioned/parq/
with partitions like:
/path/to/partitioned/parq/part_date=2021_01_01_01_01_01
/path/to/partitioned/parq/part_date=2021_01_02_01_01_01
/path/to/partitioned/parq/part_date=2021_01_03_01_01_01
When I run a Spark SQL CREATE TABLE statement like:
CREATE TABLE IF NOT EXISTS
my_db.my_table
USING PARQUET
LOCATION '/path/to/partitioned/parq'
The partition column part_date shows up in my dataset, but DESCRIBE EXTENDED indicates there are no PARTITIONS. SHOW PARTITIONS my_db.my_table shows no partition data.
This seems to happen intermittently, like sometimes spark infers the partitions, other times it doesn't. This is causing issues downstream where we add a partition and try to MSCK REPAIR TABLE my_db.my_table and it says you can't run that on non-partitioned tables.
I see that if you DO declare schema, you can FORCE the PARTITIONED BY part of the clause but we do not have the luxury of a schema, just the files from underneath.
Why is spark intermittently unable to determine partition columns from a parquet in this shape?
Unfortunately with Hive you need to specify the schema, even if parquet obviously has this itself.
You need to add partition by clause to DDL.
Use ALTER table statement to add each partition separately with location.

Write large data set around 100 GB having just one partition to hive using spark

I am trying to write large dataset to a partitioned hive table (partitioned by date) using spark .The data set results in just one date, so just one partition. It is taking long time to write to table. It is also causing shuffling while writing . My code does not contain any join. It has just some map function, filter and union. How to efficiently write this kind of data to hive table? Check image of spark UI here

Data is not getting written in sorted format on target oracle table through SPARK

I have a table in hive with below schema
emp_id:int
emp_name:string
I have created data frame from above hive table
df = sql_context.sql('SELECT * FROM employee ORDER by emp_id')
df.show()
After above code is run I see that data is sorted properly on emp_id
I am trying to write the data to Oracle table through below code
df.write.jdbc(url=url, table='target_table', properties=properties, mode="overwrite")
As per my understanding, This is happening because of multiple executor processes running at the same time on every data partitions and sorting applied through query is been applied on specific partition and when multiple processes writing data to Oracle at the same time the result table ordering is distorted
I further tried to repartition the data to just one partition(Which is not ideal solution) and post writing the data to oracle the sorting worked properly
Is there any way to write sorted data to RDBMS from SPARK
TL;DR When working with relational systems you should never depend on the insert order. Spark is not really relevant here.
Relational databases, including Oracle, don't guarantee any intrinsic order of the stored data. Exact order of stored records is a detail of implementation, and can change during lifetime of the data.
The sole exception in Oracle are Index Organized Tables where:
data for an index-organized table is stored in a B-tree index structure in a primary key sorted manner.
This of course requires a primary key which can reliably determine order.

Spark and cassandra, range query on clustering key

I have cassandra table with following structure:
CREATE TABLE table (
key int,
time timestamp,
measure float,
primary key (key, time)
);
I need to create a Spark job which will read data from previous table, within specified start and end timestamp do some processing, and flush results back to cassandra.
So my spark-cassandra-connector will have to do a range query on clustering cassandra table column.
Are there any performance differences if I do:
sc.cassandraTable(keyspace,table).
as(caseClassObject).
filter(a => a.time.before(startTime) && a.time.after(endTime).....
so what I am doing is loading all the data into Spark and applying filtering
OR if I do this:
sc.cassandraTable(keyspace, table).
where(s"time>$startTime and time<$endTime)......
which filters all the data in Cassandra and then loads smaller subset to Spark.
The selectivity of a range query will be around 1%
It is impossible to include partition key in the query.
Which of these two solutions is preferred?
sc.cassandraTable(keyspace, table).where(s"time>$startTime and time<$endTime)
Will be MUCH faster. You are basically doing a percentage (if you only pull 5% of the data 5% of the total work) of the full grab in the first command to get the same data.
In the first case you are
Reading all of the data from Cassandra.
Serializing every object and then moving it to Spark.
Then finally filtering everything.
In the second case you are
Reading only the data you actually want from C*
Serializing only this tiny subset
There is no step 3
As an additional comment you can also put your case class type right in the call
sc.cassandraTable[CaseClassObject](keyspace, table)

Resources