Avoid shuffling when inserting into sorted iceberg table - apache-spark

I have an Iceberg table created with
CREATE TABLE catalog.db.table (a int, b int) USING iceberg
Then I apply some sort order on it
ALTER TABLE catalog.db.table WRITE ORDERED BY (a, b)
After invoking the last command, SHOW TBLPROPERTIES catalog.db.table starts showing the write.distribution-mode: range property:
|sort-order |a ASC NULLS FIRST, b ASC NULLS FIRST|
|write.distribution-mode|range |
Now I'm writing data into the table:
df = spark.createDataFrame([(i, i*4) for i in range(100000)], ["a", "b"]).coalesce(1).sortWithinPartitions("a", "b")
df.writeTo("datalakelocal.ixanezis.table").append()
I thought this should have created a single task in spark, which would sort all data in dataframe (in fact it is sorted since creation) and then insert it into the table as a single file too.
Unfortunately, at the moment of writing, spark decides to repartition all data which causes shuffling. I believe this happens due to write.distribution-mode: range, which had been automatically set.
== Physical Plan ==
AppendData (6)
+- * Sort (5)
+- Exchange (4) # :(
+- * Project (3)
+- Coalesce (2)
+- * Scan ExistingRDD (1)
Is there a way to insert new data but also avoid unwanted shuffling?

According to the Apache Iceberg docs, WRITE ORDERED BY does the following:
Iceberg tables can be configured with a sort order that is used to automatically sort data that is written to the table in some engines. For example, MERGE INTO in Spark will use the table ordering.
Now, you create and write your table with the following:
df = spark.createDataFrame([(i, i*4) for i in range(100000)], ["a", "b"]).coalesce(1).sortWithinPartitions("a", "b")
df.writeTo("datalakelocal.ixanezis.table").append()
Sorting a dataframe requires a shuffle operation. You've used sortWithinPartitions which does sort your data, but only within your partitions. So this does not do the full sort operation as your Iceberg table requires.
Hence, you need another full shuffle operation to do your complete sort.

Related

Partitions order when reading Iceberg table by Spark

I have a large partitioned Iceberg table ordered by some columns. Now I want to scan through some filtered parts of that table using Spark and toLocalIterator(), preserving the order.
When my filter condition outputs the data from single partition everything is OK - the rows are ordered as expected.
The problem happens when there are multiple partitions in result - they come to me in random order. Of course I can add ORDER BY to my select statement, but that triggers expensive sorting, which is totally unnecessary if I only could explicitly specify the order for partitions.
The question is: how to tell Spark to use that order (or some other order)? Or broader: how to leverage from ordering columns in Iceberg schema?

Perfomance related question on spark + cassandra (JAVA code)

i am using cassandra as my dumping ground on which i have multiple jobs running to process the data and update different system. below are the job related filters
Job 1. data filter based on active_flag and update_date_time and expiry_time and process the filtered data.
Job 2. data filter based on update_date_time process the data
Job 3. data filter based on created_date_time and active flag
db columns on which where condition would run are (one or many columns in one query)
Active -> yes/no
created_date -> timestamp
expiry_time -> timestamp
updated_date -> timestamp
My question on these conditions :-
how should i form my cassandra primary key? as i dont see any way to acheive uniqueness on this (id is present but thats not required for me to process data).
do i even need the primary key if i use the filtering on spark code using table scan?
considering this for millions of records processing.
Answering to your question - you need to have a primary key, even if it consists only of the partition key :-)
More detailed answer really depends on how often these jobs are running, how much data overall, how many nodes in the cluster, what hardware is used, etc. Usually, we're trying to push as much filtering to Cassandra as possible, so it will return only relevant data, not everything. The most effective this filtering happens on the first clustering column, for example, if I want to process only newly created entries, then I can use the table with following structure:
create table test.test (
pk int,
tm timestamp,
c2 int,
v1 int,
v2 int,
primary key(pk, tm, c2));
and then I can fetch only newly created entries by using:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("test", "test").load()
val filtered = data.filter("tm >= cast('2019-03-10T14:41:34.373+0000' as timestamp)")
Or I can fetch entries in the given time period:
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
The effect of the filter pushdown could be checked by executing explain on the dataframe, and checking the PushedFilters section - conditions that are marked with * will be executed on Cassandra side...
But it's not always possible to design tables to match all queries, so you'll need to design primary key for jobs that are executed most often. In your case, update_date_time could be a good candidate for that, but if you put it as clustering column, then you'll need to take care when updating it - you'll need to perform change as batch, something like this:
begin batch
delete from table where pk = ... and update_date_time = old_timestamp;
insert into table (pk, update_date_time, ...) values (..., new_timestamp, ...);
apply batch;
or something like this.

how does spark push down filter work with cassandra table non partition keys?

I have a table in cassandra where date is not part of partition key but it is part of clustering key. While reading the table in spark I am applying date filter and it is being pushed down. I want to understand how push down works because through cql we cannot query directly on clustering key. Is the data being filtered somewhere?
Implementation in Java:
transactions.filter(transactions.col("timestamp").gt(timestamp)) //column timestamp is of type timestamp
and the physical plan coming out as
== Physical Plan ==
*Project [customer_user_id#67 AS customerUserId#111, cast(timestamp#66 as date) AS date#112, city#70]
+- *Filter (isnotnull(timestamp#66) && isnotnull(city#70))
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#571db8b4 [customer_user_id#67,timestamp#66,city#70] PushedFilters: [IsNotNull(timestamp), *GreaterThan(timestamp,2018-08-13 00:00:00.0), IsNotNull(city)], ReadSchema: struct<customerUserId:int,date:date,city:string>
Also for timestamp part this worked fine but if column is of type date then if was not pushing the filter even if date was part of partition key. I had to write it as transactions.filter("date >= cast('" + timestamp + "'as date)") to make it work. (column date is of type date)
When you don't have a condition on partition key, Spark Cassandra connector uses token ranges to perform effective scanning in parallel. So if you have condition on some clustering column clasCol (greater as in your example), connector will generate following query (pseudo-code, not real code - you can find real CQL queries if you enable debug logging):
SELECT col1, col2, ... FROM ks.table WHERE
token(pk) > :startRange AND token(pk) <= :endRange
AND clasCol > :your-value ALLOW FILTERING;
Then Cassandra will perform effective range scans for multiple partitions on the same node. You can look for code here if you want more details.
Regarding date - it's requires looking more into the code, but maybe it's just some type conversion missing, or something like - you can check what queries were generated for both cases.

Pruning partitions based on other columns

Lets consider an orc table on hive with a partition on a dt_month column and it contains all rows for the days in a month (txn_dt).
Partition pruning will work when I when I introduce a where clause directly on the dt_month like below.
df = spark.table("table")
df.where("dt_month = '2018-01-01'")
But is there a possibility for me to gather statistics on a partition level and prune partitions while filtering on the txn_dt (which is the column that dt_month is derived from) because there are some transitive properties that this holds towards the partition column?
df = spark.table("table")
df.where("txn_dt = '2018-01-01'")
Can we make this query not run through the whole table and rely on orc indices but only the 2018-01-01 partition and then use the orc index?

Spark-Hive partitioning

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?
As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Resources