How to update the values of partitioned columns in Delta? - delta-lake

I wanted to know, whether is it possible to update the "values" of column on which the delta table is partitioned?
The table is partitioned on a particular column, now I want to update the value of that particular column. Can I do that?
(found on slack)

Use replaceWhere option.
Quoting the official documentation at Replace table schema:
By default, overwriting the data in a table does not overwrite the schema. When overwriting a table using mode("overwrite") without replaceWhere, you may still want to overwrite the schema of the data being written. You replace the schema and partitioning of the table by setting the overwriteSchema option to true.
Quoting the article Selectively updating Delta partitions with replaceWhere:
Delta makes it easy to update certain disk partitions with the replaceWhere option.
replaceWhere is particularly useful when you have to run a computationally expensive algorithm, but only on certain partitions.

Related

Can sort() and cache() combined in spark increase filter speed like creating index column in SQL?

We know in SQL, an index can be created on a column if it is frequently used for filtering. Is there anything similar I can do in spark? Let's say I have a big table T containing a column C I want to filter on. I want to filter 10s of thousands of id sets on the column C. Can I sort/orderBy column C, cache the result, and then filter all the id sets with the sorted table? Will it help like indexing in SQL?
You should absolutely build the table/dataset/dataframe with a sorted id if you will query on it often. It will help predicate pushdown. and in general give a boost in performance.
When executing queries in the most generic and basic manner, filtering
happens very late in the process. Moving filtering to an earlier phase
of query execution provides significant performance gains by
eliminating non-matches earlier, and therefore saving the cost of
processing them at a later stage. This group of optimizations is
collectively known as predicate pushdown.
Even if you aren't sorting data you may want to look at storing the data in file with 'distribute by' or 'cluster by'. It is very similar to repartitionBy. And again only boosts performance if you intend to query the data as you have distributed the data.
If you intend to requery often yes, you should cache data, but in general there aren't indexes. (There are file types that help boost performance if you have specific query type needs. (Row based/columnar based))
You should look at the Spark Specific Performance tuning options. Adaptive query is a next generation that helps boost performance, (without indexes)
If you are working with Hive: (Note they have their own version of partitions)
Depending on how you will query the data you may also want to look at partitioning or :
[hive] Partitioning is mainly helpful when we need to filter our data based
on specific column values. When we partition tables, subdirectories
are created under the table’s data directory for each unique value of
a partition column. Therefore, when we filter the data based on a
specific column, Hive does not need to scan the whole table; it rather
goes to the appropriate partition which improves the performance of
the query. Similarly, if the table is partitioned on multiple columns,
nested subdirectories are created based on the order of partition
columns provided in our table definition.
Hive Partitioning is not a magic bullet and will slow down querying if the pattern of accessing data is different than the partitioning. It make a lot of sense to partition by month if you write a lot of queries looking at monthly totals. If on the other hand the same table was used to look at sales of product 'x' from the beginning of time, it would actually run slower than if the table wasn't partitioned. (It's a tool in your tool shed.)
Another hive specific tip:
The other thing you want to think about, and is keeping your table stats. The Cost Based Optimizer uses those statistics to query your data. You should make sure to keep them up to date. (Re-run after ~30% of your data has changed.)
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name
since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];

Databricks query performance when filtering on a column correlated to the partition-column

Setting: Delta-lake, Databricks SQL compute used by powerbi.
I am wondering about the following scenario: We have a column timestamp and a derived column date (which is the date of timestamp), and we choose to partitionby date. When we query we use timestamp in the filter, not date.
My understanding is that databrikcs a priori wont connect the timestamp and the date, and seemingly wont get any advantage of the partitioning. But since the files are in fact partitioned by timestamps (implicitly), when databricks looks at the min/max timestamps of all the files, it will find that it can skip most files after all. So it seems like we can get quite a benefit of partitioning even if its on a column we dont explicitly use in the query.
Is this correct?
What is the performance cost (roughly) of having to filter away files in this way vs using the partitioning directly.
Will databricks have all the min/max information in memory, or does it have to go out and look at the files for each query?
Yes, Databricks will take implicit advantage of this partitioning through data skipping because there will be min/max statistics associated with specific data files. The min/max information will be loaded into memory from the transaction log, but it will need to make decision which files it need to hit on every query. But because everything is in memory, it shouldn't be very big performance overhead, until you have hundreds of thousands files.
One thing that you may consider - use generated column instead of explicit date column. Declare it as date GENERATED ALWAYS AS (CAST(timestampColumn AS DATE)), and partition by it. The advantage is that when you're doing a query on timestampColumn, then it should do partition filtering on the date column automatically.

spark parquet partitioning which remove the partition column

If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.

Query in Cassandra that will sort the whole table by a specific field

I have a table like this
CREATE TABLE my_table(
category text,
name text,
PRIMARY KEY((category), name)
) WITH CLUSTERING ORDER BY (name ASC);
I want to write a query that will sort by name through the entire table, not just each partition.
Is that possible? What would be the "Cassandra way" of writing that query?
I've read other answers in the StackOverflow site and some examples created single partition with one id (bucket) which was the primary key but I don't want that because I want to have my data spread across the nodes by category
Cassandra doesn't support sorting across partitions; it only supports sorting within partitions.
So what you could do is query each category separately and it would return the sorted names for each partition. Then you could do a merge of those sorted results in your client (which is much faster than a full sort).
Another way would be to use Spark to read the table into an RDD and sort it inside Spark.
Always model cassandra tables through the access patterns (relational db / cassandra fill different needs).
Up to Cassandra 2.X, one had to model new column families (tables) for each access pattern. So if your access pattern needs a specific column to be sorted then model a table with that column in the partition/clustering key. So the code will have to insert into both the master table and into the projection table. Note depending on your business logic this may be difficult to synchronise if there's concurrent update, especially if there's update to perform after a read on the projections.
With Cassandra 3.x, there is now materialized views, that will allow you to have a similar feature, but that will be handled internally by Cassandra. Not sure it may fit your problem as I didn't play too much with 3.X but that may be worth investigation.
More on materialized view on their blog.

How wide-columns databases achieve dynamic columns on the storage layout?

As known, in relational databases, when adding a new column, data must be reallocated (ALTER TABLE without locking the table?) to maintain a single row contiguous on disk.
I would like to understand how this is achieved on wide-columns storages such as Cassandra, which are sparses and can handle lots of dynamic columns insertions (http://www.datastax.com/dev/blog/thrift-to-cql3 (Dynamic Column family))
Thanks!
In Cassandra adding a Column is adding a bit like adding row in relational database. You can even delete a column for a specific row:
delete first_name from user where user_id='abcd';
In CQL, alter table doesn't modify all rows, in short it just modifies the schema dictionary which describe tables (look at tables prefixed schema_ in the system keyspace). This changes only CQL parsing (the new column is now recognized) and interpretation (select * from user meaning is changed).
When you drop a column, data doesn't appear anymore in query results yet it is still present in SSTables. The data will be removed (and space freed) during a future compaction (like tombstones).

Resources