Z order column in databricks table - databricks

I am working on creating a notebook which end users could run by providing the table name as input and get an efficient sample query(by utilising the partition key and Z order column). I could get the partition column with describe table or spark.catalog, but not able to find a way to get the Z order column from table metadata?
The code for getting the partition column is given below.
columns = spark.catalog.listColumns(tableName=tablename,dbName=dbname)
partition_columns_details = list(filter(lambda c: c.isPartition , columns))
partition_columns=[ (c.name) for c in partition_columns_details ]

Maybe first an important thing to know about the difference between partitioning and Z ordering. Once you partition a table on a column, this partition remains after each transaction. If you do an insert, update, optimize, ... the table will still be partitioned on that column.
This is not the case for Z ordering. If you do Z ordering on a table during an optimize, and after that you do an insert, update, ... it is possible that the data is not well structured anymore.
That being said, here is an example to find which column(s) the last Z ordering was done on:
df = spark.sql(f'DESCRIBE HISTORY {table_name}')
(df.filter(F.col('operation')=='OPTIMIZE')
.orderBy(F.desc('timestamp'))
.select('operationParameters.zOrderBy')
.collect()[0].zOrderBy)
You can probably expand the code above with some extra information. E.g. where there many other transactions done after the z ordering, ...? Note that not all transactions 'destroy' the Z ordering result. VACUUM for example will be registered within the history, but does not impact Z ordering. After a small INSERT you also will probably still benefit from the Z ordering that was done before.

Related

How to check Spark DataFrame difference?

I need to check my solution for idempotency and check how much it's different with past solution.
I tried next:
spark.sql('''
select * from t1
except
select * from t2
''').count()
It's gives me information how much this tables different (t1 - my solution, t2 - primal data). If here is many different data, I want to check, where it different.
So, I tried that:
diff = {}
columns = t1.columns
for col in columns:
cntr = spark.sql('''
select {col} from t1
except
select {col} from t2
''').count()
diff[col] = cntr
print(diff)
It's not good for me, because it's works about 1-2 hours (both tables have 30 columns and 30 million lines of data).
Do you guys have an idea how to calculate this quickly?
Except is a kind of a join on all columns at the same time. Does your data have a primary key? It could even be complex, comprising of multiple columns, but it's still much better then taking all 30 columns into account.
Once you figure out the primary key you can do the FULL OUTER JOIN and:
check NULLs on the left
check NULLs on the right
check other columns of matching rows (it's much cheaper to compare the values after the join)
Given that your resource remains unchanged, I think there are three ways that you can optimize:
Join two dataframe once but not looping the except: I assume your dataset should have a key / index, otherwise there is no ordering in your both dataframe and you can't perform except to check the difference. Unless you have limited resource, just do join once to concat two dataframe first instead of multiple except.
Check your data partitioning: Even you use point 1 / the method that you're using, make sure that data partition is in even distribution with optimal number of partition. Most of the time, data skew is one of the critical parts to lower your performance. If your key is a string, use repartition. If you're using a sequence number, use repartitionByRange.
Use the when-otherwise pair to check the difference: once you join two dataframe, you can use a when-otherwise condition to compare the difference, for example: df.select(func.sum(func.when(func.col('df1.col_a')!=func.col('df2.col_a'), func.lit(1))).otherwise(func.lit(0)).alias('diff_in_col_a_count')). Therefore, you can calculate all the difference within one action but not multiple action.

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

Query a Cassandra table according to fields that are not part of the partition key

I have a Cassandra table where a few columns are defined as the cluster but I need to also be able to filter according to data in the other columns.
So let's say my table consists of the columns A,B,C,D,E,F
Columns A,B are the cluster key but I need the "WHERE" part to include values in E or F or E and F
so something like
SELECT * FROM My_Table WHERE A='x' AND B='y' AND E='t' AND F='g'
Cassandra will only allow this with the ALLOW FILTERING option which of course is not good.
What are my options?
It isn't easy to answer your question since you haven't posted the table schema.
If the columns E and F are not clustering keys, an option for you is to index the columns. However, those have their own pros and cons depending on the data types and/or data you are storing.
For more info, see When to use an index in Cassandra. Cheers!

How do you eliminate data skew when joining large tables in pyspark?

Table A has ~150M rows while Table B has about 60. In Table A, column_1 can and often does contain a large number of NULLS. This causes the data to become badly skewed and one executor ends up doing all of the work after LEFT JOINING.
I've read several posts on a solution but I've been unable to wrap my head around the different approaches that span several different versions of Spark.
What operation to do I need to take on Table A and what operation do I need to take on Table B to eliminate the skewed partitioning that occurs as a result of LEFT JOIN?
I'm using Spark 2.3.0 and writing in Python. In the code snippet below, I'm attempting to derive a new column that's devoid of NULLs (which would be used to execute the join), but I'm not sure where to take it (and I have no idea what to do with Table B)
new_column1 = when(col('column_1').isNull(), rand()).otherwise(col('column_1'))
df1 = df1.withColumn('no_nulls_here', new_column1)
df1.persist().count()

Delete lots of rows from a very large Cassandra Table

I have a table Foo with 4 columns A, B, C, D. The partitioning key is A. The clustering key is B, C, D.
I want to scan the entire table and find all rows where D is in set (X, Y, Z).
Then I want to delete these rows but I don't want to "kill" Cassandra (because of compactions), I'd like these rows deleted with minimal disruption or risk.
How can I do this?
You have a big problem here. Indeed, you really can't find the rows without actually scanning all of your partitions. The problem real problem is that C* will allow you to restrict your queries with a partition key, and then by your cluster keys in the order in which they appear in your PRIMARY KEY table declaration. So if your PK is like this:
PRIMARY KEY (A, B, C, D)
then you'd need to filter by A first, then by B, C, and only at the end by D.
That being said, for the part of finding your rows, if this is something you have to run only once, you
Could scan all your table and do comparisons of D in your App logic.
If you know the values of A you could query every partition in parallel and then compare D in your application
You could attach a secondary index and try to exploit speed from there.
Please note that depending on how many nodes do you have 3 is really not an option, secondary indexes don't scale)
If you need to perform such tasks multiple times, I'd suggest you to create another table that would satisfy this query, something like PRIMARY KEY (D), you'd then just scan three partitions and that would be very fast.
About deleting your rows, I think there's no way to do it without triggering compactions, they are part of C* and you have to live with them. If you really can't tolerate tombstone creation and/or compactions, the only alternative is to not delete rows from a C* cluster, and that often means thinking about a new data model that won't need deletes.

Resources