Can I know z-order column? - databricks

Is there any way to know which column was used for z-ordering for a given table? I've tried multiple commands like describe and describe extended along with viewing the delta log. I found no information which column was used to perform the optimization.

You can find this information in the history of the table. Filter by operation = 'OPTIMIZE' then the operationParameters column will be a struct with fields predicate and zOrderBy (string encoding a JSON array of ZOrder columns)...
For example, when I optimized table by with ZOrder by rnd column

Related

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

How do I create a measure in Power Pivot that pulls a value from another table?

I have two tables that use a unique concatenated column for their relationship. I simply want to make a measure that uses the values from C4 of Table1. I thought I could use a simple formula like =values(Table1[C4]) but I get an error of "A table of multiple values was supplied where a single value was expected."
Side note: I realize the concatenation is unnecessary here in this simple example, but it is necessary in the full data I am working with which is why I added it into this example.
Here's a simplified set of tables for what I am trying to do:
Table1
Table2
Relationships
First you should think. Do I really need a Calculated column? Can't this be calculated at runtime?
But if you still want to do it, you can use RELATED or RELATEDTABLE.
Keep in mind if you are pulling from RELATEDTABLE, returns many values. So you should apply some type of aggregation like SUMX or MAXX.
You can use context transition to retrieve the value.
= CALCULATE(MAX(Table1[C4])

How to understand the 'Flexible schema' in Cassandra?

I am new to Cassandra, and found below in the wikipedia.
A column family (called "table" since CQL 3) resembles a table in an RDBMS (Relational Database Management System). Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.[29]
It said that 'different rows in the same column family do not have to share the same set of columns', but how to implement it? I have almost read all the documents in the offical site.
I can create table and insert data like below.
CREATE TABLE Emp_record(E_id int PRIMARY KEY,E_score int,E_name text,E_city text);
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (101, 85, 'ashish', 'Noida');
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (102, 90, 'ankur', 'meerut');
It's very like I did in the relational database. So how to create multiply rows with different columns?
I also found the offical document mentioned 'Flexible schema', how to understand it here?
Thanks very much in advance.
Column family is from the original design of Cassandra, when the data model looked like the Google BigTable or Apache HBase, and Thrift protocol was used for communication. But this required that schema was defined inside the application, and that makes access to data from many applications more problematic, as you need to update the schema inside all of them...
The CREATE TABLE and INSERT is a part of the Cassandra Query Language (CQL) that was introduced long time ago, and replaced Thrift-based implementation (Cassandra 4.0 completely removed the Thrift support). In CQL you need to have schema defined for a table, where you need to provide column name & type. If you really need to have dynamic columns, there are several approaches to that (I'll link answers that I already wrote over the time, so there won't duplicates):
If you have values of the same type, you can use one column as a name of the attribute/column, and another to store the value, like described here
if you have values of different types, you can also use one column as a name of attribute/column, and define multiple columns for values - one for each of the data types: int, text, ..., and you insert value into the corresponding columns only (described here)
you can use maps (described here) - it's similar to first or second, but mostly designed for very small number of "dynamic columns", plus have other limitations, like, you need to read the full map to fetch one value, etc.)

Avoid DISTINCTCOUNT in PowerPivot

Due to performance issues I need to remove a few distinct counts on my DAX. However, I have a particular scenario and I can't figure out how to do it.
As example, let's say one or more restaurants can be hired at one or more feasts and prepare one or more menus (see data below).
I want a PowerPivot table that shows in how many feasts each restaurant was present (see table below). I achieved this by using distinctcount.
Why not precalculating this on Power Query? The real data I have is a bit more complex (more ID columns) and in order to be able to pivot the data I would have to calculate thousands of possible combinations.
I tried adding to my model a Feast dimensional table (on the example this would only be 1 column of 2 rows). I was hoping to use that relationship to be able to make a straight count, but I haven't been able to come up with the right DAX to do so.
You could use COUNTROWS() combined with VALUES().
Specifically, COUNTROWS() will give you the count of rows in a table. That means COUNTROWS is expecting a table is input. Here's the magic part: VALUES() will return a table as results, and the table it returns are the distinct values in the table/column that you provide as the argument for VALUES().
I'm not sure if I'm explaining it well, so for the sample data you provided, the measure would look like this (assuming the table is named Table1):
Unique Feasts:=COUNTROWS(VALUES('Table1'[Feast Id]))
You can then create a pivot table from Powerpivot, and drag Restaurant Id into Rows, and drag the measure above into Values. Same result as DISTINCTCOUNT, but with less performance overhead (I think).

How to arrange data in Cassandra to get data in last in first out format

As we cannot sort data in Cassandra, I wanted to store data in such format that when I retrieve the data, I need to get data in ' last in first out format ' i.e if user enter comments when I retrieve data, I should first get very latest comment first and then older comments. I think it's something to do with comparator.
I have set following when configuring Cassandra:
assume posts comparator as utf8;
assume posts validator as utf8;
assume posts keys as utf8;
Please help - how should I create the column to arrange data in time format so that latest data is stored first?
Columns in a row are always sorted, and you can iterate over the columns in a row in reverse order. Given these two facs we could model the situation you're describing by storing comments in a column family called "comments" where the row key is the post ID, and the columns represent the comments to the corresponding post. The columns are timestamts (either ISO formatted dates, UNIX timestamps or time UUIDs) and the values are the comment text bodies.
If you would now get the columns for a row and specify that you wanted them in reverse order you would get what you want. How to specify reverse order depends on your driver, but it's usually just an option to the command that retrieves a row, or a column slice.
Another way, which is more hackish, would be to take the UNIX timestamp of a post, and subtract it from a large integer, like 2^31, and use that as column key. That way columns would sort in reverse order by default. It's not pretty and the above method is more elegant.
If you worry about using timestamps because there could be collisions where two comments are posted at exactly the same time, use Cassandra's time UUID type.
You need to organize your data such that the comparator is a timestamp. You store your data in natural order and specify reverse order in your slice query.

Resources