How to cube on two columns as if they were one? - apache-spark

I have a following attributes that I am interested on performing aggregations (regular count for example) on attributes:
'category', 'sub-category', age, city, education... (around 10 more)
I am interested in all possible combinations of the attributes in group by, so using dataframes cube function could help me achieve that.
But here is a catch: sub-category does not make any sense without category, so in order to achieve this I need to combine rollup(category, sub-category) with cube(age, city. education...).
How to do this?
This is what I tried, where test is the name of my table:
val data = sqlContext.sql("select category,'sub-category',age from test group by cube(rollup(category,'sub-category'), age )")
and this is the error I get:
org.apache.spark.sql.AnalysisException: expression 'test.category' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

I think what you want is struct or expr functions to combine two columns as one and use it to cube on.
With struct it'd be as follows:
df.rollup(struct("category", "sub-category") as "(cat,sub)")
With expr it's as simple as using "pure" SQL, i.e.
df.rollup(expr("(category, 'sub-category')") as "(cat,sub)")
But I'm just guessing...

Related

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

How to get a list of columns that will give me a unique record in Pyspark Dataframe

My intension is to write a python function that would take a pyspark DataFrame as input, and its output would be a list of columns (could be multiple lists) that gives a unique record when combined together.
So, if you take a set of values for the columns in the list, you would always get just 1 record from the DataFrame.
Example:
Input Dataframe
Name Role id
--------------------
Tony Dev 130
Stark Qa 131
Steve Prod 132
Roger Dev 133
--------------------
Output:
Name,Role
Name,id
Name,id,Role
Why is the output what it is?
For any Name,Role combination I will always get just 1 record
And, for any Name, id combination I will always get just 1 record.
There are ways to define a function, which will do exactly what you are asking for.
I will only show 1 possibility and it is a very naive solution. You can iterate through all the combinations of columns and check whether they form a unique entry in the table:
import itertools as it
def find_all_unique_columns_naive(df):
cols = df.columns
res = []
for num_of_cols in range(1, len(cols) + 1):
for comb in it.combinations(cols, num_of_cols):
num_of_nonunique = df.groupBy(*comb).count().where("count > 1").count()
if not num_of_nonunique:
res.append(comb)
return res
With a result for your example being:
[('Name',), ('id',), ('Name', 'Role'), ('Name', 'id'), ('Role', 'id'),
('Name', 'Role', 'id')]
There is obviously a performance issue, since this function is exponentially increasing in time as the number of columns grow, i.e. O(2^N). Meaning the runtime for a table with just 20 columns is already going to take quite a long time.
There are however some obvious ways to speed this up, f.e. in case you already know that column Name is unique, then definitely any combination which includes the already known unique combination will remain unique, hence you can already by that fact deduce that combinations (Name, Role), (Name, id) and (Name, Role, id) are unique as well and this will definitely reduce the search space quite efficiently. The worst case scenario however remains the same, i.e. in case the table has no unique combination of columns, you will have to exhaust the entire search space to make that conclusion.
As a conclusion, I'd suggest that you should think about why you want this function in the first place. There might be some specific use-cases for small tables I agree, just to save some time, but to be completely honest, this is not how one should treat a table. If a table exists, then there should be a purpose for the table to exist and a proper table design, i.e. how are the data inside the table really structured and updated. And that should be the starting point when looking for unique identifiers. Because even though you will be able to find other unique identifiers now with this method, it very well might be the case that the table design will destroy them with the next update. I'd much rather suggest to use the table's metadata and documentation, because then you can be sure that you are treating the table in the correct way as it was designed and, in case the table has a lot of columns, it is actually faster.

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Select where multiple fields are not in subquery (excluding join)

I have a requirement to pull records, that do not have history in an archive table. 2 Fields of 1 record need to be checked for in the archive.
In technical sense my requirement is a left join where right side is 'null' (a.k.a. an excluding join), which in abap openSQL is commonly implemented like this (for my scenario anyways):
Select * from xxxx //xxxx is a result for a multiple table join
where xxxx~key not in (select key from archive_table where [conditions] )
and xxxx~foreign_key not in (select key from archive_table where [conditions] )
Those 2 fields are also checked against 2 more tables, so that would mean a total of 6 subqueries.
Database engines that I have worked with previously usually had some methods to deal with such problems (such as excluding join or outer apply).
For this particular case I will be trying to use ABAP logic with 'for all entries', but I would still like to know if it is possible to use results of a sub-query to check more than than 1 field or use another form of excluding join logic on multiple fields using SQL (without involving application server).
I have tested quite a few variations of sub-queries in the life-cycle of the program I was making. NOT EXISTS with multiple field check (shortened example below) to exclude based on 2 keys works in certain cases.
Performance acceptable (processing time is about 5 seconds), although, it's noticeably slower than the same query when excluding based on 1 field.
Select * from xxxx //xxxx is a result for a multiple table inner joins and 1 left join ( 1-* relation )
where NOT EXISTS (
select key from archive_table
where key = xxxx~key OR key = XXXX-foreign_key
)
EDIT:
With changing requirements (for more filtering) a lot has changed, so I figured I would update this. The construct I marked as XXXX in my example contained a single left join ( where main to secondary table relation is 1-* ) and it appeared relatively fast.
This is where context becomes helpful for understanding the problem:
Initial requirement: pull all vendors, without financial records in 3
tables.
Additional requirements: also exclude based on alternative
payers (1-* relationship). This is what example above is based on.
More requirements: also exclude based on alternative payee (*-* relationship between payer and payee).
Many-to-many join exponentially increased the record count within the construct I labeled XXXX, which in turn produces a lot of unnecessary work. For instance: a single customer with 3 payers, and 3 payees produced 9 rows, with a total of 27 fields to check (3 per row), when in reality there are only 7 unique values.
At this point, moving left-joined tables from main query into sub-queries and splitting them gave significantly better performance.
than any smarter looking alternatives.
select * from lfa1 inner join lfb1
where
( lfa1~lifnr not in ( select lifnr from bsik where bsik~lifnr = lfa1~lifnr )
and lfa1~lifnr not in ( select wyt3~lifnr from wyt3 inner join t024e on wyt3~ekorg = t024e~ekorg and wyt3~lifnr <> wyt3~lifn2
inner join bsik on bsik~lifnr = wyt3~lifn2 where wyt3~lifnr = lfa1~lifnr and t024e~bukrs = lfb1~bukrs )
and lfa1~lifnr not in ( select lfza~lifnr from lfza inner join bsik on bsik~lifnr = lfza~empfk where lfza~lifnr = lfa1~lifnr )
)
and [3 more sets of sub queries like the 3 above, just checking different tables].
My Conclusion:
When exclusion is based on a single field, both not in/not exits work. One might be better than the other, depending on filters you use.
When exclusion is based on 2 or more fields and you don't have many-to-many join in main query, not exists ( select .. from table where id = a.id or id = b.id or... ) appears to be the best.
The moment your exclusion criteria implements a many-to-many relationship within your main query, I would recommend looking for an optimal way to implement multiple sub-queries instead (even having a sub-query for each key-table combination will perform better than a many-to-many join with 1 good sub-query, that looks good).
Anyways, any additional insight into this is welcome.
EDIT2: Although it's slightly off topic, given how my question was about sub-queries, I figured I would post an update. After over a year I had to revisit the solution I worked on to expand it. I learned that proper excluding join works. I just failed horribly at implementing it the first time.
select header~key
from headers left join items on headers~key = items~key
where items~key is null
if it is possible to use results of a sub-query to check more than
than 1 field or use another form of excluding join logic on multiple
fields
No, it is not possible to check two columns in subquery, as SAP Help clearly says:
The clauses in the subquery subquery_clauses must constitute a scalar
subquery.
Scalar is keyword here, i.e. it should return exactly one column.
Your subquery can have multi-column key, and such syntax is completely legit:
SELECT planetype, seatsmax
FROM saplane AS plane
WHERE seatsmax < #wa-seatsmax AND
seatsmax >= ALL ( SELECT seatsocc
FROM sflight
WHERE carrid = #wa-carrid AND
connid = #wa-connid )
however you say that these two fields should be checked against different tables
Those 2 fields are also checked against two more tables
so it's not the case for you. Your only choice seems to be multi-join.
P.S. FOR ALL ENTRIES does not support negation logic, you cannot just use some sort of NOT IN FOR ALL ENTRIES, it won't be that easy.

Dynamics AX Nested Query

Maybe I'm missing something simple, but is there a way to write a nested query in AX? I tried some syntax I thought would work, but with no luck.
The following standard SQL statement would accomplish what I'm trying to do, but I need to do this in AX, not SQL.
SELECT table1.column1A, table1.column1B,
(SELECT Top 1 column2B FROM table2
WHERE table1.column1A = table2.column2A
ORDER BY table2.column1A)
AS lookupResult
FROM table1
My problem is that table1 has a one-to-many relationship with table2, and since AX doesn't have a DISTINCT function that I'm aware of, I receive many copies of each record when using a JOIN statement.
Thanks
Nested queries are not supported in AX.
One way to bypass the missing distinct is to use group by (assuming max value of column2B is interesting):
while select column1A, column1B from table1
group column1A, column1B
join max-of(column2B) from table2
where table2.column2A == table1.column1A
{
...
}
Another method would be use a display method on table1 in the form or report.
display ColumnB column2B()
{
return (select max-of(column2B) from table2
where table2.column2A == this.column1A).column2A;
}
The performance is inferior to the first solution, but it may be acceptable.
As mentioned in the previous reply, group-by is the closest you can get to a distinct function. If you need a simpler query for some reason, or if you need a table or query object to use as a datasource on a form or report, you may entertain the idea of creating a view in the AOT, which contains the group-by. You can then use that view to easily join to on a query object or form datasource etc...
Ax2012 has support of computed columns in views, you can use the SysComputedColumn class to build query you want

Resources