Power Query - Join multiple tables - excel

I know, this is something that should be done at the level of the data model as a star Schema but in this specific case it is not possible.
I have 1 dimensional table and 4 transactional tables. All of them have the 1 unique column (order number).
Is there any way to get 1 huge table without joining them together step by step and expanding them?
Is there a way to do it all in one step?
In the future this will be done in the Warehouse but for now this will work as a proof of concept for several months.

Related

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

Power Query - Alternative for Join to filter Records

I have two tables:
Table: One Row per Order with the Status (Online / Offline)
Table: Multiple Rows per Order
Now I would like to reduce the number of record/ rows in the second table based on the status (Offline) from Table 1.
Is there any alternative to a right join? The first table is filtered on Status 'Offline'
We are talking about several millions of rows which takes some time to Join.
Any thoughts on this from your sight?
Some thoughts:
Create a relationship between these two tables and filter to "Offline".
You could create a join (Merge queries) in Power query and only select the On/Off State column to append. Then the import needs more time, but you're getting a flat dataset in PowerBI
Create a new column in PowerBI with DAX and use LOOKUPVALUE
Without seeing the data I think I would try the first one. If it's too slow, the I think the only way is the second point. Even it takes some more time for importing.
The third one might be the slowest.

Joining two large tables which have large regions of no overlap

Let's say I have the following join (modified from Spark documentation):
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= cast(impressionTime as date) AND
clickTime <= cast(impressionTime as date) + interval 1 day
""")
)
Assume that both tables have trillions of rows for 2 years of data. I think that joining everything from both tables is unnecessary. What I want to do is create subsets, similar to this: create 365 * 2 * 2 smaller dataframes so that there is 1 dataframe for each day of each table for 2 years, then create 365 * 2 join queries and take a union of them. But that is inefficient. I am not sure how to do it properly. I think I should add table.repartition(factor/multiple of 365 * 2) for both tables and add write.partitionBy(cast(impressionTime as date), cast(impressionTime as date)) to the streamwriter, and set the number of executors times cores to a factor or multiple of 365 * 2.
What is a proper way to do this? Does Spark analyze the query and optimizes it so that the entries from a single day are automatically put in the same partition? What if I am not joining all records from the same day, but rather from the same hour but there are very few records from 11pm to 1am? Does Spark know that it is most efficient to partition by day or will it be even more efficient?
Initially just trying to specify what i have understood from your question. You have two tables with two years worth of data and it has around trillion records in both of them. You want to join them efficiently based on the timeframe that you provided . for example could be for any specific month of any year or could be any specific custom dates but it should only read that much data and not all the data.
Now to answer your question you can do something as below:
First of all when you are writing data to create the table , you should partition the table by day column so that you have each day data in separate directory/partition for both the tables. Spark won't do that by default for you. You will have to decide that based on your dataset.
Second now when you are reading the data and performing the joins it should not be done on whole table. You will have to read the data from the specific partitions only by applying filter condition on the dataframe so that spark would apply partition pruning and it would read only the partitions that satisfy the condition in filter clause.
Once you have filtered the data at the time of reading from the table and stored it in a dataframe then you should join those dataframe based on the key relationship and that would be most efficient and performant way of doing it at first shot.
If it is still not fast enough you can look at bucketing your data along with partition but in most cases it is not required.

Sparse matrix using column store on MemSQL

I am new to column store db family and some of the concepts are not yet completely clear to me. I want to use MemSQL to store sparse matrix.
The table would look something like this:
CREATE TABLE matrix (
r_id INT,
c_id INT,
cell_data VARCHAR(10),
KEY (`r_id`, `c_id`) USING CLUSTERED COLUMNSTORE,
);
The Queries:
SELECT c_id, cell_data FROM matrix WHERE r_id=<val>; i.e. whole row
SELECT r_id, cell_data FROM matrix WHERE c_id=<val>; i.e. whole column
SELECT cell_data FROM matrix WHERE r_id=<val1> AND c_id=<val2>; i.e. one cell
UPDATE matrix SET cell_data=<val> WHERE r_id=<val1> AND c_id=<val2>;
INSERT INTO matrix VALUES (<v1>, <v2>, <v3>);
The queries 1 and 2 are about equally frequent and 3, 4 and 5 are also equally frequent. One of Q1,2 are equally frequent as one of Q3,4,5 (i.e. Q1,2:Q3,4,5 ~= 1:1).
I do realize that inserting into column store one row at a time creates Row segment group for each insert and thus degrading performance. I cannot batch the inserts. Also I cannot use in-memory row store (the matrix is too big).
I have three questions:
Does the issue with single row inserts concern updates too if only cell_data is changed (i.e. Q4)?
Would it be possible to have in-memory row table in which I would do INSERT (?and UPDATE?) operations and periodically batch the contents to column table?
How would I perform Q1,2 if I need most recent data (?UNION ALL?)?
Is it possible avoid executing Q3 for both tables (?which would mean two round trips?)?
I am concerned by execution speed of Q1 and Q2. Is the Clustered key optimal for those. I am not sure how the records would be stored with table above.
1.
Yes, single-row updates also perform poorly - they are essentially a delete and an insert.
2.
Yes, and in fact we automatically do this behind the scenes - the most recently inserted data (if it is too small a number of rows to be a good columnar segment) is kept in an in-memory rowstore form, and read queries are essentially looking at a UNION ALL of that data and the column-oriented data. We then batch up this data to write into column-oriented form.
If that doesn't work well enough, depending on your workload, you may benefit from explicitly keeping some of your data in a rowstore table instead of relying on the above behavior, in which case:
2a. yes, to see the most recent data you would use UNION ALL
2b. the data could be in either table, so you would have to query both (like for Q1,2, using UNION ALL works). This does not do two round trips, just one.
3.
You can either order by r or c first in the columnstore key - r in your current schema. This makes queries for a row efficient, but queries for a column are going to be very inefficient, they may have to scan basically the full table (depending on the patterns in your data). Unfortunately columnstore tables do not support using multiple keys, so there is no good way to solve this. One potential hacky solution is to maintain two copies of your table, one with key (r, c) and one with key (c, r) - this is essentially manually maintaining two indexes.
Based on the workload you're describing, it sounds like you are doing many single-row queries (Q3,4,5, which is 50% of the workload), which rowstore is much better suited for than columnstore (see http://docs.memsql.com/latest/concepts/columnstore/). Unfortunately, if it doesn't fit in memory, there isn't really a good way around this other than perhaps to add more memory.

Joining two result sets into one

I wanted to know if there's a way to join two or more result sets into one.
I actually need to execute more than one query and return just one result set. I can't use the UNION or the JOIN operators because I'm working with Cassandra (CQL)
Thanks in advance !
Framework like Playorm provide support for JOIN (INNER and LEFT JOINs)queries in Cassandra.
http://buffalosw.com/wiki/Command-Line-Tool/
You may see more examples at:
https://github.com/deanhiller/playorm/blob/master/src/test/java/com/alvazan/test/TestJoins.java
If your wanting to query multiple rows within the same column family you can use the IN keyword:
SELECT * FROM testCF WHERE key IN ('rowKeyA', 'rowKeyB', 'rowKeyZ') LIMIT 10;
This will get you back 10 results from each row.
If your needing to join results from different CFs, or query with differing WHERE clauses, then you need to run multiple queries and merge the results in code - cassandra doesn't cater for that kind of thing.
PlayOrm can do joins, but you may need to have PlayOrm partitioning on so you still scale. (ie. you dont' want to join 1 billion rows with 1 billion rows). Typically instead you do a join of one partition with another partition or a partition on the Account table joining a partition on the Users table. ie. make sure you design for scale still.

Resources