How to do an asof timeseries join in J/Jd - j

Is there a simple/efficient way to do an asof join in Jd? For example, given tables A and B with columns time/measurement, query for A.time,A.measurement,B.measurement where B.measurement is the last observation such that B.time <= A.time.

Jd does not have asof join. This is something that we agree would be nice and is currently a low priority project. You might try to start a discussion on the Jsoftware database forum.

Related

Improve performance of a query using row_number() in Azure Synapse Analytics dedicated sql pool

Below query uses row_number() and it is introducing a shuffle move when executing the query.
SELECT
f.col1
,f.col2
,f.col3
,f.col4
,rowNum=row_number() OVER (PARTITION BY f.col2 ORDER BY f.col4 desc)
FROM #currentData e
left join dbo.targetTable f on
e.col2 =f.col2
#currentData temporary table and targetTable both are distributed on col2 column.
I have also created indexes on columns that are used in the row_number() -- (col2 asc,col4 desc) but it didn't get rid of the shuffle move.
I have tried creating a covering index to cover all of the columns in the select statement and columns in the row_number but that didn't resolve the issue either.
Both of the tables have index on the join column (col2).
Also made sure statistics are up to date on these 2 tables.
Query takes long time to process due to the shuffle move, is there any other way to improve the below query performance?
Appreciate the help!
I just found this out the hard way.. unfortunately. I haven't got the time to fully understand but i managed to reduce query performance by 90% through removing the ROW_NUMBER function.
to my understanding the ROW_NUMBER introduces the necessity that each node should have all data to be able to calculate the row_number based on the order by clause. And if the order by (or partition) origins from a big table that's a lot of shuffle going on. Because we use row_number just as a primary key generator I was able to get rid of it but i assume this probably happens with rank etc also.
By removing the row_number the query plan actually does what it should do. Joining without data movement.
Interested to see if anyone has a solution or better explanation.

Efficient reading/transforming partitioned data in delta lake

I have my data in a delta lake in ADLS and am reading it through Databricks. The data is partitioned by year and date and z ordered by storeIdNum, where there are about 10 store Id #s, each with a few million rows per date. When I read it, sometimes I am reading one date partition (~20 million rows) and sometimes I am reading in a whole month or year of data to do a batch operation. I have a 2nd much smaller table with around 75,000 rows per date that is also z ordered by storeIdNum and most of my operations involve joining the larger table of data to the smaller table on the storeIdNum (and some various other fields - like a time window, the smaller table is a roll up by hour and the other table has data points every second). When I read the tables in, I join them and do a bunch of operations (group by, window by and partition by with lag/lead/avg/dense_rank functions, etc.).
My question is: should I have the date in all of the joins, group by and partition by statements? Whenever I am reading one date of data, I always have the year and the date in the statement that reads the data as I know I only want to read from a certain partition (or a year of partitions), but is it important to also reference the partition col. in windows and group bus for efficiencies, or is this redundant? After the analysis/transformations, I am not going to overwrite/modify the data I am reading in, but instead write to a new table (likely partitioned on the same columns), in case that is a factor.
For example:
dfBig = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, UNIX_TS, BARCODE, CUSTNUM, .... FROM STORE_DATA_SECONDS WHERE YEAR = 2020 and DATE='2020-11-12'")
dfSmall = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, TS_HR, CUSTNUM, .... FROM STORE_DATA_HRS WHERE YEAR = 2020 and DATE='2020-11-12'")
Now, if I join them, do I want to include YEAR and DATE in the join, or should I just join on STORE_ID_NUM (and then any of the timestamp fields/customer Id number fields I need to join on)? I definitely need STORE_ID_NUM, but I can forego YEAR AND DATE if it is just adding another column and makes it more inefficient because it is more things to join on. I don't know how exactly it works, so I wanted to check as by foregoing the join, maybe I am making it more inefficient as I am not utilizing the partitions when doing the operations? Thank you!
The key with delta is to choose the partitioned columns very well, this could take some trial and error, if you want to optimize the performance of the response, a technique I learned was to choose a filter column with low cardinality (you know if the problem is of time series, it will be the date, on the other hand if it is about a report for all clients in that case it may be convenient to choose your city), remember that if you work with delta each partition represents a level of the file structure where its cardinality will be the number of directories.
In your case I find it good to partition by YEAR, but I would add the MONTH given the number of records that would help somewhat with the dynamic pruning of spark
Another thing you can try is to use BRADCAST JOIN if the table is very small compared to the other.
Broadcast Hash Join en Spark (ES)
Join Strategy Hints for SQL Queries
The latter link explains how dynamic pruning helps in MERGE operations.
How to improve performance of Delta Lake MERGE INTO queries using partition pruning

How to use ORDER BY and GROUP BY together in u-sql

I am having a u-sql query which fetch some from 3 tables and this query already had the GROUP BY. I want to fetch only top 10 rows, so i have to use the FETCH.
#data= SELECT C.id,C.Name,C.Address,ph.phoneLabel,ph.phone
FROM person AS C
INNER JOIN
phone AS ph
ON ph.id == C.id
GROUP BY id
ORDER BY id ASC
FETCH 100 ROWS;
Please provide me some samples.
Thanks in Advance!
I am not an expert or anything but few days ago I executed a query which uses both group by and order by clause. Here's how it looks: SELECT distinct savedposters.*, comments.rating, comments.posterid FROM savedposters INNER JOIN comments ON savedposters.id=comments.posterid WHERE savedposters.display=1 GROUP BY comments.posterid HAVING avg(comments.rating)>=4 and count(comments.rating)>=2 ORDER BY avg(comments.rating) DESC
What is your exact goal? There is no relationship between ORDER BY and GROUP BY. In your query you have GROUP BY but there is no aggregation so the GROUP BY is not needed, plus the query would fail. If you're looking to limit the output by 10 rows then see the first example at Output Statement (U-SQL).

Spark sql top n per group

How can I get the top-n (lets say top 10 or top 3) per group in spark-sql?
http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ provides a tutorial for general SQL. However, spark does not implement subqueries in the where clause.
You can use the window function feature that was added in Spark 1.4
Suppose that we have a productRevenue table as shown below.
the answer to What are the best-selling and the second best-selling products in every category is as follows
SELECT product,category,revenue FROM
(SELECT product,category,revenue,dense_rank()
OVER (PARTITION BY category ORDER BY revenue DESC) as rank
FROM productRevenue) tmp
WHERE rank <= 2
Tis will give you the desired result

Joining two result sets into one

I wanted to know if there's a way to join two or more result sets into one.
I actually need to execute more than one query and return just one result set. I can't use the UNION or the JOIN operators because I'm working with Cassandra (CQL)
Thanks in advance !
Framework like Playorm provide support for JOIN (INNER and LEFT JOINs)queries in Cassandra.
http://buffalosw.com/wiki/Command-Line-Tool/
You may see more examples at:
https://github.com/deanhiller/playorm/blob/master/src/test/java/com/alvazan/test/TestJoins.java
If your wanting to query multiple rows within the same column family you can use the IN keyword:
SELECT * FROM testCF WHERE key IN ('rowKeyA', 'rowKeyB', 'rowKeyZ') LIMIT 10;
This will get you back 10 results from each row.
If your needing to join results from different CFs, or query with differing WHERE clauses, then you need to run multiple queries and merge the results in code - cassandra doesn't cater for that kind of thing.
PlayOrm can do joins, but you may need to have PlayOrm partitioning on so you still scale. (ie. you dont' want to join 1 billion rows with 1 billion rows). Typically instead you do a join of one partition with another partition or a partition on the Account table joining a partition on the Users table. ie. make sure you design for scale still.

Resources