Joining two result sets into one - cassandra

I wanted to know if there's a way to join two or more result sets into one.
I actually need to execute more than one query and return just one result set. I can't use the UNION or the JOIN operators because I'm working with Cassandra (CQL)
Thanks in advance !

Framework like Playorm provide support for JOIN (INNER and LEFT JOINs)queries in Cassandra.
http://buffalosw.com/wiki/Command-Line-Tool/
You may see more examples at:
https://github.com/deanhiller/playorm/blob/master/src/test/java/com/alvazan/test/TestJoins.java

If your wanting to query multiple rows within the same column family you can use the IN keyword:
SELECT * FROM testCF WHERE key IN ('rowKeyA', 'rowKeyB', 'rowKeyZ') LIMIT 10;
This will get you back 10 results from each row.
If your needing to join results from different CFs, or query with differing WHERE clauses, then you need to run multiple queries and merge the results in code - cassandra doesn't cater for that kind of thing.

PlayOrm can do joins, but you may need to have PlayOrm partitioning on so you still scale. (ie. you dont' want to join 1 billion rows with 1 billion rows). Typically instead you do a join of one partition with another partition or a partition on the Account table joining a partition on the Users table. ie. make sure you design for scale still.

Related

Cassandra - get all data for a certain time range

Is it possible to query a Cassandra database to get records for a certain range?
I have a table definition like this
CREATE TABLE domain(
domain_name text,
status int,
last_scanned_date long
PRIMARY KEY(text,last_scanned_date)
)
My requirement is to get all the domains which are not scanned in the last 24 hours. I wrote the following query, but this query is not efficient as Cassandra is trying to fetch entire dataset because of ALLOW FILTERING
SELECT * FROM domain where last_scanned_date<=<last24hourstimeinmillis> ALLOW FILTERING;
Then I decided to do it in two queries
1st query:
SELECT DISTINCT name from domain;
2nd query:
Use IN operator to query domains which are not scanned i nlast 24 hours
SELECT * FROM domain where
domain_name IN('domain1','domain2')
AND
last_scanned_date<=<last24hourstimeinmillis>
My second approach works, but comes with an extra overhead of querying first for distinct values.
Is there any better approach than this?
You should update your structure table definition. Currently, you are selecting domain name as your partition key while you can not have more than 2 billion records in single Cassandra partition.
I would suggest you should use your time as part of your partition key. If you are not going to receive more than 2 billion requests per day. Try to use day since epoch as the partition key. You can do composite partition keys but they won't be helpful for your query.
While querying you have to scan at max two partitions with an additional filter in a query or in your application filtering out results which do not belong to a
the range you have specified.
Go over following concepts before finalizing your design.
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCompositePartitionKeyConcept.html
https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html
Cassandra can effectively perform range queries only inside one partition. The same is for use of the aggregations, such as DISTINCT. So in your case you'll need to have only one partition that will contain all data. But that's is bad design.
You may try to split this big partition into smaller ones, by using TLDs as separate partition keys, and perform fetching in parallel from every partition - but this also will lead to imbalance, as some TLDs will have more sites than other.
Another issue with your schema is that you have last_scanned_date as clustering column, and this means that when you update last_scanned_date, you're effectively insert a new row into database - you'll need to explicitly remove row for previous last_scanned_date, otherwise the query last_scanned_date<=<last24hourstimeinmillis> will always fetch old rows that you already scanned.
Partially your problem with your current design could be solved by using the Spark that is able to perform effective scanning of full table via token range scan + range scan for every individual row - this will return only data in given time range. Or if you don't want to use Spark, you can perform token range scan in your code, something like this.

What is the best practice of groupby in Spark SQL?

I have a Spark SQL that groupbys multiple columns. I was wondering if the order of the columns matter to the query performance.
Does placing the column with more distinct values earlier help? I assume the groupby is based on some hash/shuffle algorithm. If the first groupby can distribute data to small subsets that can be hold in one machine, the later groupbys can be done locally. Is this true?
What is the best practice of groupby?
group by, as you assumed, uses hash function on columns to decide which set of group by keys would end up in which partition.
You can use distribute by to tell spark which columns to use - https://docs.databricks.com/spark/latest/spark-sql/language-manual/select.html
As for any other manipulation on the data (like placing more distinct values earlier), note that if have 2 group by statements in your query, you end up with 2 shuffles. And the result of the first one is obviously quite big (as it's not the final aggregation). So I would try to have as little group by statements as possible.

Does it help to filter down a dataframe before a left outer join?

I've only seen sources say that this helps for RDDs, so I was wondering if it helped for DataFrames since the Spark core and spark SQL engines optimize differently.
Let's say table 1 has 6mil records and we're joining to table 2 which has 600mil records. We are joining these two tables on table 2's primary key, 'key2'.
If we plan to do:
table 3 = table1.join(table2, 'key2', 'left_outer'),
Is it worth it to filter down table2's 600mil records with a WHERE table2.key2 IN table1.key2 before the join? And if so, what's the best way to do it? I know the DataFrame LEFT SEMI JOIN method is similar to a WHERE IN filter, but I'd like to know if there are better ways to filter it down.
TL;DR It is not possible to answer without data, but probably not.
Pre-filtering may provide performance boost if you significantly reduce number of records to be shuffled. To do that:
It has to be highly selective.
Size of the key column is << size of all columns.
The first one is obvious. If there is reduction you search for nothing.
The second is subtle - WHERE ... IN (SELECT ... FROM ...) requires a shuffle, same a join. So the keys are actually shuffled twice.
Using bloom filters might scale more gracefully (no need to shuffle).
If you have 100 fold difference in the number of records, it might be better to consider broadcast join.

Spark SQL window function causes skew in data distribution

The performance of this Spark SQL query is bad due to skew data distribution:
select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 1 PRECEDING), 0L) as totalRevenue
from records c
I see in SparkUI that single task stack and the cluster fail if I increase the scanned range.
I am using Yarn at AWS EMR, with Spark 2.2.0
How can I overcome this issue?
Thanks
I can only recommend several approaches to alleviate your condition for investigation. I would actually try two approaches that don’t treat the skew first:
Try increasing the executor memory per the message. On YARN you may additionally need to increase the maximum container memory as well. The default on Spark IIRC is 2gb and its not uncommon to need to increase it.
Try switching to memory_and_disk or disk_only persistence levels. I believe this should work for your query although it can be hard to eyeball the full query plan
The reason for this is that at least to my eye your data is fundamentally skewed. You’re setting yourself up for maintenance difficulties if you start reshaping the data to address the skew in specific ways to the current shape of the data because the shape of the data may change over time. In my opinion at least you want to preserve the most straightforward implementation of your query for as long as you can, and only optimize skew issues programmatically if you hit problems with SLA violations, etc.
If those don’t work then you can try to address the skew directly. A simple approach for this is to create a third column that is populated by a random number for the column values that are known to be problematic. Do one pass of your summing operation with this in place, using it as a key, then a second pass with the extra random column removed. Alternatively you can do two queries and concatenate them: one with the random number for skewed data (which must still be handled in two passes) and another unaltered query for the non problematic data.
Edit - compute partial sums through two frames
The fundamentally useful observation here is that addition is commutative and associative. My original proposal based on random numbers won't work but this will. Basically, you want to compute the partial sum of the frame you want in several parts. The easiest way to do this is probably as a set of ranges (two used here for simplicity):
create temporary table partial_revenue_1 as select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 118 PRECEDING), 0L) as partialTotalRevenue
from records c
create temporary table partial_revenue_2 as select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 117 PRECEDING and 1 PRECEDING), 0L) as partialTotalRevenue
from records c
create temporary table combined_partials as select * from
partial_reveneue_1 union all select * from partial_revenue_2
select sum(partialTotalRevenue), first(c.some_col) ... from
combined_partials c group by cid, pid, code
Notice you need to use the first aggregate function to cull the duplicate fields that you will have from the earlier select * operations on the records table. Don't worry, this will be fine since both values came from the same table.

Performing join operation on two huge tables using apache spark

I have 2 tables in my database. Each table is having 100 million rows.
Is there a way to join these 2 tables and extract data using apache spark in fastest way ?
I would say the most efficient way would be to use DataFrames and call join, followed by any other criteria. The benefit is that certain filters or selections will be pushed as far down as possible to cut down your network load...only the data that is needed will be pulled.
Without more information, that is the best suggestion I can give.

Resources