Spark JDBC UpperBound - apache-spark

jdbc(String url,
String table,
String columnName,
long lowerBound,
long upperBound,
int numPartitions,
java.util.Properties connectionProperties)
Hello,
I want to import few table from Oracle to hdfs using spark jdbc connectivity. To ensure parallelism, I want to choose the correct upperBound for each table. I am planning put row_number as my partition column and count of the table as the upperBound. Is there a better way to chose upperBound?, since I have to connect to the table in the first time to get the count. Please help

Generally the better way to use partitioning in Spark JDBC:
Choose a numeric or date type column.
Set upper bound as the maximum value of the col
Set lower bound as the minimum value of the col
(if there is skew then there are other ways to handle, generally this is good)
Obviously the above requires some querying and handling
Keep the mapping of table: partition column (probably an external store)
Query and fetch min, max
Another tip for skipping query: if you can find a Date based column , you can probably use upper = today's date and lower = 2000's date. But again it subject's to your content. the values might not hold true.
From your question I believe you are looking for something generic which you can easily apply for all tables. I understand that's the desired state , but it was as straight forward as using row_number in DB to do so, Spark would have done that already by default.
Such functions may technically work, but will be definitely be slower than the above steps as well put extra load on your database.

Related

Spark SQL window function causes skew in data distribution

The performance of this Spark SQL query is bad due to skew data distribution:
select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 1 PRECEDING), 0L) as totalRevenue
from records c
I see in SparkUI that single task stack and the cluster fail if I increase the scanned range.
I am using Yarn at AWS EMR, with Spark 2.2.0
How can I overcome this issue?
Thanks
I can only recommend several approaches to alleviate your condition for investigation. I would actually try two approaches that don’t treat the skew first:
Try increasing the executor memory per the message. On YARN you may additionally need to increase the maximum container memory as well. The default on Spark IIRC is 2gb and its not uncommon to need to increase it.
Try switching to memory_and_disk or disk_only persistence levels. I believe this should work for your query although it can be hard to eyeball the full query plan
The reason for this is that at least to my eye your data is fundamentally skewed. You’re setting yourself up for maintenance difficulties if you start reshaping the data to address the skew in specific ways to the current shape of the data because the shape of the data may change over time. In my opinion at least you want to preserve the most straightforward implementation of your query for as long as you can, and only optimize skew issues programmatically if you hit problems with SLA violations, etc.
If those don’t work then you can try to address the skew directly. A simple approach for this is to create a third column that is populated by a random number for the column values that are known to be problematic. Do one pass of your summing operation with this in place, using it as a key, then a second pass with the extra random column removed. Alternatively you can do two queries and concatenate them: one with the random number for skewed data (which must still be handled in two passes) and another unaltered query for the non problematic data.
Edit - compute partial sums through two frames
The fundamentally useful observation here is that addition is commutative and associative. My original proposal based on random numbers won't work but this will. Basically, you want to compute the partial sum of the frame you want in several parts. The easiest way to do this is probably as a set of ranges (two used here for simplicity):
create temporary table partial_revenue_1 as select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 118 PRECEDING), 0L) as partialTotalRevenue
from records c
create temporary table partial_revenue_2 as select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 117 PRECEDING and 1 PRECEDING), 0L) as partialTotalRevenue
from records c
create temporary table combined_partials as select * from
partial_reveneue_1 union all select * from partial_revenue_2
select sum(partialTotalRevenue), first(c.some_col) ... from
combined_partials c group by cid, pid, code
Notice you need to use the first aggregate function to cull the duplicate fields that you will have from the earlier select * operations on the records table. Don't worry, this will be fine since both values came from the same table.

Cassandra Allow filtering

I have a table as below
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),start,id)
);
I want to run this query
Select * from test where day=1 and start > 1475485412 and start < 1485785654
and action='accept' ALLOW FILTERING
Is this ALLOW FILTERING efficient?
I am expecting that cassandra will filter in this order
1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result.
So the allow filtering will not be a bad choice on this query.
In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work?
Please explain.
Is this ALLOW FILTERING efficient?
When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.
I am expecting that cassandra will filter in this order...
Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").
As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),action,start,id)
);
You then would rewrite your query without any ALLOW FILTERING clause:
SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654
having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.
In general ALLOW FILTERING is not efficient.
But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.
In your case cassandra do not need filtering upto :
By the range column(start) on the 1's result
As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.
Now, keep following in mind
If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
So ensure this first. If it works in you favour, use FILTERING.
Otherwise, it would be wise to add secondary index on 'action'.
PS : There is some minor edit.

Is there a way to filter a counter column in cassandra?

I have been unable to decipher on how to proceed with a use case....
I want to keep count of some items, and query the data such that
counter_value < threshold value
Now in cassandra, indexes cannot be made on counters, that is something that is a problem, is there a workaround modelling which can be done to accomplish something similar??
thanks
You have partially answered your own question, saying what you want to query. So lets say first model the data the way you will query it later.
If you want to query through counter value, it cannot be a counter type. As it doesn't complies the two conditions needed to query the data
Cannot be part of index
Cannot be part of the partition key
Counters are the most efficient way to do fast writes in Cassandra for a counter use of case. But unfortunately they cannot be part of where clause, because of above two restrictions.
So if you want to solve the problem using Cassandra, change the type to a long in Cassandra, make it the clustering key or make an index over that column. In any case this will slower your writes and increase the latency of every operation of updating counter value, as you will be using the anti parttern of read-before-write.
I would recommend to use the index.
Last but not least, I would consider using a SQL database for this problem.
Depending on what you're trying to return as a result, you might be able to do something with a user defined aggregate function. You can put arbitrary code in the user defined function to filter based on the value of the counter.
See some examples here and here.
Other approaches would be to filter the returned rows on the client side, or to load the data into Spark and filter the rows in Spark.

how to do a query with cassandradb counter table

i have a table in Cassandradb as mentioned below:
CREATE TABLE remaining (owner varchar,buddy varchar,remain counter,primary key(owner,buddy));
generally i do some inc/dec operations on REMAIN field ,using cql like below:
update remaining set remain=remain + 1 where owner='userA' and buddy='userB';
update remaining set remain=remain + 1 where owner='userA' and buddy='userC';
....
and now i need to find out all buddies for userA which it's REMAIN field greater then 1. when i using:
select buddy,remain from remaining where owner='userA' and remain > 0;
gives me an error:
No indexed columns present in by-columns clause with Equal operator
how to do this in a cassandradb way?
The short answer to this is that you cannot do queries with conditionals on counter columns in Cassandra.
The reason behind this is that all Cassandra queries need to be modeled around the primary key of that particular table. Counter columns are not allowed as parts of the primary key of a table (their changing values would cause constant reorganization of the dat on disk). Counter columns are more used for tracking the state of a known piece of data, for example number of times a photo has been up-voted. This could be quickly recalled as long as we knew which photo we were interested in. To actually sort photos by numbers of votes you would need to perform an analytics style query using spark or Hadoop.

get_range_slices and CQL query handling, need for ALLOW FILTERING

I have a following CQL table (a bit simplified for clarity):
CREATE TABLE test_table (
user uuid,
app_id ascii,
domain_id ascii,
props map<ascii,blob>,
PRIMARY KEY ((user), app_id, domain_id)
)
The idea is that this table would contain many users (i.e. rows, say, dozens of millions). For each user there would be a few domains of interest and there would be a few apps per domain. And for each user/domain/app there would be a small set of properties.
I need to scan this entire table and load its contents in chunks for given app_id and domain_id. My idea was to use the TOKEN function to be able to read the whole data set in several iterations. So, something like this:
SELECT props FROM test_table WHERE app_id='myapp1'
AND domain_id='mydomain1'
AND TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807;
I was assuming that this query would be efficient because I specify the range of the row keys and by specifying the values of the clustering keys I effectively specify the column range. But when I try to run this query I get the error message "Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING".
I have limited experience with Cassandra and I assumed that this sort of query would map into get_range_slices() call, which accepts the slice predicate (i.e. the range of columns defined by my app_id/domain_id values) and the key range defined by my token range. It seems either I misunderstand how this sort of query is handled or maybe I misunderstand about the efficiency of get_range_slices() call.
To be more specific, my questions are:
- if this data model does make sense for the kind of query I have in mind
- if this query is expected to be efficient
- if it is efficient, then why am I getting this error message asking me to ALLOW FILTERING
My only guess about the last one was that the rows that do not have the given combination of app_id/domain_id would need to be skipped from the result.
--- update ----
Thank for all the comments. I have been doing more research on this and there is still something that I do not fully understand.
In the given structure what I am trying to get is like a rectangular area from my data set (assuming that all rows have the same columns). Where top and the bottom of the rectangle is determined by the token range (range) and the left/right sides are defined as column range (slice). So, this should naturally transform into get_range_slices request. My understanding (correct me if I am wrong) that the reason why CQL requires me to put ALLOW FILTERING clause is because there will be rows that do not contain the columns I am looking for, so they will have to be skipped. And since nobody knows if it will have to skip every second row or first million rows before finding one that fits my criteria (in the given range) - this is what causes the unpredictable latency and possibly even timeout. Am I right? I have tried to write a test that does the same kind of query but using low-level Astyanax API (over the same table, I had to read the data generated with CQL, it turned out to be quite simple) and this test does work - except that it returns keys with no columns where the row does not contain the slice of columns I am asking for. Of course I had to implement some kind of simple paging based on the starting token and limit to fetch the data in small chunks.
Now I am wondering - again, considering that I would need to deal with dozens of millions of users: would it be better to partially "rotate" this table and organize it in something like this:
Row key: domain_id + app_id + partition no (something like hash(user) mod X)
Clustering key: column partition no (something like hash(user) >> 16 mod Y) + user
For the "column partition no"...I am not sure if it is really needed. I assume that if I go with this model I will have relatively small number of rows (X=1000..10000) for each domain + app combination. This will allow me to query the individual partitions, even in parallel if I want to. But (assuming the user is random UUID) for 100M users it will result in dozens or hundreds of thousands of columns per row. Is it a good idea to read one such a row in one request? It should created some memory pressure for Cassandra, I am sure. So maybe reading them in groups (say, Y=10..100) would be better?
I realize that what I am trying to do is not what Cassandra does well - reading "all" or large subset of CF data in chunks that can be pre-calculated (like token range or partition keys) for parallel fetching from different hosts. But I am trying to find a pattern that is the most efficient for such a use case.
By the way, the query like "select * from ... where TOKEN(user)>X and TOKEN(user)
Short answer
This warning means that Cassandra would have to read non-indexed data and filter out the rows that don't satisfy the criteria. If you add ALLOW FILTERING to the end of query, it will work, however it will scan a lot of data:
SELECT props FROM test_table
WHERE app_id='myapp1'
AND domain_id='mydomain1'
AND TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807
ALLOW FILTERING;
Longer explanation
In your example primary key consists of two parts: user is used as partition key, and <app_id, domain_id> form remaining part. Rows for different users are distributed across the cluster, each node responsible for specific range of token ring.
Rows on a single node are sorted by the hash of partition key (token(user) in your example). Different rows for single user are stored on a single node, sorted by <app_id, domain_id> tuple.
So, the primary key forms a tree-like structure. Partition key adds one level of hierarchy, and each remaining field of a primary key adds another one. By default, Cassandra processes only the queries that return all rows from the continuos range of the tree (or several ranges if you use key in (...) construct). If Cassandra should filter out some rows, ALLOW FILTERING must be specified.
Example queries that don't require ALLOW FILTERING:
SELECT * FROM test_table
WHERE user = 'user1';
//OK, returns all rows for a single partition key
SELECT * FROM test_table
WHERE TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807;
//OK, returns all rows for a continuos range of the token ring
SELECT * FROM test_table
WHERE user = 'user1'
AND app_id='myapp1';
//OK, the rows for specific user/app combination
//are stored together, sorted by domain_id field
SELECT * FROM test_table
WHERE user = 'user1'
AND app_id > 'abc' AND app_id < 'xyz';
//OK, since rows for a single user are sorted by app
Example queries that do require ALLOW FILTERING:
SELECT props FROM test_table
WHERE app_id='myapp1';
//Must scan all the cluster for rows,
//but return only those with specific app_id
SELECT props FROM test_table
WHERE user='user1'
AND domain_id='mydomain1';
//Must scan all rows having user='user1' (all app_ids),
//but return only those having specific domain
SELECT props FROM test_table
WHERE user='user1'
AND app_id > 'abc' AND app_id < 'xyz'
AND domain_id='mydomain1';
//Must scan the range of rows satisfying <user, app_id> condition,
//but return only those having specific domain
What to do?
In Cassandra it's not possible to create a secondary index on the part of the primary key. There are few options, each having its pros and cons:
Add a separate table that has primary key ((app_id), domain_id, user) and duplicate the necessary data in two tables. It will allow you to query necessary data for a specific app_id or <app_id, domain_id> combination. If you need to query specific domain and all apps - third table is necessary. This approach is called materialized views
Use some sort of parallel processing (hadoop, spark, etc) to perform necessary calculations for all app/domain combinations. Since Cassandra needs to read all the data anyway, there probably won't be much difference from a single pair. If the result for other pairs might be cached for later use, it will probably save some time.
Just use ALLOW FILTERING if query performance is acceptable for your needs. Dozens of millions partition keys is probably not too much for Cassandra.
Presuming you are using the Murmur3Partitioner (which is the right choice), you do not want to run range queries on the row key. This key is hashed to determine which node holds the row, and is therefore not stored in sorted order. Doing this kind of range query would therefore require a full scan.
If you want to do this query, you should store some known value as a sentinel for your row key, such that you can query for equality rather than range. From your data it appears that either app_id or domain_id would be a good choice, since it sounds like you always know these values when performing your query.

Resources