Partitioning in Spark while connecting to RDBMS - apache-spark

Say I have a RDBMS table with 10,000 records which has a column(pk_key) which is a sequence value starting from 1 to 10,000. I am planning to read it via spark.
I am planning to split into 10 partitions.
So in DataFrameReader jdbc method,my columnName will be "pk_key" and numPartitions will be 10.
What should be the lowerBound and upperBound for these ?
PS: My actual Record count is much higher,i just need to understand how it works?

Have you got any natural key? It may be non-unique. It's hard to determine lowerBound and upperBound to Long values, it will be different in different days.
One thing you can do is to run two queries:
select min(pk_key) from table;
select max(pk_key) from table;
via normal JDBC Connection. The first query will return lowerBound, the second one - upperBound

Related

Spark JDBC UpperBound

jdbc(String url,
String table,
String columnName,
long lowerBound,
long upperBound,
int numPartitions,
java.util.Properties connectionProperties)
Hello,
I want to import few table from Oracle to hdfs using spark jdbc connectivity. To ensure parallelism, I want to choose the correct upperBound for each table. I am planning put row_number as my partition column and count of the table as the upperBound. Is there a better way to chose upperBound?, since I have to connect to the table in the first time to get the count. Please help
Generally the better way to use partitioning in Spark JDBC:
Choose a numeric or date type column.
Set upper bound as the maximum value of the col
Set lower bound as the minimum value of the col
(if there is skew then there are other ways to handle, generally this is good)
Obviously the above requires some querying and handling
Keep the mapping of table: partition column (probably an external store)
Query and fetch min, max
Another tip for skipping query: if you can find a Date based column , you can probably use upper = today's date and lower = 2000's date. But again it subject's to your content. the values might not hold true.
From your question I believe you are looking for something generic which you can easily apply for all tables. I understand that's the desired state , but it was as straight forward as using row_number in DB to do so, Spark would have done that already by default.
Such functions may technically work, but will be definitely be slower than the above steps as well put extra load on your database.

DSE (Cassandra) - Range search on int data type

I am a beginner using Cassandra. I created a table with below details and when I try to perform range search using token, I am not getting any results. Am I doing something wrong or is it my understanding of data model?
Query select * from test where token(header)>=2 and token(header)<=4;
the token function calculates the token from the value based on the configured partitioner. The calculated value is the hash that is used to identify the node where the data is located, this is not a data itself.
Cassandra can perform range search on values only on clustering columns (only for some designs) only inside the single partition. If you need to perform range on arbitrary column (also for partition keys), there is a DSE Search that allows you to index the table and perform different types of search, including range... But take into account that it will be much slower than traditional Cassandra queries.
In your situation, you can run 3 queries in parallel (to cover values 2,3,4), like this:
select * from test where header = value;
and then combine results in your code.
I recommend to take DS201 & DS220 courses on DataStax Academy to understand how Cassandra performs queries, and how to model data to make this possible.

Spark sql limit in IN clause

I have a query in spark-sql with a lot of values in the IN clause:
select * from table where x in (<long list of values>)
When i run this query i get a TransportException from the MetastoreClient in spark.
Column x is the partition column of the table. The hive metastore is on Oracle.
Is there a hard limit on how many values can be in the in clause?
Or can i maybe set the timeout value higher to give the metastore more time to answer.
yes,you can pass upto 1000 values inside IN clause.
However, you can use OR operator inside IN clause and slice the list of values into multiple 1000 windows.

Running partition specific query in Spark Dataframe

I am working on spark streaming application, where I partition the data as per a certain ID in the data.
For eg: partition 0-> contains all data with id 100
partition 1 -> contains all data with id 102
Next I want to execute query on whole dataframe for final result. But my query is specific to each partition.
For eg: I need to run
select(col1 * 4) in case of partiton 0
while
select(col1 * 10) in case of parition 1.
I have looked into documentation but didnt find any clue. One solution i have is to create different RDDs/ Dataframe for different id in data. But that is not scalable in my case.
Any suggestion how to run query on dataframe where query can be specific to each partition.
Thanks
I think you should not couple your business logic with Spark's way of partitioning your data (you won't be able to repartition your data if required). I would suggest to add an artificial column in your DataFrame that equals with the partitionId value.
In any case, you can always do
df.rdd.mapPartitionsWithIndex{ case (partId, iter: Iterable[Row]) => ...}
See also the docs.

Cassandra range queries cql

I have to create a table which stores a big amount of data (like 400 columns and 5.000.000 to 40.000.000 rows). There is a counter "counter" which counts from 1 upwards. Right now this is my primary key. The other variables are int, float, and varchar type and repeating.
I need to do this for a database-comparison, so I have to use Cassandra, even if there could be other databases, that can do better in this specific problem.
On this table I want to execute some range queries. The queries should be like:
SELECT counter, val1, val2, val3 FROM table WHERE counter > 1000 AND counter < 5000;
Also there will be other filter-parameters:
... AND val54 = 'OK';
I think this is a problem in Cassandra, because "counter" is the PK. I will try running the token() function, but I guess this will be slow.
Right now I am learning about the data modelling in Cassandra but I hope somebody with experience in Cassandra got some hints for me, like how to organize the table and make the queries possible and fast? Perhaps just some topics I should learn about or links that will help me.
Have a nice day,
Friedrich
This sounds like a bad use case for Cassandra.
First, range queries are discouraged in Cassandra. This is because the range can't be resolved with out visiting every node in the cluster.
Second, you can't mix a counter type column with other column types. For a given table it can either have (and only have) counter columns or it can have all non-counter columns.
As far as Cassandra data modeling goes, if you want to create a successful data model, create your partitions around the exact thing you're going to query.

Resources