Cassandra data modelling - Select subset of rows in large table for batch Spark processing - cassandra

I am working on a project in which we ingest large volumes of structured raw data into a Cassandra table and then transform it to a target schema using Spark.
Since our raw data table is getting pretty large we would like to process it in batches. That means Spark has to look at our raw data table to identify not yet processed rows (by partition key) and then load the subset of these rows into Spark.
Being new to Cassandra I am now wondering how to implement this. Using Spark, I can quite efficiently load raw data keys and compare them with the keys in the transformed table to identify the subsets. But what is the best strategy to load the subset of these rows from my raw data table.
Here is an example. If my schema looks like this:
CREATE TABLE raw_data (
dataset_id text PRIMARY KEY,
some_json_data text
);
...and if I have dataset_ids 1,2,3,4,5 in my table and know that I now need to process the rows with ids 4, 5, how can I efficiently select those rows knowing that the list of ids can be pretty long in practice?

Related

Hbase filter specific rows

I have a Java Spark (v2.4.7) job that currently reads the entire table from Hbase.
My table has millions of rows and reading the entire table is very expensive (memory).
My process doesn't need all the data from the Hbase table, how can I avoid reading rows with specific keys?
Currently, I read from Hbase as following:
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> jrdd = sparkSession.sparkContext().newAPIHadoopRDD(DataContext.getConfig(),
TableInputFormat.class, ImmutableBytesWritable.class, Result.class)
I saw the answer in this post, but I didn't find how can I filter out specific keys.
Any help?
Thanks!

External Table in Azure synapse very slow performance

I have a parquet file and created a new External table, but the performance is very slow as compare to a normal table in the synapse. Can you please let me know how to over come this.
Very broad question. So I'll give broad answer:
Use normal table. Hard to beat performance of "normal table" with external tables. "normal table" means a table created in a Dedicated SQL pool using CREATE TABLE. If you're querying data from one or more tables repeatedly and each query is different (group-by, join, selected columns) then you can't get beat performance of "normal" table with external tables.
Understand and apply basic best practices:
Use parquet format, which you're doing.
Pick right partition column and partition your data by storing partitions to different folders or file names.
If a query targets a single large file, you'll benefit from splitting it into multiple smaller files.
Try to keep your CSV (if using csv) file size between 100 MB and 10 GB.
Use correct data type.
Manually create statistics for CSV files
Use CETAS to enhance query performance and joins
...and many more.
a) The first step is to partition your Parquet File using a relevant partition column, such as Year, Month, and Date.
b) I recommend using a View rather than an external table as a second recommendation. External Tables don't support Partition Prunning and won't use the partition columns to eliminate unnecessary files during the read.
c) Assure that data types are enforced, and that string types are being used appropriately.
d) If possible, convert your Parquet file to Delta format. Synapse is able to read Partition columns from Delta without the need for the filepath() and filename() functions. External tables do not support Delta, only views.
Note: External tables doesn't support Parquet partition columns.
SELECT *,
CAST(fct.filepath(1) AS SMALLINT) AS SalesOrderPathYear,
CAST(fct.filepath(2) AS TINYINT) AS SalesOrderPathMonth,
CAST(fct.filepath(3) AS DATE) AS SalesOrderPathDate
FROM
OPENROWSET
(
BULK 'conformed/facts/factsales/*/*/*/*.parquet',
DATA_SOURCE = 'ExternalDataSourceDataLake',
FORMAT = 'Parquet'
) AS fct
WITH
(
ColA as String(10),
ColB as Integer,
ColC as ...
)
Ref: https://www.serverlesssql.com/certification/mastering-dp-500-exam-querying-partitioned-sources-in-azure-storage/

spark parquet partitioning which remove the partition column

If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.

Data is not getting written in sorted format on target oracle table through SPARK

I have a table in hive with below schema
emp_id:int
emp_name:string
I have created data frame from above hive table
df = sql_context.sql('SELECT * FROM employee ORDER by emp_id')
df.show()
After above code is run I see that data is sorted properly on emp_id
I am trying to write the data to Oracle table through below code
df.write.jdbc(url=url, table='target_table', properties=properties, mode="overwrite")
As per my understanding, This is happening because of multiple executor processes running at the same time on every data partitions and sorting applied through query is been applied on specific partition and when multiple processes writing data to Oracle at the same time the result table ordering is distorted
I further tried to repartition the data to just one partition(Which is not ideal solution) and post writing the data to oracle the sorting worked properly
Is there any way to write sorted data to RDBMS from SPARK
TL;DR When working with relational systems you should never depend on the insert order. Spark is not really relevant here.
Relational databases, including Oracle, don't guarantee any intrinsic order of the stored data. Exact order of stored records is a detail of implementation, and can change during lifetime of the data.
The sole exception in Oracle are Index Organized Tables where:
data for an index-organized table is stored in a B-tree index structure in a primary key sorted manner.
This of course requires a primary key which can reliably determine order.

Spark and cassandra, range query on clustering key

I have cassandra table with following structure:
CREATE TABLE table (
key int,
time timestamp,
measure float,
primary key (key, time)
);
I need to create a Spark job which will read data from previous table, within specified start and end timestamp do some processing, and flush results back to cassandra.
So my spark-cassandra-connector will have to do a range query on clustering cassandra table column.
Are there any performance differences if I do:
sc.cassandraTable(keyspace,table).
as(caseClassObject).
filter(a => a.time.before(startTime) && a.time.after(endTime).....
so what I am doing is loading all the data into Spark and applying filtering
OR if I do this:
sc.cassandraTable(keyspace, table).
where(s"time>$startTime and time<$endTime)......
which filters all the data in Cassandra and then loads smaller subset to Spark.
The selectivity of a range query will be around 1%
It is impossible to include partition key in the query.
Which of these two solutions is preferred?
sc.cassandraTable(keyspace, table).where(s"time>$startTime and time<$endTime)
Will be MUCH faster. You are basically doing a percentage (if you only pull 5% of the data 5% of the total work) of the full grab in the first command to get the same data.
In the first case you are
Reading all of the data from Cassandra.
Serializing every object and then moving it to Spark.
Then finally filtering everything.
In the second case you are
Reading only the data you actually want from C*
Serializing only this tiny subset
There is no step 3
As an additional comment you can also put your case class type right in the call
sc.cassandraTable[CaseClassObject](keyspace, table)

Resources