Spark JDBC - Read -> update -> write huge table without primary key

Spark JDBC - Read -> update -> write huge table without primary key - apache-spark

I am trying to update the few fields of each row of a big mysql table (having close to 500 million rows). The table doesn't have any primary key (or having string primary key like UUID). I don't have enough executor memory to read and hold the entire data in once. Can anyone please let me know what are my options to process such tables.
Below is the schema
CREATE TABLE Persons ( Personid varchar(255) NOT NULL, LastName varchar(255) NOT NULL, FirstName varchar(255) DEFAULT NULL, Email varchar(255) DEFAULT NULL, Age int(11) DEFAULT NULL) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Spark code is like
SparkSession spark = SparkSession.builder().master("spark://localhost:7077").appName("KMASK").getOrCreate();
DataFrame rawDataFrame = spark.read().format("jdbc").load();
rawDataFrame.createOrReplaceTempView("data");
//encrypt is UDF
String sql = "select Personid, LastName, FirstName, encrypt(Email), Age from data";
Dataset newData = spark.sql(sql);
newData.write().mode(SaveMode.Overwrite).format("jdbc").options(options).save();
This table has around 150 million records, size of data is around 6GB. My executor memory is just 2 gb. Can I process this table using Spark - jdbc.

Ideally you can alter the spark jdbc fetchsize option to reduce/increase how many records are fetched and processed each time.
Partitioning the data can also help to reduce shuffles and additional overhead. Since you have Age as a numerical field. You may also process the data in partitions determined by the Age. First determine the min and max age and use the Spark JDBC Options.
Notably:
partitionColumn : Age
lowerBound : min age you identified
upperBound : max age you identified
numPartitions: really dependent on the number of cores and worker nodes but more hints and links are here
You may also use custom queries to only select and update a few records that can hold in memory with the query option. NB. when using the query option you should not use dbtable option.

Related

Spark SQL : INSERT Statement with JDBC does not support default value

I am trying to read/write data from other databases using JDBC.
just following the doc https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
But I found Spark SQL does not work well with Default value or AUTO_INCREMENT
CREATE TEMPORARY VIEW jdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:postgresql:dbserver",
dbtable "schema.tablename",
user 'username',
password 'password'
)
INSERT INTO TABLE jdbcTable (id) values (1)
Here is my DDL
CREATE TABLE `tablename` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`age` int(11) NULL DEFAULT 0,
PRIMARY KEY (`id`) USING BTREE
)
The error org.apache.spark.sql.AnalysisException: unknown requires that the data to be inserted have the same number of columns as the target table: target table has 2 column(s) but the inserted data has 1 column(s), including 0 partition column(s) having constant value(s).
Is there any way to support Default value or AUTO_INCREMENT? thx

I have discovered this same issue with columns with DEFAULT and also COMPUTED columns. If you are using SQL Server you can consider an AFTER INSERT TRIGGER otherwise you may need to calculate the id on the INSERT side.

Delete records in Cassandra table based on time range

I have a Cassandra table with schema:
CREATE TABLE IF NOT EXISTS TestTable(
documentId text,
sequenceNo bigint,
messageData blob,
clientId text
PRIMARY KEY(documentId, sequenceNo))
WITH CLUSTERING ORDER BY(sequenceNo DESC);
Is there a way to delete the records which were inserted between a given time range? I know internally Cassandra must be using some timestamp to track the insertion time of each record, which would be used by features like TTL.
Since there is no explicit column for insertion timestamp in the given schema, is there a way to use the implicit timestamp or is there any better approach?
There is never any update to the records after insertion.

It's an interesting question...
All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name) function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime is in microseconds, not milliseconds as in CQL's timestamp type).
The easiest way is to use Spark Cassandra Connector's RDD API, something like this:
val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
.select("prk1", "prk2", "reg_col".writeTime as "writetime")
.filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable,
keyColumns = SomeColumns("prk1", "prk2"))
where: prk1, prk2, ... are all components of the primary key (documentId and sequenceNo in your case), and reg_col - any of the "regular" columns of the table that isn't collection or UDT (for example, clientId). It's important that list of the primary key columns in select and deleteFromCassandra was the same.

Cassandra slow SELECT MAX(x) query

I have a dev machine with Cassandra 3.9 and 2 tables, one has ~~ 400,000 records, another about 40,000,000 records. Their structures are different.
Each of them has a secondary index on a field x, and I'm trying to run a query of the form SELECT MAX(x) FROM table. On the first table, the query takes a couple of seconds, and on the second table, it times out.
My experience is with relational databases where these queries are trivial and fast. So in Cassandra, it looks like the index isn't used to execute these queries? Is there an alternative?

In cassandra aggregation functions such as MIN, MAX, COUNT, SUM or AVG on a table without specifing a partition key is a bad practice. instead, you can have an other table that store the max value of x field for both tables.
However, you have to add some client side logic to maintain this max value in the other table when you run INSERT or UPDATE statements.
Tables structures :
CREATE TABLE t1 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE t2 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE agg_table (
table_name text PRIMARY KEY,
max_value int
);
So with this structure you can have the max value for a table :
SELECT max_value
FROM agg_table
WHERE table_name = 't1';
Hope this can help you.

CQL query on 'validFrom/validTo timestamps'

I'm currently trying to model a column family that has two timestamps specifying whether an entry is valid (or 'active') at a given date (typically execution time).
No big issue with traditional SQL, 64 gigs of RAM and some indices, we're doing that quite often with our SQL server.
However, in CQL I haven't managed to model this scenario and write valid queries for it.
My basic model is (I skipped the PK definition!)
create table myTable(
id uuid,
validFrom timeuuid,
validTo timeuuid,
someInformationalData varChar
);
Some explanations:
due to the fact, that a validity date is not unique, I need a combined key in my final application this is going to be a usergroup reference (would be an ideal partition key)
validFrom/To are designed to be optional, but I could deal with by using boundary values (1970, 2038) for 'null' values passed through the persistence layer
I tried various combinations of partitioning/clustering keys, however neither of them resulted in valid CQL
-- only active results
select *
from
myTable
where
validFrom < now()
and
validTo > now()
I'm quite new to the NoSQL/CQL world and am struggling a bit with converting some of our applications. I could do it in memory, but I'm afraid, this could get a bottleneck at some point...
No sure if this kind of 'I have no idea what I'm doing' yell is appropriate, but any kind of help would be appreciated. :)
edit Here's one of the approaches I've been messing around with
drop table if exists myTable;
create table myTable(
id int,
datefrom timeuuid,
dateto timeuuid,
someColumns varChar,
primary key((id,datefrom),dateto)
);
create index if not exists my_idx on myTable(datefrom);
insert into myTable(id, datefrom,dateto,somecolumns)
values(0,minTimeuuid('1970-01-01 00:00:00'),minTimeuuid('2020-01-01 00:00:00'),'test');
insert into myTable(id,datefrom,dateto,somecolumns)
values(1,minTimeuuid('1970-01-01 00:00:00'),minTimeuuid('2012-01-01 00:00:00'),'test2');
select * from myTable where dateto > now() allow filtering;
-- invalid ("A column of a partition key can be restricted only if the preceding one is restricted by an Equal relation.")
select * from myTable where datefrom < now() and dateto > now() allow filtering;
The first query is limiting my result, the row with 'validTo=2012-01-01' is filtered, but I wasn't able to work out a scheme that worked on both limitations in the where clause.

If I understand your problem, what you are looking for is a way to run a range query based on the timestamp. Basically to be able to do this, your model will have to have the timestamp component as part of the clustering key:
create table myTable(
eventType uuid,
ts timestamp,
val text,
PRIMARY KEY (eventType, ts)
);
The above will allow you to run a query like: SELECT eventType, val from myTable where eventType = 'your_event' and ts >= 'start_ts' and ts < 'end_ts'.
What you need to remember is that the clustering keys are dictating the order on disk, thus making it possible to run efficiently queries like above. You can read more details about this in the CQL spec SELECT section.

Their is no such thing as Now() in cassandra like any other sql databases. you have to clearly mention today's date instead of Now() ..
You can use columns in which you defined as primary key or secondary index in where clause.

Get a random row in Cassandra using Datastax and CQL

I am NoSQL n00b, and just trying things out. I have the following keyspace with a single table in cassandra 2.0.2
CREATE KEYSPACE PersonDB WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
USE PersonDB;
CREATE TABLE Persons (
id int,
lastname text,
firstname text,
PRIMARY KEY (id)
)
I have close to 500 entries in the Persons table. I want to select any random row from the table. Is there an efficient way to do it in CQL? I am using groovy to invoke APIs exposed by datastax.

If want to get "any" row you can just use LIMIT.
select * from persons LIMIT 1;
You would get the row with the lower hash of the partition key (id).
It will not be random, it will depend on your partitioner, but you would get A row.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark JDBC - Read -> update -> write huge table without primary key - apache-spark

Related

Spark SQL : INSERT Statement with JDBC does not support default value

Delete records in Cassandra table based on time range

Cassandra slow SELECT MAX(x) query

CQL query on 'validFrom/validTo timestamps'

Get a random row in Cassandra using Datastax and CQL

Categories

Resources