Spark SQL : INSERT Statement with JDBC does not support default value - apache-spark

I am trying to read/write data from other databases using JDBC.
just following the doc https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
But I found Spark SQL does not work well with Default value or AUTO_INCREMENT
CREATE TEMPORARY VIEW jdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:postgresql:dbserver",
dbtable "schema.tablename",
user 'username',
password 'password'
)
INSERT INTO TABLE jdbcTable (id) values (1)
Here is my DDL
CREATE TABLE `tablename` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`age` int(11) NULL DEFAULT 0,
PRIMARY KEY (`id`) USING BTREE
)
The error org.apache.spark.sql.AnalysisException: unknown requires that the data to be inserted have the same number of columns as the target table: target table has 2 column(s) but the inserted data has 1 column(s), including 0 partition column(s) having constant value(s).
Is there any way to support Default value or AUTO_INCREMENT? thx

I have discovered this same issue with columns with DEFAULT and also COMPUTED columns. If you are using SQL Server you can consider an AFTER INSERT TRIGGER otherwise you may need to calculate the id on the INSERT side.

Related

Partitioned table inside a colocated database in YugabyteDB YSQL

[Question posted by a user on YugabyteDB Community Slack]
Is it possible to create a partitioned table in a colocated database?
When the database is created with colocated=true and trying to add a partitioned table like this:
create table test(id bigserial not null, PRIMARY KEY(id HASH)) PARTITION BY RANGE WITH (colocated = false);
I’m getting an error
Query 1 ERROR: ERROR: syntax error at or near “WITH” LINE 3: PRIMARY KEY(id HASH)) PARTITION BY RANGE WITH (colocated=t…
Is it possible to do this or should I think about some other approach? I’m trying to do geo-partitioning and at the same time have some of the tables colocated.
The syntax is wrong: you need to specify which columns to PARTITION BY RANGE. For example, PARTITION BY RANGE (id) (but then why is it a hash primary key?)
You can't have a hash partitioned table for colocation. In your case, since the table is partitioned, it should work (as long as you fix the syntax error), but all partitions under it can't be colocated.
Taking into account the above, you can have something like:
create table new (id bigserial not null, PRIMARY KEY (id ASC)) partition by range(id);
create table new_1 partition of new for values from (5) to (10) with (colocated = false);
create table new_2 partition of new for values from (20) to (30) with (colocated = true);
You can't shard by hash if you want to set colocated=true. It works fine with colocated=false:
create table new (id bigserial not null, value text) partition by range(id);
create table new_1 partition of new (primary key(id hash)) for values from (0) to (5) with (colocated = false);
create table new_2 partition of new (primary key(id hash)) for values from (5) to (10) with (colocated = false);

Spark JDBC - Read -> update -> write huge table without primary key

I am trying to update the few fields of each row of a big mysql table (having close to 500 million rows). The table doesn't have any primary key (or having string primary key like UUID). I don't have enough executor memory to read and hold the entire data in once. Can anyone please let me know what are my options to process such tables.
Below is the schema
CREATE TABLE Persons ( Personid varchar(255) NOT NULL, LastName varchar(255) NOT NULL, FirstName varchar(255) DEFAULT NULL, Email varchar(255) DEFAULT NULL, Age int(11) DEFAULT NULL) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Spark code is like
SparkSession spark = SparkSession.builder().master("spark://localhost:7077").appName("KMASK").getOrCreate();
DataFrame rawDataFrame = spark.read().format("jdbc").load();
rawDataFrame.createOrReplaceTempView("data");
//encrypt is UDF
String sql = "select Personid, LastName, FirstName, encrypt(Email), Age from data";
Dataset newData = spark.sql(sql);
newData.write().mode(SaveMode.Overwrite).format("jdbc").options(options).save();
This table has around 150 million records, size of data is around 6GB. My executor memory is just 2 gb. Can I process this table using Spark - jdbc.
Ideally you can alter the spark jdbc fetchsize option to reduce/increase how many records are fetched and processed each time.
Partitioning the data can also help to reduce shuffles and additional overhead. Since you have Age as a numerical field. You may also process the data in partitions determined by the Age. First determine the min and max age and use the Spark JDBC Options.
Notably:
partitionColumn : Age
lowerBound : min age you identified
upperBound : max age you identified
numPartitions: really dependent on the number of cores and worker nodes but more hints and links are here
You may also use custom queries to only select and update a few records that can hold in memory with the query option. NB. when using the query option you should not use dbtable option.

Alter column in table to auto increment in Sybase ASE 16.0

I am using Sybase ASE16.0 database in which I am trying to alter a column in an existing USER table so that it autoincrements every time a row is added to the table. The column: user_id is set as primary key and not null.
I have gone through many sybase tutorials and have tried many approaches but of no avail. Here are some queries that I wrote to make this change:
ALTER TABLE USER (user_id smallint IDENTITY not null)
ALTER TABLE USER ALTER user_id smallint IDENTITY not null
ALTER TABLE USER MODIFY user_id smallint NOT NULL IDENTITY
ALTER TABLE USER MODIFY user_id smallint NOT NULL AUTO_INCREMENT
ALTER TABLE USER MODIFY user_id smallint NOT NULL AUTOINCREMENT
ALTER TABLE USER ALTER user_id smallint NOT NULL AUTOINCREMENT
ALTER TABLE USER user_id smallint AUTOINCREMENT
I expect a SYBASE DB compliant query that would alter the user_id column in the table to autoincrement it by 1 on adding a new record
From documentation:
Adds an IDENTITY column to a table. For each existing row in the table, Adaptive Server assigns a unique, sequential column value. The IDENTITY column could be type numeric or integer, and a scale of zero. The precision determines the maximum value (10 5 -1, or 99,999) that can be inserted into the column:
alter table sales_daily
add ord_num numeric (5,0) identity
Found here

How to use static column in scylladb and cassandra?

I am new in scylladb and cassandra, I am facing some issues in querying data from the table, following is the schema I have created:
CREATE TABLE usercontacts (
userID bigint, -- userID
contactID bigint, -- Contact ID lynkApp userID
contactDeviceToken text, -- Device Token
modifyDate timestamp static ,
PRIMARY KEY (contactID,userID)
);
CREATE MATERIALIZED VIEW usercontacts_by_contactid
AS SELECT userID, contactID, contactDeviceToken,
FROM usercontacts
contactID IS NOT NULL AND userID IS NOT NULL AND modifyDate IS NOT NULL
-- Need to not null as these are the primary keys in main
-- table same structure as the main table
PRIMARY KEY(userID,contactID);
CREATE MATERIALIZED VIEW usercontacts_by_modifyDate
AS SELECT userID,contactID,contactDeviceToken,modifyDate
FROM usercontacts WHERE
contactID IS NOT NULL AND userID IS NOT NULL AND modifyDate IS NOT NULL
-- Need to not null as these are the primary keys in main
-- table same structure as the main table
PRIMARY KEY (modifyDate,contactID);
I want to create materialized view for contact table which is usercontacts_by_userid and usercontacts_by_modifydate
I need the following queries in case of when I set modifydate (timestamp) static:
update usercontacts set modifydate="newdate" where contactid="contactid"
select * from usercontacts_by_modifydate where modifydate="modifydate"
delete from usercontacts where contactid="contactid"
It is not currently possible to create a materialized view that includes a static column, either as part of the primary key or just as a regular column.
Including a static row would require the whole base table (usercontacts) to be read when the static column is changed, so that the view rows could be re-calculated. This has a significant performance penalty.
Having the static row be the view's partition key means that there would only be one entry in the view for all the rows of a partition. However, secondary indexes do work in this case, and you can use that instead.
This is valid for both Scylla and Cassandra at the moment.

Cassandra does not support DELETE on indexed columns

Say I have a cassandra table xyz with the following schema :
create table xyz(
xyzid uuid,
name text,
fileid int,
sid int,
PRIMARY KEY(xyzid));
I create index on columns fileid , sid:
CREATE INDEX file_index ON xyz (fileid);
CREATE INDEX sid_index ON xyz (sid);
I insert data :
INSERT INTO xyz (xyzid, name , fileid , sid ) VALUES ( now(), 'p120' , 1, 100);
INSERT INTO xyz (xyzid, name , fileid , ssid ) VALUES ( now(), 'p120' , 1, 101);
INSERT INTO xyz (xyzid, name , fileid , sid ) VALUES ( now(), 'p122' , 2, 101);
I want to delete data using the indexed columns :
DELETE from xyz WHERE fileid=1 and sid=101;
Why do I get this error ?
InvalidRequest: code=2200 [Invalid query] message="Non PRIMARY KEY fileid found in where clause"
Is it mandatory to specify the primary key in the where clause for delete queries ?
Does Cassandra supports deletes using secondary index s ?
What has to be done to delete data using secondary index s ?
Any suggestions that could help .
I am using Data Stax Community Cassandra 2.1.8 but I also want to know whether delete using indexed columns is supported by Data Stax Community Cassandra 3.2.1
Thanks
Let me try and answer your questions in order:
1) Yes, if you are going to use a where clause in a CQL statement then the PARTITION KEY must be an equality operator in the where clause. Other than that you are only allowed to filter on clustering columns specified in your primary key. (Unless you have a secondary index)
2) No it does not. See this post for some more information as it is essentially the same problem.
Why can cassandra "select" on secondary key, but not update using secondary key? (1.2.8+)
3) Why not add sid as a clustering column in your primary key. This would allow you to do the delete or query using both as you have shown.
create table xyz(
xyzid uuid,
name text,
fileid int,
sid int,
PRIMARY KEY(xyzid, sid));
4) In general using secondary indexes is considered an anti-pattern (a bit less so with SASI indexes in C* 3.4) so my question is can you add these fields as clustering columns to your primary key? How are you querying these secondary indexes?
I suppose you can perform delete in two steps:
Select data by secondary index and get primary index column values
(xyzid) from query result
Perform delete by primary index values.

Resources