spark dataframe save to SQL table with auto increment column - apache-spark

I have the following table in db
+----------------+------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------------+------+-----+---------+----------------+
| id | bigint(20) | NO | PRI | NULL | auto_increment |
| VERSION | bigint(20) | NO | | NULL | |
| user_id | bigint(20) | NO | MUL | NULL | |
| measurement_id | bigint(20) | NO | MUL | NULL | |
| day | timestamp | NO | | NULL | |
| hour | tinyint(4) | NO | | NULL | |
| hour_timestamp | timestamp | NO | | NULL | |
| value | bigint(20) | NO | | NULL | |
+----------------+------------+------+-----+---------+----------------+
I'm trying to save spark dataframe that holds multiple rows that have the following case class structure:
case class Record(val id : Int,
val VERSION : Int,
val user_id : Int,
val measurement_id : Int,
val day : Timestamp,
val hour : Int,
val hour_timestamp : Timestamp,
val value : Long )
When I'm trying to save the dataframe to my sql through jdbc driver using:
dataFrame.insertIntoJDBC(...)
I get a primary key violation error:
com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry '1' for key 'PRIMARY'
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
I tried to set id=0 as the default value of all the rows and also tried to remove the id field from the case class, neither worked.
Can anyone help?
Thanks,
Tomer

Found it.
I had a sql <-> java column type issue.
According to: https://www.cis.upenn.edu/~bcpierce/courses/629/jdkdocs/guide/jdbc/getstart/mapping.doc.html
bigint sql columns should be represented as Long in java.
After I've changed my case class to:
case class Record(val id: Long,
val VERSION : Long,
val user_id : Long,
val measurement_id : Long,
val day : Timestamp,
val hour : Int,
val hour_timestamp : Timestamp,
val value : Long )
And set a id=0 for all the records in the dataframe it worked.
Thanks

Related

Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

I have a need to generate DDL statements for Hive tables & views programmatically. I tried using Spark and Beeline for this task. Beeline takes around 5-10 seconds for each of the statements whereas Spark completes the same thing in a few milliseconds. I am planning to use Spark since it is faster compared to beeline. One downside of using spark for getting DDL statements from the hive is, it treats CHAR, VARCHAR characters as String and it doesn't preserve the length information that goes with CHAR,VARCHAR data types. At the same time beeline preserves the data type and the length information for CHAR,VARCHAR data types. I am using Spark 2.4.1 and Beeline 2.1.1.
Given below the sample create table command and its show create table output.
Beeline Output:
Spark-Shell:
I wanted to know if there is any configuration on the Spark side to preserve the data type and length information for CHAR,VARCHAR data types. If there are other ways to get DDL from Hive quickly, I will be fine with that also.
This is in
Hive 3.1.1
Spark 3.1.1
Your stack overflow issue raised and I quote:
"I have a need to generate DDL statements for Hive tables & views programmatically. I tried using Spark and Beeline for this task. Beeline takes around 5-10 seconds for each of the statements whereas Spark completes the same thing in a few milliseconds. I am planning to use Spark since it is faster compared to beeline. One downside of using spark for getting DDL statements from the hive is, it treats CHAR, VARCHAR characters as String and it doesn't preserve the length information that goes with CHAR,VARCHAR data types. At the same time beeline preserves the data type and the length information for CHAR,VARCHAR data types. I am using Spark 2.4.1 and Beeline 2.1.1. Given below the sample create table command and its show create table output."
Create a simple table in Hive in test database
hive> use test;
OK
hive> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);
OK
hive> desc formatted etc;
# col_name data_type comment
id bigint
col1 varchar(30)
col2 string
# Detailed Table Information
Database: test
OwnerType: USER
Owner: hduser
CreateTime: Fri Mar 11 18:29:34 GMT 2022
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1647023374
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Now let's go to spark-shell
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647023374')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
You can see Spark shows columns correctly
Now let us go and create the same table in hive through beeline
0: jdbc:hive2://rhes75:10099/default> use test
No rows affected (0.019 seconds)
0: jdbc:hive2://rhes75:10099/default> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING)
. . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
0: jdbc:hive2://rhes75:10099/default> desc formatted etc
. . . . . . . . . . . . . . . . . . > +-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| id | bigint | |
| col1 | varchar(30) | |
| col2 | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | test | NULL |
| OwnerType: | USER | NULL |
| Owner: | hduser | NULL |
| CreateTime: | Fri Mar 11 18:51:00 GMT 2022 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} |
| | bucketing_version | 2 |
| | numFiles | 0 |
| | numRows | 0 |
| | rawDataSize | 0 |
| | totalSize | 0 |
| | transient_lastDdlTime | 1647024660 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
33 rows selected (0.159 seconds)
Now check that in spark-shell again
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647024660')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
It shows OK. So in summary you get column definitions in Spark as you have defined them in Hive.
In your statement above and I quote "I am using Spark 2.4.1 and Beeline 2.1.1", refers to older versions of Spark and hive which may have had such issues.

Why is my django unittest failing a constraint?

I have this model:
class TestopiaEvent(Model):
event_id = AutoField(primary_key=True)
name = CharField(max_length=255)
start_date = DateField()
end_date = DateField()
testers_required = IntegerField()
class Meta:
constraints = [
CheckConstraint(
check=Q(start_date__lte=F('end_date'), start_date__gte=datetime.now().date()),
name='correct_datetime'
)
]
And this test:
class TestopiaEventTestCase(TestCase):
def setUp(self):
self.default_values = {
'name': 'Testopia 1',
'start_date': datetime.now().date(),
'end_date': datetime.now().date() + timedelta(days=1),
'testers_required': 1
}
self.testopia_event = TestopiaEvent(**self.default_values)
def test_save_with_valid_model_check_database(self):
self.assertIsNone(self.testopia_event.save())
And it fails with this error:
django.db.utils.IntegrityError: new row for relation "webserver_testopiaevent" violates check constraint "correct_datetime"
DETAIL: Failing row contains (1, Testopia 1, 2020-07-24 00:00:00+00, 2020-07-25 00:00:00+00, 1).
I don't understand why it is failing as it should only fail if today's date is less than the start date and the start date or/and the start date is greater than the end date, which it isn't?
What have I done wrong? Thanks
Edit: Here are the postgresdb constraints:
testopia=# \d+ webserver_testopiaevent
Table
"public.webserver_testopiaevent"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
------------------+------------------------+-----------+----------+-----------------------------------------------------------+----------+--------------+-------------
event_id | integer | | not null | nextval('webserver_testopiaevent_event_id_seq'::regclass) | plain | |
name | character varying(255) | | not null | | extended | |
start_date | date | | not null | | plain | |
end_date | date | | not null | | plain | |
testers_required | integer | | not null | | plain | |
Indexes:
"webserver_testopiaevent_pkey" PRIMARY KEY, btree (event_id)
Check constraints:
"correct_datetime" CHECK (start_date >= statement_timestamp() AND start_date <= end_date)
Access method: heap
Now() returns a DateTimeField() so with the timestamp addition it will be more than the current date if my DateField is set to the same date.

How to concatenate spark dataframe columns using Spark sql in databricks

I have two columns called "FirstName" and "LastName" in my dataframe, how can I concatenate this two columns into one.
|Id |FirstName|LastName|
| 1 | A | B |
| | | |
| | | |
I want to make it like this
|Id |FullName |
| 1 | AB |
| | |
| | |
my query look like this but it raises an error
val kgt=spark.sql("""
Select Id,FirstName+' '+ContactLastName AS FullName from tblAA """)
kgt.createOrReplaceTempView("NameTable")
Here we go with the Spark SQL solution:
spark.sql("select Id, CONCAT(FirstName,' ',LastName) as FullName from NameTable").show(false)
OR
spark.sql( " select Id, FirstName || ' ' ||LastName as FullName from NameTable ").show(false)
from pyspark.sql import functions as F
df = df.withColumn('FullName', F.concat(F.col('First_name'), F.col('last_name')))

typeOrm unique row

I'm trying to make a Entity using typeOrm on my NestJS, and it's not working as I expected.
I have the following entity
#Entity('TableOne')
export class TableOneModel {
#PrimaryGeneratedColumn()
id: number
#PrimaryColumn()
tableTwoID: number
#PrimaryColumn()
tableThreeID: number
#CreateDateColumn()
createdAt?: Date
}
This code generate a migration that generates a table like the example below
+--------------+-------------+------+-----+----------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-------------+------+-----+----------------------+-------+
| id | int(11) | NO | | NULL | |
| tableTwoID | int(11) | NO | | NULL | |
| tableThreeID | int(11) | NO | | NULL | |
| createdAt | datetime(6) | NO | | CURRENT_TIMESTAMP(6) | |
+--------------+-------------+------+-----+----------------------+-------+
That's ok, the problem is, that I want to the table only allow one row with tableTwoID and tableThreeID, what should I use in the Entity to generated the table as I expected it to be?
Expected to not allow rows like the example below
+----+------------+--------------+----------------------------+
| id | tableTwoID | tableThreeID | createdAt |
+----+------------+--------------+----------------------------+
| 1 | 1 | 1 | 2019-10-30 19:27:43.054844 |
| 2 | 1 | 1 | 2019-10-30 19:27:43.819174 | <- should not allow the insert of this row
+----+------------+--------------+----------------------------+
Try marking the column as Unique
#Unique()
ColumnName
This is currently expected behavior from TypeORM. According to the documentation if you have multiple #PrimaryColumn() decorators you create a composite key. The combination of the composite key columns must be unique (in your above '1' + '1' + '1' = '111' vs '2' + '1' + '1' = '211'). If you are looking to make each column unique along with being a composite primary key, you should be able to do something like #PrimaryColumn({ unique: true })

Create Cassandra CQL with IN and ORDER BY

I need a CQL to get all rows from the table based on set of current user friends (I'm using IN for that) and sort them based on created date.
I'm trying to play with key and clustering key, but got no ideas.
Here is my Cassandra table:
CREATE TABLE chat.news_feed(
id_news_feed uuid,
id_user_sent uuid,
first_name text,
last_name text,
security int,
news_feed text,
image blob,
image_preview text,
image_name text,
image_length int,
image_resolution text,
is_image int,
created_date timestamp,
PRIMARY KEY ((id_news_feed, id_user_sent), created_date))
WITH CLUSTERING ORDER BY (created_date DESC) AND comment = 'List of all news feed by link id';
and here is my CQL (formed in Java):
SELECT JSON id_news_feed, first_name, last_name, id_user_sent, news_feed, image_name, image_preview, image_length, created_date, is_image, image_resolution FROM chat.news_feed WHERE id_user_sent in (b3306e3f-1f1d-4a87-8a64-e22d46148316,b3306e3f-1f1d-4a87-8a64-e22d46148316) ALLOW FILTERING;
I coul not run it cause there is no key in my WHERE part of CQL.
Is there any way how I could get all rows created by set of users with Order By (I tried to create table different ways, but no results yet)?
Thank you!
Unlike the relational databases here you will probably need denormalization of the tables. First of all, you cannot effectively query everything from a single table. Also Cassandra does not support joins natively. I suggest to split up your table into several.
Let's start with the friends: the current user id should be part of the primary key and the friends should go as a clustering column.
CREATE TABLE chat.user_friends (
user_id uuid,
friend_id uuid,
first_name text,
last_name text,
security int,
PRIMARY KEY ((user_id), friend_id));
Now you can find the friend for each particular user by querying as follows:
SELECT * FROM chat.user_friends WHERE user_id = 'a001-...';
or
SELECT * FROM chat.user_friends WHERE user_id = 'a001-...' and friend_id in ('a121-...', 'a156-...', 'a344-...');
Next let's take care of news feed: before putting remaining columns into this table I'd think about the desired query against this table. The news feeds needs to be filtered by the user ids with IN listing and at the same time be sortable by time. So we put the created_date timestamp as a clustering key and friends' user_id as a partitioning key. Note that the timestamps will be sorted per user_id not globally (you can re-sort those on the client side). What's really important is to keep news_feed_id out of the primary key. This column still may contain uuid which is unique, but as long as we don't want to query this table to get a particular news feed by id. For this purpose We'd anyway require separate table (denormalization of the data) or materialized view (which I will not cover in this answer but is quite nice solution for some types of denormalization introduced in Cassandra 3.0).
Here is the updated table:
CREATE TABLE chat.news_feed(
id_user_sent uuid,
first_name text,
last_name text,
security int,
id_news_feed uuid,
news_feed text,
image blob,
image_preview text,
image_name text,
image_length int,
image_resolution text,
is_image int,
created_date timestamp,
PRIMARY KEY ((id_user_sent), created_date))
WITH CLUSTERING ORDER BY (created_date DESC) AND comment = 'List of all news feed by link id';
Some example dataset:
cqlsh:ks_test> select * from news_feed ;
id_user_sent | created_date | first_name | id_news_feed | image | image_length | image_name | image_preview | image_resolution | is_image | last_name | news_feed | security
--------------------------------------+---------------------------------+------------+--------------------------------------+-------+--------------+------------+---------------+------------------+----------+-----------+-----------+----------
01b9b9e8-519c-4578-b747-77c8d9c4636b | 2017-02-23 00:00:00.000000+0000 | null | fd25699c-78f1-4aee-913a-00263912fe18 | null | null | null | null | null | null | null | null | null
9bd23d16-3be3-4e27-9a47-075b92203006 | 2017-02-21 00:00:00.000000+0000 | null | e5d394d3-b67f-4def-8f1e-df781130ea22 | null | null | null | null | null | null | null | null | null
6e05257d-9278-4353-b580-711e62ade8d4 | 2017-02-25 00:00:00.000000+0000 | null | ec34c655-7251-4af8-9718-3475cad18b29 | null | null | null | null | null | null | null | null | null
6e05257d-9278-4353-b580-711e62ade8d4 | 2017-02-22 00:00:00.000000+0000 | null | 5342bbad-0b55-4f44-a2e9-9f285d16868f | null | null | null | null | null | null | null | null | null
6e05257d-9278-4353-b580-711e62ade8d4 | 2017-02-20 00:00:00.000000+0000 | null | beea0c24-f9d6-487c-a968-c9e088180e73 | null | null | null | null | null | null | null | null | null
63003200-91c0-47ba-9096-6ec1e35dc7a0 | 2017-02-21 00:00:00.000000+0000 | null | a0fba627-d6a7-463c-a00c-dd0472ad10c5 | null | null | null | null | null | null | null | null | null
And the filtered one:
cqlsh:ks_test> select * from news_feed where id_user_sent in (01b9b9e8-519c-4578-b747-77c8d9c4636b, 6e05257d-9278-4353-b580-711e62ade8d4) and created_date >= '2017-02-22';
id_user_sent | created_date | first_name | id_news_feed | image | image_length | image_name | image_preview | image_resolution | is_image | last_name | news_feed | security
--------------------------------------+---------------------------------+------------+--------------------------------------+-------+--------------+------------+---------------+------------------+----------+-----------+-----------+----------
01b9b9e8-519c-4578-b747-77c8d9c4636b | 2017-02-25 00:00:00.000000+0000 | null | 26dc0952-0636-438f-8a26-6a3fef4fb808 | null | null | null | null | null | null | null | null | null
01b9b9e8-519c-4578-b747-77c8d9c4636b | 2017-02-23 00:00:00.000000+0000 | null | fd25699c-78f1-4aee-913a-00263912fe18 | null | null | null | null | null | null | null | null | null
6e05257d-9278-4353-b580-711e62ade8d4 | 2017-02-25 00:00:00.000000+0000 | null | ec34c655-7251-4af8-9718-3475cad18b29 | null | null | null | null | null | null | null | null | null
6e05257d-9278-4353-b580-711e62ade8d4 | 2017-02-22 00:00:00.000000+0000 | null | 5342bbad-0b55-4f44-a2e9-9f285d16868f | null | null | null | null | null | null | null | null | null
P.S. As you might notice we got rid of the ALLOW FILTERING clause. Don't use ALLOW FILTERING in any application as it has significant performance penalty. This is only usable to look up some small chunk of data scattered around in different partitions.

Resources