Cassandra Table to Spark Mapping in DataStax

Cassandra Table to Spark Mapping in DataStax - cassandra

I am trying to use scala case class to map Cassandra table.
Some of my column names happen to be reserved key words in scala. Is there an easy way to map them ?
eg:
Cassandra Table
Create Table cars (
id_uuid uuid,
new boolean,
type text,
PRIMARY KEY ((id_uuid))
)
// This declaration will fail as "new" and "type" are reserved keywords
scala> case class Cars (idUuid : String, new : Boolean, type: String)

Try this:
case class Cars (idUuid:String, `new`:Boolean, `type`:String)

Related

delete Apache hudi duplicate record key

I got some trouble in hudi when i delete rows with the same record key by spark-sql.
e.g
I created a table and set the recordKey=empno
CREATE TABLE emp_duplicate_pk (
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal int,
comm int,
deptno int,
tx_date string
)
using hudi
options(
type='cow'
,primaryKey='empno'
,payloadclass='org.apache.hudi.common.model.OverwriteNonDefaultWithLatestAvroPayLoad'
,preCombineField='tx_date'
,hoodie.cleaner.commits.retained='10'
,hoodie.keep.min.commits='20'
,hoodie.keep.max.commits='30'
,hoodie.index.type='SIMPLE'
,hoodie.sql.insert.mode='non-strict'
,hoodie.combine.before.insert='false'
,hoodie.combine.before.upsert='false'
,hoodie.merge.allow.duplicate.on.inserts='true'
);
then insert some test records
seems hudi allows duplicate primary key with these attributes
hoodie.sql.insert.mode='non-strict'
hoodie.combine.before.insert='false'
hoodie.combine.before.upsert='false'
hoodie.merge.allow.duplicate.on.inserts='true'
insert into emp_duplicate_pk values
(7369,'SMITH','CLERK',7902,'1980-12-17',800,100,20,'2022-11-17'),
(7499,'ALLEN','SALESMAN',7698,'1981-02-20',1600,300,30,'2022-11-17'),
(5233,'PTER','DEVELOPER',9192,'1996-05-30',5000,3000,10,'2022-11-13'),
(5233,'PTER','DEVELOPER',9192,'1996-05-30',5000,3000,10,'2022-11-14');
insert into emp_duplicate_pk values
(5233,'PTER','DEVELOPER',9192,'1996-05-30',5000,3000,10,'2022-11-15'),
(5233,'PTER','DEVELOPER',9192,'1996-05-30',5000,3000,10,'2022-11-16'),
(5233,'PTER','DEVELOPER',9192,'1996-05-30',5000,3000,10,'2022-11-17');
all data could be searched (image)
then i delete a record
delete from emp_duplicate_pk where tx_date='2022-11-16';
seems it delete all empno=5233 rows,only two rows left(7369 and 7499) (image)
How to delete tx_date='2022-11-16' exactly and reserve other rows,anyone can help?

Hive, how to partition by a colum with null values, putting all nulls in one partition

I am using Hive, and the IDE is Hue. I am trying different key combinations to choose for my partition key(s).
The definition of my original table is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_day` string
)
PARTITIONED BY (
`date_year` string,
`date_month` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
The above code is working properly. But when I create a new table with date_day as partition key, the table is empty and I need to run MSCK Repair Table. However, I am getting the following error:
Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
When the partion keys were date_year, date_month, MSCK worked properly.
Table definition of the table I am getting the error for is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (
`date_day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
After this the following query is empty:
Select * From `my_hive_db`.`my_table` Limit 10;
I therefore ran the following command:
MSCK REPAIR TABLE `my_hive_db`.`my_table`;
And I get the error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
I checked this link as it is exactly the error I am getting, but by using the solution provided:
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
I get a different error:
Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime.
I think the reason I am getting these errors is that there are more than 200 million records with date_day having null value.
There are 31 distinct date-day not null values. I would like to partition my table with 32 partitions, each for a distinct value of date_day field, and all the null values get into a different partition. Is there a way to do so (partitioning by a column with null values)?
If this can be achieved by spark, I am also open to use it.
This is part of a bigger problem of changing partition keys by recreating a table as mentioned in this link in answer to my other question.
Thank you for your help.

You seem to not understand how Hive's partitioning work.
Hive stores data into files on HDFS (or S3, or some other distributed folders).
If you create a non-partitioned parquet table called my_schema.my_table, you will see in your distributed storage files stored in a folder
hive/warehouse/my_schema.db/my_table/part_00001.parquet
hive/warehouse/my_schema.db/my_table/part_00002.parquet
...
If you create a table partitioned by a column p_col, the files will look like
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00002.parquet
...
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00002.parquet
...
The command MSCK repair table allows you to automatically reload the partitions, when you create an external table.
Let's say you have folders on s3 that look like this:
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value3/part_00001.parquet
You create an external table with
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_col STRING)
the table will be created but empty, because Hive hasn't detected the partitions yet. You run MSCK REPAIR TABLE my_schema.my_table, and Hive will recognize that your partition p_col matches the partitioning scheme on s3 (/p_col=value1/).
From what I could understand from your other question, you are trying to change the partitioning scheme of the table by doing
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_another_col STRING)
and you are getting an error message because p_another_col doesn't match with the column used in s3, which was p_col.
And this error is perfectly normal, since what you are doing doesn't make sense.
As stated in the other question's answer, you need to create a copy of the first table, with a different partitioning scheme.
You should instead try something like this:
CREATE External Table my_hive_db.my_table_2(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (`date_day` string)
and then populate your new table with dynamic partitioning
INSERT OVERWRITE TABLE my_hive_db.my_table_2 PARTITION(date_day)
SELECT
col_id,
result_section__col2,
result_section_col3,
result_section_col4,
result_section_col5,
result_section_col6__label,
result_section_col7__label_id,
result_section_text,
result_section_unit,
result_section_col,
result_section_title,
result_section_title_id,
col13,
timestamp,
date_year,
date_month,
date_day
FROM my_hive_db.my_table_1

Invalid type error when using Datastax Cassandra Driver

I have a case class which represents partition key values.
case class UserKeys (bucket:Int,
email: String)
I create query Clauses as follows:
def conditions(id: UserKeys):List[Clauses] = List(
QueryBuilder.eq("bucket", id.bucket), //TODOM - pick table description from config/env file.
QueryBuilder.eq("email", id.email)
)
And use the query as follows
val selectStmt =
select()
.from(tablename)
.where(QueryBuilder.eq(partitionKeyColumns(0), whereClauseList(0))).and(QueryBuilder.eq(partitionKeyColumns(1), whereClauseList(1)))
.limit(1)
I am getting following error.
com.datastax.driver.core.exceptions.InvalidTypeException: Value 0 of type class com.datastax.driver.core.querybuilder.Clause$SimpleClause does not correspond to any CQL3 type
Question 1 - What am I doing wrong?
The query works on cqlsh
The table I am querying is
CREATE TABLE users (
bucket int,
email text,
firstname text,
lastname text,
authprovider text,
password text,
PRIMARY KEY ((bucket, email), firstname, lastname)
Question 2 - Is there a way to print the List which contains the query clauses? I tried it but I get this incomprehensible text.
List(com.datastax.driver.core.querybuilder.Clause$SimpleClause#2389b3ee, com.datastax.driver.core.querybuilder.Clause$SimpleClause#927f81)

My bad, I was using the query clauses incorrectly. Rather than
.where(QueryBuilder.eq(partitionKeyColumns(0), whereClauseList(0))).and(QueryBuilder.eq(partitionKeyColumns(1), whereClauseList(1)))
I needed to do
.where(whereClauseList(0)).and(whereClauseList(1))
because the List already has QueryBuilder.eq("bucket", id.bucket) part

How to select Cluster keys that are in the same partition key?

I'm creating a like oneToMany relationShip with cassandra CQL 3,
A service may have different responses, and response belong to one service.
I would like to know if there's a function that can let me select Cluster keys that are in the same partition key?
CREATE TABLE reponse (
service_id uuid,
service_nom text static,
service_timeout boolean static,
reponse_id uuid,
reponse_type boolean,
reponse_nom text,
reponse_valide boolean,
reponse_message text,
PRIMARY KEY((service_id), reponse_type, reponse_id )
);
Thanks in advance!

saveToCassandra specify column names in Spark Cassandra Connector

I'm using Spark Cassandra Connector to update counters in this table:
CREATE TABLE IF NOT EXISTS analytics.minute_usage_stats (
metric_date timestamp,
user_id uuid,
metric_name text,
h1_m1 counter,
h1_m2 counter,
h1_m3 counter,
...
h24_m60 counter,
PRIMARY KEY ((metric_date), user_id, metric_name)
);
The code looks something like this:
class Metric(metricName: String, metricDate: Date, userId: String, ???)
metricEventStream
.map(event => {
// Parsing logic
new Metric(metricName, metricDate, userId, h1M1???)
})
.saveToCassandra("analytics", "minute_usage_stats")
I need to be able to update one counter column at a time and the columns could be different for each item in the map.
Looking through the documentation, there doesn't seem to be a way to do this without using the Cassandra driver itself and losing all the great functionality of the Spark connector.
Is there any way to specify column names per RDD row with the current Spark Cassandra Connector?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cassandra Table to Spark Mapping in DataStax - cassandra

Try this: case class Cars (idUuid:String, `new`:Boolean, `type`:String)

Related

delete Apache hudi duplicate record key

Hive, how to partition by a colum with null values, putting all nulls in one partition

Invalid type error when using Datastax Cassandra Driver

How to select Cluster keys that are in the same partition key?

saveToCassandra specify column names in Spark Cassandra Connector

Categories

Resources