Invalid Column name Error in DSE Analytics Spark - apache-spark

I have one table whose structure roughly is as follows ->
CREATE TABLE keyspace_name.table_name (
id text PRIMARY KEY,
type text,
bool_yn boolean,
created_ts timestamp,
modified_ts timestamp
)
Recently I added new column in the table ->
alter table keyspace_name.table_name first_name text;
And when I query on the given column from table in cqlsh, it gives me the result. For eg.
select first_name from keyspace_name.table_name limit 10;
But if I try to perform the same query in dse spark-sql
It is giving me the following error.
Error in query: cannot resolve 'first_name' given input columns: [id, type, bool_yn, created_ts, modified_ts];
I don't know what's wrong in spark-sql. I've tried nodetool repair but problem still persists
Any help would be appreciated. Thanks

If table schema changes, the Spark metastore doesn't automatically refresh the schema changes, so manually remove the old tables from spark sql with a DROP TABLE command, then run SHOW TABLES. The new table with latest schema will be automatically created. This will not change the data in Cassandra.

Related

How to add new column in to partition by clause in Hive External Table

I have external Hive Table which is filled by spark job and partitioned by(event_date date) now I have modified the spark code and added one extra column 'country'.In earlier written data country column will have null values as it is newly added. now I want to Alter 'partitioned by' clause as partition by(event_date date,country string) how can I achieve this.Thank you!!
Please try to alter the partition using below commnad-
ALTER TABLE table_name PARTITION part_spec SET LOCATION path
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Try this databricks spark-sql language manual for alter command

Automatically Updating a Hive View Daily

I have a requirement I want to meet. I need to sqoop over data from a DB to Hive. I am sqooping on a daily basis since this data is updated daily.
This data will be used as lookup data from a spark consumer for enrichment. We want to keep a history of all the data we have received but we don't need all the data for lookup only the latest data (same day). I was thinking of creating a hive view from the historical table and only showing records that were inserted that day. Is there a way to automate the view on a daily basis so that the view query will always have the latest data?
Q: Is there a way to automate the view on a daily basis so that the
view query will always have the latest data?
No need to update/automate the process if you get a partitioned table based on date.
Q: We want to keep a history of all the data we have received but we
don't need all the data for lookup only the latest data (same day).
NOTE : Either hive view or hive table you should always avoid scanning the full table data aka full table scan for getting latest partitioned data.
Option 1: hive approach to query data
If you want to adapt hive approach
you have to go with partition column for example : partition_date and partitioned table in hive
select * from table where partition_column in
(select max(distinct partition_date ) from yourpartitionedTable)
or
select * from (select *,dense_rank() over (order by partition_date desc) dt_rnk from db.yourpartitionedTable ) myview
where myview.dt_rnk=1
will give the latest partition always. (if same day or todays date is there in partition data then it will give the same days partition data otherwise it will give max partition_date) and its data from the partition table.
Option 2: Plain spark approach to query data
with spark show partitions command i.e. spark.sql(s"show Partitions $yourpartitionedtablename") get the result in array and sort that to get latest partition date. using that you can query only latest partitioned date as lookup data using your spark component.
see my answer as an idea for getting latest partition date.
I prefer option2 since no hive query is needed and no full table query since
we are using show partitions command. and no performance bottle necks
and speed will be there.
One more different idea is querying with HiveMetastoreClient or with option2... see this and my answer and the other
I am assuming that you are loading daily transaction records to your history table with some last modified date. Every time you insert or update record to your history table you get your last_modified_date column updated. It could be date or timestamp also.
you can create a view in hive to fetch the latest data using analytical function.
Here's some sample data:
CREATE TABLE IF NOT EXISTS db.test_data
(
user_id int
,country string
,last_modified_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;
I am inserting few sample records. you see same id is having multiple records for different dates.
INSERT INTO TABLE db.test_data VALUES
(1,'India','2019-08-06'),
(2,'Ukraine','2019-08-06'),
(1,'India','2019-08-05'),
(2,'Ukraine','2019-08-05'),
(1,'India','2019-08-04'),
(2,'Ukraine','2019-08-04');
creating a view in Hive:
CREATE VIEW db.test_view AS
select user_id, country, last_modified_date
from ( select user_id, country, last_modified_date,
max(last_modified_date) over (partition by user_id) as max_modified
from db.test_data ) as sub
where last_modified_date = max_modified
;
hive> select * from db.test_view;
1 India 2019-08-06
2 Ukraine 2019-08-06
Time taken: 5.297 seconds, Fetched: 2 row(s)
It's showing us result with max date only.
If you further inserted another set of record with max last modified date as:
hive> INSERT INTO TABLE db.test_data VALUES
> (1,'India','2019-08-07');
hive> select * from db.test_view;
1 India 2019-08-07
2 Ukraine 2019-08-06
for reference:Hive View manuual

Get column type of a table using cql command

I am trying to get column type of a table using cql command.
My table:
CREATE TABLE users (
id uuid,
name text);
Now I am trying to get type of name column. With the help of some select query I want to get text as output.
My use case is: I am trying to drop name column only if type of name is text
What script should I try?
From CQL you can read this data from system tables. In Cassandra 3.x, this information is located in the system_schema.columns table that has following schema:
CREATE TABLE system_schema.columns (
keyspace_name text,
table_name text,
column_name text,
clustering_order text,
column_name_bytes blob,
kind text,
position int,
type text,
PRIMARY KEY (keyspace_name, table_name, column_name)
) WITH CLUSTERING ORDER BY (table_name ASC, column_name ASC);
so you can use query like this to retrieve the data:
select type from system_schema.columns where keyspace_name = 'your_ks'
and table_name = 'users' and column_name = 'name';
In Cassandra 2.x, the structure of the system tables is different, so you may need to adapt your query.
If you're accessing cluster programmatically, then the driver hides differences between Cassandra versions, and you can use something like Metadata class from Java driver to get information about table's structure and types of columns. But if you're doing schema changes programmatically, you must be careful and explicitly wait for schema agreement, like in following example.
select keyspace_name, table_name, column_name, type from system_schema.columns WHERE type='counter' AND keyspace_name='someName' LIMIT 100 ALLOW FILTERING ;

Cassandra Data Model design for vnodes enabled cluster?

I have recently started working with Cassandra. We have cassandra cluster which is using DSE 4.0 version and has VNODES enabled. We have a tables like this -
Below is my first table -
CREATE TABLE customers (
customer_id int PRIMARY KEY,
last_modified_date timeuuid,
customer_value text
)
Read query pattern is like this on above table as of now since we need to get everything from above table and load it into our application memory every x minutes.
select customer_id, customer_value from datakeyspace.customers;
We have second table like this -
CREATE TABLE client_data (
client_name text PRIMARY KEY,
client_id text,
creation_date timestamp,
is_valid int,
last_modified_date timestamp
)
CREATE INDEX idx_is_valid_clnt_data ON client_data (is_valid);
Right now in the above table, we have 500 records and all those records has "is_valid" column value set as 1. And the read query pattern is like this on above table as of now since we need to get everything from above table and load it into our application memory every x minutes so the below query will return me all 500 records since everything has is_valid set to 1.
select client_name, client_id from datakeyspace.client_data where is_valid=1;
Since our cluster is VNODES enabled so my above query pattern is not efficient at all and it is taking lot of time to get the data from Cassandra. It takes around 50 seconds to get the data from cqlsh client. We are reading from these table with consistency level QUORUM.
Is there any possibility of improving our data model by using wide rows concept or anything else?
Any suggestions will be greatly appreciated.

Brisk cassandra TimeUUIDType

I used brisk. The cassandra column family automatically maps to Hive tables.
However, if data type is timeuuid in column family, it is unreadable in Hive tables.
For example, I used following command to create an external table in hive to map column family.
Hive > create external table A (rowkey string, column_name string, value string)
> STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
> WITH SERDEPROPERTIES (
> "cassandra.columns.mapping" = ":key,:column,:value");
If column name is TimeUUIDType in cassandra, it becomes unreadable in the Hive table.
For example, a row in cassandra column family looks like:
RowKey: 2d36a254bb04272b120aaf79d70a3578
=> (column=29139210-b6dc-11df-8c64-f315e3a329d6, value={"event_id":101},timestamp=1283464254261)
Where column name is TimeUUIDType.
In hive table, it looks like the following row:
2d36a254bb04272b120aaf79d70a3578 t��ߒ4��!�� {"event_id":101}
So, column name is unreadable in Hive table.
This is a known issue with the automatic table mapping. For best results with a timeUUIDType, turn the auto-mapping feature off in $brisk_home/resources/hive/hive-site.xml:
"cassandra.autoCreateHiveSchema"
and create the table in hive manually.

Resources