I am creating a DataFrame and registering that DataFrame as temp view using df.createOrReplaceTempView('mytable'). After that I try to write the content from 'mytable' into Hive table(It has partition) using the following query
insert overwrite table
myhivedb.myhivetable
partition(testdate) // ( 1) : Note here : I have a partition named 'testdate'
select
Field1,
Field2,
...
TestDate //(2) : Note here : I have a field named 'TestDate' ; Both (1) & (2) have the same name
from
mytable
when I execute this query, I am getting the following error
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.Table$ValidationFailureSemanticException: Partition spec
{testdate=, TestDate=2013-01-01}
Looks like I am getting this error because of the same field names ; ie testdate(the partition in Hive) & TestDate (The field in temp table 'mytable')
Whereas if my partition name testdate is different from the fieldname(ie TestDate), the query executes successuflly. Example...
insert overwrite table
myhivedb.myhivetable
partition(my_partition) //Note here the partition name is not 'testdate'
select
Field1,
Field2,
...
TestDate
from
mytable
My guess is it looks like a Bug in Spark...but would like to have second opinion...Am I missing something here?
#DuduMarkovitz #dhee ; apologies for being too late for the response. I am finally able to resolve the issue. Earlier I was creating the table using cameCase(in the CREATE statement) which seems to be the reason for the Exception. Now i have created the table using the DDL where field names are in lower case. This has resolved my issue
Related
I am trying to create a partitioned table in Hive on spark and load it with data available in other table in Hive.
I am getting following error while loading the data:
Error: org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException:
Partition spec {cardsuit=, cardcolor=, cardSuit=SPA, cardColor=BLA}
contains non-partition columns;
following are the commands used to execute the task:-
create table if not exists hive_tutorial.hive_table(color string, suit string,value string) comment 'first hive table' row format delimited fields terminated by '|' stored as TEXTFILE;
LOAD DATA LOCAL INPATH 'file:///E:/Kapil/software-study/Big_Data_NoSql/hive/deckofcards.txt' OVERWRITE INTO TABLE hive_table; --data is correctly populated(52 rows)
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
create table if not exists hive_tutorial.hive_table_partitioned(color string, suit string,value int) comment 'first hive table' partitioned by (cardSuit string,cardColor string) row format delimited fields terminated by '|' stored as TEXTFILE;
INSERT INTO TABLE hive_table_partitioned PARTITION (cardSuit,cardColor) select color,suit,value,substr(suit, 1, 3) as cardSuit,substr(color, 1, 3) as cardColor from hive_table;
--alternatively i tried
INSERT OVERWRITE TABLE hive_table_partitioned PARTITION (cardSuit,cardColor) select color,suit,value,substr(suit, 1, 3) as cardSuit,substr(color, 1, 3) as cardColor from hive_table;
sample of data:-
BLACK|SPADE|2
BLACK|SPADE|3
BLACK|SPADE|4
BLACK|SPADE|5
BLACK|SPADE|6
BLACK|SPADE|7
BLACK|SPADE|8
BLACK|SPADE|9
I am using spark 2.2.0 and java version 1.8.0_31.
I have checked and tried answers given in similar thread but could not solve my problem:-
SemanticException Partition spec {col=null} contains non-partition columns
Am I missing something here?
After carefully reading the scripts above while creating tables ,value column is type of int in partitioned table where as string in original. Error was misleading !!!
I have many tables per keyspace, therefore I would like to filter the tables based on restriction criteria. I tried this query but it is not really giving the intended result that I want:
SELECT table_name FROM system_schema.tables
WHERE keyspace_name = 'test'
and table_name >= 'test_001_%';
The output shown is:
'table_name'
---------------------
'test_001_metadata'
'test_001_time1'
'test_001_time2'
'test_001_time3'
'test_001_time4'
'test_002_metadata'
'test_002_time1'
'test_002_time2'
'test_002_time3'
What I really want is:
The output shown is:
'table_name'
---------------------
'test_001_metadata'
'test_001_time1'
'test_001_time2'
'test_001_time3'
'test_001_time4'
The other way out is to use LIKE keyword by creating secondary index on table_name. But I am a bit skeptical if it might cause problem as it is a system table. Another concern is, does clustering column ACTUALLY support secondary index?
Create a SASI index with mode contains on the table_name column after removing the previous index and try the query as
SELECT table_name FROM system_schema.tables
WHERE keyspace_name = 'test'
and table_name LIKE '%test_001_%';
The command to create a SASI index with mode contains is as follows:
CREATE CUSTOM INDEX ON system_schema.tables(table_name)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
'case_sensitive': 'false', 'tokenization_normalize_uppercase': 'true', 'mode': 'CONTAINS'}
And for your second question, you cannot create secondary index on anything which is part of PRIMARY KEY.
How can I insert json objects to Cassandra table without creating table? Can Cassandra parse json to table which is not created? Or, Can I create a table with no column and insert json?
Thanks.
After Cassandra 2.2 you can insert json directly, but the table still should be created beforehead.
You need to create table First, then you can insert data
You can create table like the below one :
CREATE TABLE json_data (
id timeuuid PRIMARY KEY,
data text
);
And you can insert the json as string with the below query :
INSERT INTO json_data (id , data ) VALUES ( now(), '{"first_name" : "Ashraful", "last_name" : "Islam"}') ;
I'd like to switch an actual system importing data into a PostgreSQL 9.5 database from CSV files to a more efficient system.
I'd like to use the COPY statement because of its good performance. The problem is that I need to have one field populated that is not in the CSV file.
Is there a way to have the COPY statement add a static field to all the rows inserted ?
The perfect solution would have looked like that :
COPY data(field1, field2, field3='Account-005')
FROM '/tmp/Account-005.csv'
WITH DELIMITER ',' CSV HEADER;
Do you know a way to have that field populated in every row ?
My server is running node.js so I'm open to any cost-efficient solution to complete the files using node before COPYing it.
Use a temp table to import into. This allows you to:
add/remove/update columns
add extra literal data
delete or ignore records (such as duplicates)
, before inserting the new records into the actual table.
-- target table
CREATE TABLE data
( id SERIAL PRIMARY KEY
, batch_name varchar NOT NULL
, remote_key varchar NOT NULL
, payload varchar
, UNIQUE (batch_name, remote_key)
-- or::
-- , UNIQUE (remote_key)
);
-- temp table
CREATE TEMP TABLE temp_data
( remote_key varchar -- PRIMARY KEY
, payload varchar
);
COPY temp_data(remote_key,payload)
FROM '/tmp/Account-005'
;
-- The actual insert
-- (you could also filter out or handle duplicates here)
INSERT INTO data(batch_name, remote_key, payload)
SELECT 'Account-005', t.remote_key, t.payload
FROM temp_data t
;
BTW It is possible to automate the above: put it into a function (or maybe a prepared statement), using the filename/literal as argument.
Set a default for the column:
alter table data
alter column field3 set default 'Account-005'
Do not mention it the the copy command:
COPY data(field1, field2) FROM...
i'm using cassandra 1.2.8. i have a column family like below:
CREATE TABLE word_probability (
word text,
category text,
probability double,
PRIMARY KEY (word,category)
);
when i use query like this:
String query = "SELECT * FROM word_probability WHERE word='%s' AND category='%s';";
it works well but for some words i get this message:
name provided was not in the list of valid column labels error
every thing is ok and i don't know why i get this error :(
You're not doing anything wrong except mixing up cql with sql. Cql doesn't support % wildcards.