While trying to OVERWRITE the Hive table with specific columns from Spark(Pyspark) using a dataframe getting the below error
pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 36)\n\n== SQL ==\ninsert OVERWRITE table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe\n------------------------------------^^^\n"
Based on the https://issues.apache.org/jira/browse/HIVE-9481 looks like column-list is still not supported in INSERT OVERWRITE, so on trying to run without the overwrite keyword still gives me the same error.
sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe")
Note: The above works fine when the specific column-list is not
specified, and the columns between the tables match.
But on trying the same via Hive Terminal goes through fine.
INSERT INTO TABLE DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 from DB.TableName2;
Should any property or configuration be set or passed through the Spark-Submit
Please do let me know if you need more data or information..
Related
I have a table that looks like this
select 'Alice' AS ID, 1 as col1, 3 as col2, -2 as col3, 9 as col4
union all
select 'Bob' AS ID, -9 as col1, 2 as col2, 5 as col3, -6 as col4
I would like to get the top 3 absolute values for each record across the four columns and then format the output as a dictionary or STRUCT like below
select
'Alice' AS ID, [STRUCT('col4' AS column, 9 AS value), STRUCT('col2',3), STRUCT('col3',-2)] output
union all
select
'Bob' AS ID, [STRUCT('col1' AS column, -9 AS value), STRUCT('col4',-6), STRUCT('col3',5)]
output
output
I would like it to be dynamic, so avoid writing out columns individually. It could go up to 100 columns that change
For more context, I am trying to get the top three features from the batch local explanations output in Vertex AI
https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-batch-predictions
I have looked up some examples, would like something similar to the second answer here How to get max value of column values in a record ? (BigQuery)
EDIT: the data is actually structured like this. If this can be worked with more easily, this would be a better option to work from
select 'Alice' AS ID, STRUCT(1 as col1, 3 as col2, -2 as col3, 9 as col4) AS featureAttributions
union all
SELECT 'Bob' AS ID, STRUCT(-9 as col1, 2 as col2, 5 as col3, -6 as col4) AS featureAttributions
Consider below query.
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM sample_table UNPIVOT (value FOR column IN (col1, col2, col3, col4))
)
GROUP BY ID;
Query results
Dynamic Query
I would like it to be dynamic, so avoid writing out columns individually
You need to consider a dynamic SQL for this. By refering to the answer from #Mikhail you linked in the post, you can write a dynamic query like below.
EXECUTE IMMEDIATE FORMAT("""
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM sample_table UNPIVOT (value FOR column IN (%s))
)
GROUP BY ID
""", ARRAY_TO_STRING(
REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT AS STRUCT * EXCEPT (ID) FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
);
For updated sample table
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table)
UNPIVOT (value FOR column IN (col1, col2, col3, col4))
)
GROUP BY ID;
EXECUTE IMMEDIATE FORMAT("""
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table)
UNPIVOT (value FOR column IN (%s))
)
GROUP BY ID
""", ARRAY_TO_STRING(
REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT featureAttributions FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
);
I have a SQL statement such as:
INSERT INTO my_table (col1, col2, col3) VALUES (1,2,3)
I am using psycopg2 to insert data as follows:
cur.execute(
sql.SQL("INSERT INTO{} VALUES (%s, %s, %s)").format(sql.Identifier('my_table')),[1, 2, 3]
)
I don't see how to specify column names into the insert statement though? The above sql.sql is "assuming" that 1,2,3 are in order of col1, col2 and col3. For instance, at times when I want to say insert only col3, how would I specify the column name with sql.sql?
The execute is just executing the SQL code, so you can just mention the columns as per standard PostgreSQL INSERT statement like
INSERT INTO TABLE_ABC (col_name_1,col_name_2,col_name_3) VALUES (1, 2, 3)"
I am having two tables TableA and TableB with below structure
Table1
PkCol1
PkCol2
Col3
Col4
Col5
Table2
PkCol1
PkCol2
Col3
Col4
Col5
But i am getting the primary key information as input. For example, i receive it as PkCol2,PkCol2. I may receive more primaryKey columns as input too.
How do i dynamically add my where condition to spark sql ?
Below is my code
df.createOrReplaceTempView("Table1")
df2.createOrReplaceTempView("Table2")
primaryKeyString = ar(1)
val df3 = spark.sql("Select * from table1 where "+primaryKeyString+" not in (select "+primaryKeyString+" from table2)").toDF()
If there is a better way to do it with Dataframes let me know.
I am able to acheive my purpose by concatenating in spark sql as below
val df3 = spark.sql("Select * from table1 where CONCAT("+primaryKeyString+") not in (select CONCAT("+primaryKeyString+") from table2)").toDF()
Trying to find if there is a better way to achieve it in scala.
I'm new in Cassandra and I have a trouble inserting some rows in a database getting the error of the title.
I use cassandra 1.0.8 and cqlsh for doing changes in my database.
Next, I explain the given steps before I get the error:
CREATE A COLUMN FAMILY
CREATE TABLE test (
col1 int PRIMARY KEY,
col2 bigint,
col3 boolean,
col4 timestamp
);
INSERT SEVERAL ROWS WITHOUT SPECIFICYING ALL OF COLUMNS OF THE TABLE
insert into test (col1, col2, col3) values (1, 100, true);
insert into test (col1, col2, col3) values (2, 200, false);
SELECT FOR CHECKING THAT ROWS HAVE BEEN INSERTED CORRECTLY
select * from test;
The result is the following:
INSERT A ROW SPECIFICYING A VALUE FOR THE col4 (NOT SPECIFIED BEFORE)
insert into test (col1, col2, col3, col4) values (3, 100, true, '2011-02-03');
SELECT FOR CHECKING THAT ROW HAS BEEN INSERTED CORRECTLY
select * from test;
In this SELECT is the error. The result is the following:
SELECT EACH COLUMN OF THE TABLE SEPARATELY
select col1 from test;
select col2 from test;
select col3 from test;
select col4 from test;
it works fine and shows the right values:
Then, my question is: what's the problem in the first SELECT? what's wrong?
Thanks in advance!!
NOTE:
If I define col4 as Integer rather than a timestamp it works. However, I've tried to insert col4 as the normalized format yyyy-mm-dd HH:mm (I've tried with '2011-02-03 01:05' and '2011-02-03 01:05:10') but it doesn't work.
Cassandra 1.0.8 shipped with CQL2 and that's where your problem is coming from. I managed to recreate this in 1.0.8 but it works fine with 1.2.x so my advice is upgrade if you can.
In C* 1.2.10
cqlsh> update db.user set date='2011-02-03 01:05' where user='JCTYpjJlM';
cqlsh> SELECT * from db.user ;
user | date | password
-----------+--------------------------+----------
xvkYQKerQ | null | 765
JCTYpjJlM | 2011-02-03 01:05:00+0200 | 391
#mol
Weird, try to insert col4 as Integer (convert to milliseconds first) or use the normalized format : yyyy-mm-dd HH:mm
Accodring to the doc here, you can omit the time and just input the date but it seems that breaks something in your case
Consider two data frames, dataFrame1 and dataFrame2:
dataFrame1 has N columns (colmn1, ..., colmnN)
dataFrame2 has 3 columns (col1, col2, col3)
Can I write a statement like:
Select colmn1, colmn2, ..., colmnN, col1, col2 from dataFrame1, dataFrame2
using RMySQL?
Maybe you want package sqldf instead.
Try this:
library("sqldf")
sqldf("select colmn1, colmn2, ..., colmnN, col1, col2 from dataFrame1, dataFrame2")
of course you must replace ... with actual column names.