I want to insert selective columns to Hive and I am unable to do so. This is what I was trying via spark
val df2 = spark.sql("SELECT Device_Version,date, SUM(size) as size FROM table1 WHERE date='2019-06-13' GROUP BY date, Device_Version")
df2.createOrReplaceTempView("tempTable")
spark.sql("Insert into table2 PARTITION (date,ID) (Device_Version) SELECT Device_Version, date, '1' AS ID FROM tempTable")
My aim is to only insert selective fields to the table t2. Table t2 has many other columns which I want to be padded as null. I can do the padding as long as I can specify the order. I do not want the order to be taken by default.
Something like ...
spark.sql("Insert into table2 PARTITION (date,cuboid_id) (Device_Version,OS) SELECT Device_Version, null as os, date, '10001' AS CUBOID_ID FROM tempTable")
Is there any way to do this ? Any options are welcome.
Related
I have created a temporary table for a pyspark dataframe so that I can query it with spark.sql.
df.createGlobalTempView("new")
This is my query:
spark.sql("select distinct(city), count(distinct(city)) FROM global_temp.new GROUP BY 1 ORDER BY 2 desc").show()
When I perform this query, the second column shows incorrectly in ascending order. When I order by the first column, however, it will work correctly and the results will correspond to my choosing asc or desc.
Is it not possible to order by a column that is the result of a calculation in a temp table? I can do so with a regular spark dataframe.
I'm trying to join two tables in spark sql. Each table has 50+ columns. Both has column id as the key.
spark.sql("select * from tbl1 join tbl2 on tbl1.id = tbl2.id")
The joined table has duplicated id column.
We can of course specify which id column to keep like below:
spark.sql("select tbl1.id, .....from tbl1 join tbl2 on tbl1.id = tbl2.id")
But since we have so many columns in both tables, I do not want to type all the other column names in the query above. (other than id column, no other duplicated column names).
what should I do? thanks.
If id is the only column name in common, you can take advantage of the USING clause:
spark.sql("select * from tbl1 join tbl2 using (id) ")
The using clause matches columns that have the same name in both tables. When using select *, the column appears only once.
Assuming, you want to preserve the "duplicates", you can try to use the internal row-id or equivalents for your help. This helped me in the past, if I had to delete exactly one of two identical rows.
select *,ctid from table;
outputs in postgresql also the internal counter id. Your before exact identical rows become different now. I don't know about spark.sql, but I assume, that you can access a similar attribute there.
val joined = spark
.sql("select * from tbl1")
.join(
spark.sql("select * from tbl2"),
Seq("id"),
"inner" // optional
)
joined should have only one id column. Tested with Spark 2.4.8
Hive 2.3.6-mapr
Spark v2.3.1
I am running same query:
select count(*)
from TABLE_A a
left join TABLE_B b
on a.key = c.key
and b.date > '2021-01-01'
and date_add(last_day(add_months(a.create_date, -1)),1) < '2021-03-01'
where cast(a.TIMESTAMP as date) >= '2021-01-20'
and cast(a.TIMESTAMP as date) < '2021-03-01'
But getting 1B rows as output in hive, while 1.01B in spark-sql.
By some initial analysis, it seems like all the extra rows in spark are having timestamp column as 2021-02-28 00:00:00.000000.
Both the TIMESTAMP and create_date columns have data type string.
What could be the reason behind this?
I will give you one possibility, but I need more information.
If you drop an external table, the data remains and spark can read it, but the metadata in Hive says it doesn't exist and doesn't read it.
That's why you have a difference.
I am using spark 2.2.1 and hive2.1. I am trying to insert overwrite multiple partitions into existing partitioned hive/parquet table.
Table was created using sparkSession.
I have a table 'mytable' with partitions P1 and P2.
I have following set on sparkSession object:
"hive.exec.dynamic.partition"=true
"hive.exec.dynamic.partition.mode"="nonstrict"
Code:
val df = spark.read.csv(pathToNewData)
df.createOrReplaceTempView("updateTable") //here 'df' may contains data from multiple partitions. i.e. multiple values for P1 and P2 in data.
spark.sql("insert overwrite table mytable PARTITION(P1, P2) select c1, c2,..cn, P1, P2 from updateTable") // I made sure that partition columns P1 and P2 are at the end of projection list.
I am getting following error:
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {p1=, p2=, P1=1085, P2=164590861} contains non-partition columns;
dataframe 'df' have records for P1=1085, P2=164590861 . It looks like issue with casing (lower vs upper). I tried both cases in my query but it's still not working.
EDIT:
Insert statement works with static partitioning but that is not what I am looking for:
e.g. following works
spark.sql("insert overwrite table mytable PARTITION(P1=1085, P2=164590861) select c1, c2,..cn, P1, P2 from updateTable where P1=1085 and P2=164590861")
Create table stmt:
`CREATE TABLE `my_table`(
`c1` int,
`c2` int,
`c3` string,
`p1` int,
`p2` int)
PARTITIONED BY (
`p1` int,
`p2` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'maprfs:/mds/hive/warehouse/my.db/xc_bonus'
TBLPROPERTIES (
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{.spark struct metadata here.......}';
'spark.sql.sources.schema.partCol.0'='P1', //Spark is using Capital Names for Partitions; while hive is using lowercase
'spark.sql.sources.schema.partCol.1'='P2',
'transient_lastDdlTime'='1533665272')`
In above, spark.sql.sources.schema.partCol.0 uses all uppercase while PARTITIONED BY statement uses all lowercase for partitions columns
Based on the Exception and also assuming that the table 'mytable' was created as a partitioned table with P1 and P2 as partitions. One way to overcome this exception would be to force a dummy partition manually before executing the command. Try doing
spark.sql("alter table mytable add partition (p1=default, p2=default)").
Once successful, execute your insert overwrite statement. Hope this helps?
As I mentioned in EDIT section issue was in fact with difference in partition columns casing (lower vs upper) between hive and spark! I created hive table with all Upper cases but hive still internally stored it as lowercases but spark metadata kept is as Upper cases as intended by me. Fixing create statement with all lower case partition columns fixed the issue with subsequent updates!
If you are using hive 2.1 and spark 2.2 make sure following properties in create statement have same casing.
PARTITIONED BY (
p1int,
p2int)
'spark.sql.sources.schema.partCol.0'='p1',
'spark.sql.sources.schema.partCol.1'='p2',
I am working on pyspark, need to write a query which reads data from hive table and returns a pyspark dataframe containing all the columns and row number.
This is what I tried :
SELECT *, ROW_NUMBER() OVER () as rcd_num FROM schema_name.table_name
This query works fine in hive, but when I run it from a pyspark script it throws the following error:
Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;
Please suggest some solution.
Note: I do not wish to order the rows in any particular order, I just need row numbers for all the rows present in the table without any sorting or ordering.
Using spark 2.1
ROW_NUMBER()might be required ordering so you can used monotonicallyIncreasingId function which gives you row numbers for all the rows present in the table.
from pyspark.sql.functions import monotonicallyIncreasingId
df.withColumn("rcd_num ", monotonicallyIncreasingId())
OR
SELECT *, ROW_NUMBER() OVER (Order by (select NULL)) as rcd_num FROM schema_name.table_name
you can set order by select NULL