Writing to delta table fails with "not enough data columns"? - databricks

I was trying to execute below spark-sql code in data bricks which is doing Insert Overwriting on other table. which are have same no.of columns with same names.
INSERT OVERWRITE TABLE cs_br_prov
SELECT NAMED_STRUCT('IND_ID',stg.IND_ID,'CUST_NBR',stg.CUST_NBR,'SRC_ID',stg.SRC_ID,
'SRC_SYS_CD',stg.SRC_SYS_CD,'OUTBOUND_ID',stg.OUTBOUND_ID,'OPP_ID',stg.OPP_ID,
'CAMPAIGN_CD',stg.CAMPAIGN_CD,'TREAT_KEY',stg.TREAT_KEY,'PROV_KEY',stg.PROV_KEY,
'INSERTDATE',stg.INSERTDATE,'UPDATEDATE',stg.UPDATEDATE,'CONTACT_KEY',stg.CONTACT_KEY) AS key,
stg.MEM_KEY,
stg.INDV_ID,
stg.MBR_ID,
stg.OPP_DT,
stg.SEG_ID,
stg.MODA,
stg.E_KEY,
stg.TREAT_RUNDATETIME
FROM cs_br_prov_stg stg
Error which i am getting was :
AnalysisException: Cannot write to 'delta.`path`', not enough data columns;
target table has 20 column(s) but the inserted data has 9 column(s)

The reason is as the exception says, the SELECT subquery creates a logical plan with just 9 columns (not 20 as the cs_br_prov table expects).
Unless the table uses generated columns, the exception is perfectly fine.

Related

How to optimize a delete on table which doesn't have any primary key but has a column which has TimeStamp?

My process is doing a insert into to a backup table 'B from a table 'A' which gets updated daily[truncate and load] in the azure sql db.
A column 'TSP' [eg value =2022-12-19T22:06:01.950994] is present in both tables. TSP for all rows inserted in a day is same.
Later in the day, I'm supposed to delete older data.
Currently using 'delete from 'B' where TSP<'today-1day' logic
Is there a way to optimize this delete using index or something?
SSMS suggested to create non clustered index on the table.TSP column.
I tested it but seems there isn't much difference.
If this was the data:
50mil TSP1
50mil TSP2
50mil TSP3
My expectation was it would skip scanning TSP2,TSP3 rows and delete TSP1.
Whereas if table doesn't have index it would need to scan all 150mil rows.
The batched delete operation utilizes a view to simplify the execution plan, and that can be achieved using Fast Ordered Delete Operation. This is achieved by refreshing the table once which in turn reduces the amount of I/O required.
Below are the sample queries: -
CREATE TABLE tableA
(
id int,
TSP Datetime DEFAULT GETDATE(),
[Log] NVARCHAR(250)
)
WHILE #I <=1000 BEGIN INSERT INTO tableA VALUES(#I, GETDATE()-1, concat('Log message ', #I) ) SET #I=#I+1 END
Option 1:- using CTE
;WITH DeleteData
AS
(SELECT id, TSP, Log FROM tableA
WHERE CAST(tsp AS DATE) = CAST(GETDATE() AS DATE))
DELETE FROM DeleteData
Option 2:- using SQL views
CREATE VIEW VW_tableA AS (SELECT * FROM tableA WHERE CAST(tsp AS DATE) = CAST(GETDATE()-1 AS DATE))
delete from VW_tableA
Reference 1: An article by John Sansom on fast-sql-server-delete.
Reference 2: Similar SO thread.

PySpark.SQL Order By not working on count column when querying temp table

I have created a temporary table for a pyspark dataframe so that I can query it with spark.sql.
df.createGlobalTempView("new")
This is my query:
spark.sql("select distinct(city), count(distinct(city)) FROM global_temp.new GROUP BY 1 ORDER BY 2 desc").show()
When I perform this query, the second column shows incorrectly in ascending order. When I order by the first column, however, it will work correctly and the results will correspond to my choosing asc or desc.
Is it not possible to order by a column that is the result of a calculation in a temp table? I can do so with a regular spark dataframe.

Pyspark: Row count is not matching to the count of records appended

I am trying to identify and insert only the delta records to the target hive table from pyspark program. I am using left anti join on ID columns and it's able to identify the new records successfully. But I could notice that the total number of delta records is not the same as the difference between table record count before load and afterload.
delta_df = src_df.join(tgt_df, src_df.JOIN_HASH == tgt_df.JOIN_HASH,how='leftanti')\
.select(src_df.columns).drop("JOIN_HASH")
delta_df.count() #giving out correct delta count
delta_df.write.mode("append").format("hive").option("compression","snappy").saveAsTable(hivetable)
But if I could see delta_df.count() is not the same as count( * ) from hivetable after writting data - count(*) from hivetable before writting data. The difference is always coming higher compared to the delta count.
I have a unique timestamp column for each load in the source, and to my surprise, the count of records in the target for the current load(grouping by unique timestamp) is less than the delta count.
I am not able to identify the issue here, do I have to write the df.write in some other way?
It was a problem with the line delimiter. When the table is created with spark.write, in SERDEPROPERTIES there is no line.delim specified and column values with * were getting split into multiple rows.
Now I added the below SERDEPROPERTIES and it stores the data correctly.
'line.delim'='\n'

spark-hive - Upsert into dynamic partition hive table throws an error - Partition spec contains non-partition columns

I am using spark 2.2.1 and hive2.1. I am trying to insert overwrite multiple partitions into existing partitioned hive/parquet table.
Table was created using sparkSession.
I have a table 'mytable' with partitions P1 and P2.
I have following set on sparkSession object:
"hive.exec.dynamic.partition"=true
"hive.exec.dynamic.partition.mode"="nonstrict"
Code:
val df = spark.read.csv(pathToNewData)
df.createOrReplaceTempView("updateTable") //here 'df' may contains data from multiple partitions. i.e. multiple values for P1 and P2 in data.
spark.sql("insert overwrite table mytable PARTITION(P1, P2) select c1, c2,..cn, P1, P2 from updateTable") // I made sure that partition columns P1 and P2 are at the end of projection list.
I am getting following error:
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {p1=, p2=, P1=1085, P2=164590861} contains non-partition columns;
dataframe 'df' have records for P1=1085, P2=164590861 . It looks like issue with casing (lower vs upper). I tried both cases in my query but it's still not working.
EDIT:
Insert statement works with static partitioning but that is not what I am looking for:
e.g. following works
spark.sql("insert overwrite table mytable PARTITION(P1=1085, P2=164590861) select c1, c2,..cn, P1, P2 from updateTable where P1=1085 and P2=164590861")
Create table stmt:
`CREATE TABLE `my_table`(
`c1` int,
`c2` int,
`c3` string,
`p1` int,
`p2` int)
PARTITIONED BY (
`p1` int,
`p2` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'maprfs:/mds/hive/warehouse/my.db/xc_bonus'
TBLPROPERTIES (
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{.spark struct metadata here.......}';
'spark.sql.sources.schema.partCol.0'='P1', //Spark is using Capital Names for Partitions; while hive is using lowercase
'spark.sql.sources.schema.partCol.1'='P2',
'transient_lastDdlTime'='1533665272')`
In above, spark.sql.sources.schema.partCol.0 uses all uppercase while PARTITIONED BY statement uses all lowercase for partitions columns
Based on the Exception and also assuming that the table 'mytable' was created as a partitioned table with P1 and P2 as partitions. One way to overcome this exception would be to force a dummy partition manually before executing the command. Try doing
spark.sql("alter table mytable add partition (p1=default, p2=default)").
Once successful, execute your insert overwrite statement. Hope this helps?
As I mentioned in EDIT section issue was in fact with difference in partition columns casing (lower vs upper) between hive and spark! I created hive table with all Upper cases but hive still internally stored it as lowercases but spark metadata kept is as Upper cases as intended by me. Fixing create statement with all lower case partition columns fixed the issue with subsequent updates!
If you are using hive 2.1 and spark 2.2 make sure following properties in create statement have same casing.
PARTITIONED BY (
p1int,
p2int)
'spark.sql.sources.schema.partCol.0'='p1',
'spark.sql.sources.schema.partCol.1'='p2',

Hive query for row number

I am working on pyspark, need to write a query which reads data from hive table and returns a pyspark dataframe containing all the columns and row number.
This is what I tried :
SELECT *, ROW_NUMBER() OVER () as rcd_num FROM schema_name.table_name
This query works fine in hive, but when I run it from a pyspark script it throws the following error:
Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;
Please suggest some solution.
Note: I do not wish to order the rows in any particular order, I just need row numbers for all the rows present in the table without any sorting or ordering.
Using spark 2.1
ROW_NUMBER()might be required ordering so you can used monotonicallyIncreasingId function which gives you row numbers for all the rows present in the table.
from pyspark.sql.functions import monotonicallyIncreasingId
df.withColumn("rcd_num ", monotonicallyIncreasingId())
OR
SELECT *, ROW_NUMBER() OVER (Order by (select NULL)) as rcd_num FROM schema_name.table_name
you can set order by select NULL

Resources