Insert Overwrite in data bricks overwriting complete data in table? - apache-spark

I am have two table 1 is with 50K records and other is with 2.5K records and I want to update this 2.5K records into table one. Currently I was doing this by using INSERT OVERWRITE statement in spark Mapr cluster. And I want to do same in azure databricks. Where I created 2 tables in databricks and read data from on-prem servers into azure then using INSERT OVERWRITE statement. But I was doing this my previus/History data was completely replaced with new data.
In Mapr cluster
src_df_name.write.mode("overwrite").format("hive").saveAsTable(s"cs_hen_mbr_stg") //stage table with 2.5K records.
spark.sql(s"INSERT OVERWRITE TABLE cs_hen_mbr_hist " +
s"SELECT NAMED_STRUCT('INID',stg.INID,'SEG_NBR',stg.SEG_NBR,'SRC_ID',stg.SRC_ID, "+
s"'SYS_RULE',stg.SYS_RULE,'MSG_ID',stg.MSG_ID, " +
s"'TRE_KEY',stg.TRE_KEY,'PRO_KEY',stg.PRO_KEY, " +
s"'INS_DATE',stg.INS_DATE,'UPDATE_DATE',stg.UPDATE_DATE,'STATUS_KEY',stg.STATUS_KEY) AS key, "+
s"stg.MEM_KEY,stg.INDV_ID,stg.MBR_ID,stg.SEGO_ID,stg.EMAIL, " +
s"from cs_hen_mbr_stg stg" )
By doing above in mapr cluster I was able to update values.But i was trying same in azure databricks My history data is getting lost.
In Databriks
val VW_HISTORY_MAIN=spark.read.format("parquet").option("header","true").load(s"${SourcePath}/VW_HISTORY")
VW_HISTORY_MAIN.write.mode("overwrite").format("hive").saveAsTable(s"demo.cs_hen_mbr_stg") //writing this to table in databricks.
spark.sql(s"INSERT OVERWRITE TABLE cs_hen_mbr_hist " +
s"SELECT NAMED_STRUCT('INID',stg.INID,'SEG_NBR',stg.SEG_NBR,'SRC_ID',stg.SRC_ID, "+
s"'SYS_RULE',stg.SYS_RULE,'MSG_ID',stg.MSG_ID, " +
s"'TRE_KEY',stg.TRE_KEY,'PRO_KEY',stg.PRO_KEY, " +
s"'INS_DATE',stg.INS_DATE,'UPDATE_DATE',stg.UPDATE_DATE,'STATUS_KEY',stg.STATUS_KEY) AS key, "+
s"stg.MEM_KEY,stg.INDV_ID,stg.MBR_ID,stg.SEGO_ID,stg.EMAIL, " +
s"from cs_hen_mbr_stg stg" )
Why it is not working with databricks?

Related

Azure Synapse Serverless SQL Pool Error: Incorrect syntax near 'DISTRIBUTION'

The following code on Azure Synapse Serverless SQL Pool gives the following error:
Incorrect syntax near 'DISTRIBUTION'.
SELECT CM.EntityName,
--Before the first column of each table, construct a DROP TABLE statement if already exist
CASE WHEN CM.OrdinalPosition = 1
THEN
'DROP EXTERNAL TABLE MyTable' + '.' +
QUOTENAME(#EnrichedViewSchema) + '.' + CM.EntityName + '
CREATE TABLE MyTable' + '.' +
QUOTENAME(#EnrichedViewSchema) + '.' + CM.EntityName + '
WITH
(
DISTRIBUTION = ROUND_ROBIN
);
AS
SELECT DISTINCT '
ELSE ' ,'
END
Can someone look at the code and let me know where I might going wrong?
Azure Synapse SQL Server Pool Error: Incorrect syntax near 'DISTRIBUTION'
CREATE TABLE MyTable' + '.' +
QUOTENAME(#EnrichedViewSchema) + '.' +
CM.EntityName + '
WITH
(
DISTRIBUTION = ROUND_ROBIN
)
Serverless SQL pool is used to query over the data lake, and we cannot create tables in it. We can create external tables and temporary tables only in serverless SQL pool.
Also, Distribution is applicable only for dedicated SQL pool tables.
Therefore, above SQL script is not possible.
Reference: screenshot from Microsoft document Design tables using Synapse SQL - Azure Synapse Analytics | Microsoft Learn
there is an additional semicolon before AS in your script.
Wrong:
CREATE TABLE XXX WITH(DISTRIBUTION=ROUND_ROBIN); AS SELECT
Correct:
CREATE TABLE XXX WITH(DISTRIBUTION=ROUND_ROBIN) AS SELECT

How do you setup a Synapse Serverless SQL External Table over partitioned data?

I have setup a Synapse workspace and imported the Covid19 sample data into a PySpark notebook.
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
df = spark.read.parquet(wasbs_path)
I have then partitioned the data by country_region, and written it back down into my storage account.
df.write.partitionBy("country_region") /
.mode("overwrite") /
.parquet("abfss://rawdata#synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/")
All that works fine as you can see. So far I have only found a way to query data from the exact partition using OPENROWSET, like this...
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=Afghanistan/**',
FORMAT = 'PARQUET'
) AS [result]
I want to setup an Serverless SQL External table over the partition data, so that when people run a query and use "WHERE country_region = x" it will only read the appropriate partition. Is this possible, and if so how?
You need to get the partition value using the filepath function like this. Then filter on it. That achieves partition elimination. You can confirm by the bytes read compared to when you don’t filter on that column.
CREATE VIEW MyView
As
SELECT
*, filepath(1) as country_region
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=*/*',
FORMAT = 'PARQUET'
) AS [result]
GO
Select * from MyView where country_region='Afghanistan'

cannot view newly created delta table via DeltaTable.forPath

I created a table like this and inserted some data
spark.sql(s"create table if not exists test " +
"(key String," +
"name String," +
"address String," +
"inserted_at TIMESTAMP) " +
s" using delta LOCATION 's3://test/user/'")
I can view the table via
spark.table("test").show()
But when I do
DeltaTable.forPath(spark,"s3://test/user/" ).toDF.show(false)
I cannot see the data. But when i try this method
DeltaTable.isDeltaTable("s3://test/user/")
it is true. Can anyone please explain what I am missing?
Further, when I want to do a merge operation, I am getting this error.
[error] !
[error] java.lang.UnsupportedOperationException: null (DeltaTable.scala:639)
[error] io.delta.tables.DeltaTable$.forPath(DeltaTable.scala:639)

Kundera Cassandra Delete a row based on Indexed column

How to delete rows in cassandra based on an indexed column ?
Tried:
upload_id is added as an index in the table.
Delete from table where upload_id = '"+uploadId+"'"
But this gives me an Error "NON PRIMARY KEY found in where clause".
String selectQuery = "Select hashkey from table where upload_id='" + uploadId + "'"
entityManager.createNativeQuery(selectQuery).getResultList()
and delete all the elements in the List using a for loop.
This query is changed by kundera to append LIMIT 100 ALLOW Filtering.
Found a Question similar to this at Kundera for Cassandra - Deleting record by row key but that was asked in 2012 after that there were a lot of changes to cassandra and Kundera.
Kundera by default uses LIMIT 100. You can use query.setMaxResults(<integer>) to modify the limit accordingly and then run the loop.
Example:
Query findQuery = entityManager.createQuery("Select p from PersonCassandra p where p.age = 10", PersonCassandra.class);
findQuery.setMaxResults(5000);
List<PersonCassandra> allPersons = findQuery.getResultList();

how to create the table(column family) for file input and output

how to read file from cassandra and write file to cassandra?
From the link above, I got the way to read and write file with Cassandra, but I don't know the structure of the table.
The code below is the way to insert file into Cassandra. I guess that filename and file
are two columns (is it "filename" the primary key?). What is the type for file when we write CREATE TABLE (...,...,...) .Is "blob" ok?
ByteBuffer fileByteBuffer = ByteBuffer.wrap( readFileToByteArray( filename ) );
Statement insertFile = QueryBuilder.insertInto( "files" ).value( "filename", filename ).value( "file", fileByteBuffer );
session.execute( insertFile );
I create the columnfamily :
"CREATE TABLE " + columnfamily + " ("
+ PK+ " varchar PRIMARY KEY,"
+ " password varchar ,"
+ " file blob"
+ ");";
I do the query:
Statement insertFile = QueryBuilder.insertInto(keyspace + "."+columnfamily)
.value(PK, "LDCR.lua").value("file", fileByteBuffer).value("password", "654321");
session.execute(insertFile);
but it says:
com.datastax.driver.core.exceptions.InvalidQueryException: unconfigured columnfamily

Resources