How to overwrite a file with pyspark in Azure Spark Cluster

How to overwrite a file with pyspark in Azure Spark Cluster - azure

I'm using the following script to output the results of a SPARQL query to a file in Azure Data Store. However, instead of creating a file called myresults.json and publishing the results to the myresults.json file, the script publishes the results to a random file name like part-0000-tid like in the image below:
The code is as follows:
example1 = spark.sql("""SELECT
CF.CountryName AS CountryCarsSold
,COUNT(CF.CountryName) AS NumberCountry
,MAX(CB.SalesDetailsID) AS TotalSold
FROM Data_SalesDetails CB
INNER JOIN Data_Sales CD
ON CB.SalesID = CD.SalesID
INNER JOIN Data_Customer CG
ON CD.CustomerID = CG.CustomerID
INNER JOIN Data_Country CF
ON CG.Country = CF.CountryISO2
GROUP BY CF.CountryName""")
example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput3/myresults.json")
Can someone let me know how to save as a single file and each file is overwritten each time it is saved.
Thanks

Related

Delta Live Tables and ingesting AVRO

So, im trying to load avro files in to dlt and create pipelines and so fourth.
As a simple data frame in Databbricks, i can read and unpack to avro files, using functions json / rdd.map /lamba function. Where i can create a temp view then do a sql query and then select the fields i want.
--example command
in_path = '/mnt/file_location/*/*/*/*/*.avro'
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
data.createOrReplaceTempView("eventhub")
--selecting the data
sql_query1 = sqlContext.sql("""
select distinct
data.field.test1 as col1
,data.field.test2 as col2
,data.field.fieldgrp.city as city
from
eventhub
""")
However, i am trying to replicate the process , but use delta live tables and pipelines.
I have used autoloader to load the files into a table, and kept the format as is. So bronze is just avro in its rawest form.
I then planned to create a view that listed the unpack avro file. Much like I did above with "eventhub". Whereby it will then allow me to create queries.
The trouble is, I cant get it to work in dlt. I fail at the 2nd step, after i have imported the file into a bronze layer. It just does not seem to apply the functions to make the data readable/selectable.
This is the sort of code i have been trying. However, it does not seem to pick up the schema, so it is as if the functions are not working. so when i try and select a column, it does not recognise it.
--unpacked data
#dlt.view(name=f"eventdata_v")
def eventdata_v():
avroDf = spark.read.format("delta").table("live.bronze_file_list")
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
return data
--trying to query the data but it does not recognise field names, even when i select "data" only
#dlt.view(name=f"eventdata2_v")
def eventdata2_v():
df = (
dlt.read("eventdata_v")
.select("data.field.test1 ")
)
return df
I have been working on this for weeks, trying to use different approach's but still no luck.
Any help will be so appreciated. Thankyou

How do you setup a Synapse Serverless SQL External Table over partitioned data?

I have setup a Synapse workspace and imported the Covid19 sample data into a PySpark notebook.
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
df = spark.read.parquet(wasbs_path)
I have then partitioned the data by country_region, and written it back down into my storage account.
df.write.partitionBy("country_region") /
.mode("overwrite") /
.parquet("abfss://rawdata#synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/")
All that works fine as you can see. So far I have only found a way to query data from the exact partition using OPENROWSET, like this...
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=Afghanistan/**',
FORMAT = 'PARQUET'
) AS [result]
I want to setup an Serverless SQL External table over the partition data, so that when people run a query and use "WHERE country_region = x" it will only read the appropriate partition. Is this possible, and if so how?

You need to get the partition value using the filepath function like this. Then filter on it. That achieves partition elimination. You can confirm by the bytes read compared to when you don’t filter on that column.
CREATE VIEW MyView
As
SELECT
*, filepath(1) as country_region
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=*/*',
FORMAT = 'PARQUET'
) AS [result]
GO
Select * from MyView where country_region='Afghanistan'

JOIN in Azure Stream Analytics

I have a requirement to validate the values of one column with a master data in stream analytics.
I have written queries to fetch some data from a blob location and One of the column value should be validated against a master data available in another blob location.
Below is the SAQL I tried. signals1 is the master data in blob and signals2 is the data processed and to be validated:
WITH MASTER AS (
SELECT [signals1].VAL as VAL
FROM [signals1]
)
SELECT
ID,
VAL,
SIG
INTO [output]
FROM signals2
I have to check the VAL from signals2 to be validated against VAL in signals1.
If the VAL in signals2 is there in signals1, then we should write to output.
If the VAL in signals2 is not there in signals1, then that doc should be ignored(should not write into output).
I tried with JOIN and WHERE clause, but not working as expected.
Any leads, how to achieve this using JOIN or WHERE?

In case your Signal1 data is the reference input, and Signal2 is the streaming input, you can use something like the following query:
with signals as (select * from Signal2 I join Signal1 R ON I.Val = R.Val)
select * into output from signals
I tested this query locally, and I assumed that your reference data(Signal1) is in the format:
[
{
"Val":"123",
"Data":"temp"
},
{
"Val":"321",
"Data":"humidity"
}
]
And for example, your Signal2 - the streaming input is:
{
"Val":"123",
"SIG":"k8s23kk",
"ID":"1234589"
}
Have a look at this query and data samples to see if it can guide you towards the solution.
Side note you cannot use this join in case that Signal1 is the streaming data. The way these types of joins are working is that you have to use time-windowing. Without that is not possible.

Alternative of sp_depends in Azure Data Warehouse

I need to get the list of tables used in a stored procedure,However in Azure Datawarehouse sp_depends is not supported.
The other alternative I thought of having is to get the stored proc code from INFORMATION_SCHEMA.ROUTINES and then run a script to get the [schema].[tablename] from the stored procedure definition but here the issue is in storing the whole stored proc into a variable. VARCHAR(MAX)has a limit of 8000 to store and if my proc exceeds this limit then I wont be able to get the complete table list.

Try using sys.sql_expression_dependencies. The following query may help you:
SELECT ReferencingObjectType = o1.type,
ReferencingObject = SCHEMA_NAME(o1.schema_id)+'.'+o1.name,
ReferencedObject = SCHEMA_NAME(o2.schema_id)+'.'+ed.referenced_entity_name,
ReferencedObjectType = o2.type
FROM sys.sql_expression_dependencies ed
INNER JOIN sys.objects o1
ON ed.referencing_id = o1.object_id
INNER JOIN sys.objects o2
ON ed.referenced_id = o2.object_id
WHERE o1.type in ('P','TR','V', 'TF')
ORDER BY ReferencingObjectType, ReferencingObject

Is it possible to repeat a step in pyspark with python for loop

Is it possible to do something like this in pyspark to loop through each value in list and read the son files ?
The goal here is to get the app-name from directory into the table as column value and use it for partition while writing the data.
s3 location that has Json files : "s3a://abc/processing/test/raghu/date/app-name/"
for abc in test:
path = "s3a://abc/processing/test/raghu/*/"+abc+"/*"
push = sqlContext.read.json(path)
push.registerTempTable("push")
final = sqlContext.sql("SELECT unbase64(body.payload) as payload,"abc" as app-name from push")
final.write.parquet("/data/test/dev/raghu/SPARK-Test/")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to overwrite a file with pyspark in Azure Spark Cluster - azure

Related

Delta Live Tables and ingesting AVRO

How do you setup a Synapse Serverless SQL External Table over partitioned data?

JOIN in Azure Stream Analytics

Alternative of sp_depends in Azure Data Warehouse

Is it possible to repeat a step in pyspark with python for loop

Categories

Resources