Multiple parameter in IN clause of Spark SQL from parameter file - apache-spark

I am trying to run spark query where I am creating curated table from a source table based upon values in parameter file.
properties_file.properties contains below key values:
substatus,allow,deny
SparkQuery is
//Code to load property file in parseConf
spark.sql(s"""insert into curated.table from source.table where
substatus='${parseConf.substatus}'""")
Above works with single value in substatus. But Can someone help what shall i do if I need to use substatus in ${parseConf.substatus} for multiple values from param as below.
spark.sql(s"""insert into curated.table from source.table where substatus in '${parseConf.substatus}'""")

To resolve my problem, I updated my property file as:
substatus,'allow'-'deny'
Then in scala code, I implemented below logic:
val subStatus=(parseConf.substatus).replace('-',',')
spark.sql(s"""insert into curated.table from source.table where substatus in ('${subStatus}')""")
Above strategy helped in breaking the values in string to muliple parameters of IN clause.

Equalto operator expects 1 value to be passed other than directly reading the value from parameter file who make in pass a one string. You need to break the values and then use IN clause inplace of equalto(=).

Related

ADF - passing parameters within string to SQL Lookup

I'm writing a pipeline, where I fetch SQL queries from a metadata database in a lookup with the hope to execute those later on in the pipeline. Imagine a string stored in a database:
"SELECT * FROM #{pipeline().parameters.SchemaName}.#{pipeline().parameters.TableName}"
My hope was when passing this string to another Lookup activity, it would pick up the necessary parameters. However it's being passed to the activity as-is, without parameter substitution and I'm getting errors as a result. Is there any clean fix for this or am I trying to implement something not supported by ADF natively?
I found a work-around is just wrapping the string in a series of replace() statements, but hoping something simpler exists.
Can you try below query in the Dynamic Content text box:
#concat('SELECT * FROM ',pipeline().parameters.SchemaName,'.',pipeline().parameters.TableName)

Using Presto's Coalesce function with a row on AWS Athena

I am using AWS Web Application Firewall (WAF). The logs are written to an S3 Bucket and queried by AWS Athena.
Some log fields are not simple data types but complex JSON types. For example "rulegrouplist". It contains a JSON array of complex types and the array might have 0 or more elements.
So I am using Presto's try() function to convert errors to NULLs and trying to use the coalesce() function to put a dash in their place. (Keeping null values cause problems while using GROUP BY)
try() is working fine but coalesce() is causing a type mismatch problem.
The function call below:
coalesce(try(waf.rulegrouplist[1].terminatingrule),'-')
causes this error:
All COALESCE operands must be the same type: row(ruleid varchar,action varchar,rulematchdetails varchar)
How can I convert "-" to a row or what else can I use that will count as a row?
Apperantly you can create an empty row and cast it to a typed row.
This worked...
coalesce(try(waf.rulegrouplist[1].terminatingrule),CAST(row('null','null','null') as row(ruleid varchar,action varchar,rulematchdetails varchar)))

how to pass parameter value to saveAsTable() option in Spark

I need to parameterize the table value in saveAsTable() option in Spark.
can anyone please suggest
I tried saveAsTable("$tablename") but it didn't work out and throwing error.
If I understood your question correctly, Use string interpolation i.e append s to before "$tablename" string.
saveAsTable(s"$tablename")

pySpark dataframe filter method

I use Databricks runtime 6.3 and use pySpark. I have a dataframe df_1. SalesVolume is an integer but AveragePrice is a string.
When I execute below code, code runs and I get the correct output.
display(df_1.filter('SalesVolume>10000 and AveragePrice>70000'))
But, below code ends up in error; "py4j.Py4JException: Method and([class java.lang.Integer]) does not exist"
display(df_1.filter(df_1['SalesVolume']>10000 & df_1['AveragePrice']>7000))
Why does the first one work but not the second one?
you have to wrap your conditions in ()
display(df_1.filter((df_1['SalesVolume']>10000) & (df_1['AveragePrice']>7000)))
Filter accepts SQL like syntax or dataframe like syntax, 1st one works because it's a valid sql like syntax. but second one isn't.

How to pass content of file as a pipeline parameter

I have a pipeline that accepts an array as parameters.
Currently, an array has been hardcoded as the default value.
Is it possible to make this dynamic? there is a file called Array.txt in azure blob which is updated frequently, how can I extract the content of Array.txt and pass it as parameter values to the Pipeline.
I tried using Lookup but receive error 'Object cannot be passed, pipeline is expecting an Array'
Please make sure the data in Array.txt is array format-compatible, then use a lookup activity for data extracting, pass #array(activity('Lookup1').output.value) to the subsequent activity. Remember to use #array() function to convert data into an Array.

Resources