how to pass parameter value to saveAsTable() option in Spark - apache-spark

I need to parameterize the table value in saveAsTable() option in Spark.
can anyone please suggest
I tried saveAsTable("$tablename") but it didn't work out and throwing error.

If I understood your question correctly, Use string interpolation i.e append s to before "$tablename" string.
saveAsTable(s"$tablename")

Related

How to use timestamp type parameters set in pipeline with timestamp type in data flow

I can't send the question due to some mysterious error, so I'll share a screenshot of the question.
Can anyone help me solve this?
I have reproduced the above and got same error when the Expression checkbox in checked.
Remove the Expression checkbox check in dataflow pipeline assignment and pass it as a string. Now it won't give the error.
It will take the Dataflow parameter like this.
Also, along with the Date time string pass the format in toTimestamp() function to avoid null values.
This is my sample input data:
sample filter condition:
toTimestamp(start_date,'yyyy-MM-dd\'T\'HH:mm:ss')
Filtered Result:

Why do I get a naming convention error in PySpark when the name is correct?

I'm trying to groupBy a variable (column) called saleId, and then get the Sum for it, using an attribute (column) called totalAmount with the code below:
df = df.groupBy('saleId').agg({"totalAmount": "sum"})
But I get the following error:
Attribute sum(totalAmount) contains an invalid character among
,;{}()\n\t=. Please use an alias to rename it
I'm assuming there's something wrong with the way I'm using groupBy, because I get other errors even when I try the following code instead of the above one:
df = df.groupBy('saleId').sum('totalAmount')
What's the problem with my code?
OK, I figured out what went wrong.
The code I used in my question, returns the whole sum(totalAmount) as the name of the variable (column), which as you can see includes parenthesis.
This can be avoided by using:
df= df.groupBy('saleId').agg({"totalAmount": "sum"}).withColumnRenamed('sum(totalAmount)', 'totalAmount')
or
df.groupBy('saleId').agg(F.sum("totalAmount").alias(totalAmount))

Multiple parameter in IN clause of Spark SQL from parameter file

I am trying to run spark query where I am creating curated table from a source table based upon values in parameter file.
properties_file.properties contains below key values:
substatus,allow,deny
SparkQuery is
//Code to load property file in parseConf
spark.sql(s"""insert into curated.table from source.table where
substatus='${parseConf.substatus}'""")
Above works with single value in substatus. But Can someone help what shall i do if I need to use substatus in ${parseConf.substatus} for multiple values from param as below.
spark.sql(s"""insert into curated.table from source.table where substatus in '${parseConf.substatus}'""")
To resolve my problem, I updated my property file as:
substatus,'allow'-'deny'
Then in scala code, I implemented below logic:
val subStatus=(parseConf.substatus).replace('-',',')
spark.sql(s"""insert into curated.table from source.table where substatus in ('${subStatus}')""")
Above strategy helped in breaking the values in string to muliple parameters of IN clause.
Equalto operator expects 1 value to be passed other than directly reading the value from parameter file who make in pass a one string. You need to break the values and then use IN clause inplace of equalto(=).

How To Update Field of a User-Defined Type (UDT) with Spring Data Cassandra CassandraTemplate

I have a Cassandra table which includes a user-defined type. Using CassandraTemplate from Spring Data Cassandra, I want to update a single field of that UDT. This doesn't seem possible.
I have tried this:
database.update(query(
where("party_id").is(partyId)).and(where("relationship_id").is(relationshipId)),
update("address.address_line_1", "this field was updated"),
Address.class);
This throws:
Query error after 3 ms: UPDATE current_addresses_by_party SET "address.address_line_1"=? WHERE party_id=? AND relationship_id=?;com.datastax.driver.core.exceptions.InvalidQueryException: Undefined column name "address.address_line_1"
Running the CQL given in the error output without the quotes works. I don't know if there's a way to get Spring to execute this statement without putting the column name in quotes.
In a fit of optimism I also tried using the syntax for map types:
database.update(query(
where("party_id").is(partyId)).and(where("relationship_id").is(relationshipId)),
Update.empty().set("address").atKey("address_line_1").to("this field was updated"),
Address.class)
This resulted in the error you would expect: the field is not a map.
Is there a way to do what I want with CassandraTemplate without resorting to direct CQL? If CassandraTemplate lacks this feature, it would be great if the devs added it.
I was surprised that I couldn't find anyone else wanting to do this. Maybe I'm doing something completely wrong? I'm fairly new to Cassandra.

pySpark dataframe filter method

I use Databricks runtime 6.3 and use pySpark. I have a dataframe df_1. SalesVolume is an integer but AveragePrice is a string.
When I execute below code, code runs and I get the correct output.
display(df_1.filter('SalesVolume>10000 and AveragePrice>70000'))
But, below code ends up in error; "py4j.Py4JException: Method and([class java.lang.Integer]) does not exist"
display(df_1.filter(df_1['SalesVolume']>10000 & df_1['AveragePrice']>7000))
Why does the first one work but not the second one?
you have to wrap your conditions in ()
display(df_1.filter((df_1['SalesVolume']>10000) & (df_1['AveragePrice']>7000)))
Filter accepts SQL like syntax or dataframe like syntax, 1st one works because it's a valid sql like syntax. but second one isn't.

Resources