pySpark dataframe filter method - databricks

I use Databricks runtime 6.3 and use pySpark. I have a dataframe df_1. SalesVolume is an integer but AveragePrice is a string.
When I execute below code, code runs and I get the correct output.
display(df_1.filter('SalesVolume>10000 and AveragePrice>70000'))
But, below code ends up in error; "py4j.Py4JException: Method and([class java.lang.Integer]) does not exist"
display(df_1.filter(df_1['SalesVolume']>10000 & df_1['AveragePrice']>7000))
Why does the first one work but not the second one?

you have to wrap your conditions in ()
display(df_1.filter((df_1['SalesVolume']>10000) & (df_1['AveragePrice']>7000)))
Filter accepts SQL like syntax or dataframe like syntax, 1st one works because it's a valid sql like syntax. but second one isn't.

Related

Why do I get a naming convention error in PySpark when the name is correct?

I'm trying to groupBy a variable (column) called saleId, and then get the Sum for it, using an attribute (column) called totalAmount with the code below:
df = df.groupBy('saleId').agg({"totalAmount": "sum"})
But I get the following error:
Attribute sum(totalAmount) contains an invalid character among
,;{}()\n\t=. Please use an alias to rename it
I'm assuming there's something wrong with the way I'm using groupBy, because I get other errors even when I try the following code instead of the above one:
df = df.groupBy('saleId').sum('totalAmount')
What's the problem with my code?
OK, I figured out what went wrong.
The code I used in my question, returns the whole sum(totalAmount) as the name of the variable (column), which as you can see includes parenthesis.
This can be avoided by using:
df= df.groupBy('saleId').agg({"totalAmount": "sum"}).withColumnRenamed('sum(totalAmount)', 'totalAmount')
or
df.groupBy('saleId').agg(F.sum("totalAmount").alias(totalAmount))

Using Presto's Coalesce function with a row on AWS Athena

I am using AWS Web Application Firewall (WAF). The logs are written to an S3 Bucket and queried by AWS Athena.
Some log fields are not simple data types but complex JSON types. For example "rulegrouplist". It contains a JSON array of complex types and the array might have 0 or more elements.
So I am using Presto's try() function to convert errors to NULLs and trying to use the coalesce() function to put a dash in their place. (Keeping null values cause problems while using GROUP BY)
try() is working fine but coalesce() is causing a type mismatch problem.
The function call below:
coalesce(try(waf.rulegrouplist[1].terminatingrule),'-')
causes this error:
All COALESCE operands must be the same type: row(ruleid varchar,action varchar,rulematchdetails varchar)
How can I convert "-" to a row or what else can I use that will count as a row?
Apperantly you can create an empty row and cast it to a typed row.
This worked...
coalesce(try(waf.rulegrouplist[1].terminatingrule),CAST(row('null','null','null') as row(ruleid varchar,action varchar,rulematchdetails varchar)))

Multiple parameter in IN clause of Spark SQL from parameter file

I am trying to run spark query where I am creating curated table from a source table based upon values in parameter file.
properties_file.properties contains below key values:
substatus,allow,deny
SparkQuery is
//Code to load property file in parseConf
spark.sql(s"""insert into curated.table from source.table where
substatus='${parseConf.substatus}'""")
Above works with single value in substatus. But Can someone help what shall i do if I need to use substatus in ${parseConf.substatus} for multiple values from param as below.
spark.sql(s"""insert into curated.table from source.table where substatus in '${parseConf.substatus}'""")
To resolve my problem, I updated my property file as:
substatus,'allow'-'deny'
Then in scala code, I implemented below logic:
val subStatus=(parseConf.substatus).replace('-',',')
spark.sql(s"""insert into curated.table from source.table where substatus in ('${subStatus}')""")
Above strategy helped in breaking the values in string to muliple parameters of IN clause.
Equalto operator expects 1 value to be passed other than directly reading the value from parameter file who make in pass a one string. You need to break the values and then use IN clause inplace of equalto(=).

np.std change ddof within groupby

I was running a manual (I wrote a function) std dev versus numpy's built in.
There was a slight difference in the returned values.
I looked it up and numpy uses ddof=0 by default.
I am trying to figure out how to pass that within a groupby and I am failing.
My groupby is simply this: grouped = houses.groupby('Yr Sold').agg({'SalePrice': np.std})
If I use: np.std(ddof=1) it errors out saying I am missing the required positional argument 'a'.
I looked that up and I see what it is, but it seems to me that 'a' is my 'SalePrice' column.
I have tried a few different ways but every single thing I try results in a syntax error.
Using the groupby syntax above, how do I pass the ddof=1 parameter to adjust numpy's default behavior?
I figured out how to solve my problem, just not by directly using the syntax above.
std_dev_dict = {}
for id, group in houses.groupby('Yr Sold'):
std_dev_dict[id] = np.std(group['SalePrice'], ddof=1)
print(std_dev_dict)

Spark SQL like returns no results spark version 1.5.1/1.5.2 using sqlContext select [duplicate]

I am using Spark 1.3.0 and Spark Avro 1.0.0.
I am working from the example on the repository page. This following code works well
val df = sqlContext.read.avro("src/test/resources/episodes.avro")
df.filter("doctor > 5").write.avro("/tmp/output")
But what if I needed to see if the doctor string contains a substring? Since we are writing our expression inside of a string. What do I do to do a "contains"?
You can use contains (this works with an arbitrary sequence):
df.filter($"foo".contains("bar"))
like (SQL like with SQL simple regular expression whith _ matching an arbitrary character and % matching an arbitrary sequence):
df.filter($"foo".like("bar"))
or rlike (like with Java regular expressions):
df.filter($"foo".rlike("bar"))
depending on your requirements. LIKE and RLIKE should work with SQL expressions as well.
In pyspark,SparkSql syntax:
where column_n like 'xyz%'
might not work.
Use:
where column_n RLIKE '^xyz'
This works perfectly fine.

Resources