Understanding execution order in UDFs on pyspark dataframes

Understanding execution order in UDFs on pyspark dataframes - apache-spark

I was reading up on pyspark UDF when I came across the following snippet:
No guarantee Name is not null will execute first.
If convertUDF(Name) like '%John%' execute first then
you will get runtime error
spark.sql("select Seqno, convertUDF(Name) as Name from NAME_TABLE " + \
"where Name is not null and convertUDF(Name) like '%John%'") \
.show(truncate=False)
Also, I could write the same code in the dataframe API as well
df_filter = df.filter(df.Name.isNotNull())
df_filter = df.filter(df.Name.contains("John"))
df_filter.select(col(Seqno),convertUDF(df_filtered.Name))
Does the issue of ambiguity in the order of execution of filter show up here in the dataframe API as well? i.e. Could it be that the df.filter(df.Name.isNotNull()) line is not executed before the next df.filter(df.Name.contains("John")) line? What does this ambiguity have to do with UDF being there? Is the order of execution of various filters guarenteed (with or without UDF in the query execution plan) and what is the interplay? For example: Is the filter order guaranteed in the following syntax df.filter(bool1).filter(bool2). What about df.filter(bool1).filter(bool2).select(UDF(col1))?

Related

Parametrize and loop KQL queries in JupyterLab

My question is how to assign variables within a loop in KQL magic command in Jupyter lab. I refer to Microsoft's document on this subject and will base my question on the code given here:
https://learn.microsoft.com/en-us/azure/data-explorer/kqlmagic
1. First query below
%%kql
StormEvents
| summarize max(DamageProperty) by State
| order by max_DamageProperty desc
| limit 10
2. Second: Convert the resultant query to a dataframe and assign a variable to 'statefilter'
df = _kql_raw_result_.to_dataframe()
statefilter =df.loc[0].State
statefilter
3. This is where I would like to modify the above query and let statefilter have multiple variables (i.e. consist of different states):
df = _kql_raw_result_.to_dataframe()
statefilter =df.loc[0:3].State
statefilter
4. And finally I would like to run my kql query within a for loop for each of the variables within statefilter. This below syntax may not be correct but it can give an example for what I am looking for:
dfs = [] # an empty list to store dataframes
for state in statefilters:
%%kql
let _state = state;
StormEvents
| where State in (_state)
| do some operations here for that specific state
df = _kql_raw_result_.to_dataframe()
dfs.append(df) # store the df specific to state in the list
The reason why I am not querying all the desired states within the KQL query is to prevent resulting in really large query outcomes being assigned to dataframes. This is not for this sample StormEvents table which has a reasonable size but for my research data which consists of many sites and is really big. Therefore I would like to be able to run a KQL query/analysis for each site within a for loop and assign each site's query results to a dataframe. Please let me know if this is possible or perhaps there may other logical ways to do this within KQL...

There are few ways to do it.
The simplest is to refractor your %%kql cell magic to a %kql line magic.
Line magic can be embedded in python cell.
Other option is to: from Kqlmagic import kql
The Kqlmagic kql method, accept as a string a kql cell or line.
You can call kql from python.
Third way is to call the kql magic via the ipython method:
ip.run_cell_magic('kql', {your kql magic cell text})
You can call it from python.

Example of using the single line magic mentioned by Michael and a return statement that converted the result to JSON. Without the conversion to JSON I wasn't getting anything back.
def testKQL():
%kql DatabaseName | take 10000
return _kql_raw_result_.to_dataframe().to_json(orient='records')

Writing spark.sql dataframe result to parquet file

I enabled the following spark.sql session:
# creating Spark context and connection
spark = (SparkSession.builder.appName("appName").enableHiveSupport().getOrCreate())
and am able to produce see the results of the following query:
spark.sql("select year(plt_date) as Year, month(plt_date) as Mounth, count(build) as B_Count, count(product) as P_Count from first_table full outer join second_table on key1=CONCAT('SS',key_2) group by year(plt_date), month(plt_date)").show()
However, when I try to write the resulting dataframe from this query to hdfs, I get the following error:
I am able to save the resulting dataframe of a simple version of this query to the same path. The problem appears by adding functions such as count(), year() and etc.
What is the problem? and how can I save the results to hdfs?

It is giving error due to '(' present in column 'year(CAST(plt_date AS DATE))' :
Use to rename :
data = data.selectExpr("year(CAST(plt_date AS DATE)) as nameofcolumn")
Upvote if works
Refer : Rename Spark Column

PySpark throwing ParseException for syntactical correct Hive Query

I got a DDL query that works fine within beeline, but when I try to run the same query within a sparkSession it throws a parse Exception.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
# Initialise Hive metastore
SparkContext.setSystemProperty("hive.metastore.uris","thrift://localhsost:9083")
# Create Spark Session
sparkSession = (SparkSession\
.builder\
.appName('test_case')\
.enableHiveSupport()\
.getOrCreate())
sparkSession.sql("CREATE EXTERNAL TABLE B LIKE A")
Pyspark Exception:
pyspark.sql.utils.ParseException: u"\nmismatched input 'LIKE' expecting <EOF>(line 1, pos 53)\n\n== SQL ==\nCREATE EXTERNAL TABLE B LIKE A\n-----------------------------------------------------^^^\n"
How Can I make the hiveQL function work within pySpark?
The problem seems to be that the query is executed like a SparkSQL-Query and not like a HiveQL-Query, even though I got enableHiveSupport activated for the sparkSession.

Spark SQL queries use SparkSQL by default. To enable HiveQL syntax, I believe you need to give it a hint about your intent via a comment. (In fairness, I don't think this is well-documented; I've only been able to find a tangential reference to this being a thing here, and only in the Scala version of the example.)
For example, I'm able to get my command to parse by writing:
%sql
-- `USING HIVE`
CREATE TABLE narf LIKE poit
Now, I don't have Hive Support enabled on my session, so my query fails... but it does parse!
Edit: Since your SQL statement is in a Python string, you can use a multi-line string to use the single-line comment syntax, like this:
sparkSession.sql("""
-- `USING HIVE`
CREATE EXTERNAL TABLE B LIKE A
""")
There's also a delimited comment syntax in SQL, e.g.
sparkSession.sql("/* `USING HIVE` */ CREATE EXTERNAL TABLE B LIKE A")
which may work just as well.

can't resolve ... given input columns

I'm going through the Spark: The Definitive Guide book from O'Reilly and I'm running into an error when I try to do a simple DataFrame operation.
The data is like:
DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,Romania,15
United States,Croatia,1
...
I then read it with (in Pyspark):
flightData2015 = spark.read.option("inferSchema", "true").option("header","true").csv("./data/flight-data/csv/2015-summary.csv")
Then I try to run the following command:
flightData2015.select(max("count")).take(1)
I get the following error:
pyspark.sql.utils.AnalysisException: "cannot resolve '`u`' given input columns: [DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count];;
'Project ['u]
+- AnalysisBarrier
+- Relation[DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] csv"
I don't know where "u" is even coming from, since it's not in my code and it isn't in the data file header either. I read another suggestion that this could be caused by spaces in the header, but that's not applicable here. Any idea what to try?
NOTE: The strange thing is, the same thing works when I use SQL instead of the DataFrame transformations. This works:
flightData2015.createOrReplaceTempView("flight_data_2015")
spark.sql("SELECT max(count) from flight_data_2015").take(1)
I can also do the following and it works fine:
flightData2015.show()

Your issue is that you are calling the built-in max function, not pyspark.sql.functions.max.
When python evaluates max("count") in your code it returns the letter 'u', which is the maximum value in the collection of letters that make up the string.
print(max("count"))
#'u'
Try this instead:
import pyspark.sql.functions as f
flightData2015.select(f.max("count")).show()

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.

From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)

What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Understanding execution order in UDFs on pyspark dataframes - apache-spark

Related

Parametrize and loop KQL queries in JupyterLab

Writing spark.sql dataframe result to parquet file

PySpark throwing ParseException for syntactical correct Hive Query

can't resolve ... given input columns

Existing column can't be found by DataFrame#filter in PySpark

Categories

Resources