Azure Apache Spark groupby clause throws an error - azure

I am following this section of a tutorial on Apache Spark from Azure team. But when I try to use BroupBy function of DataFrame, I get the following error:
Error:
NameError: name 'TripDistanceMiles' is not defined
Question: What may be a cause of the error in the following code, and how can it be fixed?
NOTE: I know how to group by the following results using Spark SQL as it is shown in a later section of the same tutorial. But I am interested in using the Groupby clause on the DataFrame
Details:
a) Following code correctly displays 100 rows with column headers PassengerCount and TripDistanceMiles:
%%pyspark
df = spark.read.load('abfss://testcontainer4synapse#adlsgen2synspsetest.dfs.core.windows.net/NYCTripSmall.parquet', format='parquet')
display(df.select("PassengerCount","TripDistanceMiles").limit(100))
b) But the following code does not group by the records and throws error shown above:
%%pyspark
df = spark.read.load('abfss://testcontainer4synaps#adlsgen2synspsetest.dfs.core.windows.net/NYCTripSmall.parquet', format='parquet')
df = df.select("PassengerCount","TripDistanceMiles").limit(100)
display(df.groupBy("PassengerCount").sum(TripDistanceMiles).limit(100))

Try putting the TripDistanceMiles into double quotes. Like
display(df.groupBy("PassengerCount").sum("TripDistanceMiles").limit(100))

Related

Databricks accessing DataFrame in SQL

I'm learning Databricks and got stuck on the simplest step.
I'd like to utilize my DataFrame from DB's SQL ecosystem
Here are my steps:
df = spark.read.csv('dbfs:/databricks-datasets/COVID/covid-19-data/us.csv', header=True, inferSchema=True)
display(df)
Everything is fine, df is displayed. Then submitting:
df.createOrReplaceGlobalTempView("covid")
Finally:
%sql
show tables
No results are displayed. When trying:
display(spark.sql('SELECT * FROM covid LIMIT 10'))
Getting the error:
[TABLE_OR_VIEW_NOT_FOUND] The table or view `covid` cannot be found
When executing:
df.createGlobalTempView("covid")
Again, I'm getting a message covid already exists.
How to access my df from sql ecosystem, please?
In a Databricks notebook, if you're looking to utilize SQL to query your dataframe loaded in python,
you can do so in the following way (using your example data):
Setup df in python
df = spark.read.csv('dbfs:/databricks-datasets/COVID/covid-19-data/us.csv', header=True, inferSchema=True)
setup your global view
df.createGlobalTempView("covid")
Then a simply query in SQL will be equivalent to display() function
%sql
SELECT * FROM global_temp.covid
If you want to avoid using global_temp prefix, use df.createTempView

Understanding execution order in UDFs on pyspark dataframes

I was reading up on pyspark UDF when I came across the following snippet:
No guarantee Name is not null will execute first.
If convertUDF(Name) like '%John%' execute first then
you will get runtime error
spark.sql("select Seqno, convertUDF(Name) as Name from NAME_TABLE " + \
"where Name is not null and convertUDF(Name) like '%John%'") \
.show(truncate=False)
Also, I could write the same code in the dataframe API as well
df_filter = df.filter(df.Name.isNotNull())
df_filter = df.filter(df.Name.contains("John"))
df_filter.select(col(Seqno),convertUDF(df_filtered.Name))
Does the issue of ambiguity in the order of execution of filter show up here in the dataframe API as well? i.e. Could it be that the df.filter(df.Name.isNotNull()) line is not executed before the next df.filter(df.Name.contains("John")) line? What does this ambiguity have to do with UDF being there? Is the order of execution of various filters guarenteed (with or without UDF in the query execution plan) and what is the interplay? For example: Is the filter order guaranteed in the following syntax df.filter(bool1).filter(bool2). What about df.filter(bool1).filter(bool2).select(UDF(col1))?

Loading Data from Azure Synapse Database into a DataFrame with Notebook

I am attempting to load data from Azure Synapse DW into a dataframe as shown in the image.
However, I'm getting the following error:
AttributeError: 'DataFrameReader' object has no attribute 'sqlanalytics'
Traceback (most recent call last):
AttributeError: 'DataFrameReader' object has no attribute 'sqlanalytics'
Any thoughts on what I'm doing wrong?
That particular method has changed its name to synapsesql (as per the notes here) and is Scala only currently as I understand it. The correct syntax would therefore be:
%%spark
val df = spark.read.synapsesql("yourDb.yourSchema.yourTable")
It is possible to share the Scala dataframe with Python via the createOrReplaceTempView method, but I'm not sure how efficient that is. Mixing and matching is described here. So for your example you could mix and match Scala and Python like this:
Cell 1
%%spark
// Get table from dedicated SQL pool and assign it to a dataframe with Scala
val df = spark.read.synapsesql("yourDb.yourSchema.yourTable")
// Save the dataframe as a temp view so it's accessible from PySpark
df.createOrReplaceTempView("someTable")
Cell 2
%%pyspark
## Scala dataframe is now accessible from PySpark
df = spark.sql("select * from someTable")
## !!TODO do some work in PySpark
## ...
The above linked example shows how to write the dataframe back to the dedicated SQL pool too if required.
This is a good article for importing / export data with Synpase notebooks and the limitation is described in the Constraints section:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export#constraints

Writing spark.sql dataframe result to parquet file

I enabled the following spark.sql session:
# creating Spark context and connection
spark = (SparkSession.builder.appName("appName").enableHiveSupport().getOrCreate())
and am able to produce see the results of the following query:
spark.sql("select year(plt_date) as Year, month(plt_date) as Mounth, count(build) as B_Count, count(product) as P_Count from first_table full outer join second_table on key1=CONCAT('SS',key_2) group by year(plt_date), month(plt_date)").show()
However, when I try to write the resulting dataframe from this query to hdfs, I get the following error:
I am able to save the resulting dataframe of a simple version of this query to the same path. The problem appears by adding functions such as count(), year() and etc.
What is the problem? and how can I save the results to hdfs?
It is giving error due to '(' present in column 'year(CAST(plt_date AS DATE))' :
Use to rename :
data = data.selectExpr("year(CAST(plt_date AS DATE)) as nameofcolumn")
Upvote if works
Refer : Rename Spark Column

can't resolve ... given input columns

I'm going through the Spark: The Definitive Guide book from O'Reilly and I'm running into an error when I try to do a simple DataFrame operation.
The data is like:
DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,Romania,15
United States,Croatia,1
...
I then read it with (in Pyspark):
flightData2015 = spark.read.option("inferSchema", "true").option("header","true").csv("./data/flight-data/csv/2015-summary.csv")
Then I try to run the following command:
flightData2015.select(max("count")).take(1)
I get the following error:
pyspark.sql.utils.AnalysisException: "cannot resolve '`u`' given input columns: [DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count];;
'Project ['u]
+- AnalysisBarrier
+- Relation[DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] csv"
I don't know where "u" is even coming from, since it's not in my code and it isn't in the data file header either. I read another suggestion that this could be caused by spaces in the header, but that's not applicable here. Any idea what to try?
NOTE: The strange thing is, the same thing works when I use SQL instead of the DataFrame transformations. This works:
flightData2015.createOrReplaceTempView("flight_data_2015")
spark.sql("SELECT max(count) from flight_data_2015").take(1)
I can also do the following and it works fine:
flightData2015.show()
Your issue is that you are calling the built-in max function, not pyspark.sql.functions.max.
When python evaluates max("count") in your code it returns the letter 'u', which is the maximum value in the collection of letters that make up the string.
print(max("count"))
#'u'
Try this instead:
import pyspark.sql.functions as f
flightData2015.select(f.max("count")).show()

Resources