can't resolve ... given input columns - apache-spark

I'm going through the Spark: The Definitive Guide book from O'Reilly and I'm running into an error when I try to do a simple DataFrame operation.
The data is like:
DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,Romania,15
United States,Croatia,1
...
I then read it with (in Pyspark):
flightData2015 = spark.read.option("inferSchema", "true").option("header","true").csv("./data/flight-data/csv/2015-summary.csv")
Then I try to run the following command:
flightData2015.select(max("count")).take(1)
I get the following error:
pyspark.sql.utils.AnalysisException: "cannot resolve '`u`' given input columns: [DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count];;
'Project ['u]
+- AnalysisBarrier
+- Relation[DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] csv"
I don't know where "u" is even coming from, since it's not in my code and it isn't in the data file header either. I read another suggestion that this could be caused by spaces in the header, but that's not applicable here. Any idea what to try?
NOTE: The strange thing is, the same thing works when I use SQL instead of the DataFrame transformations. This works:
flightData2015.createOrReplaceTempView("flight_data_2015")
spark.sql("SELECT max(count) from flight_data_2015").take(1)
I can also do the following and it works fine:
flightData2015.show()

Your issue is that you are calling the built-in max function, not pyspark.sql.functions.max.
When python evaluates max("count") in your code it returns the letter 'u', which is the maximum value in the collection of letters that make up the string.
print(max("count"))
#'u'
Try this instead:
import pyspark.sql.functions as f
flightData2015.select(f.max("count")).show()

Related

Understanding execution order in UDFs on pyspark dataframes

I was reading up on pyspark UDF when I came across the following snippet:
No guarantee Name is not null will execute first.
If convertUDF(Name) like '%John%' execute first then
you will get runtime error
spark.sql("select Seqno, convertUDF(Name) as Name from NAME_TABLE " + \
"where Name is not null and convertUDF(Name) like '%John%'") \
.show(truncate=False)
Also, I could write the same code in the dataframe API as well
df_filter = df.filter(df.Name.isNotNull())
df_filter = df.filter(df.Name.contains("John"))
df_filter.select(col(Seqno),convertUDF(df_filtered.Name))
Does the issue of ambiguity in the order of execution of filter show up here in the dataframe API as well? i.e. Could it be that the df.filter(df.Name.isNotNull()) line is not executed before the next df.filter(df.Name.contains("John")) line? What does this ambiguity have to do with UDF being there? Is the order of execution of various filters guarenteed (with or without UDF in the query execution plan) and what is the interplay? For example: Is the filter order guaranteed in the following syntax df.filter(bool1).filter(bool2). What about df.filter(bool1).filter(bool2).select(UDF(col1))?

Converting a Case Transform in ADF Mapping DataFlow

I am currently building a DF in ADF where i am converting the below Query which is already placed in another ETL tool called BigDecission. The Query looks like below
SELECT
Asset_ID,
MAX(CASE WHEN meter = 'LTPC' THEN reading_date ELSE NULL END) AS LTPC_Date,
MAX(CASE WHEN meter = 'LTPC' THEN page_Count ELSE NULL END) AS LTPC,
FROM
mv_latest_asset_read
GROUP BY
Asset_ID
While converting this piece in ADF DF i have used AGGREGATE transform and done GROUP BY "ASSET_ID" .
In the AGGREGATES Tab i am deriving the column "LTPC_DATE" and "LTPC" with below mentioned code.
LTPC_DATE ---- > max(case(METER=='LTPC',READING_DATE))
LTPC ---- > max(case(METER=='LTPC',PAGE_COUNT))
But in the output i am getting null values which shouldn't be the case. Can anyone identify the right way to do it.
I followed the same approach to reproduced above and getting proper result.
Please check the below:
My source data:
Here I have taken 2 additional columns using derived column transformation and giving a sample value.
Group By and aggregate:
Used max(case(condtion,expression)) here.
Result in Data preview:
Try to check your projection in the source. Also, transform this to a sink file and check if it gives correct result or not.
If it still gives same, you can try maxIf(condition, expression) as suggested by #Mark Kromer MSFT.
The above also giving the same result for me.
If your source is a database, you can try query option in the source of dataflow and give the above query.
After Importing projection, you can see the desired result in the Data preview.

Azure Apache Spark groupby clause throws an error

I am following this section of a tutorial on Apache Spark from Azure team. But when I try to use BroupBy function of DataFrame, I get the following error:
Error:
NameError: name 'TripDistanceMiles' is not defined
Question: What may be a cause of the error in the following code, and how can it be fixed?
NOTE: I know how to group by the following results using Spark SQL as it is shown in a later section of the same tutorial. But I am interested in using the Groupby clause on the DataFrame
Details:
a) Following code correctly displays 100 rows with column headers PassengerCount and TripDistanceMiles:
%%pyspark
df = spark.read.load('abfss://testcontainer4synapse#adlsgen2synspsetest.dfs.core.windows.net/NYCTripSmall.parquet', format='parquet')
display(df.select("PassengerCount","TripDistanceMiles").limit(100))
b) But the following code does not group by the records and throws error shown above:
%%pyspark
df = spark.read.load('abfss://testcontainer4synaps#adlsgen2synspsetest.dfs.core.windows.net/NYCTripSmall.parquet', format='parquet')
df = df.select("PassengerCount","TripDistanceMiles").limit(100)
display(df.groupBy("PassengerCount").sum(TripDistanceMiles).limit(100))
Try putting the TripDistanceMiles into double quotes. Like
display(df.groupBy("PassengerCount").sum("TripDistanceMiles").limit(100))

Writing spark.sql dataframe result to parquet file

I enabled the following spark.sql session:
# creating Spark context and connection
spark = (SparkSession.builder.appName("appName").enableHiveSupport().getOrCreate())
and am able to produce see the results of the following query:
spark.sql("select year(plt_date) as Year, month(plt_date) as Mounth, count(build) as B_Count, count(product) as P_Count from first_table full outer join second_table on key1=CONCAT('SS',key_2) group by year(plt_date), month(plt_date)").show()
However, when I try to write the resulting dataframe from this query to hdfs, I get the following error:
I am able to save the resulting dataframe of a simple version of this query to the same path. The problem appears by adding functions such as count(), year() and etc.
What is the problem? and how can I save the results to hdfs?
It is giving error due to '(' present in column 'year(CAST(plt_date AS DATE))' :
Use to rename :
data = data.selectExpr("year(CAST(plt_date AS DATE)) as nameofcolumn")
Upvote if works
Refer : Rename Spark Column

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.
From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)
What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Resources