How to use variables in pyspark functions like months_between - apache-spark

I am newbie in pyspark. Facing difficulty using var in pyspark. It is treating that var as column name and throwing exception.
var_date_to='2020-06-01' \
months_between(col("date_to"),var_date_to)
Exception Thrown:
pyspark.sql.utils.AnalysisException: "cannot resolve '2020-06-01' given input columns: [......
I tried formatting the input string but getting same exception.
months_between(col("date_to"),'{0}'.format(var_date_to))
Please help

You have to convert it to column type first
months_between(col("date_to"),lit(var))
And it will work

Related

AzureML TabularDatasetFactory.from_parquet_files() error handling column types

I'm reading in a folder of parquet files using azureml's TabularDatasetFactory method:
dataset = TabularDatasetFactory.from_parquet_files(path=[(datastore_instance, "path/to/files/*.parquet")])
but am running into the issue that one of the columns is typed 'List' in the parquet files, and it seems TabularDatasetFactory.from_parquet_files() can't handle that typing?
ExecutionError:
Error Code: ScriptExecution.StreamAccess.Validation
Validation Error Code: NotSupported
Validation Target: ParquetType
Failed Step: xxxxxx
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by ValidationException.
No conversion exists for column: '[REDACTED COLUMN NAME]', from Parquet SchemaType: 'List' to DataPrep ValueKind
So I'm wondering if there's a way to tell TabularDatasetFactory.from_parquet_files() specifically which columns to pull in, or a way to tell it to fall back on any unsupported column types to just use object/string. Or maybe there's a work around by first reading in the files as a FileDataset, then selecting which columns in the files to use?
I do see the set_column_types parameter, but I don't know the columns until I read it into a dataset since I'm using datasets to explore what data is available in the folder paths in the first place

Azure Apache Spark groupby clause throws an error

I am following this section of a tutorial on Apache Spark from Azure team. But when I try to use BroupBy function of DataFrame, I get the following error:
Error:
NameError: name 'TripDistanceMiles' is not defined
Question: What may be a cause of the error in the following code, and how can it be fixed?
NOTE: I know how to group by the following results using Spark SQL as it is shown in a later section of the same tutorial. But I am interested in using the Groupby clause on the DataFrame
Details:
a) Following code correctly displays 100 rows with column headers PassengerCount and TripDistanceMiles:
%%pyspark
df = spark.read.load('abfss://testcontainer4synapse#adlsgen2synspsetest.dfs.core.windows.net/NYCTripSmall.parquet', format='parquet')
display(df.select("PassengerCount","TripDistanceMiles").limit(100))
b) But the following code does not group by the records and throws error shown above:
%%pyspark
df = spark.read.load('abfss://testcontainer4synaps#adlsgen2synspsetest.dfs.core.windows.net/NYCTripSmall.parquet', format='parquet')
df = df.select("PassengerCount","TripDistanceMiles").limit(100)
display(df.groupBy("PassengerCount").sum(TripDistanceMiles).limit(100))
Try putting the TripDistanceMiles into double quotes. Like
display(df.groupBy("PassengerCount").sum("TripDistanceMiles").limit(100))

Read multiple text file in single dataframe

I'm trying to read multiple text files into a single DataFrame in Pyspark and then apply the show() but getting the error in second file path.
BUYERS10_m1 = spark.read.text(Buyers_F1_path,Buyers_F2_path)
BUYERS10_m1.show()
Py4JJavaError: An error occurred while calling o245.showString.
: java.lang.IllegalArgumentException: For input string: "s3a://testing/Buyers/File2.TXT"
Does anyone have any idea why I'm getting this error and how to resolve it ?
Following should work.
spark.read.text("s3a://testing/Buyers/File{1,2}.TXT")

Error while overwriting Cassandra table from PySpark

I am attempting to OVERWRITE data in cassandra with a PySpark dataframe. I get this error: keyword can't be an expression
I am able to append the data by
df.write.format("org.apache.spark.sql.cassandra").options(keyspace="ks",table="testtable").mode("append").save()
However, overwriting is throwing error
df.write.format("org.apache.spark.sql.cassandra").options(keyspace="ks",table="testtable", confirm.truncate="true").mode("overwrite").save()
Error: keyword can't be an expression
I found the solution.
df.write.format("org.apache.spark.sql.cassandra")
.mode("overwrite").option("confirm.truncate","true")
.options(keyspace="ks",table="testtable")
.save()

can't resolve ... given input columns

I'm going through the Spark: The Definitive Guide book from O'Reilly and I'm running into an error when I try to do a simple DataFrame operation.
The data is like:
DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,Romania,15
United States,Croatia,1
...
I then read it with (in Pyspark):
flightData2015 = spark.read.option("inferSchema", "true").option("header","true").csv("./data/flight-data/csv/2015-summary.csv")
Then I try to run the following command:
flightData2015.select(max("count")).take(1)
I get the following error:
pyspark.sql.utils.AnalysisException: "cannot resolve '`u`' given input columns: [DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count];;
'Project ['u]
+- AnalysisBarrier
+- Relation[DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] csv"
I don't know where "u" is even coming from, since it's not in my code and it isn't in the data file header either. I read another suggestion that this could be caused by spaces in the header, but that's not applicable here. Any idea what to try?
NOTE: The strange thing is, the same thing works when I use SQL instead of the DataFrame transformations. This works:
flightData2015.createOrReplaceTempView("flight_data_2015")
spark.sql("SELECT max(count) from flight_data_2015").take(1)
I can also do the following and it works fine:
flightData2015.show()
Your issue is that you are calling the built-in max function, not pyspark.sql.functions.max.
When python evaluates max("count") in your code it returns the letter 'u', which is the maximum value in the collection of letters that make up the string.
print(max("count"))
#'u'
Try this instead:
import pyspark.sql.functions as f
flightData2015.select(f.max("count")).show()

Resources