PySpark dataframe error: ava.lang.ArrayIndexOutOfBoundsException - apache-spark

Consider this parquet. I ran the following code on it:
nis = spark.read.parquet(PATH)
nis.show()
nis.filter('mixed is null').show()
Output is:
+-------+---------+-----+
|nonulls|onlynulls|mixed|
+-------+---------+-----+
| value1| null| one|
| value2| null| null|
| value3| null| two|
+-------+---------+-----+
[...]
Py4JJavaError: An error occurred while calling o201.showString.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 25) (10.250.33.73 executor 32): java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3
The error is generated by each of the columns - every time I filter by 'is null' or 'is not null' it fails (except when the result of the filter is an empty dataframe).
What causes this problem? Is PySpark unable to filter nulls?

Related

Spark SQL on AWS Glue: pyspark.sql.utils.AnalysisException

I am using Spark SQL in AWS Glue script to transform some data in S3.
Here is the script logic
Data Format CSV
Programming Language: Python
1) Pull the data from S3 using Glue’s Catalog into Glue’s DynamicDataFrame
2) Extract the Spark Data Frame from Glue’s Data frame using toDF()
3) Make the Spark Data Frame Spark SQL Table
createOrReplaceTempView()
4) Use SQL query to transform (Here is where I am having issues)
5) Convert the final data frame to Glue Dynamic Data Frame
6) Store final Data Frame into S3 using
glueContext.write_dynamic_frame.from_options()
Problem
When I am using comparison in SQL such as WHERE >
or
(case when <some_columns> > <some int> then 1 else 0 end) as <some_newcol>
I am getting the following error
pyspark.sql.utils.AnalysisException: u"cannot resolve '(sales.`cxvalue` >
100000)' due to data type mismatch: differing types in '(sales.`cxvalue` >
100000)' (struct<int:int,string:string> and int).; line 1 pos 35;\n'Project
['demand_amt]\n+- 'Filter (cxvalue#4 > 100000)\n +- SubqueryAlias sales\n +-
LogicalRDD [sales_id#0, customer_name#1, customer_loc#2, demand_amt#3L,
cxvalue#4]\n"
pyspark.sql.utils.AnalysisException: u"cannot resolve '(sales.`cxvalue` =
100000)' due to data type mismatch: differing types in '(sales.`cxvalue` =
100000)' (struct<int:int,string:string> and int).; line 1 pos 33;\n'Project
[customer_name#1, CASE WHEN (cxvalue#4 = 100000) THEN demand_amt#3 ELSE 0 END AS
small#12, CASE WHEN cxvalue#4 IN (200000,300000,400000) THEN demand_amt#3 ELSE 0
END AS medium#13]\n+- SubqueryAlias sales\n +- LogicalRDD [sales_id#0,
customer_name#1, customer_loc#2, demand_amt#3, cxvalue#4]\n"
This tells me it is considering a colums as both numeric and string and this is specific to Spark and not AWS. SUM()
GROUP BY works fine only comparision
I have tried the following steps
1) Tried to change the column type using Spark method - Failed
df=df.withColumn(<column> df[<columns>].cast(DoubleType())) # df is Spark Data
111
Glue does not allow to change the data type of spark data frame column type
2) Used Glue’s resoveChoice method as explained in https://github.com/aws-samples/aws-gluesamples/
blob/master/examples/resolve_choice.md . resolveChoice method worked - but sql Failed with the same error
3) Used cast(<columns> as <data_type>) in SQL query – Failed
4) Spun up Spark Cluster on my Google Cloud (Just to ensure nothing AWS related). Used Spark only with same above logic – Failed with the same error
5) On same Spark cluster and same data set used the same logic but enforced schema using StructType and StructField
while creating a new Spark data frame – Passed
Here is the Sample Data
+--------+-------------+------------+----------+-------+
|sales_id|customer_name|customer_loc|demand_amt|cxvalue|
+--------+-------------+------------+----------+-------+
| 1| ABC| Denver CO| 1200| 300000|
| 2| BCD| Boston MA| 212| 120000|
| 3| CDE| Phoenix AZ| 332| 100000|
| 4| BCD| Boston MA| 211| 120000|
| 5| DEF| Portland OR| 2121|1000000|
| 6| CDE| Phoenix AZ| 32| 100000|
| 7| ABC| Denver CO| 3227| 300000|
| 8| DEF| Portland OR| 2121|1000000|
| 9| BCD| Boston MA| 21| 120000|
| 10| ABC| Denver CO| 1200|300000 |
+--------+-------------+------------+----------+-------+
These are sample code and queries where things fail
sdf_sales.createOrReplaceTempView("sales")
tbl1="sales"
sql2="""select customer_name, (case when cxvalue < 100000 then 1 else 0) as small,
(case when cxvalue in (200000, 300000, 400000 ) then demand_amt else 0 end) as medium
from {0}
""".format(tbl1)
sql4="select demand_amt from {0} where cxvalue>100000".format(tbl1)
However, these queries work great with successful Glue Job
sql3="""select customer_name, sum(demand_amt) as total_spent from {0} GROUP BY customer_name""".format(tbl1)
Challenge:
Wish glue somehow allowed me to change Spark Dataframe schema. Any suggestion will be appreciated.
AWS Glue resolveChoice fixed the issue.
Programing logic error: Treated Spark Frame as mutable

Spark: how to remove unnecessary characters in df column values

I am having df like this,
+----+---+
| _c0|_c1|
+----+---+
|('a'| 2)|
|('b'| 4)|
|('c'| 6)|
+----+---+
I want like below how to do,
+----+---+
| _c0|_c1|
+----+---+
| a | 2 |
| b | 4 |
| c | 6 |
+----+---+
If I try like this getting an error
df1.select(regexp_replace('_c0', "('", "c")).show()
An error occurred while calling o789.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 71.0 failed 1 times, most recent failure: Lost task
1.0 in stage 71.0 (TID 184, localhost, executor driver): java.util.regex.PatternSyntaxException: Unclosed group near index 2
Like the other user has said it is necessary to escape special characters like brackets with a backslash. Here you can find a list of regex special characters. The following code uses two different approaches for your problem. With regexp_extract we extract the single character between (' and ' in column _c0. With regexp_replace we replace ) in the second column. You can of course use only the regexp_replace function with the regex "[()']" to achieve what you wanted. I just want to show you two different ways how you could tackle the problem.
from pyspark.sql import functions as F
columns = ['_c0', '_c1']
vals = [("('a'", "2)"),("('b'", "4)"),("('c'", "6)")]
df = spark.createDataFrame(vals, columns)
df = df.select(F.regexp_extract('_c0', "\('(\w)'", 1).alias('_c0')
, F.regexp_replace("_c1", "\)", "").alias('_c1'))
df.show()
Output:
+---+---+
|_c0|_c1|
+---+---+
| a| 2|
| b| 4|
| c| 6|
+---+---+
You should escape the brackets:
df1.select(regexp_replace('_c0', "\\('", "c")).show()

Adding a Arraylist value to a new column in Spark Dataframe using Pyspark [duplicate]

This question already has answers here:
How to add a constant column in a Spark DataFrame?
(3 answers)
Closed 5 years ago.
I want add a new column in my existing dataframe. Below is my dataframe -
+---+---+-----+
| x1| x2| x3|
+---+---+-----+
| 1| a| 23.0|
| 3| B|-23.0|
+---+---+-----+
I am able to add df = df.withColumn("x4", lit(0)) like this
+---+---+-----+---+
| x1| x2| x3| x4|
+---+---+-----+---+
| 1| a| 23.0| 0|
| 3| B|-23.0| 0|
+---+---+-----+---+
but I want to add a array list to my df.
Supose this [0,0,0,0] is my array to add and after adding my df will look like this -
+---+---+-----+---------+
| x1| x2| x3| x4|
+---+---+-----+---------+
| 1| a| 23.0|[0,0,0,0]|
| 3| B|-23.0|[0,0,0,0]|
+---+---+-----+---------+
I tried like this -
array_list = [0,0,0,0]
df = df.withColumn("x4", lit(array_list))
But it is giving error
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [0, 0, 0, 0, 0, 0]
Do anybody know how to do this?
Based on your comment
My array is variable and I have to add it to multiple places with different value. This approach is fine for adding either same value or for adding one or two arrays. It will not suit for adding huge data
I believe it an XY-problem. If you want scalable solution (1000 rows in not huge to be honest), then use another dataframe and join. For example if want to connect by x1
arrays = spark.createDataFrame([
(1, [0.0, 0.0, 0.0]), (3, [0.0, 0.0, 0.0])
], ("x1", "x4"))
df.join(arrays, ["x1"])
Add more complex condition depending on the requirements.
To solve you're immediate problem see How to add a constant column in a Spark DataFrame? - all elements of array should be columns
from pyspark.sql.functions import lit
array(lit(0.0), lit(0.0), lit(0.0))
# Column<b'array(0.0, 0.0, 0.0)'>

What exactly does .select() do?

I ran into a surprising behavior when using .select():
>>> my_df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 5|
| 2| 4| 6|
+---+---+---+
>>> a_c = s_df.select(col("a"), col("c")) # removing column b
>>> a_c.show()
+---+---+
| a| c|
+---+---+
| 1| 5|
| 2| 6|
+---+---+
>>> a_c.filter(col("b") == 3).show() # I can still filter on "b"!
+---+---+
| a| c|
+---+---+
| 1| 5|
+---+---+
This behavior got my wondering... Are my following points correct?
DataFrames are just views, a simple DataFrame is a view of itself. In my case a_c is just a view into my_df.
When I created a_c no new data was created, a_c is just pointing at the same data my_df is pointing.
If there is additional information that is relevant, please add!
This is happening because of the lazy nature of Spark. It is "smart" enough to push the filter down so that it happens at a lower level - before the filter*. So, since this all happens within the same stage of execution and is able to still be resolved. In fact you can see this in explain:
== Physical Plan ==
*Project [a#0, c#2]
+- *Filter (b#1 = 3) <---Filter before Project
+- LocalTableScan [A#0, B#1, C#2]
You can force a shuffle and new stage, then see your filter fail, though. Even catching it at compile time. Here's an example:
a_c.groupBy("a","c").count.filter(col("b") === 3)
*There is also a projection pruning that pushes the selection down to database layers if it realizes it doesn't need the column at any point. However I believe the filter would cause it to "need" it and not prune...but I didn't test that.
Let us start with some basics about the spark underlying.This will make your understanding easy.
RDD : Underlying the spark core is the data structure called RDD ,which are
lazily evaluated. By lazy evaluation we mean that RDD computation
happens when the action (like calling a count in RDD or show in dataset).
Dataset or Dataframe(which Dataset[Row]) also uses RDDs at the core.
This means every transformation (like filter) will be realized only when the action is triggered (show).
So your question
"When I created a_c no new data was created, a_c is just pointing at the same data my_df is pointing."
As there is no data which was realized. We have to realize it to bring it to memory. Your filter works on the initial dataframe.
The only way to make your a_c.filter(col("b") == 3).show() throw a run time exception is to cache your intermediate dataframe by using dataframe.cache.
So spark will throw"main" org.apache.spark.sql.AnalysisException: Cannot resolve column name
Eg.
val a_c = s_df.select(col("a"), col("c")).cache
a_c.filter(col("b") == 3).show()
So spark will throw"main" org.apache.spark.sql.AnalysisException: Cannot
resolve column name.

Calculate quantile on grouped data in spark Dataframe

I have the following Spark dataframe :
agent_id|payment_amount|
+--------+--------------+
| a| 1000|
| b| 1100|
| a| 1100|
| a| 1200|
| b| 1200|
| b| 1250|
| a| 10000|
| b| 9000|
+--------+--------------+
my desire output would be something like
agen_id 95_quantile
a whatever is 95 quantile for agent a payments
b whatever is 95 quantile for agent b payments
for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:
test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)
but I take the following error:
'GroupedData' object has no attribute 'approxQuantile'
I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes
I am using Spark 2.0.0
One solution would be to use percentile_approx :
>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")
>>> df2.show()
# +--------+-----------------+
# |agent_id| approxQuantile|
# +--------+-----------------+
# | a|8239.999999999998|
# | b|7449.999999999998|
# +--------+-----------------+
Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext.
Note 2 : approxQuantile isn't available in Spark < 2.0 for pyspark.
Note 3 : percentile returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.
EDIT : From Spark 2+, HiveContext is not required.

Resources