Spark SQL not recognizing \d+ - apache-spark

I am trying to use the regex_extract function to get the last three digits in a string ABCDF1_123 with:
regexp_extrach('ABCDF1_123', 'ABCDF1_(\d+)', 1)
and it does not capture the group. If I change the function call to:
regexp_extrach('ABCDF1_123', 'ABCDF1_([0-9]+)', 1)
it works. Can anyone give me some insight in to why? I am also grabbing the data from a Postgres database using a JDBC connection.

I ran the regexp_extract and both of them are giving the same output as shown below
from pyspark.sql import Row
from pyspark.sql.functions import lit, when, col, regexp_extract
l = [('ABCDF1_123')]
rdd = sc.parallelize(l)
sample = rdd.map(lambda x: Row(name=x))
sample_df = sqlContext.createDataFrame(sample)
not_working = r'ABCDF1_(\d+)'
working = r'ABCDF1_([0-9]+)'
sample_df.select(regexp_extract('name',not_working,1).alias('not_working'),
regexp_extract('name',working,1).alias('working')).show(10)
+-----------+-------+
|not_working|working|
+-----------+-------+
| 123| 123|
+-----------+-------+
Is this what you are looking for?

Related

PySpark UDF issues when referencing outside of function

I facing the issue that I get the error
TypeError: cannot pickle '_thread.RLock' object
when I try to apply the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
data_1 = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
data_2 = [('Junior','Smith','M',15),('Helga','Rose','F',33),
('Mike','Williams','M',77),
]
columns = ["firstname","lastname","gender","age"]
df_1 = spark.createDataFrame(data=data_1, schema = columns)
df_2 = spark.createDataFrame(data=data_2, schema = columns)
def find_n_people_with_higher_age(x):
return df_2.filter(df_2['age']>=x).count()
find_n_people_with_higher_age_udf = udf(find_n_people_with_higher_age, IntegerType())
df_1.select(find_n_people_with_higher_age_udf(col('category_id')))
Here's a good article on python UDF's.
I use it as a reference as I suspected that you were running into a serialization issue. I'm showing the entire paragraph to add context of the sentence but really it's the serialization that's the issue.
Performance Considerations
It’s important to understand the performance implications of Apache
Spark’s UDF features. Python UDFs for example (such as our CTOF
function) result in data being serialized between the executor JVM and
the Python interpreter running the UDF logic – this significantly
reduces performance as compared to UDF implementations in Java or
Scala. Potential solutions to alleviate this serialization bottleneck
include:
If you consider what you are asking maybe you'll see why this isn't working. You are asking all data from your dataframe(data_2) to be shipped(serialized) to an executor that then serializes it and ships it to python to be interpreted. Dataframes don't serialize. So that's your issue, but if they did, you are sending an entire data frame to each executor. Your sample data here isn't an issue, but for trillions of records it would blow up the JVM.
What your asking is doable I just need to figure out how do it. Likely a window or group by would be the trick.
add additional data:
from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
data_1 = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
# add more data to make it more interesting.
data_2 = [('Junior','Smith','M',15),('Helga','Rose','F',33),('Gia','Rose','F',34),
('Mike','Williams','M',77), ('John','Williams','M',77), ('Bill','Williams','F',79),
]
columns = ["firstname","lastname","gender","age"]
df_1 = spark.createDataFrame(data=data_1, schema = columns)
df_2 = spark.createDataFrame(data=data_2, schema = columns)
# dataframe to help fill in missing ages
ref = spark.range( 1, 110, 1).toDF("numbers").withColumn("count", lit(0)).withColumn("rolling_Count", lit(0))
countAges = df_2.groupby("age").count()
#this actually give you the short list of ages
rollingCounts = countAges.withColumn("rolling_Count", sum(col("count")).over(Window.partitionBy().orderBy(col("age").desc())))
#fill in missing ages and remove duplicates
filled = rollingCounts.union(ref).groupBy("age").agg(sum("count").alias("count"))
#add a rolling count across all ages
allAgeCounts = filled.withColumn("rolling_Count", sum(col("count")).over(Window.partitionBy().orderBy(col("age").desc())))
#do inner join because we've filled in all ages.
df_1.join(allAgeCounts, df_1.age == allAgeCounts.age, "inner").show()
+---------+--------+------+---+---+-----+-------------+
|firstname|lastname|gender|age|age|count|rolling_Count|
+---------+--------+------+---+---+-----+-------------+
| Anna| Rose| F| 41| 41| 0| 3|
| Robert|Williams| M| 62| 62| 0| 3|
| James| Smith| M| 30| 30| 0| 5|
+---------+--------+------+---+---+-----+-------------+
I wouldn't normally want to use a window over an entire table, but here the data it's iterating over <= 110 so this is reasonable.

pyspark equivalent of postgres regexp_substr fails to extract value

I'm trying to adapt some postgres sql code I have, to pyspark sql. In the postgres sql I'm using the regexp_substr function to parse out ' .5G' if it shows up in a string in the productname column. (I've included example code below). On the pyspark side I'm trying to use the regexp_extract function, but it's only returning null. I've compared the output from the regexp_replace function in postgres to the pyspark, and it's returning the same value. so the issue must be in the regexp_extract function. I've created a sample input dataframe along with the pyspark code I'm currently running below. can someone please tell me what I'm doing wrong and suggest how to fix it, thank you.
postgres:
select
regexp_substr(trim(upper(regexp_replace(a.productname, '[,/#!$%^&*;:{}=_`~()-]'))), ' .5G') as A
from df
output:
' .5G'
code:
# creating dummy data
df = sc.parallelize([('LEMON MERINGUE .5G CAKE SUGAR', )]).toDF(["productname"])
# turning dataframe into view
df.createOrReplaceTempView("df")
# example query trying to extract ' .5G'
testquery=("""select
regexp_extract('('+trim(upper(regexp_replace(a.productname, '[,/#!$%^&*;:{}=_`~()-]','')))+')', ' .5G',1) as A
from df a
""")
# creating dataframe with extracted value in column
test_df=spark.sql(testquery)
test_df.show(truncate=False)
output:
+----+
|A |
+----+
|null|
+----+
You need to wrap '.5G' in parenthesis, not wrapping the column in parenthesis.
testquery = """
select
regexp_extract(trim(upper(regexp_replace(a.productname, '[,/#!$%^&*;:{}=_`~()-]',''))), '( .5G)', 1) as A
from df a
"""
test_df = spark.sql(testquery)
test_df.show(truncate=False)
+----+
|A |
+----+
| .5G|
+----+
Also note that you cannot + strings together; use concat for that purpose.

How to reverse and combine string columns in a spark dataframe?

I am using pyspark version 2.4 and I am trying to write a udf which should take the values of column id1 and column id2 together, and returns the reverse string of it.
For example, my data looks like:
+---+---+
|id1|id2|
+---+---+
| a|one|
| b|two|
+---+---+
the corresponding code is:
df = spark.createDataFrame([['a', 'one'], ['b', 'two']], ['id1', 'id2'])
The returned value should be
+---+---+----+
|id1|id2| val|
+---+---+----+
| a|one|enoa|
| b|two|owtb|
+---+---+----+
My code is:
#udf(string)
def reverse_value(value):
return value[::-1]
df.withColumn('val', reverse_value(lit('id1' + 'id2')))
My errors are:
TypeError: Invalid argument, not a string or column: <function
reverse_value at 0x0000010E6D860B70> of type <class 'function'>. For
column literals, use 'lit', 'array', 'struct' or 'create_map'
function.
Should be:
from pyspark.sql.functions import col, concat
df.withColumn('val', reverse_value(concat(col('id1'), col('id2'))))
Explanation:
lit is a literal while you want to refer to individual columns (col).
Columns have to be concatenated using concat function (Concatenate columns in Apache Spark DataFrame)
Additionally it is not clear if argument of udf is correct. It should be either:
from pyspark.sql.functions import udf
#udf
def reverse_value(value):
...
or
#udf("string")
def reverse_value(value):
...
or
from pyspark.sql.types import StringType
#udf(StringType())
def reverse_value(value):
...
Additionally the stacktrace suggests that you have some other problems in your code, not reproducible with the snippet you've shared - the reverse_value seems to return function.
The answer by #user11669673 explains what's wrong with your code and how to fix the udf. However, you don't need a udf for this.
You will achieve much better performance by using pyspark.sql.functions.reverse:
from pyspark.sql.functions import col, concat, reverse
df.withColumn("val", concat(reverse(col("id2")), col("id1"))).show()
#+---+---+----+
#|id1|id2| val|
#+---+---+----+
#| a|one|enoa|
#| b|two|owtb|
#+---+---+----+

Spark 1.6: filtering DataFrames generated by describe()

The problem arises when I call describe function on a DataFrame:
val statsDF = myDataFrame.describe()
Calling describe function yields the following output:
statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]
I can show statsDF normally by calling statsDF.show()
+-------+------------------+
|summary| count|
+-------+------------------+
| count| 53173|
| mean|104.76128862392568|
| stddev|3577.8184333911513|
| min| 1|
| max| 558407|
+-------+------------------+
I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:
val temp = statsDF.where($"summary" === "stddev").collect()
I am getting Task not serializable exception.
I am also facing the same exception when I call:
statsDF.where($"summary" === "stddev").show()
It looks like we cannot filter DataFrames generated by describe() function?
I have considered a toy dataset I had containing some health disease data
val stddev_tobacco = rawData.describe().rdd.map{
case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect
You can select from the dataframe:
from pyspark.sql.functions import mean, min, max
df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
You can also register it as a table and query the table:
val t = x.describe()
t.registerTempTable("dt")
%sql
select * from dt
Another option would be to use selectExpr() which also runs optimized, e.g. to obtain the min:
myDataFrame.selectExpr('MIN(count)').head()[0]
myDataFrame.describe().filter($"summary"==="stddev").show()
This worked quite nicely on Spark 2.3.0

Averaging over window function leads to StackOverflowError

I am trying to determine the average timespan between dates in a Dataframe column by using a window-function. Materializing the Dataframe however throws a Java exception.
Consider the following example:
from pyspark import SparkContext
from pyspark.sql import HiveContext, Window, functions
from datetime import datetime
sc = SparkContext()
sq = HiveContext(sc)
data = [
[datetime(2014,1,1)],
[datetime(2014,2,1)],
[datetime(2014,3,1)],
[datetime(2014,3,6)],
[datetime(2014,8,23)],
[datetime(2014,10,1)],
]
df = sq.createDataFrame(data, schema=['ts'])
ts = functions.col('ts')
w = Window.orderBy(ts)
diff = functions.datediff(
ts,
functions.lag(ts, count=1).over(w)
)
avg_diff = functions.avg(diff)
While df.select(diff.alias('diff')).show() correctly renders as
+----+
|diff|
+----+
|null|
| 31|
| 28|
| 5|
| 170|
| 39|
+----+
doing df.select(avg_diff).show() gives a java.lang.StackOverflowError.
Am I wrong to assume that this should work? And if so, what am I doing wrong and what could I do instead?
I am using the Python API on Spark 1.6
When I do df2 = df.select(diff.alias('diff')) and then do
df2.select(functions.avg('diff'))
there's no error. Unfortunately that is not an option in my current setup.
It looks like a bug in Catalyst but. Chaining methods should work just fine:
df.select(diff.alias('diff')).agg(functions.avg('diff'))
Nevertheless I would be careful here. Window functions shouldn't be used to perform global (without PARTITION BY clause) operations. These move all data to a single partition and perform a sequential scan. Using RDDs could be a better choice here.

Resources