pyspark equivalent of postgres regexp_substr fails to extract value - python-3.x

I'm trying to adapt some postgres sql code I have, to pyspark sql. In the postgres sql I'm using the regexp_substr function to parse out ' .5G' if it shows up in a string in the productname column. (I've included example code below). On the pyspark side I'm trying to use the regexp_extract function, but it's only returning null. I've compared the output from the regexp_replace function in postgres to the pyspark, and it's returning the same value. so the issue must be in the regexp_extract function. I've created a sample input dataframe along with the pyspark code I'm currently running below. can someone please tell me what I'm doing wrong and suggest how to fix it, thank you.
postgres:
select
regexp_substr(trim(upper(regexp_replace(a.productname, '[,/#!$%^&*;:{}=_`~()-]'))), ' .5G') as A
from df
output:
' .5G'
code:
# creating dummy data
df = sc.parallelize([('LEMON MERINGUE .5G CAKE SUGAR', )]).toDF(["productname"])
# turning dataframe into view
df.createOrReplaceTempView("df")
# example query trying to extract ' .5G'
testquery=("""select
regexp_extract('('+trim(upper(regexp_replace(a.productname, '[,/#!$%^&*;:{}=_`~()-]','')))+')', ' .5G',1) as A
from df a
""")
# creating dataframe with extracted value in column
test_df=spark.sql(testquery)
test_df.show(truncate=False)
output:
+----+
|A |
+----+
|null|
+----+

You need to wrap '.5G' in parenthesis, not wrapping the column in parenthesis.
testquery = """
select
regexp_extract(trim(upper(regexp_replace(a.productname, '[,/#!$%^&*;:{}=_`~()-]',''))), '( .5G)', 1) as A
from df a
"""
test_df = spark.sql(testquery)
test_df.show(truncate=False)
+----+
|A |
+----+
| .5G|
+----+
Also note that you cannot + strings together; use concat for that purpose.

Related

Processing a list of json strings in Spark Streaming

I'm trying to transform the input I get with spark streaming in order to create a dataframe out of it. Basically I receive a list of json strings from which I would want to extract the data.
Note: I reduced the json strings to just the coords objects which should be sufficient for the general concept.
The input I get:
["{\"coord\":{\"lon\":10.0217,\"lat\":53.5281}}", "{\"coord\":{"lon\":10.1169,\"lat\":53.6522}}", "{\"coord\":...."]
The dataframe I want to create in order to save it to a database:
+----------+----------+
|lon |lat |
+----------+----------+
| 10.0217| 53.5281|
| 10.1169| 53.6522|
| ... | ... |
+----------+----------+
So far I managed to replace the excaped quotes which leaves me with a array of strings.
I tried to flatten the array:
result = df \
.selectExpr("Cast(value AS STRING) as json") \
.withColumn("json", f.regexp_replace('json', '\\\\"', '"')) \
.withColumn("json", f.flatten(f.col("json"))) \
.select("json")
Error:
pyspark.sql.utils.AnalysisException: cannot resolve 'flatten(json)'
due to data type mismatch: The argument should be an array of arrays,
but 'json' is of string type.;;
Then I tried to load the array with json.loads, but I was not able to call this function from Spark streaming.
So how do I extract the data from this input?
With the array provided
arr = [
"{\"coord\":{\"lon\":10.0217,\"lat\":53.5281}}",
"{\"coord\":{\"lon\":10.1169,\"lat\":53.6522}}",
]
You can get the desired result with the following code
from pyspark.sql import functions, types
df = (df.withColumn("lon", functions.regexp_extract("value", "(?<=lon\"\:)[0-9]+.[0-9]+", 0))
.withColumn("lat", functions.regexp_extract("value", "(?<=lat\"\:)[0-9]+.[0-9]+", 0)))
df = df.select(df["lon"], df["lat"])
df.show()
+-------+-------+
| lon| lat|
+-------+-------+
|10.0217|53.5281|
|10.1169|53.6522|
+-------+-------+

Problem in reading string NULL values from BigQuery

Currently I am using spark to read data from bigqiery tables and write it to storage bucket as csv. One issue that i am facing is that the null string values are not being read properly by spark from bq. It reads the null string values but in the csv it writes that value as an empty string with double quotes (i.e. like this "").
# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
.option('table', <bq_dataset> + <bq_table>) \
.load()
bqdf.createOrReplaceTempView('bqdf')
# Select required data into another df
bqdf2 = spark.sql(
'SELECT * FROM bqdf')
# write to GCS
bqdf2.write.csv(<gcs_data_path> + <bq_table> + '/' , mode='overwrite', sep= '|')
I have tried emptyValue='' and nullValue options with df.write.csv() while writing to csv but dosen't work.
I needed a solution for this problem, if anyone else faced this issue and could help. Thanks!
I was able to reproduce your case and I found a solution that worked with a sample table I created in BigQuery. The data is as follows:
According to the PySpark documentation, in the class pyspark.sql.DataFrameWriter(df), there is an option called nullValue:
nullValue – sets the string representation of a null value. If None is
set, it uses the default value, empty string.
Which is what you are looking for. Then, I just implemented nullValue option below.
sc = SparkContext()
spark = SparkSession(sc)
# Read the data from BigQuery as a Spark Dataframe.
data = spark.read.format("bigquery").option(
"table", "dataset.table").load()
# Create a view so that Spark SQL queries can be run against the data.
data.createOrReplaceTempView("data_view")
# Select required data into another df
data_view2 = spark.sql(
'SELECT * FROM data_view')
df=data_view2.write.csv('gs://bucket/folder', header=True, nullValue='')
data_view2.show()
Notice that I have used data_view2.show() to print out the view in order to check if it was correctly read. The output was:
+------+---+
|name |age|
+------+---+
|Robert| 25|
|null | 23|
+------+---+
Therefore, the null value was precisely interpreted. In addition, I also checked the .csv file:
name,age
Robert,25
,23
As you can see the null value is correct and not represented as between double quotes as an empty String. Finally, just as a final inspection I created a load job from this .csv file to BigQuery. The table was created and the null value was interpreted accurately.
Note: I ran the pyspark job from the DataProc job's console in a DataProc cluster, previously created. Also, the cluster was at the same location as the dataset in BigQuery.

Spark SQL not recognizing \d+

I am trying to use the regex_extract function to get the last three digits in a string ABCDF1_123 with:
regexp_extrach('ABCDF1_123', 'ABCDF1_(\d+)', 1)
and it does not capture the group. If I change the function call to:
regexp_extrach('ABCDF1_123', 'ABCDF1_([0-9]+)', 1)
it works. Can anyone give me some insight in to why? I am also grabbing the data from a Postgres database using a JDBC connection.
I ran the regexp_extract and both of them are giving the same output as shown below
from pyspark.sql import Row
from pyspark.sql.functions import lit, when, col, regexp_extract
l = [('ABCDF1_123')]
rdd = sc.parallelize(l)
sample = rdd.map(lambda x: Row(name=x))
sample_df = sqlContext.createDataFrame(sample)
not_working = r'ABCDF1_(\d+)'
working = r'ABCDF1_([0-9]+)'
sample_df.select(regexp_extract('name',not_working,1).alias('not_working'),
regexp_extract('name',working,1).alias('working')).show(10)
+-----------+-------+
|not_working|working|
+-----------+-------+
| 123| 123|
+-----------+-------+
Is this what you are looking for?

Spark 1.6: filtering DataFrames generated by describe()

The problem arises when I call describe function on a DataFrame:
val statsDF = myDataFrame.describe()
Calling describe function yields the following output:
statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]
I can show statsDF normally by calling statsDF.show()
+-------+------------------+
|summary| count|
+-------+------------------+
| count| 53173|
| mean|104.76128862392568|
| stddev|3577.8184333911513|
| min| 1|
| max| 558407|
+-------+------------------+
I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:
val temp = statsDF.where($"summary" === "stddev").collect()
I am getting Task not serializable exception.
I am also facing the same exception when I call:
statsDF.where($"summary" === "stddev").show()
It looks like we cannot filter DataFrames generated by describe() function?
I have considered a toy dataset I had containing some health disease data
val stddev_tobacco = rawData.describe().rdd.map{
case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect
You can select from the dataframe:
from pyspark.sql.functions import mean, min, max
df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
You can also register it as a table and query the table:
val t = x.describe()
t.registerTempTable("dt")
%sql
select * from dt
Another option would be to use selectExpr() which also runs optimized, e.g. to obtain the min:
myDataFrame.selectExpr('MIN(count)').head()[0]
myDataFrame.describe().filter($"summary"==="stddev").show()
This worked quite nicely on Spark 2.3.0

Averaging over window function leads to StackOverflowError

I am trying to determine the average timespan between dates in a Dataframe column by using a window-function. Materializing the Dataframe however throws a Java exception.
Consider the following example:
from pyspark import SparkContext
from pyspark.sql import HiveContext, Window, functions
from datetime import datetime
sc = SparkContext()
sq = HiveContext(sc)
data = [
[datetime(2014,1,1)],
[datetime(2014,2,1)],
[datetime(2014,3,1)],
[datetime(2014,3,6)],
[datetime(2014,8,23)],
[datetime(2014,10,1)],
]
df = sq.createDataFrame(data, schema=['ts'])
ts = functions.col('ts')
w = Window.orderBy(ts)
diff = functions.datediff(
ts,
functions.lag(ts, count=1).over(w)
)
avg_diff = functions.avg(diff)
While df.select(diff.alias('diff')).show() correctly renders as
+----+
|diff|
+----+
|null|
| 31|
| 28|
| 5|
| 170|
| 39|
+----+
doing df.select(avg_diff).show() gives a java.lang.StackOverflowError.
Am I wrong to assume that this should work? And if so, what am I doing wrong and what could I do instead?
I am using the Python API on Spark 1.6
When I do df2 = df.select(diff.alias('diff')) and then do
df2.select(functions.avg('diff'))
there's no error. Unfortunately that is not an option in my current setup.
It looks like a bug in Catalyst but. Chaining methods should work just fine:
df.select(diff.alias('diff')).agg(functions.avg('diff'))
Nevertheless I would be careful here. Window functions shouldn't be used to perform global (without PARTITION BY clause) operations. These move all data to a single partition and perform a sequential scan. Using RDDs could be a better choice here.

Resources