PySpark 3.1.2 - where are functions such as day, date, month - apache-spark

Spark document Built-in Functions has function such as day, date, month.
However, they are not available in PySpark. Why is this?
from pyspark.sql.functions import (
day,
date,
month
)
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Input In [162], in <module>
----> 1 from pyspark.sql.functions import (
2 day,
3 date,
4 month
5 )
ImportError: cannot import name 'day' from 'pyspark.sql.functions' (/opt/spark/spark-3.1.2/python/lib/pyspark.zip/pyspark/sql/functions.py)
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_312
Branch HEAD
Compiled by user centos on 2021-05-24T04:27:48Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.

Replace day with dayofmonth, date with to_date, month stays similiar.
from pyspark.sql.functions import dayofmonth, to_date, month, col
dateDF = spark.createDataFrame([("06-03-2009"),("07-24-2009")],"string")
dateDF.select(
col("value"),
to_date(col("value"),"MM-dd-yyyy")).show()
+----------+--------------------------+
| value|to_date(value, MM-dd-yyyy)|
+----------+--------------------------+
|06-03-2009| 2009-06-03|
|07-24-2009| 2009-07-24|
+----------+--------------------------+

Related

Apache Spark with Delta Lake: DataFrame.show() is not responding

I create a spark cluster environment and i am facing a freezing issue when i try to show a dataframe,
builder = SparkSession.builder \
.appName('Contabilidade > Conta Cosif') \
.config("spark.jars", "/home/dir/op-cast-lramos/.ivy2/jars/io.delta_delta-core_2.12-2.2.0.jar,../drivers/zstd-jni-1.5.2-1.jar") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
.config("spark.sql.debug.maxToStringFields", 1000)\
.master('spark://server:7077')
spark = configure_spark_with_delta_pip(builder).getOrCreate()
data = spark.range(0, 5)
data.write.format("delta").save("/datalake/workspace/storage/dag002")
df= spark.read.format("delta").load("/datalake/workspace/storage/dag002")
df.show() ==> in this part of code , i am facing the freezing...
My environment:
Red Hat Linux 4.18.0-425.3.1.el8.x86_64
Python 3.7.11
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 12
delta lake 2.12-2.2.0

Unable to use pyspark udf

I am trying to format the string in one the columns using pyspark udf.
Below is my dataset:
+--------------------+--------------------+
| artists| id|
+--------------------+--------------------+
| ['Mamie Smith']|0cS0A1fUEUd1EW3Fc...|
|"[""Screamin' Jay...|0hbkKFIJm7Z05H8Zl...|
| ['Mamie Smith']|11m7laMUgmOKqI3oY...|
| ['Oscar Velazquez']|19Lc5SfJJ5O1oaxY0...|
| ['Mixe']|2hJjbsLCytGsnAHfd...|
|['Mamie Smith & H...|3HnrHGLE9u2MjHtdo...|
| ['Mamie Smith']|5DlCyqLyX2AOVDTjj...|
|['Mamie Smith & H...|02FzJbHtqElixxCmr...|
|['Francisco Canaro']|02i59gYdjlhBmbbWh...|
| ['Meetya']|06NUxS2XL3efRh0bl...|
| ['Dorville']|07jrRR1CUUoPb1FLf...|
|['Francisco Canaro']|0ANuF7SvPeIHanGcC...|
| ['Ka Koula']|0BEO6nHi1rmTOPiEZ...|
| ['Justrock']|0DH1IROKoPK5XTglU...|
| ['Takis Nikolaou']|0HVjPaxbyfFcg8Rh0...|
|['Aggeliki Karagi...|0Hn7LWy1YcKhPaA2N...|
|['Giorgos Katsaros']|0I6DjrEfd3fKFESHE...|
|['Francisco Canaro']|0KGiP9EW1xtojDHsT...|
|['Giorgos Katsaros']|0KNI2d7l3ByVHU0g2...|
| ['Amalia Vaka']|0LYNwxHYHPW256lO2...|
+--------------------+--------------------+
And code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
import logging as log
session = SparkSession.builder.master("local").appName("First Python App").getOrCreate()
df = session.read.option("header", "true").csv("/home/deepak/Downloads/spotify_data_Set/data.csv")
df = df.select("artists", "id")
# df = df.withColumn("new_atr",f.translate(f.col("artists"),'"', "")).\
# withColumn("new_atr_2" , f.translate(f.col("artists"),'[', ""))
df.show()
def format_column(st):
print(type(st))
print(1)
return st.upper()
session.udf.register("format_str", format_column)
df.select("id",format_column(df.artists)).show(truncate=False)
# schema = t.StructType(
# [
# t.StructField("artists", t.ArrayType(t.StringType()), True),
# t.StructField("id", t.StringType(), True)
#
# ]
# )
df.show(truncate=False)
The UDF is still not complete but with the error, I am not able to move further. When I run the above code I am getting below error:
<class 'pyspark.sql.column.Column'>
1
Traceback (most recent call last):
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 25, in <module>
df.select("id",format_column(df.artists)).show(truncate=False)
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 18, in format_column
return st.upper()
TypeError: 'Column' object is not callable
The syntax looks fine and I am not able to figure out what wrong with the code.
You get this error because you are calling the python function format_column instead of the registered UDF format_str.
You should be using :
from pyspark.sql import functions as F
df.select("id", F.expr("format_str(artists)")).show(truncate=False)
Moreover, the way you registered the UDF you can't use it with DataFrame API but only in Spark SQL. If you want to use it within DataFrame API you should define the function like this :
format_str = F.udf(format_column, StringType())
df.select("id", format_str(df.artists)).show(truncate=False)
Or using annotation syntax:
#F.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
df.select("id", format_column(df.artists)).show(truncate=False)
That said, you should use Spark built-in functions (upper in this case) unless you have a specific need that can't be done using Spark functions.
well , I see that you are using a predined spark function in the definition of an UDF which is acceptable as you said that you are starting with some examples , your error means that there is no method called upper for a column however you can correct that error using this defintion:
#f.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
for example :

Format float to currency using PySpark and Babel

I'd like to convert a float to a currency using Babel and PySpark
sample data:
amount currency
2129.9 RON
1700 EUR
1268 GBP
741.2 USD
142.08091153 EUR
4.7E7 USD
0 GBP
I tried:
df = df.withColumn(F.col('amount'), format_currency(F.col('amount'), F.col('currency'),locale='be_BE'))
or
df = df.withColumn(F.col('amount'), format_currency(F.col('amount'), 'EUR',locale='be_BE'))
They both give me an error:
To use Python libraries with Spark dataframes, you need to use an UDF:
from babel.numbers import format_currency
import pyspark.sql.functions as F
format_currency_udf = F.udf(lambda a, c: format_currency(a, c))
df2 = df.withColumn(
'amount',
format_currency_udf('amount', 'currency')
)
df2.show()
+----------------+--------+
| amount|currency|
+----------------+--------+
| RON2,129.90| RON|
| €1,700.00| EUR|
| £1,268.00| GBP|
| US$741.20| USD|
| €142.08| EUR|
|US$47,000,000.00| USD|
+----------------+--------+
There seems a problem in pre-processing the amount column of your dataframe. From the error it is evident that value after converting to string is not just numeric which it has to be according to this tableand has has some additional characters as well. You can check on this column to find that and remove unnecessary character to fix this. As as example:
>>> import decimal
>>> value = '10.0'
>>> value = decimal.Decimal(str(value))
>>> value
Decimal('10.0')
>>> value = '10.0e'
>>> value = decimal.Decimal(str(value))
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
value = decimal.Decimal(str(value))
decimal.InvalidOperation: [<class 'decimal.ConversionSyntax'>] # as '10.0e' is not just numeric

Pyspark 2.4.3, Read Avro format message from Kafka - Pyspark Structured streaming

I am trying to read Avro messages from Kafka, using PySpark 2.4.3. Based on the below stack over flow link , Am able to covert into Avro format (to_avro) and code is working as expected. but from_avro is not working and getting below issue.Are there any other modules that support reading avro messages streamed from Kafka? This is Cloudra distribution environment.
Please suggest on this .
Reference :
Pyspark 2.4.0, read avro from kafka with read stream - Python
Environment Details :
Spark :
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1.2.6.1.0-129
/_/
Using Python version 3.6.1 (default, Jul 24 2019 04:52:09)
Pyspark :
pyspark 2.4.3
Spark_submit :
/usr/hdp/2.6.1.0-129/spark2/bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.3 --conf spark.ui.port=4064
to_avro
from pyspark.sql.column import Column, _to_java_column
def from_avro(col, jsonFormatSchema):
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
def to_avro(col):
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").to_avro
return Column(f(_to_java_column(col)))
from pyspark.sql.functions import col, struct
avro_type_struct = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
df = spark.range(10).select(struct(
col("id"),
col("id").cast("string").alias("id2")
).alias("struct"))
avro_struct_df = df.select(to_avro(col("struct")).alias("avro"))
avro_struct_df.show(3)
+----------+
| avro|
+----------+
|[00 02 30]|
|[02 02 31]|
|[04 02 32]|
+----------+
only showing top 3 rows
from_avro:
avro_struct_df.select(from_avro("avro", avro_type_struct)).show(3)
Error Message :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/2.6.1.0-129/spark2/python/pyspark/sql/dataframe.py", line 993, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/hdp/2.6.1.0-129/spark2/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o61.select.
: java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:66)
at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
Spark 2.4.0 supports to_avro and from_avro functions but only for Scala and Java. Then your approach should be fine as long as using appropriate spark version and spark-avro package.
There is an alternative way that I prefer during using Spark Structure Streaming to consume Kafka message is to use UDF with fastavro python library. fastavro is relative fast as it used C extension. I had been used it for our production for months without any issues.
As denoted in below code snippet, main Kafka message is carried in values column of kafka_df. For a demonstration purpose, I use a simple avro schema with 2 columns col1 & col2. The return of deserialize_avro UDF function is a tuple respective to number of fields described within avro schema. Then write the stream out to console for debugging purpose.
from pyspark.sql import SparkSession
import pyspark.sql.functions as psf
from pyspark.sql.types import *
import io
import fastavro
def deserialize_avro(serialized_msg):
bytes_io = io.BytesIO(serialized_msg)
bytes_io.seek(0)
avro_schema = {
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}
deserialized_msg = fastavro.schemaless_reader(bytes_io, avro_schema)
return ( deserialized_msg["col1"],
deserialized_msg["col2"]
)
if __name__=="__main__":
spark = SparkSession \
.builder \
.appName("consume kafka message") \
.getOrCreate()
kafka_df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka01-broker:9092") \
.option("subscribe", "topic_name") \
.option("stopGracefullyOnShutdown", "true") \
.load()
df_schema = StructType([
StructField("col1", LongType(), True),
StructField("col2", StringType(), True)
])
avro_deserialize_udf = psf.udf(deserialize_avro, returnType=df_schema)
parsed_df = kafka_df.withColumn("avro", avro_deserialize_udf(psf.col("value"))).select("avro.*")
query = parsed_df.writeStream.format("console").option("truncate", "true").start()
query.awaitTermination()
Your Spark version is actually 2.1.1, therefore you cannot use a 2.4.3 version of the spark-avro package included in Spark
You'll need to use the one from Databricks
Are there any other modules that support reading avro messages streamed from Kafka?
You could use a plain kafka Python library, not Spark

Spark: cast bytearray to bigint

Trying to cast kafka key (binary/bytearray) to long/bigint using pyspark and spark sql results in data type mismatch: cannot cast binary to bigint
Environment details:
Python 3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.0.cloudera2
/_/
Using Python version 3.6.8 (default, Dec 30 2018 01:22:34)
SparkSession available as 'spark'.
Test case:
from pyspark.sql.types import StructType, StructField, BinaryType
df1_schema = StructType([StructField("key", BinaryType())])
df1_value = [[bytearray([0, 6, 199, 95, 77, 184, 55, 169])]]
df1 = spark.createDataFrame(df1_value,schema=df1_schema)
df1.printSchema()
#root
# |-- key: binary (nullable = true)
df1.show(truncate=False)
#+-------------------------+
#|key |
#+-------------------------+
#|[00 06 C7 5F 4D B8 37 A9]|
#+-------------------------+
df1.selectExpr('cast(key as bigint)').show(truncate=False)
Error:
(...) File "/app/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.selectExpr.
: org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`key` AS BIGINT)' due to data type mismatch: cannot cast binary to bigint; line 1 pos 0;
(...)
pyspark.sql.utils.AnalysisException: "cannot resolve 'CAST(`key` AS BIGINT)' due to data type mismatch: cannot cast binary to bigint; line 1 pos 0;\n'Project [unresolvedalias(cast(key#0 as bigint), None)]\n+- AnalysisBarrier\n +- LogicalRDD [key#0], false\n"
But my expected result would be 1908062000002985, e.g.:
dfpd = df1.toPandas()
int.from_bytes(dfpd['key'].values[0], byteorder='big')
#1908062000002985
Use pyspark.sql.functions.hex and pyspark.sql.functions.conv:
from pyspark.sql.functions import col, conv, hex
df1.withColumn("num", conv(hex(col("key")), 16, 10).cast("bigint")).show(truncate=False)
#+-------------------------+----------------+
#|key |num |
#+-------------------------+----------------+
#|[00 06 C7 5F 4D B8 37 A9]|1908062000002985|
#+-------------------------+----------------+
The cast("bigint") is only required if you want the result to be a long because conv returns a StringType().

Resources