Apache Spark with Delta Lake: DataFrame.show() is not responding - apache-spark

I create a spark cluster environment and i am facing a freezing issue when i try to show a dataframe,
builder = SparkSession.builder \
.appName('Contabilidade > Conta Cosif') \
.config("spark.jars", "/home/dir/op-cast-lramos/.ivy2/jars/io.delta_delta-core_2.12-2.2.0.jar,../drivers/zstd-jni-1.5.2-1.jar") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
.config("spark.sql.debug.maxToStringFields", 1000)\
.master('spark://server:7077')
spark = configure_spark_with_delta_pip(builder).getOrCreate()
data = spark.range(0, 5)
data.write.format("delta").save("/datalake/workspace/storage/dag002")
df= spark.read.format("delta").load("/datalake/workspace/storage/dag002")
df.show() ==> in this part of code , i am facing the freezing...
My environment:
Red Hat Linux 4.18.0-425.3.1.el8.x86_64
Python 3.7.11
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 12
delta lake 2.12-2.2.0

Related

PySpark 3.1.2 - where are functions such as day, date, month

Spark document Built-in Functions has function such as day, date, month.
However, they are not available in PySpark. Why is this?
from pyspark.sql.functions import (
day,
date,
month
)
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Input In [162], in <module>
----> 1 from pyspark.sql.functions import (
2 day,
3 date,
4 month
5 )
ImportError: cannot import name 'day' from 'pyspark.sql.functions' (/opt/spark/spark-3.1.2/python/lib/pyspark.zip/pyspark/sql/functions.py)
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_312
Branch HEAD
Compiled by user centos on 2021-05-24T04:27:48Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.
Replace day with dayofmonth, date with to_date, month stays similiar.
from pyspark.sql.functions import dayofmonth, to_date, month, col
dateDF = spark.createDataFrame([("06-03-2009"),("07-24-2009")],"string")
dateDF.select(
col("value"),
to_date(col("value"),"MM-dd-yyyy")).show()
+----------+--------------------------+
| value|to_date(value, MM-dd-yyyy)|
+----------+--------------------------+
|06-03-2009| 2009-06-03|
|07-24-2009| 2009-07-24|
+----------+--------------------------+

Read data from Kafka and print to console with Spark Structured Sreaming in Python

I have kafka_2.13-2.7.0 in Ubuntu 20.04. I run kafka server and zookeeper then create a topic and send a text file in it via nc -lk 9999. The topic is full of data. Also, I have spark-3.0.1-bin-hadoop2.7 on my system. In fact, I want to use the kafka topic as a source for Spark Structured Streaming with python. My code is like this:
spark = SparkSession \
.builder \
.appName("APP") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sparktest") \
.option("startingOffsets", "earliest") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
df.printSchema()
I run the above code via spark-submit with this command:
./spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 /home/spark/PycharmProjects/testSparkStream/KafkaToSpark.py
The code run without any exception and I receive this output as it is in Spark site:
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
My question is that the kafka topic is full of data;but there is no any data as a result of running the code in output. Would you please guide me what is wrong here?
The code as is will not print out any data but only provide you the schema once.
You can follow the instructions given in the general Structured Streaming Guide and the Structured Streaming + Kafka integration Guide to see how to print out data to the console. Remember that reading data in Spark is a lazy operation and nothing is done without an action (typically a writeStream operation).
If you complement the code as below you should see the selected data (key and value) printed out to the console:
spark = SparkSession \
.builder \
.appName("APP") \
.getOrCreate()
df = spark\
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sparktest") \
.option("startingOffsets", "earliest") \
.load()
query = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.format("console") \
.option("checkpointLocation", "path/to/HDFS/dir") \
.start()
query.awaitTermination()

Pyspark 2.4.3, Read Avro format message from Kafka - Pyspark Structured streaming

I am trying to read Avro messages from Kafka, using PySpark 2.4.3. Based on the below stack over flow link , Am able to covert into Avro format (to_avro) and code is working as expected. but from_avro is not working and getting below issue.Are there any other modules that support reading avro messages streamed from Kafka? This is Cloudra distribution environment.
Please suggest on this .
Reference :
Pyspark 2.4.0, read avro from kafka with read stream - Python
Environment Details :
Spark :
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1.2.6.1.0-129
/_/
Using Python version 3.6.1 (default, Jul 24 2019 04:52:09)
Pyspark :
pyspark 2.4.3
Spark_submit :
/usr/hdp/2.6.1.0-129/spark2/bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.3 --conf spark.ui.port=4064
to_avro
from pyspark.sql.column import Column, _to_java_column
def from_avro(col, jsonFormatSchema):
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
def to_avro(col):
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").to_avro
return Column(f(_to_java_column(col)))
from pyspark.sql.functions import col, struct
avro_type_struct = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
df = spark.range(10).select(struct(
col("id"),
col("id").cast("string").alias("id2")
).alias("struct"))
avro_struct_df = df.select(to_avro(col("struct")).alias("avro"))
avro_struct_df.show(3)
+----------+
| avro|
+----------+
|[00 02 30]|
|[02 02 31]|
|[04 02 32]|
+----------+
only showing top 3 rows
from_avro:
avro_struct_df.select(from_avro("avro", avro_type_struct)).show(3)
Error Message :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/2.6.1.0-129/spark2/python/pyspark/sql/dataframe.py", line 993, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/hdp/2.6.1.0-129/spark2/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o61.select.
: java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:66)
at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
Spark 2.4.0 supports to_avro and from_avro functions but only for Scala and Java. Then your approach should be fine as long as using appropriate spark version and spark-avro package.
There is an alternative way that I prefer during using Spark Structure Streaming to consume Kafka message is to use UDF with fastavro python library. fastavro is relative fast as it used C extension. I had been used it for our production for months without any issues.
As denoted in below code snippet, main Kafka message is carried in values column of kafka_df. For a demonstration purpose, I use a simple avro schema with 2 columns col1 & col2. The return of deserialize_avro UDF function is a tuple respective to number of fields described within avro schema. Then write the stream out to console for debugging purpose.
from pyspark.sql import SparkSession
import pyspark.sql.functions as psf
from pyspark.sql.types import *
import io
import fastavro
def deserialize_avro(serialized_msg):
bytes_io = io.BytesIO(serialized_msg)
bytes_io.seek(0)
avro_schema = {
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}
deserialized_msg = fastavro.schemaless_reader(bytes_io, avro_schema)
return ( deserialized_msg["col1"],
deserialized_msg["col2"]
)
if __name__=="__main__":
spark = SparkSession \
.builder \
.appName("consume kafka message") \
.getOrCreate()
kafka_df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka01-broker:9092") \
.option("subscribe", "topic_name") \
.option("stopGracefullyOnShutdown", "true") \
.load()
df_schema = StructType([
StructField("col1", LongType(), True),
StructField("col2", StringType(), True)
])
avro_deserialize_udf = psf.udf(deserialize_avro, returnType=df_schema)
parsed_df = kafka_df.withColumn("avro", avro_deserialize_udf(psf.col("value"))).select("avro.*")
query = parsed_df.writeStream.format("console").option("truncate", "true").start()
query.awaitTermination()
Your Spark version is actually 2.1.1, therefore you cannot use a 2.4.3 version of the spark-avro package included in Spark
You'll need to use the one from Databricks
Are there any other modules that support reading avro messages streamed from Kafka?
You could use a plain kafka Python library, not Spark

Spark: cast bytearray to bigint

Trying to cast kafka key (binary/bytearray) to long/bigint using pyspark and spark sql results in data type mismatch: cannot cast binary to bigint
Environment details:
Python 3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.0.cloudera2
/_/
Using Python version 3.6.8 (default, Dec 30 2018 01:22:34)
SparkSession available as 'spark'.
Test case:
from pyspark.sql.types import StructType, StructField, BinaryType
df1_schema = StructType([StructField("key", BinaryType())])
df1_value = [[bytearray([0, 6, 199, 95, 77, 184, 55, 169])]]
df1 = spark.createDataFrame(df1_value,schema=df1_schema)
df1.printSchema()
#root
# |-- key: binary (nullable = true)
df1.show(truncate=False)
#+-------------------------+
#|key |
#+-------------------------+
#|[00 06 C7 5F 4D B8 37 A9]|
#+-------------------------+
df1.selectExpr('cast(key as bigint)').show(truncate=False)
Error:
(...) File "/app/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.selectExpr.
: org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`key` AS BIGINT)' due to data type mismatch: cannot cast binary to bigint; line 1 pos 0;
(...)
pyspark.sql.utils.AnalysisException: "cannot resolve 'CAST(`key` AS BIGINT)' due to data type mismatch: cannot cast binary to bigint; line 1 pos 0;\n'Project [unresolvedalias(cast(key#0 as bigint), None)]\n+- AnalysisBarrier\n +- LogicalRDD [key#0], false\n"
But my expected result would be 1908062000002985, e.g.:
dfpd = df1.toPandas()
int.from_bytes(dfpd['key'].values[0], byteorder='big')
#1908062000002985
Use pyspark.sql.functions.hex and pyspark.sql.functions.conv:
from pyspark.sql.functions import col, conv, hex
df1.withColumn("num", conv(hex(col("key")), 16, 10).cast("bigint")).show(truncate=False)
#+-------------------------+----------------+
#|key |num |
#+-------------------------+----------------+
#|[00 06 C7 5F 4D B8 37 A9]|1908062000002985|
#+-------------------------+----------------+
The cast("bigint") is only required if you want the result to be a long because conv returns a StringType().

Spark structed streaming window issue

I have a problem regrading the window in Spark Structed Streaming. I want to group the data i'm receiving continuously from kafka source in sliding window and count the number of data. The issue is that writestream streams the window dataframe each time there is data coming and update the count of the current window.
I'm using the following code to create the window:
#Define schema of the topic to be consumed
jsonSchema = StructType([ StructField("State", StringType(), True) \
, StructField("Value", StringType(), True) \
, StructField("SourceTimestamp", StringType(), True) \
, StructField("Tag", StringType(), True)
])
spark = SparkSession \
.builder \
.appName("StructuredStreaming") \
.config("spark.default.parallelism", "100") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.129.140.23:9092") \
.option("subscribe", "SIMULATOR.SUPERMAN.TOTO") \
.load() \
.select(from_json(col("value").cast("string"), jsonSchema).alias("data")) \
.select("data.*")
df = df.withColumn("time", current_timestamp())
Window = df \
.withColumn("window",window("time","4 seconds","1 seconds")).groupBy("window").count() \
.withColumn("time", current_timestamp())
#Write back to kafka
query = Window.select(to_json(struct("count","window","time")).alias("value")) \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.129.140.23:9092") \
.outputMode("update") \
.option("topic", "structed") \
.option("checkpointLocation", "/home/superman/notebook/checkpoint") \
.start()
The windows are not sorted and are updated each time there is a change in count. How can we wait for the end of the window and stream the final count one time. Instead of this output:
{"count":21,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:18.000Z","end":"2019-05-13T09:39:22.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":37,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:21.939Z"}
{"count":18,"window":{"start":"2019-05-13T09:39:21.000Z","end":"2019-05-13T09:39:25.000Z"},"time":"2019-05-13T09:39:21.939Z"}
I would like this:
{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}
The expected ouput wait for the window to be closed based on comparaison between the end timestamp and the current time.

Resources