Trying to cast kafka key (binary/bytearray) to long/bigint using pyspark and spark sql results in data type mismatch: cannot cast binary to bigint
Environment details:
Python 3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.0.cloudera2
/_/
Using Python version 3.6.8 (default, Dec 30 2018 01:22:34)
SparkSession available as 'spark'.
Test case:
from pyspark.sql.types import StructType, StructField, BinaryType
df1_schema = StructType([StructField("key", BinaryType())])
df1_value = [[bytearray([0, 6, 199, 95, 77, 184, 55, 169])]]
df1 = spark.createDataFrame(df1_value,schema=df1_schema)
df1.printSchema()
#root
# |-- key: binary (nullable = true)
df1.show(truncate=False)
#+-------------------------+
#|key |
#+-------------------------+
#|[00 06 C7 5F 4D B8 37 A9]|
#+-------------------------+
df1.selectExpr('cast(key as bigint)').show(truncate=False)
Error:
(...) File "/app/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.selectExpr.
: org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`key` AS BIGINT)' due to data type mismatch: cannot cast binary to bigint; line 1 pos 0;
(...)
pyspark.sql.utils.AnalysisException: "cannot resolve 'CAST(`key` AS BIGINT)' due to data type mismatch: cannot cast binary to bigint; line 1 pos 0;\n'Project [unresolvedalias(cast(key#0 as bigint), None)]\n+- AnalysisBarrier\n +- LogicalRDD [key#0], false\n"
But my expected result would be 1908062000002985, e.g.:
dfpd = df1.toPandas()
int.from_bytes(dfpd['key'].values[0], byteorder='big')
#1908062000002985
Use pyspark.sql.functions.hex and pyspark.sql.functions.conv:
from pyspark.sql.functions import col, conv, hex
df1.withColumn("num", conv(hex(col("key")), 16, 10).cast("bigint")).show(truncate=False)
#+-------------------------+----------------+
#|key |num |
#+-------------------------+----------------+
#|[00 06 C7 5F 4D B8 37 A9]|1908062000002985|
#+-------------------------+----------------+
The cast("bigint") is only required if you want the result to be a long because conv returns a StringType().
Related
I create a spark cluster environment and i am facing a freezing issue when i try to show a dataframe,
builder = SparkSession.builder \
.appName('Contabilidade > Conta Cosif') \
.config("spark.jars", "/home/dir/op-cast-lramos/.ivy2/jars/io.delta_delta-core_2.12-2.2.0.jar,../drivers/zstd-jni-1.5.2-1.jar") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
.config("spark.sql.debug.maxToStringFields", 1000)\
.master('spark://server:7077')
spark = configure_spark_with_delta_pip(builder).getOrCreate()
data = spark.range(0, 5)
data.write.format("delta").save("/datalake/workspace/storage/dag002")
df= spark.read.format("delta").load("/datalake/workspace/storage/dag002")
df.show() ==> in this part of code , i am facing the freezing...
My environment:
Red Hat Linux 4.18.0-425.3.1.el8.x86_64
Python 3.7.11
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 12
delta lake 2.12-2.2.0
update:
I think that I finally understood the issue.. As databricks inits the spark session, the raydb-spark session is not really set up..
So, is there a way to make the raydb-context available in the Spark session?
pre-update:
I am currently running Spark on Databricks and set up Ray onto it (head node only).
It seems to work, however, if I try to transfer the data from Spark to Ray datasets, I run into an issue:
TypeError Traceback (most recent call last)
<command-2445755691838> in <module>
5 memory_per_executor = "500M"
6 # spark = raydp.init_spark(app_name, num_executors, cores_per_executor, memory_per_executor)
----> 7 dataset = ray.data.from_spark(df)
/databricks/python/lib/python3.7/site-packages/ray/data/read_api.py in from_spark(df, parallelism)
1046 import raydp
1047
-> 1048 return raydp.spark.spark_dataframe_to_ray_dataset(df, parallelism)
1049
1050
/databricks/python/lib/python3.7/site-packages/raydp/spark/dataset.py in spark_dataframe_to_ray_dataset(df, parallelism, _use_owner)
176 if parallelism != num_part:
177 df = df.repartition(parallelism)
--> 178 blocks, _ = _save_spark_df_to_object_store(df, False, _use_owner)
179 return from_arrow_refs(blocks)
180
/databricks/python/lib/python3.7/site-packages/raydp/spark/dataset.py in _save_spark_df_to_object_store(df, use_batch, _use_owner)
150 jvm = df.sql_ctx.sparkSession.sparkContext._jvm
151 jdf = df._jdf
--> 152 object_store_writer = jvm.org.apache.spark.sql.raydp.ObjectStoreWriter(jdf)
153 obj_holder_name = df.sql_ctx.sparkSession.sparkContext.appName + RAYDP_OBJ_HOLDER_SUFFIX
154 if _use_owner is True:
TypeError: 'JavaPackage' object is not callable
with the vanilla code:
# loading the data
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
df = spark.read.format("csv") \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load("dbfs:/databricks-datasets/nyctaxi/tripdata/green")
import ray
import sys
# disable stdout
sys.stdout.fileno = lambda: False
# connect to ray cluster on a single instance
ray.init()
ray.cluster_resources()
import raydp
dataset = ray.data.from_spark(df)
What am I doing wrong?
pyspark 3.0.1
ray 2.0.0
Spark document Built-in Functions has function such as day, date, month.
However, they are not available in PySpark. Why is this?
from pyspark.sql.functions import (
day,
date,
month
)
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Input In [162], in <module>
----> 1 from pyspark.sql.functions import (
2 day,
3 date,
4 month
5 )
ImportError: cannot import name 'day' from 'pyspark.sql.functions' (/opt/spark/spark-3.1.2/python/lib/pyspark.zip/pyspark/sql/functions.py)
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_312
Branch HEAD
Compiled by user centos on 2021-05-24T04:27:48Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.
Replace day with dayofmonth, date with to_date, month stays similiar.
from pyspark.sql.functions import dayofmonth, to_date, month, col
dateDF = spark.createDataFrame([("06-03-2009"),("07-24-2009")],"string")
dateDF.select(
col("value"),
to_date(col("value"),"MM-dd-yyyy")).show()
+----------+--------------------------+
| value|to_date(value, MM-dd-yyyy)|
+----------+--------------------------+
|06-03-2009| 2009-06-03|
|07-24-2009| 2009-07-24|
+----------+--------------------------+
I'm trying to insert a hex string into a Cassandra table with blob datatype column. The Cassandra table structure is as follows:
CREATE TABLE mob.sample (
id text PRIMARY KEY,
data blob
);
Here is my code:
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.functions import udf
def hexstrtohexnum(hexstr):
ani = int(hexstr[2:],16)
return(ani)
# Create a DataFrame using SparkSession
spark = (SparkSession.builder
.appName('SampleLoader')
.appName('SparkCassandraApp')
.getOrCreate())
schema = StructType([StructField("id",StringType(),True),
StructField("data",StringType(),True)])
# Create a DataFrame
df = spark.createDataFrame([("key1", '0x546869732069732061206669727374207265636f7264'),
("key2", '0x546865207365636f6e64207265636f7264'),
("key3", '0x546865207468697264207265636f7264')],schema)
hexstr2hexnum = udf(lambda z: hexstrtohexnum(z),IntegerType())
spark.udf.register("hexstr2hexnum", hexstr2hexnum)
df.withColumn("data",hexstr2hexnum("data"))
df.write.format("org.apache.spark.sql.cassandra").options(keyspace='mob',table='sample').save(mode="append")
When I run the code above it's giving an error:
WARN 2020-09-03 19:41:57,902 org.apache.spark.scheduler.TaskSetManager: Lost task 3.0 in stage 17.0 (TID 441, 10.37.122.156, executor 2): com.datastax.spark.connector.types.TypeConversionException: Cannot convert object 0x546869732069732061206669727374207265636f7264 of type class java.lang.String to java.nio.ByteBuffer.
at com.datastax.spark.connector.types.TypeConverter$$anonfun$convert$1.apply(TypeConverter.scala:44)
at com.datastax.spark.connector.types.TypeConverter$ByteBufferConverter$$anonfun$convertPF$11.applyOrElse(TypeConverter.scala:258)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:42)
at com.datastax.spark.connector.types.TypeConverter$ByteBufferConverter$.com$datastax$spark$connector$types$NullableTypeConverter$$super$convert(TypeConverter.scala:255)
Here's the contents of the dataframe.
>>> df.show(3)
+----+--------------------+
| id| data|
+----+--------------------+
|key1|0x546869732069732...|
|key2|0x546865207365636...|
|key3|0x546865207468697...|
+----+--------------------+
Can someone help me what's wrong with my code? Is there something I am missing?
Reading a test record the blob type appears as BinaryType rather than StringType
>>> table1 = spark.read.format("org.apache.spark.sql.cassandra").options(table="blobtest",keyspace="test").load()
>>> table1.show()
+----+--------------------+
| f1| f2|
+----+--------------------+
|1234|[54 68 69 73 20 6...|
+----+--------------------+
>>> print(table1.schema)
StructType(List(StructField(f1,StringType,false),StructField(f2,BinaryType,true)))
Change your schema to BinaryType and you should be able to write it
>>> string = "This is a test."
>>> arr = bytearray(string, 'utf-8')
>>> schema = StructType([StructField("f1",StringType(),True),StructField("f2",BinaryType(),True)])
>>> df = spark.createDataFrame([("key3",arr)],schema)
>>> df.show()
+----+--------------------+
| f1| f2|
+----+--------------------+
|key3|[54 68 69 73 20 6...|
+----+--------------------+
>>> df.write.format("org.apache.spark.sql.cassandra").options(keyspace='test',table='blobtest2').save(mode="append")
I finally found a way to fix this problem.
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.functions import udf
import binascii
def hextobinary(hexstr):
a = binascii.unhexlify(hexstr[2:])
return(a)
# Create a DataFrame using SparkSession
spark = (SparkSession.builder
.appName('SampleLoader')
.appName('SparkCassandraApp')
.getOrCreate())
schema = StructType([StructField("id",StringType(),True),
StructField("data",StringType(),True)])
# Create a DataFrame
df = spark.createDataFrame([("key1", '0x546869732069732061206669727374207265636f7264'),
("key2", '0x546865207365636f6e64207265636f7264'),
("key3", '0x546865207468697264207265636f7264')],schema)
print(df)
tobinary = udf(lambda z: hextobinary(z),BinaryType())
spark.udf.register("tobinary", tobinary)
df1 = df.withColumn("data",tobinary("data"))
print(df1)
df1.write.format("org.apache.spark.sql.cassandra").options(keyspace='mob',table='sample').save(mode="append")
I am trying to read Avro messages from Kafka, using PySpark 2.4.3. Based on the below stack over flow link , Am able to covert into Avro format (to_avro) and code is working as expected. but from_avro is not working and getting below issue.Are there any other modules that support reading avro messages streamed from Kafka? This is Cloudra distribution environment.
Please suggest on this .
Reference :
Pyspark 2.4.0, read avro from kafka with read stream - Python
Environment Details :
Spark :
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1.2.6.1.0-129
/_/
Using Python version 3.6.1 (default, Jul 24 2019 04:52:09)
Pyspark :
pyspark 2.4.3
Spark_submit :
/usr/hdp/2.6.1.0-129/spark2/bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.3 --conf spark.ui.port=4064
to_avro
from pyspark.sql.column import Column, _to_java_column
def from_avro(col, jsonFormatSchema):
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
def to_avro(col):
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").to_avro
return Column(f(_to_java_column(col)))
from pyspark.sql.functions import col, struct
avro_type_struct = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
df = spark.range(10).select(struct(
col("id"),
col("id").cast("string").alias("id2")
).alias("struct"))
avro_struct_df = df.select(to_avro(col("struct")).alias("avro"))
avro_struct_df.show(3)
+----------+
| avro|
+----------+
|[00 02 30]|
|[02 02 31]|
|[04 02 32]|
+----------+
only showing top 3 rows
from_avro:
avro_struct_df.select(from_avro("avro", avro_type_struct)).show(3)
Error Message :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/2.6.1.0-129/spark2/python/pyspark/sql/dataframe.py", line 993, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/hdp/2.6.1.0-129/spark2/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o61.select.
: java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
at org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:66)
at org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
Spark 2.4.0 supports to_avro and from_avro functions but only for Scala and Java. Then your approach should be fine as long as using appropriate spark version and spark-avro package.
There is an alternative way that I prefer during using Spark Structure Streaming to consume Kafka message is to use UDF with fastavro python library. fastavro is relative fast as it used C extension. I had been used it for our production for months without any issues.
As denoted in below code snippet, main Kafka message is carried in values column of kafka_df. For a demonstration purpose, I use a simple avro schema with 2 columns col1 & col2. The return of deserialize_avro UDF function is a tuple respective to number of fields described within avro schema. Then write the stream out to console for debugging purpose.
from pyspark.sql import SparkSession
import pyspark.sql.functions as psf
from pyspark.sql.types import *
import io
import fastavro
def deserialize_avro(serialized_msg):
bytes_io = io.BytesIO(serialized_msg)
bytes_io.seek(0)
avro_schema = {
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}
deserialized_msg = fastavro.schemaless_reader(bytes_io, avro_schema)
return ( deserialized_msg["col1"],
deserialized_msg["col2"]
)
if __name__=="__main__":
spark = SparkSession \
.builder \
.appName("consume kafka message") \
.getOrCreate()
kafka_df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka01-broker:9092") \
.option("subscribe", "topic_name") \
.option("stopGracefullyOnShutdown", "true") \
.load()
df_schema = StructType([
StructField("col1", LongType(), True),
StructField("col2", StringType(), True)
])
avro_deserialize_udf = psf.udf(deserialize_avro, returnType=df_schema)
parsed_df = kafka_df.withColumn("avro", avro_deserialize_udf(psf.col("value"))).select("avro.*")
query = parsed_df.writeStream.format("console").option("truncate", "true").start()
query.awaitTermination()
Your Spark version is actually 2.1.1, therefore you cannot use a 2.4.3 version of the spark-avro package included in Spark
You'll need to use the one from Databricks
Are there any other modules that support reading avro messages streamed from Kafka?
You could use a plain kafka Python library, not Spark