How can I convert base64 encode shadeproto.ByteString in scala? - apache-spark

in my spark repo, i used following sbt setting, in my understanding all com.google.protobuf.ByteString is replace by shadeproto.ByteString. My questing is how can I import shadeproto.ByteString type and write a function that do base64 encoding of this ByteString?
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.protobuf.**" -> "shadeproto.#1").inAll,
ShadeRule.rename("scala.collection.compat.**" -> "scalacompat.#1").inAll,
ShadeRule.rename("shapeless.**" -> "shadeshapeless.#1").inAll
)

Related

Convert String expression to actual working instance expression

I am trying to convert an expression in Scala that is saved in database as String back to working code.
I have tried Reflect Toolbox, Groovy, etc. But I can't seem to achieve what I require.
Here's what I tried:
import scala.reflect.runtime.universe._
import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox
val toolbox = currentMirror.mkToolBox()
val code1 = q"""StructType(StructField(id,IntegerType,true), StructField(name,StringType,true), StructField(tstamp,TimestampType,true), StructField(date,DateType,true))"""
val sType = toolbox.compile(code1)().asInstanceOf[StructType]
where I need to use the sType instance for passing customSchema to csv file for dataframe creation but it seems to fail.
Is there any way I can get the string expression of the StructType to convert into actual StructType instance? Any help would be appreciated.
If StructType is from Spark and you want to just convert String to StructType you don't need reflection. You can try this:
import org.apache.spark.sql.catalyst.parser.LegacyTypeStringParser
import org.apache.spark.sql.types.{DataType, StructType}
import scala.util.Try
def fromString(raw: String): StructType =
Try(DataType.fromJson(raw)).getOrElse(LegacyTypeStringParser.parse(raw)) match {
case t: StructType => t
case _ => throw new RuntimeException(s"Failed parsing: $raw")
}
val code1 =
"""StructType(Array(StructField(id,IntegerType,true), StructField(name,StringType,true), StructField(tstamp,TimestampType,true), StructField(date,DateType,true)))"""
fromString(code1) // res0: org.apache.spark.sql.types.StructType
The code is taken from the org.apache.spark.sql.types.StructType companion object from Spark. You cannot use it directly as it's in private package. Moreover, it uses LegacyTypeStringParser so I'm not sure if this is good enough for Production code.
Your code inside quasiquotes, needs to be valid Scala syntax, so you need to provide quotes for strings. You'd also need to provide all the necessary imports. This works:
val toolbox = currentMirror.mkToolBox()
val code1 =
q"""
//we need to import all sql types
import org.apache.spark.sql.types._
StructType(
//StructType needs list
List(
//name arguments need to be in proper quotes
StructField("id",IntegerType,true),
StructField("name",StringType,true),
StructField("tstamp",TimestampType,true),
StructField("date",DateType,true)
)
)
"""
val sType = toolbox.compile(code1)().asInstanceOf[StructType]
println(sType)
But maybe instead of trying to recompile the code, you should consider other alternatives as serializing struct type somehow (perhaps to JSON?).

Serialize an avro object to string in python

In Python 3.7, I want to encode an Avro object to String.
I found examples converting to byte array but not to string.
Code to convert to byte array:
def serialize(mapper, schema):
bytes_writer = io.BytesIO()
encoder = avro.io.BinaryEncoder(bytes_writer)
writer1 = avro.io.DatumWriter(schema)
writer1.write(mapper, encoder)
return bytes_writer.getvalue()
mapper is a dictionary which will populate the avro object.
io provides with StringIO which I assume will need to be used instead of BytesIO but then what encoder to use with that? How do we serialize this?
if, for example, a is your Avro object, you can use a.to_json() method of Avro and then json.dumps(a)

How to use azure-sqldb-spark connector in pyspark

I want to write around 10 GB of data everyday to Azure SQL server DB using PySpark.Currently using JDBC driver which takes hours making insert statements one by one.
I am planning to use azure-sqldb-spark connector which claims to turbo boost the write using bulk insert.
I went through the official doc: https://github.com/Azure/azure-sqldb-spark.
The library is written in scala and basically requires the use of 2 scala classes :
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val bulkCopyConfig = Config(Map(
"url" -> "mysqlserver.database.windows.net",
"databaseName" -> "MyDatabase",
"user" -> "username",
"password" -> "*********",
"databaseName" -> "MyDatabase",
"dbTable" -> "dbo.Clients",
"bulkCopyBatchSize" -> "2500",
"bulkCopyTableLock" -> "true",
"bulkCopyTimeout" -> "600"
))
df.bulkCopyToSqlDB(bulkCopyConfig)
Can it be implemented in used in pyspark like this (using sc._jvm):
Config = sc._jvm.com.microsoft.azure.sqldb.spark.config.Config
connect= sc._jvm.com.microsoft.azure.sqldb.spark.connect._
//all config
df.connect.bulkCopyToSqlDB(bulkCopyConfig)
I am not an expert in Python. Can anybody help me with the complete snippet to get this done.
The Spark connector currently (as of march 2019) only supports the Scala API (as documented here).
So if you are working in a notebook, you could do all the preprocessing in python, finally register the dataframe as a temp table, e. g. :
df.createOrReplaceTempView('testbulk')
and have to do the final step in Scala:
%scala
//configs...
spark.table("testbulk").bulkCopyToSqlDB(bulkCopyConfig)

Spark decode and decompress gzip an embedded base 64 string

My Spark program reads a file that contains gzip compressed string that encoded64. I have to decode and decompress.
I used spark unbase64 to decode and generated byte array
bytedf=df.withColumn("unbase",unbase64(col("value")) )
Is there any spark method available in spark that decompresses bytecode?
I wrote a udf
def decompress(ip):
bytecode = base64.b64decode(x)
d = zlib.decompressobj(32 + zlib.MAX_WBITS)
decompressed_data = d.decompress(bytecode )
return(decompressed_data.decode('utf-8'))
decompress = udf(decompress)
decompressedDF = df.withColumn("decompressed_XML",decompress("value"))
I have a similar case, in my case, I do this:
from pyspark.sql.functions import col,unbase64,udf
from gzip import decompress
bytedf=df1.withColumn("unbase",unbase64(col("payload")))
decompress_func = lambda x: decompress(x).decode('utf-8')
udf_decompress = udf(decompress_func)
df2 = bytedf.withColumn('unbase_decompress', udf_decompress('unbase'))
Spark example using base64-
import base64
.
.
#decode base 64 string using map operation or you may create udf.
df.map(lambda base64string: base64.b64decode(base64string), <string encoder>)
Read here for detailed python example.

Transforming PySpark RDD with Scala

TL;DR - I have what looks like a DStream of Strings in a PySpark application. I want to send it as a DStream[String] to a Scala library. Strings are not converted by Py4j, though.
I'm working on a PySpark application that pulls data from Kafka using Spark Streaming. My messages are strings and I would like to call a method in Scala code, passing it a DStream[String] instance. However, I'm unable to receive proper JVM strings in the Scala code. It looks to me like the Python strings are not converted into Java strings but, instead, are serialized.
My question would be: how to get Java strings out of the DStream object?
Here is the simplest Python code I came up with:
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sparkContext=sc, batchDuration=int(1))
from pyspark.streaming.kafka import KafkaUtils
stream = KafkaUtils.createDirectStream(ssc, ["IN"], {"metadata.broker.list": "localhost:9092"})
values = stream.map(lambda tuple: tuple[1])
ssc._jvm.com.seigneurin.MyPythonHelper.doSomething(values._jdstream)
ssc.start()
I'm running this code in PySpark, passing it the path to my JAR:
pyspark --driver-class-path ~/path/to/my/lib-0.1.1-SNAPSHOT.jar
On the Scala side, I have:
package com.seigneurin
import org.apache.spark.streaming.api.java.JavaDStream
object MyPythonHelper {
def doSomething(jdstream: JavaDStream[String]) = {
val dstream = jdstream.dstream
dstream.foreachRDD(rdd => {
rdd.foreach(println)
})
}
}
Now, let's say I send some data into Kafka:
echo 'foo bar' | $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic IN
The println statement in the Scala code prints something that looks like:
[B#758aa4d9
I expected to get foo bar instead.
Now, if I replace the simple println statement in the Scala code with the following:
rdd.foreach(v => println(v.getClass.getCanonicalName))
I get:
java.lang.ClassCastException: [B cannot be cast to java.lang.String
This suggests that the strings are actually passed as arrays of bytes.
If I simply try to convert this array of bytes into a string (I know I'm not even specifying the encoding):
def doSomething(jdstream: JavaDStream[Array[Byte]]) = {
val dstream = jdstream.dstream
dstream.foreachRDD(rdd => {
rdd.foreach(bytes => println(new String(bytes)))
})
}
I get something that looks like (special characters might be stripped off):
�]qXfoo barqa.
This suggests the Python string was serialized (pickled?). How could I retrieve a proper Java string instead?
Long story short there is no supported way to do something like this. Don't try this in production. You've been warned.
In general Spark doesn't use Py4j for anything else than some basic RPC calls on the driver and doesn't start Py4j gateway on any other machine. When it is required (mostly MLlib and some parts of SQL) Spark uses Pyrolite to serialize objects passed between JVM and Python.
This part of the API is either private (Scala) or internal (Python) and as such not intended for general usage. While theoretically you access it anyway either per batch:
package dummy
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.streaming.api.java.JavaDStream
import org.apache.spark.sql.DataFrame
object PythonRDDHelper {
def go(rdd: JavaRDD[Any]) = {
rdd.rdd.collect {
case s: String => s
}.take(5).foreach(println)
}
}
complete stream:
object PythonDStreamHelper {
def go(stream: JavaDStream[Any]) = {
stream.dstream.transform(_.collect {
case s: String => s
}).print
}
}
or exposing individual batches as DataFrames (probably the least evil option):
object PythonDataFrameHelper {
def go(df: DataFrame) = {
df.show
}
}
and use these wrappers as follows:
from pyspark.streaming import StreamingContext
from pyspark.mllib.common import _to_java_object_rdd
from pyspark.rdd import RDD
ssc = StreamingContext(spark.sparkContext, 10)
spark.catalog.listTables()
q = ssc.queueStream([sc.parallelize(["foo", "bar"]) for _ in range(10)])
# Reserialize RDD as Java RDD<Object> and pass
# to Scala sink (only for output)
q.foreachRDD(lambda rdd: ssc._jvm.dummy.PythonRDDHelper.go(
_to_java_object_rdd(rdd)
))
# Reserialize and convert to JavaDStream<Object>
# This is the only option which allows further transformations
# on DStream
ssc._jvm.dummy.PythonDStreamHelper.go(
q.transform(lambda rdd: RDD( # Reserialize but keep as Python RDD
_to_java_object_rdd(rdd), ssc.sparkContext
))._jdstream
)
# Convert to DataFrame and pass to Scala sink.
# Arguably there are relatively few moving parts here.
q.foreachRDD(lambda rdd:
ssc._jvm.dummy.PythonDataFrameHelper.go(
rdd.map(lambda x: (x, )).toDF()._jdf
)
)
ssc.start()
ssc.awaitTerminationOrTimeout(30)
ssc.stop()
this is not supported, untested and as such rather useless for anything else than the experiments with Spark API.

Resources