Spark sampling options in JSON reader ignored? - apache-spark

In the following two examples, the number of tasks run and the corresponding run time imply that the sampling options have no effect, as they are similar to jobs run without any sampling options:
val df = spark.read.options("samplingRatio",0.001).json("s3a://test/*.json.bz2")
val df = spark.read.option("sampleSize",100).json("s3a://test/*.json.bz2")
I know that explicit schemas are best for performance, but in convenience cases sampling is useful.
New to Spark, am I using these options incorrectly? Attempted the same approach in PySpark, with same results:
df = spark.read.options(samplingRatio=0.1).json("s3a://test/*.json.bz2")
df = spark.read.options(samplingRatio=None).json("s3a://test/*.json.bz2")

TL;DR None of the you use options will have significant impact on the execution time:
sampleSize is not among valid JSONOptions or JSONOptionsInRead so it will be ignored.
samplingRatio is a valid option, but internally it uses PartitionwiseSampledRDD, so the process is linear in terms of the number of records. Therefore sampling can only reduce inference cost, not the IO, which is likely the bottleneck here.
Setting samplingRatio to None is equivalent to no sampling. PySpark OptionUtils simply discard None options and sampleRatio defaults to 1.0.
You can try to sample data explicitly. In Python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField
def infer_json_schema(path: str, sample_size: int, **kwargs: str) -> StructType:
spark = SparkSession.builder.getOrCreate()
sample = spark.read.text(path).limit(sample_size).rdd.flatMap(lambda x: x)
return spark.read.options(**kwargs).json(sample).schema
In Scala:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
def inferJsonSchema(
path: String, sampleSize: Int, options: Map[String, String]): StructType = {
val spark = SparkSession.builder.getOrCreate()
val sample = spark.read.text(path).limit(sampleSize).as[String]
spark.read.options(options).json(sample).schema
}
Please keep in mind, that to work well, sample size should at most equal to the expected size of partition. Limits in Spark escalate quickly (see for example my answer to Spark count vs take and length) and you can easily end up scanning the whole input.

Related

Why is Pandas-API-on-Spark's apply on groups a way slower than pyspark API?

I'm having strange performance results when comparing the two APIs in pyspark 3.2.1 that provide ability to run pandas UDF on grouped results of Spark Dataframe:
df.groupBy().applyInPandas()
ps_df.groupby().apply() - a new way of apply introduced in Pandas-API-on-Spark AKA Koalas
First I run the following input generator code in local spark mode (Spark 3.2.1):
import pyspark.sql.types as types
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
import pyspark.pandas as ps
spark = SparkSession.builder \
.config("spark.sql.execution.arrow.pyspark.enabled", True) \
.getOrCreate()
ps.set_option("compute.default_index_type", "distributed")
spark.range(1000000).withColumn('group', (col('id') / 10).cast('int')) \
.write.parquet('/tmp/sample_input', mode='overwrite')
Then I test the applyInPandas:
def getsum(pdf):
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = spark.read.parquet(f'/tmp/sample_input')
output_schema = types.StructType(
df.schema.fields + [types.StructField('sum_in_group', types.FloatType())]
)
df.groupBy('group').applyInPandas(getsum, schema=output_schema) \
.write.parquet('/tmp/schematest', mode='overwrite')
And the code executes under 30 seconds (on i7-9750H CPU)
Then, I try the new API and - while I really appreciate how nice the code looks like:
def getsum(pdf) -> ps.DataFrame["id": int, "group": int, "sum_in_group": int]:
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = ps.read_parquet(f'/tmp/sample_input')
df.groupby('group').apply(getsum) \
.to_parquet('/tmp/schematest', mode='overwrite')
... every time the execution time is at least 1m 40s on the same CPU, so more than 3x slower for this simple operation.
I am aware that adding sum_in_group can be done way more efficient with no panadas involvement, but this is just to provide a small minimal example. Any other operations is also at least 3 times slower.
Do you know what would be the reason to this slowdown? Maybe I'm lacking some context parameter that would make these execute in the similar time?

Impala vs SparkSQL: built-in function translation: fnv_hash

I am using the fnv_hash in Impala to translate some string value into numbers. Now I am migrating to Spark SQL, is there a similar function in Spark SQL that I can use? An almost 1-1 function mapping string value to number should work. Thanks!
Unfortunately Spark doesn't provide direct replacement. While built-in o.a.s.sql.functions.hash / pyspark.sql.functions.hash uses MurmurHash 3, which should have comparable properties with the same hash size, Spark uses 32 bit hashes (compared to 64 bit fnv_hash in Impala). If this is acceptable just import hash and you're good to go:
from pyspark.sql.functions import hash as hash_
df = sc.parallelize([("foo", ), ("bar", )]).toDF(["foo"])
df.select(hash_("foo"))
DataFrame[hash(foo): int]
If you need larger you can take a look at XXH64. It is not directly exposed using SQL functions, but the Catalyst expression is public so all you need is a simple wrapper. Roughly something like this:
package com.example.spark.sql
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.XxHash64
object functions {
def xxhash64(cols: Column*): Column = new Column(
new XxHash64(cols.map(_.expr))
)
}
from pyspark import SparkContext
from pyspark.sql.column import Column, _to_java_column, _to_seq
def xxhash64(*cols):
sc = SparkContext._active_spark_context
jc = sc._jvm.com.example.spark.sql.functions.xxhash64(
_to_seq(sc, cols, _to_java_column)
)
return Column(jc)
df.select(xxhash64("foo"))
DataFrame[xxHash(foo): bigint]

Transforming PySpark RDD with Scala

TL;DR - I have what looks like a DStream of Strings in a PySpark application. I want to send it as a DStream[String] to a Scala library. Strings are not converted by Py4j, though.
I'm working on a PySpark application that pulls data from Kafka using Spark Streaming. My messages are strings and I would like to call a method in Scala code, passing it a DStream[String] instance. However, I'm unable to receive proper JVM strings in the Scala code. It looks to me like the Python strings are not converted into Java strings but, instead, are serialized.
My question would be: how to get Java strings out of the DStream object?
Here is the simplest Python code I came up with:
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sparkContext=sc, batchDuration=int(1))
from pyspark.streaming.kafka import KafkaUtils
stream = KafkaUtils.createDirectStream(ssc, ["IN"], {"metadata.broker.list": "localhost:9092"})
values = stream.map(lambda tuple: tuple[1])
ssc._jvm.com.seigneurin.MyPythonHelper.doSomething(values._jdstream)
ssc.start()
I'm running this code in PySpark, passing it the path to my JAR:
pyspark --driver-class-path ~/path/to/my/lib-0.1.1-SNAPSHOT.jar
On the Scala side, I have:
package com.seigneurin
import org.apache.spark.streaming.api.java.JavaDStream
object MyPythonHelper {
def doSomething(jdstream: JavaDStream[String]) = {
val dstream = jdstream.dstream
dstream.foreachRDD(rdd => {
rdd.foreach(println)
})
}
}
Now, let's say I send some data into Kafka:
echo 'foo bar' | $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic IN
The println statement in the Scala code prints something that looks like:
[B#758aa4d9
I expected to get foo bar instead.
Now, if I replace the simple println statement in the Scala code with the following:
rdd.foreach(v => println(v.getClass.getCanonicalName))
I get:
java.lang.ClassCastException: [B cannot be cast to java.lang.String
This suggests that the strings are actually passed as arrays of bytes.
If I simply try to convert this array of bytes into a string (I know I'm not even specifying the encoding):
def doSomething(jdstream: JavaDStream[Array[Byte]]) = {
val dstream = jdstream.dstream
dstream.foreachRDD(rdd => {
rdd.foreach(bytes => println(new String(bytes)))
})
}
I get something that looks like (special characters might be stripped off):
�]qXfoo barqa.
This suggests the Python string was serialized (pickled?). How could I retrieve a proper Java string instead?
Long story short there is no supported way to do something like this. Don't try this in production. You've been warned.
In general Spark doesn't use Py4j for anything else than some basic RPC calls on the driver and doesn't start Py4j gateway on any other machine. When it is required (mostly MLlib and some parts of SQL) Spark uses Pyrolite to serialize objects passed between JVM and Python.
This part of the API is either private (Scala) or internal (Python) and as such not intended for general usage. While theoretically you access it anyway either per batch:
package dummy
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.streaming.api.java.JavaDStream
import org.apache.spark.sql.DataFrame
object PythonRDDHelper {
def go(rdd: JavaRDD[Any]) = {
rdd.rdd.collect {
case s: String => s
}.take(5).foreach(println)
}
}
complete stream:
object PythonDStreamHelper {
def go(stream: JavaDStream[Any]) = {
stream.dstream.transform(_.collect {
case s: String => s
}).print
}
}
or exposing individual batches as DataFrames (probably the least evil option):
object PythonDataFrameHelper {
def go(df: DataFrame) = {
df.show
}
}
and use these wrappers as follows:
from pyspark.streaming import StreamingContext
from pyspark.mllib.common import _to_java_object_rdd
from pyspark.rdd import RDD
ssc = StreamingContext(spark.sparkContext, 10)
spark.catalog.listTables()
q = ssc.queueStream([sc.parallelize(["foo", "bar"]) for _ in range(10)])
# Reserialize RDD as Java RDD<Object> and pass
# to Scala sink (only for output)
q.foreachRDD(lambda rdd: ssc._jvm.dummy.PythonRDDHelper.go(
_to_java_object_rdd(rdd)
))
# Reserialize and convert to JavaDStream<Object>
# This is the only option which allows further transformations
# on DStream
ssc._jvm.dummy.PythonDStreamHelper.go(
q.transform(lambda rdd: RDD( # Reserialize but keep as Python RDD
_to_java_object_rdd(rdd), ssc.sparkContext
))._jdstream
)
# Convert to DataFrame and pass to Scala sink.
# Arguably there are relatively few moving parts here.
q.foreachRDD(lambda rdd:
ssc._jvm.dummy.PythonDataFrameHelper.go(
rdd.map(lambda x: (x, )).toDF()._jdf
)
)
ssc.start()
ssc.awaitTerminationOrTimeout(30)
ssc.stop()
this is not supported, untested and as such rather useless for anything else than the experiments with Spark API.

Need some inputs in feature extraction in Apache Spark

I am new to Apache Spark and we are trying to use the MLIB utility to do some analysis. I collated some code to convert my data into features and then apply a linear regression algorithm to that. I am facing some issues . Please help and excuse if its a silly question
My person data looks like
1,1000.00,36
2,2000.00,35
3,2345.50,37
4,3323.00,45
Just a simple example to get the code working
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Person(rating: String, income: Double, age: Int)
val persondata = sc.textFile("D:/spark/mydata/persondata.txt").map(_.split(",")).map(p => Person(p(0), p(1).toDouble, p(2).toInt))
def prepareFeatures(people: Seq[Person]): Seq[org.apache.spark.mllib.linalg.Vector] = {
val maxIncome = people.map(_ income) max
val maxAge = people.map(_ age) max
people.map (p =>
Vectors.dense(
if (p.rating == "A") 0.7 else if (p.rating == "B") 0.5 else 0.3,
p.income / maxIncome,
p.age.toDouble / maxAge))
}
def prepareFeaturesWithLabels(features: Seq[org.apache.spark.mllib.linalg.Vector]): Seq[LabeledPoint] =
(0d to 1 by (1d / features.length)) zip(features) map(l => LabeledPoint(l._1, l._2))
---Its working till here.
---It breaks in the below code
val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people))
scala> val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people)))
<console>:36: error: not found: value people
Error occurred in an application involving default arguments.
val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people)))
^
Please advise
You seem to be going in roughly the right direction but there are a few minor problems. First off you are trying to reference a value (people) that you haven't defined. More generally you seem to be writing your code to work with sequences, and instead you should modify your code to work with RDDs (or DataFrames). Also you seem to be using parallelize to try and parallelize your operation, but parallelize is a helper method to take a local collection and make it available as a distributed RDD. I'd probably recommend looking at the programming guides or some additional documentation to get a better understanding of the Spark APIs. Best of luck with your adventures with Spark.

Performing operations only on subset of a RDD

I would like to perform some transformations only on a subset of a RDD (to make experimenting in REPL faster).
Is it possible?
RDD has take(num: Int): Array[T] method, I think I'd need something similar, but returning RDD[T]
You can use RDD.sample to get an RDD out, not an Array. For example, to sample ~1% without replacement:
val data = ...
data.count
...
res1: Long = 18066983
val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt)
sample.count
...
res3: Long = 180190
The third parameter is a seed, and is thankfully optional in the next Spark version.
RDDs are distributed collections which are materialized on actions only. It is not possible to truncate your RDD to a fixed size, and still get an RDD back (hence RDD.take(n) returns an Array[T], just like collect)
I you want to get similarly sized RDDs regardless of the input size, you can truncate items in each of your partitions - this way you can better control the absolute number of items in resulting RDD. Size of the resulting RDD will depend on spark parallelism.
An example from spark-shell:
import org.apache.spark.rdd.RDD
val numberOfPartitions = 1000
val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions)
val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10))
val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10))
millionRdd.count // 1000000
millionRddTruncated.count // 10000 = 10 item * 1000 partitions
billionRddTruncated.count // 10000 = 10 item * 1000 partitions
Apparently it's possible to create RDD subset by first using its take method and then passing returned array to SparkContext's makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism) which returns new RDD.
This approach seems dodgy to me though. Is there a nicer way?
I always use parallelize function of SparkContext to distribute from Array[T] but it seems makeRDD do the same. It's correct way both of them.

Resources