Spark DataFrame losing string data in yarn-client mode - apache-spark

By some reason if I'm adding new column, appending string to existing data/column or creating new DataFrame from code, it misinterpreting string data, so show() doesn't work properly, filters (such as withColumn, where, when, etc.) doesn't work ether.
Here is example code:
object MissingValue {
def hex(str: String): String = str.getBytes("UTF-8").map(f => Integer.toHexString((f&0xFF)).toUpperCase).mkString("-")
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MissingValue")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val list = List((101,"ABC"),(102,"BCD"),(103,"CDE"))
val rdd = sc.parallelize(list).map(f => Row(f._1,f._2))
val schema = StructType(StructField("COL1",IntegerType,true)::StructField("COL2",StringType,true)::Nil)
val df = sqlContext.createDataFrame(rdd,schema)
df.show()
val str = df.first().getString(1)
println(s"${str} == ${hex(str)}")
sc.stop()
}
}
If I run it in local mode then everything works as expected:
+----+----+
|COL1|COL2|
+----+----+
| 101| ABC|
| 102| BCD|
| 103| CDE|
+----+----+
ABC == 41-42-43
But when I run the same code in yarn-client mode it produces:
+----+----+
|COL1|COL2|
+----+----+
| 101| ^E^#^#|
| 102| ^E^#^#|
| 103| ^E^#^#|
+----+----+
^E^#^# == 5-0-0
This problem exists only for string values, so first column (Integer) is fine.
Also if I'm creating rdd from the dataframe then everything is fine i.e. df.rdd.take(1).apply(0).getString(1)
I'm using Spark 1.5.0 from CDH 5.5.2
EDIT:
It seems that this happens when the difference between driver memory and executor memory is too high --driver-memory xxG --executor-memory yyG i.e. when I decreasing executor memory or increasing driver memory then the problem disappears.

This is a bug related to executor memory and Oops size:
https://issues.apache.org/jira/browse/SPARK-9725
https://issues.apache.org/jira/browse/SPARK-10914
https://issues.apache.org/jira/browse/SPARK-17706
It is fixed in Spark version 1.5.2

Related

Why few partitions are processed twice if mapPartitions is used with toDF()

I need to process partition per partition (long story).
Using mapPartitions is working fine when using RDDs. In the example, when using rdd.mapPartitions(mapper).collect() all work as expected.
But, when transforming to DataFrame, one partition is processed twice.
Why this is happening and how to avoid it?
Following, the output of the next simple example. We can read how the function is executed 3 times, when there are only two partitions. One of the partitions [Row(id=1), Row(id=2)] is processed two times.
It is courious that one of the executions is ignored, as we can see in the DataDrame resulted.
size: 2 > values: [Row(id=1), Row(id=2)]
size: 2 > values: [Row(id=1), Row(id=2)]
size: 2 > values: [Row(id=3), Row(id=4)]
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
+---+
> Mapper executions: 3
Simple example used:
from typing import Iterator
from pyspark import Row
from pyspark.sql import SparkSession
def gen_random_row(id: str):
return Row(id=id)
if __name__ == '__main__':
spark = SparkSession.builder.master("local[1]").appName("looking for the error").getOrCreate()
executions_counter = spark.sparkContext.accumulator(0)
rdd = spark.sparkContext.parallelize([
gen_random_row(1),
gen_random_row(2),
gen_random_row(3),
gen_random_row(4),
], 2)
def mapper(iterator: Iterator[Row]) -> Iterator[Row]:
executions_counter.add(1)
lst = list(iterator)
print(f"size: {len(lst)} > values: {lst}")
for r in lst:
yield r
# rdd.mapPartitions(mapper).collect()
rdd.mapPartitions(mapper).toDF().show()
print(f"> Mapper executions: {executions_counter.value}")
spark.stop()
The solution is passing the schema to toDF
Looks like Spark is processing one partition to infer the schema.
To solve it:
schema = StructType([StructField("id", IntegerType(), True)])
rdd.mapPartitions(mapper).toDF(schema).show()
With this code, every partition is processed one time.

How to deal with white space in column names to use spark coalesce function in expr method

I am working on spark coalesce functionality in my project.Code works fine on columns with no spaces but fails on spaced columns.
e1.csv
id,code,type,no root
1,,A,1
2,,,0
3,123,I,1
e2.csv
id,code,type,no root
1,456,A,1
2,789,A1,0
3,,C,0
logic code
Dataset<Row> df1 = spark.read().format("csv").option("header", "true").load("/home/user/Videos/<folder>/e1.csv");
Dataset<Row> df2 = spark.read().format("csv").option("header", "true").load("/home/user/Videos/<folder>/e2.csv");
Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id== b.id").selectExpr("coalesce(`a.no root`,`b.no root`) AS `a.no root`");
newDS.show();
What I have tried
Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id== b.id").selectExpr("""coalesce(`a.no root`,`b.no root`) AS `a.no root`""");
The espexted result would be like
no root
1
0
1
Using the following criteria
val newDS = df1.as("a").join(df2.as("b")).where("a.id==b.id").selectExpr("coalesce(a.`no root`,b.`no root`) AS `a.no root`")
will generate the expected output
+---------+
|a.no root|
+---------+
| 1|
| 0|
| 1|
+---------+

How to access nested schema column?

I have a Kafka streaming source with JSONs, e.g. {"type":"abc","1":"23.2"}.
The query gives the following exception:
org.apache.spark.sql.catalyst.parser.ParseException: extraneous
input '.1' expecting {<EOF>, .......}
== SQL ==
person.1
What is the correct syntax to access "person.1"?
I have even changed DoubleType to StringType, but that didn't work either. Example works fine with just by keeping person.type and removing person.1 in selectExpr:
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)")
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.DoubleType)
val personNestedDf = personJsonDf
.select(from_json($"value", struct).as("person"))
val personFlattenedDf = personNestedDf
.selectExpr("person.type", "person.1")
val consoleOutput = personNestedDf.writeStream
.outputMode("update")
.format("console")
.start()
Interesting, since select($"person.1") should work fine (but you used selectExpr which could've confused Spark SQL).
StructField(1,DoubleType,true) won't work however since the type should actually be StringType.
Let's see...
$ cat input.json
{"type":"abc","1":"23.2"}
val input = spark.read.text("input.json")
scala> input.show(false)
+-------------------------+
|value |
+-------------------------+
|{"type":"abc","1":"23.2"}|
+-------------------------+
import org.apache.spark.sql.types._
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.StringType)
val q = input.select(from_json($"value", struct).as("person"))
scala> q.show
+-----------+
| person|
+-----------+
|[abc, 23.2]|
+-----------+
val q = input.select(from_json($"value", struct).as("person")).select($"person.1")
scala> q.show
+----+
| 1|
+----+
|23.2|
+----+
I have solved this problem by using person.*
+-----+--------+
|type | 1 |
+-----+--------+
|abc |23.2 |
+-----+--------+

Cached dataframe flushed after truncating table

Here are the steps:
scala> val df = sql("select * from table")
df: org.apache.spark.sql.DataFrame = [num: int]
scala> df.cache
res13: df.type = [num: int]
scala> df.collect
res14: Array[org.apache.spark.sql.Row] = Array([10], [10])
scala> df
res15: org.apache.spark.sql.DataFrame = [num: int]
scala> df.show
+---+
|num|
+---+
| 10|
| 10|
+---+
scala> sql("truncate table table")
res17: org.apache.spark.sql.DataFrame = []
scala> df.show
+---+
|num|
+---+
+---+
My question is why the df is flushed? My expectation is that it should be cached in the memory and truncate shouldn't erase the data.
Any idea will be much appreciated.
Thanks
The truncate table command removes the cached data then uncaches and empties the table. HERE is the source for truncate. If you follow that link to the source code for TruncateTableCommand, at the bottom of the case class you'll see the following for how the cache and table are handled when a table is truncated:
// After deleting the data, invalidate the table to make sure we don't keep around a stale
// file relation in the metastore cache.
spark.sessionState.refreshTable(tableName.unquotedString)
// Also try to drop the contents of the table from the columnar cache
try {
spark.sharedState.cacheManager.uncacheQuery(spark.table(table.identifier))
} catch {
case NonFatal(e) =>
log.warn(s"Exception when attempting to uncache table $tableIdentWithDB", e)
}
if (table.stats.nonEmpty) {
// empty table after truncation
val newStats = CatalogStatistics(sizeInBytes = 0, rowCount = Some(0))
catalog.alterTableStats(tableName, Some(newStats))
}
Seq.empty[Row]
You should never depend on cache for correctness. Spark cache is performance optimization, and even with the most defensive StorageLevel (MEMORY_AND_DISK_SER_2) is not guaranteed to preserve data in case of worker failure, executor decommissioning or insufficient resources.
Code similar to the one used in your question might work in some conditions, but don't assume it is guaranteed or deterministic behavior.

How to get the number of elements in partition? [duplicate]

This question already has answers here:
Apache Spark: Get number of records per partition
(6 answers)
Closed 2 years ago.
Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition.
Something like this:
Rdd.partitions().get(index).size()
Except I don't see such an API for spark. Any ideas? workarounds?
Thanks
The following gives you a new RDD with elements that are the sizes of each partition:
rdd.mapPartitions(iter => Array(iter.size).iterator, true)
PySpark:
num_partitions = 20000
a = sc.parallelize(range(int(1e6)), num_partitions)
l = a.glom().map(len).collect() # get length of each partition
print(min(l), max(l), sum(l)/len(l), len(l)) # check if skewed
Spark/scala:
val numPartitions = 20000
val a = sc.parallelize(0 until 1e6.toInt, numPartitions )
val l = a.glom().map(_.length).collect() # get length of each partition
print(l.min, l.max, l.sum/l.length, l.length) # check if skewed
The same is possible for a dataframe, not just for an RDD.
Just add DF.rdd.glom... into the code above.
Notice that glom() converts elements of each partition into a list, so it's memory-intensive. A less memory-intensive version (pyspark version only):
import statistics
def get_table_partition_distribution(table_name: str):
def get_partition_len (iterator):
yield sum(1 for _ in iterator)
l = spark.table(table_name).rdd.mapPartitions(get_partition_len, True).collect() # get length of each partition
num_partitions = len(l)
min_count = min(l)
max_count = max(l)
avg_count = sum(l)/num_partitions
stddev = statistics.stdev(l)
print(f"{table_name} each of {num_partitions} partition's counts: min={min_count:,} avg±stddev={avg_count:,.1f} ±{stddev:,.1f} max={max_count:,}")
get_table_partition_distribution('someTable')
outputs something like
someTable each of 1445 partition's counts:
min=1,201,201 avg±stddev=1,202,811.6 ±21,783.4 max=2,030,137
I know I'm little late here, but I have another approach to get number of elements in a partition by leveraging spark's inbuilt function. It works for spark version above 2.1.
Explanation:
We are going to create a sample dataframe (df), get the partition id, do a group by on partition id, and count each record.
Pyspark:
>>> from pyspark.sql.functions import spark_partition_id, count as _count
>>> df = spark.sql("set -v").unionAll(spark.sql("set -v")).repartition(4)
>>> df.rdd.getNumPartitions()
4
>>> df.withColumn("partition_id", spark_partition_id()).groupBy("partition_id").agg(_count("key")).orderBy("partition_id").show()
+------------+----------+
|partition_id|count(key)|
+------------+----------+
| 0| 48|
| 1| 44|
| 2| 32|
| 3| 48|
+------------+----------+
Scala:
scala> val df = spark.sql("set -v").unionAll(spark.sql("set -v")).repartition(4)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [key: string, value: string ... 1 more field]
scala> df.rdd.getNumPartitions
res0: Int = 4
scala> df.withColumn("partition_id", spark_partition_id()).groupBy("partition_id").agg(count("key")).orderBy("partition_id").show()
+------------+----------+
|partition_id|count(key)|
+------------+----------+
| 0| 48|
| 1| 44|
| 2| 32|
| 3| 48|
+------------+----------+
pzecevic's answer works, but conceptually there's no need to construct an array and then convert it to an iterator. I would just construct the iterator directly and then get the counts with a collect call.
rdd.mapPartitions(iter => Iterator(iter.size), true).collect()
P.S. Not sure if his answer is actually doing more work since Iterator.apply will likely convert its arguments into an array.

Resources