What is the difference between zipWithUniqueId and zipWithIndex in pyspark?

What is the difference between zipWithUniqueId and zipWithIndex in pyspark? - apache-spark

The documentation says that one triggers a spark job and the other one does not. I am not sure I understand what that means. Could you help me understand the difference between the two?

The source of truth comes from latest code:
/**
* Zips this RDD with its element indices. The ordering is first based on the partition index
* and then the ordering of items within each partition. So the first item in the first
* partition gets index 0, and the last item in the last partition receives the largest index.
*
* This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type.
* This method needs to trigger a spark job when this RDD contains more than one partitions.
*
* #note Some RDDs, such as those returned by groupBy(), do not guarantee order of
* elements in a partition. The index assigned to each element is therefore not guaranteed,
* and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee
* the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
*/
def zipWithIndex(): RDD[(T, Long)] = withScope {
new ZippedWithIndexRDD(this)
}
/**
* Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
* 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
* won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
*
* #note Some RDDs, such as those returned by groupBy(), do not guarantee order of
* elements in a partition. The unique ID assigned to each element is therefore not guaranteed,
* and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee
* the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
*/
def zipWithUniqueId(): RDD[(T, Long)]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1396

Related

Spark partition filter is skipped when table is used in where condition, why?

Maybe someone observed this behavior and knows why Spark takes this route.
I wanted to read only few partitions from partitioned table.
SELECT *
FROM my_table
WHERE snapshot_date IN('2023-01-06', '2023-01-07')
results in (part of) the physical plan:
-- Location: PreparedDeltaFileIndex [dbfs:/...]
-- PartitionFilters: [cast(snapshot_date#282634 as string) IN (2023-01-06,2033-01-07)]
It is very fast, ~1s, in the execution plan I see it is using those provided datasets as arguments for partition filters.
If I try to provide filter predicate in form of the one column table it does full table scan and it takes 100x longer.
SELECT *
FROM
my_table
WHERE snapshot_date IN (
SELECT snapshot_date
FROM (VALUES('2023-01-06'), ('2023-01-07')) T(snapshot_date)
)
-- plan
Location: PreparedDeltaFileIndex [dbfs:/...]
PartitionFilters: []
ReadSchema: ...
I was unable to find any query hints that would force Spark to push down this predicate.
One can easily do for loop in python and wrap logic of reading a table with desired dates and read them one by one. But I'm not sure it is possible in SQL.
Is there any option/switch I have missed?

I don't think pushing down this kind of predicate is something supported by Spark's HiveMetaStore client, today.
So in first case, HiveShim.convertFilters(...) method will transform
:
WHERE snapshot_date IN ('2023-01-06', '2023-01-07')
into a filtering predicate understood by HMS as
snapshot_date="2023-01-06" or snapshot_date="2023-01-07"
but in the second, sub-select, case the condition will be skipped altogether.
/**
* Converts catalyst expression to the format that Hive's getPartitionsByFilter() expects, i.e.
* a string that represents partition predicates like "str_key=\"value\" and int_key=1 ...".
*
* Unsupported predicates are skipped.
*/
def convertFilters(table: Table, filters: Seq[Expression]): String = {
lazy val dateFormatter = DateFormatter()
:
:

PySpark Reduce on RDD with only single element

Is there anyway to deal with RDDs with only a single element (this can sometimes happen for what I am doing)? When that's the case, reduce stops working as the operation requires 2 inputs.
I am working with key-value pairs such as:
(key1, 10),
(key2, 20),
And I want to aggregate their values, so the result should be:
30
But there are cases where the rdd only contain a single key-value pair, so reduce does not work here, example:
(key1, 10)
This will return nothing.

If you do a .values() before doing reduce, it should work even if there is only 1 element in the RDD:
from operator import add
rdd = sc.parallelize([('key1', 10),])
rdd.values().reduce(add)
# 10

LAST (not NULL) gives incorrect answers on a handful for records (Apache Spark SparkSQL)

I'm having a pretty troubling problem with the LAST aggregate in SparkSQL in Spark 2.3.1. It seems to give me around 4 bad results -- that is, values that are not LAST by the specified partitioning and order -- in 500,000 (logical SQL, not Spark) partitions, something like 50MM records. Smaller batches are worse -- the number of errors per batch seems pretty consistent, although I don't think I tried anything smaller than 100,000 logical SQL partitions.
I have roughly 66 FIRST or LAST aggregates, a compound (date, integer) logical sql partition key and a compound (string, string) sort key. I tried converting the four-character numeric values into integers, then I combined them into a single integer. Neither of those moves resolved the problem. Even with a single integer sort key, I was getting a few bad values.
Typically, there are fewer than a hundred records in each partition, and a handful of non-NULL values for any field. It never seems to get the second to last value; it's always at least third to last.
I did try to replace the simple aggregate with a windowed aggregate with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. The one run I did of that gave me six bad records -- the compound integer key had given me only two, but I didn't do enough runs to really compare the approaches and of course I need zero.
Why do I not seem to be able to rely on LAST()? Here's a test which just illustrates the unwindowed version of the LAST function, although my partitioning and sorting fields are each two fields.
import org.apache.spark.sql.functions.{expr}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.scalamock.scalatest.MockFactory
import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers}
import collection.JavaConverters._
class LastTest extends FlatSpec with Matchers with MockFactory with BeforeAndAfterAll {
implicit val spark: SparkSession = SparkSession.builder().appName("Last Test").master("local[2]").getOrCreate()
import spark.implicits._
// TRN_DATE, TRN_NUMBER, TRN_TIMESTAMP, DETAILS, DATE_TIME, QUE_LINE_ID, OPR_INITIALS, ENTRY_TYPE, HIST_NO, SUB_HIST_NO, MSG_INFO
"LAST" must "work with GROUP BY" in {
val lastSchema = StructType(Seq(
StructField("Pfield", IntegerType) // partition field
, StructField("Ofield", IntegerType) // order field
, StructField("Vfield", StringType) // value field
))
val last:DataFrame = spark.createDataFrame(List[Row](
Row(0, 1, "Pencil")
, Row(5, 1, "Aardvark")
, Row(10, 1, "Monastery")
, Row(10, 2, "Remediation")
, Row(15, 1, "Parcifal")
, Row(20, 1, "Montenegro")
, Row(20, 2, "Susquehana")
, Row(20, 3, "Perfidy")
, Row(20, 4, "Prosody")
).asJava
, lastSchema
).repartition(expr("MOD(Pfield, 4)"))
last.createOrReplaceTempView("last_group_test")
// apply the unwindowed last
val unwindowed:DataFrame = spark.sql("SELECT Pfield, LAST(Vfield) AS Vlast FROM (SELECT * FROM last_group_test ORDER BY Pfield, Ofield) GROUP BY Pfield ORDER BY Pfield")
unwindowed.show(5)
// apply a windowed last
val windowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
windowed.show(5)
// include the partitioning function in the window
val excessivelyWindowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY MOD(Pfield, 4), Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
excessivelyWindowed.show(5)
assert(unwindowed.collect() === windowed.collect() && windowed.collect() === excessivelyWindowed.collect())
assert(windowed.count() == 5)
assert(windowed.filter("Pfield=20").select($"Vlast").collect()(0)(0)==="Prosody")
}
}
So, all three datasets are the same, which is nice. But, if I apply this logic to my actual needs -- which has sixty-odd columns, almost all of which are LAST values -- I'll get an error, it looks like about 4 times in a batch of 500,000 groups. If I run the dataset 30 times, I'll get 30 different sets of bad records.
Am I doing something wrong, or is this a defect? Is it a known defect? Is it fixed in 2.4? I didn't see if, but "aggregates simply don't work sometimes" can't be something they released with, right?

I was able to resolve the issue by applying with windowed aggregate to a dataset with the same sorting, sorted in a subquery.
SELECT LAST(VAL) FROM (SELEcT * FROM TBL ORDER BY SRT) SRC GROUP BY PRT
was not sufficient, nor was
SELECT LAST(VAL) OVER (PARTITION BY PRT ORDER BY SRT ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM TBL
I had to do both
SELECT DISTINCT LAST(VAL) OVER (PARTITION BY PRT ORDER BY SRT ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM (SELEcT * FROM TBL ORDER BY SRT) SRC
These datasets had been extracted from an Oracle 12.2 instance over JDBC. I also added SRT to the order by clause there, which had just had ORDER BY PRT.
Further -- and I think this may have been most significant -- I used the cacheTable API on the spark catalog object after extracting the data. I had been doing
.repartition
.cache
.count
in order to load all the records with a relatively small number of data connections, but I suspect it was not enough to get all the data sparkside before the aggregations took place.

What is an efficient way to partition by column but maintain a fixed partition count?

What is the best way to partition the data by a field into predefined partition count?
I am currently partitioning the data by specifying the partionCount=600. The count 600 is found to give best query performance for my dataset/cluster setup.
val rawJson = sqlContext.read.json(filename).coalesce(600)
rawJson.write.parquet(filenameParquet)
Now I want to partition this data by the column 'eventName' but still keep the count 600. The data currently has around 2000 unique eventNames, plus the number of rows in each eventName is not uniform. Around 10 eventNames have more than 50% of the data causing data skew. Hence if I do the partitioning like below, its not very performant. The write is taking 5x more time than without.
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("eventName").parquet(filenameParquet)
What is a good way to partition the data for these scenarios? Is there a way to partition by eventName but spread this into 600 partitions?
My schema looks like this:
{
"eventName": "name1",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "detailed1",
"id": "1234",
...
...
}
}
}
Thanks!

This is a common problem with skewed data and there are several approaches you can take.
List bucketing works if the skew remains stable over time, which may or may not be the case, especially if new values of the partitioning variable are introduced. I have not researched how easy it is to adjust list bucketing over time and, as your comment states, you can't use that anyway because it is a Spark 2.0 feature.
If you are on 1.6.x, the key observation is that you can create your own function that maps each event name into one of 600 unique values. You can do this as a UDF or as a case expression. Then, you simply create a column using that function and then partition by that column using repartition(600, 'myPartitionCol) as opposed to coalesce(600).
Because we deal with very skewed data at Swoop, I've found the following workhorse data structure to be quite useful for building partitioning-related tools.
/** Given a key, returns a random number in the range [x, y) where
* x and y are the numbers in the tuple associated with a key.
*/
class RandomRangeMap[A](private val m: Map[A, (Int, Int)]) extends Serializable {
private val r = new java.util.Random() // Scala Random is not serializable in 2.10
def apply(key: A): Int = {
val (start, end) = m(key)
start + r.nextInt(end - start)
}
override def toString = s"RandomRangeMap($r, $m)"
}
For example, here is how we build a partitioner for a slightly different case: one where the data is skewed and the number of keys is small so we have to increase the number of partitions for the skewed keys while sticking with 1 as the minimum number of partitions per key:
/** Partitions data such that each unique key ends in P(key) partitions.
* Must be instantiated with a sequence of unique keys and their Ps.
* Partition sizes can be highly-skewed by the data, which is where the
* multiples come in.
*
* #param keyMap maps key values to their partition multiples
*/
class ByKeyPartitionerWithMultiples(val keyMap: Map[Any, Int]) extends Partitioner {
private val rrm = new RandomRangeMap(
keyMap.keys
.zip(
keyMap.values
.scanLeft(0)(_+_)
.zip(keyMap.values)
.map {
case (start, count) => (start, start + count)
}
)
.toMap
)
override val numPartitions =
keyMap.values.sum
override def getPartition(key: Any): Int =
rrm(key)
}
object ByKeyPartitionerWithMultiples {
/** Builds a UDF with a ByKeyPartitionerWithMultiples in a closure.
*
* #param keyMap maps key values to their partition multiples
*/
def udf(keyMap: Map[String, Int]) = {
val partitioner = new ByKeyPartitionerWithMultiples(keyMap.asInstanceOf[Map[Any, Int]])
(key:String) => partitioner.getPartition(key)
}
}
In your case, you have to merge several event names into a single partition, which would require changes but I hope the code above gives you an idea how to approach the problem.
One final observation is that if the distribution of event names values a lot in your data over time, you can perform a statistics gathering pass over some part of the data to compute a mapping table. You don't have to do this all the time, just when it is needed. To determine that, you can look at the number of rows and/or size of output files in each partition. In other words, the entire process can be automated as part of your Spark jobs.

Spark 1.5.2: Grouping DataFrame Rows over a Time Range

I have a df with the following schema:
ts: TimestampType
key: int
val: int
The df is sorted in ascending order of ts. Starting with row(0), I would like to group the dataframe within certain time intervals.
For example, if I say df.filter(row(0).ts + expr(INTERVAL 24 HOUR)).collect(), it should return all the rows within the 24 hr time window of row(0).
Is there a way to achieve the above within Spark DF context?

Generally speaking it is relatively simple task. All you need is basic arithmetics on UNIX timestamps. First lets cast all timestamps to numerics:
val dfNum = df.withColumn("ts", $"timestamp".cast("long"))
Next lets find minimum timestamp over all rows:
val offset = dfNum.agg(min($"ts")).first.getLong(0)
and use it to compute groups:
val aDay = lit(60 * 60 * 24)
val group = (($"ts" - lit(offset)) / aDay).cast("long")
val dfWithGroups = dfNum.withColumn("group", group)
Finally you can use it as a grouping column:
dfWithGroups.groupBy($"group").agg(min($"value")).
If you want meaningful intervals (interpretable as timestamps) just multiply groups by aDay.
Obviously this won't handle complex cases like handling daylight saving time or leap seconds but should be good enough most of the time. If you need to handle properly any of this you use a similar logic using Joda time with an UDF.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

What is the difference between zipWithUniqueId and zipWithIndex in pyspark? - apache-spark

The documentation says that one triggers a spark job and the other one does not. I am not sure I understand what that means. Could you help me understand the difference between the two?

Related

Spark partition filter is skipped when table is used in where condition, why?

PySpark Reduce on RDD with only single element

LAST (not NULL) gives incorrect answers on a handful for records (Apache Spark SparkSQL)

What is an efficient way to partition by column but maintain a fixed partition count?

Spark 1.5.2: Grouping DataFrame Rows over a Time Range

Categories

Resources