What is the difference between "predicate pushdown" and "projection pushdown"? - apache-spark

I have come across several sources of information, such as the one found here, which explain "predicate pushdown" as :
… if you can “push down” parts of the query to where the data is stored, and thus filter out most of the data, then you can greatly reduce network traffic.
However, I have also seen the term "projection pushdown" in other documentation such as here, which appears to be the same thing but I am not sure in my understanding.
Is there a specific difference between the two terms?

Predicate refers to the where/filter clause which effects the amount of rows returned.
Projection refers to the selected columns.
For example:
If your filters pass only 5% of the rows, only 5% of the table will be passed from the storage to Spark instead of the full table.
If your projection selects only 3 columns out of 10, then less columns will be passed from the storage to Spark and if your storage is columnar (e.g. Parquet, not Avro) and the non selected columns are not a part of the filter, then these columns won't even have to be read.

In set and bag relational algebra alike, predicate pushdown eliminates tuples.
In bag relational algebra, projection pushdown eliminates attributes ("columns"), but in case of column based storage, it doesn't matter much, becasue columns that are not used higher up aren't being carried to begin with. Even a row based database may or may not benefit from projection pushdown (even SQL doesn't specify a physical access plan). Projection in bag RA is a very nominal operation that can be physically done at just the metadata level (flag some columns as inaccessible).
In set relational algebra, projection pushdown generally eliminates tuples as well, so this is where it has significance. Set RA projection is not an inexpensive operation, due to the need for deduplication. It's like a GROUP BY with no aggregated fields. Still, it's often worth doing the projection before a join, due to a possible vast decrease of tuple count.
Bag algebra tools eg. SQL also have ways for set RA projection, such as SELECT DISTINCT.
Neither predicate pushdown nor projection pushdown lead to more optimal execution in all cases. Depends on their selectivity and a lot of other things. Still, esp. predicate pushdown is a good heuristic as joins tend to be most expensive.
If the pushed-down projection has a sort index for the retained fields, or it needs to scan the table, there may be join algorithms with which the projection can be fused, avoiding a double reading of table/index structures.

Related

What is recommended - keeping empty lists/arrays versus Null in spark tables?

I have a large spark table containing mixed data types String,arrays,maps
The array and map columns are sparse in nature. Should i keep empty arrays in values for these columns or make them null?
Similarly is it recommended to use empty strings "" for storing or null?
What is a good practice and advantages and disadvantages of both?
Generally speaking I would always try to use NULL values instead of empty strings or arrays. The main reason for me for me his how they are handled in spark, e.g. when joining two data frames. NULL values are ignored in joins, but empty strings or lists are not. This can often result in very skew data, which can heavily slow down your transformations. Some information about skew data can be found here [external link].
In addition, NULL values are also often ignored in functions like coalesce of columns [docs], count in aggregations [related question] or first(col, ignorenulls=True) [docs]. If you want to use the functions as they are intended, I would also recommend using NULL over empty string/list.
To sum this up: using NULL over other values like empty strings or lists, allows you to profit for more native Spark functionality and I would recommend to use NULL when ever possible.

reduce, reduceByKey, reduceGroups in Spark or Flink

reduce: function takes accumulated value and next value to find some aggregation.
reduceByKey: is also the same operation with specified key.
reduceGroups: is apply specified operation to the grouped data.
I don't know how memory managed for these operations. For example, how data is taken while using reduce function(e.g all data loaded to the memory?)? I want to know how data is managed for reduce operations. I also want to know what is the difference between these operations according to the data management.
Reduce is one of the cheapest operations in Spark,since that the only thing it does is actually grouping similar data to the same node.The only cost of a reduce operation is the reading of the tuple and a decision of where it should be grouped.
This means that the simple reduce,in contrast to the reduceByKey or reduceGroups is more expensive because Spark does not know how to make the grouping and searches for correlations among tuples.
Reduce can also ignore a tuple if it does not meet any requirement.

ArangoDB REGEX_TEST index acceleration?

Is there a way to index while performing REGEX_TEST() on a string to field to retrieve documents in ArangoDB?
Also if there is any way to optimize this please let me know
There is no index acceleration available for the REGEX_TEST() AQL function, and it is unlikely to come in the future. Not because there is no interest from users and developers, but because it's not really possible to build any sort of index data structure that would allow to speed up regular expression evaluation.
Regular expressions as supported by ArangoDB allow for many different types of expressions, but because they can differ so much, there is almost no chance to have a suitable index. For equality comparisons there are hash indexes, which are probably the fastest kind of index. For range queries there are skiplist indexes, and there are of course quite a few more index types known in computer science, but I'm not aware of a single one that could speed up arbitrary regex.
If your expression allows, maybe there is a chance add a filter criterion before REGEX_TEST() which might utilize an index? This will mostly be limited to case-sensitive prefix matching, e.g. FILTER REGEX_TEST(doc.str, "a[a-z]*") could be extended to FILTER doc.str >= "a" AND doc.str < "b" AND REGEX_TEST(doc.str, "a[a-z]*") and allow for a skiplist index being used to only evaluate the regex on documents where str starts with a. Or some simple regex like [fm]oo|bar could be rewritten to a set of equality comparisons: FILTER doc.str IN ["foo","moo","bar"]. Also have a look at ArangoSearch.

Raw sql with many columns

I'm building a CRUD application that pulls data using Persistent and executes a number of fairly complicated queries, for instance using window functions. Since these aren't supported by either Persistent or Esqueleto, I need to use raw sql.
A good example is that I want to select rows in which the value does not deviate strongly from the previous value, so in pseudo-sql the condition is WHERE val - lag(val) <= x. I need to run this selection in SQL, rather than pulling all data and then filtering in Haskell, because otherwise I'd have way to much data to handle.
These queries return many columns. However, the RawSql instance maxes out at tuples with 8 elements. So now I am writing additional functions from9, to9, from10, to10 and so on. And after that, all these are converted using functions with type (Single a, Single b, ...) -> DesiredType. Even though this could be shortened using code generation, the approach is simply hacky and clearly doesn't feel like good Haskell. This concerns me because I think most of my queries will require rawSql.
Do you have suggestions on how to improve this? Currently, my main thought is to un-normalize the database and duplicate data, e.g. by including the lagged value as column, so that I can query the data with Esqueleto.

Mind blown: RDD.zip() method

I just discovered the RDD.zip() method and I cannot imagine what its contract could possibly be.
I understand what it does, of course. However, it has always been my understanding that
the order of elements in an RDD is a meaningless concept
the number of partitions and their sizes is an implementation detail only available to the user for performance tuning
In other words, an RDD is a (multi)set, not a sequence (and, of course, in, e.g., Python one gets AttributeError: 'set' object has no attribute 'zip')
What is wrong with my understanding above?
What was the rationale behind this method?
Is it legal outside the trivial context like a.map(f).zip(a)?
EDIT 1:
Another crazy method is zipWithIndex(), as well as well as the various zipPartitions() variants.
Note that first() and take() are not crazy because they are just (non-random) samples of the RDD.
collect() is also okay - it just converts a set to a sequence which is perfectly legit.
EDIT 2: The reply says:
when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
This appears to imply that even the trivial a.map(f).zip(a) is not guaranteed to be equivalent to a.map(x => (f(x),x)). What is the situation when zip() results are reproducible?
It is not true that RDDs are always unordered. An RDD has a guaranteed order if it is the result of a sortBy operation, for example. An RDD is not a set; it can contain duplicates. Partitioning is not opaque to the caller, and can be controlled and queried. Many operations do preserve both partitioning and order, like map. That said I find it a little easy to accidentally violate the assumptions that zip depends on, since they're a little subtle, but it certainly has a purpose.
The mental model I use (and recommend) is that the elements of an RDD are ordered, but when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
For those who want to be aware of partitions, I'd say that:
The partitions of an RDD have an order.
The elements within a partition have an order.
If you think of "concatenating" the partitions (say laying them "end to end" in order) using the order of elements within them, the overall ordering you end up with corresponds to the order of elements if you ignore partitions.
But again, if you compute one RDD from another, all bets about the order relationships of the two RDDs are off.
Several members of the RDD class (I'm referring to the Scala API) strongly suggest an order concept (as does their documentation):
collect()
first()
partitions
take()
zipWithIndex()
as does Partition.index as well as SparkContext.parallelize() and SparkContext.makeRDD() (which both take a Seq[T]).
In my experience these ways of "observing" order give results that are consistent with each other, and the ones that translate back and forth between RDDs and ordered Scala collections behave as you would expect -- they preserve the overall order of elements. This is why I say that, in practice, RDDs have a meaningful order concept.
Furthermore, while there are obviously many situations where computing an RDD from another must change the order, in my experience order tends to be preserved where it is possible/reasonable to do so. Operations that don't re-partition and don't fundamentally change the set of elements especially tend to preserve order.
But this brings me to your question about "contract", and indeed the documentation has a problem in this regard. I have not seen a single place where an operation's effect on element order is made clear. (The OrderedRDDFunctions class doesn't count, because it refers to an ordering based on the data, which may differ from the raw order of elements within the RDD. Likewise the RangePartitioner class.) I can see how this might lead you to conclude that there is no concept of element order, but the examples I've given above make that model unsatisfying to me.

Resources