How to do a recursive self-join in Foundry Contour? - apache-spark

I have a dataset which represents objects in a hierarchy (there are no cycles). I want to analyse it in Contour and figure out for each object the list of top-level related objects.
Say, my object A depends on objects B and C. Object C in turn depends on objects D and E.
Now I want to figure out what are the "final" or highest level dependencies of A, and I expect the result to be B, D and E.

While in the SQL world dedicated constructors are available to perform hierarchical queries (look for CONNECT BY), the underpinning language behind Contour / Palantir Foundry overall (i.e. Apache Spark) has no automatic recursive construct.
So, whilst it is possible to perform recursive queries with custom functions, I strongly doubt it would be feasible to implement them in Contour.

Given that pyspark is used with Palantir, you can simulate this CONNECT BY / Recursive CTE using pyspark dataframes.
This excellent read https://medium.com/globant/how-to-implement-recursive-queries-in-spark-3d26f7ed3bc9 shows you how.
Standardly, there is no such capability with spark sql.

Related

Export spark feature transformation pipeline to a file

PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.
I see 2 options wrt Mleap:
1) implement dataframe based transformers and the SQLTransformer-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126
2) extend the DefaultMleapFrame with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing subproject.
I actually went with 2) and added implode, explode and join as methods to the DefaultMleapFrame and also a HashIndexedMleapFrame that allows for fast joins. I did not implement groupby and agg, but in Scala this is relatively easy to accomplish.
PMML and PFA are standards for representing machine learning models, not data processing pipelines. A machine learning model takes in a data record, performs some computation on it, and emits an output data record. So by definition, you are working with a single isolated data record, not a collection/frame/matrix of data records.
If you need to represent complete data processing pipelines (where the ML model is just part of the workflow) then you need to look for other/combined standards. Perhaps SQL paired with PMML would be a good choice. The idea is that you want to perform data aggregation outside of the ML model, not inside it (eg. a SQL database will be much better at it than any PMML or PFA runtime).

How to execute function at data source level (and in turn bypass Catalyst)?

Apache Sparkā„¢ provides a pluggable mechanism to integrate with external data sources using the DataSource APIs. These APIs allow Spark to read data from external data sources and also for data that is analyzed in Spark to be written back out to the external data sources. The DataSource APIs also support filter pushdowns and column pruning that can significantly improve the performance of queries.
In addition to this I want to know if Apache spark also provide ability (or interface)
for data sources which are able to execute functions (native or user defined) natively ?
We have a proprietary data source, and it can give results to functions like max(), min(), size() etc.
tl;dr No, that's not possible.
Spark SQL uses functions as a more developer-friendly interface to create Catalyst expressions that know what to generate when given an InternalRow (zero, one or more rows per what's available and whether the expression is a user-defined function or user-defined aggregate function, respectively).
DataSource does not interact with Column (or Catalyst expression in particular) or vice versa in any way. They are separate.
To get very low-level, you could review Max Catalyst expression yourself and learn what and when is generated at execution time.

how to make different sum over the same line in Spark

I have a spark dataframe with, some numeric columns.
I would like to make several aggregationg operations on these columns creating a new column for each function, some of which may also be user defined.
The easy solution would be using dataframe and withColumn. For istance, if I wanted to calculate the mean (by hand) and the function my_function on fields field_1 and field_2 I would do:
df=df.withColumn("mean",(df["field_1"]+df["field_2])/2)
df=df.withColumn("foo", my_function(df["field_1"],df["field_2]))
My doubt is about efficiency. Each of the 2 above functions scans the whole database while a smarter approach would calculate both results using one single scan.
Any hint on how to do that?
Thanks
Mauro
TL;DR You're trying to solve problem which doesn't exist
SQL transformations are lazy and declarative. Series of operations is converted into logical execution plan, and then into physical execution plan. At the first stage Spark optimizer has freedom to reorder, combine or even remove any part of the plan. You have to however, distinguish between two cases:
Python udf.
SQL expression.
The first requires separate conversion to Python RDD. It cannot be combined with native processing. The second one is processed natively using generated code.
Once you request the results physical plan is converted into stages and executed.

Apply a custom function to a spark dataframe group

I have a very big table of time series data that have these columns:
Timestamp
LicensePlate
UberRide#
Speed
Each collection of LicensePlate/UberRide data should be processed considering the whole set of data. In others words, I do not need to proccess the data row by row, but all rows grouped by (LicensePlate/UberRide) together.
I am planning to use spark with dataframe api, but I am confused on how can I perform a custom calculation over spark grouped dataframe.
What I need to do is:
Get all data
Group by some columns
Foreach spark dataframe group apply a f(x). Return a custom object foreach group
Get the results by applying g(x) and returning a single custom object
How can I do steps 3 and 4? Any hints over which spark API (dataframe, dataset, rdd, maybe pandas...) should I use?
The whole workflow can be seen below:
What you are looking for exists since Spark 2.3: Pandas vectorized UDFs. It allows to group a DataFrame and apply custom transformations with pandas, distributed on each group:
df.groupBy("groupColumn").apply(myCustomPandasTransformation)
It is very easy to use so I will just put a link to Databricks' presentation of pandas UDF.
However, I don't know such a practical way to make grouped transformations in Scala yet, so any additional advice is welcome.
EDIT: in Scala, you can achieve the same thing since earlier versions of Spark, using Dataset's groupByKey + mapGroups/flatMapGroups.
While Spark provides some ways to integrate with Pandas it doesn't make Pandas distributed. So whatever you do with Pandas in Spark is simply local (either to driver or executor when used inside transformations) operation.
If you're looking for a distributed system with Pandas-like API you should take a look at dask.
You can define User Defined Aggregate functions or Aggregators to process grouped Datasets but this part of the API is directly accessible only in Scala. It is not that hard to write a Python wrapper when you create one.
RDD API provides a number of functions which can be used to perform operations in groups starting with low level repartition / repartitionAndSortWithinPartitions and ending with a number of *byKey methods (combineByKey, groupByKey, reduceByKey, etc.).
Which one is applicable in your case depends on the properties of the function you want to apply (is it associative and commutative, can it work on streams, does it expect specific order).
The most general but inefficient approach can be summarized as follows:
h(rdd.keyBy(f).groupByKey().mapValues(g).collect())
where f maps from value to key, g corresponds to per-group aggregation and h is a final merge. Most of the time you can do much better than that so it should be used only as the last resort.
Relatively complex logic can be expressed using DataFrames / Spark SQL and window functions.
See also Applying UDFs on GroupedData in PySpark (with functioning python example)

Implementing a Spark SQL UserDefinedAggregateFunction that performs multiple passes over a column

I've been experimenting with the UserDefinedAggregateFunction class to write aggregate functions for use in Spark SQL.
It works well for implementing single pass operations like sum(), avg() etc., but is there a trick you can use to perform multiple passes over a column?
For example, Calculating variance using the naive approach. i.e. With a first pass calculating the column mean and then a second pass that uses this value to calculate the variance. I know that there are single pass algorithms for doing this that give good approximations (as in fact implemented by Spark). I was just using this as an example of a two-pass operation.
It would be nice to be able to do the following,
spark.sql("SELECT product, MultiPassAgg(price) FROM products GROUP BY product")
I appreciate that I can do this kind of thing using Dataset / DataFrame operations in stages etc., but I was just looking clean approach as illustrated in the SQL above.
Any ideas or suggestions?
This should be possible, though the following suggestion could potentially use a large amount of memory if a large number of rows are involved in any given partition.
In the implementation of your UserDefinedAggregateFunction, set up the bufferSchema having a StructField that includes a DataType that is a collection (such as ArrayType) to act as an internal collection of inputs provided via update.
Then, in update you append each input to your collection, and in merge you combine all of the collections into a single collection. This allows you to have the full partition available for use in evaluate.
Finally, during evaluate you can operate across the entire collection of rows in any way you see fit.

Resources