Selected primitives are incompatible with Koalas EntitySets: time_since_previous, avg_time_between, trend, avg_time_between, trend - featuretools

I am using featuretools 0.20.0 and koalas 1.3.0.
create feature matrix for all customers
feature_matrix_cust, feature_defs = ft.dfs(
enter code here`entityset=es4,
target_entity="customers_ks",
agg_primitives=["count", "avg_time_between", "num_unique", "trend"],
where_primitives=["count", "avg_time_between","num_unique", "trend"],
trans_primitives=["time_since_previous"]
)
I got error below:
ValueError: Selected primitives are incompatible with Koalas EntitySets: time_since_previous, avg_time_between, trend, avg_time_between, trend
Will featuretools support those primitives for Koalas in the near future? Is there anyway to handle those primitives with current featuretools and Koalas version?

You may be able to write custom primitives that include Koalas functionality, although it is worthwhile to note that currently Koalas does not support custom groupby aggregation functions. Improved support for distributed entitysets is ongoing, however there's currently no timeline for when support for additional primitives will be added.

Related

How do I flatten a featuretools entity set to get wide input format?

I have an entity set with relations defined. Is there a method to get a left joined version of all the data frames in entities as we already have relations?
I can merge the dataframes outside using pandas but would like to leverage well defined entityset.
This functionality does not currently exist in Featuretools. You can do it outside of Featuretools (with pandas).
Feel free to create an issue for this feature:
https://github.com/alteryx/featuretools/issues/new/choose

Disadvantages of Spark Dataset over DataFrame

I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.
Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.
Currently all our data engineering flows are using Spark (Scala)DataFrame.
We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.
EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets
There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.
For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.
Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.
There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.
In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.
Limitations of Spark Datasets:
Datasets used to be less performant (not sure if that's been fixed yet)
You need to define a new case class whenever you change the Dataset schema, which is cumbersome
Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date
case class Birth(hospitalName: String, birthDate: Date)
val birthsDS = Seq(
Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+

Displaying rules of decision tree modelled in pyspark ml library

I am new to spark. I have modeled decision tree using Dataframe based API i.e. pyspark.ml. I want to display rules of decision tree similar to what we get in RDD based API(spark.mllib) in spark using toDebugString.
I have read the documentation and could not find how to display the rules. Is there any other way?
Thank you.
As of Spark 2.0 both DecisionTreeClassificationModel and DecisionTreeRegressionModel provide toDebugString methods.

Saving Spark ML pipeline to a database

Is it possible to save a Spark ML pipeline to a database (Cassandra for example)? From the documentation I can only see the save to path option:
myMLWritable.save(toPath);
Is there a way to somehow wrap or change the myMLWritable.write() MLWriter instance and redirect the output to the database?
It is not possible (or at least no supported) at this moment. ML writer is not extendable and depends on Parquet files and directory structure to represent models.
Technically speaking you can extract individual components and use internal private API to recreate models from scratch, but it is likely the only option.
Spark 2.0.0+
At first glance all Transformers and Estimators implement MLWritable. If you use Spark <= 1.6.0 and experience some issues with model saving I would suggest switching version.
Spark >= 1.6
Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implements the MLWritable interface. For example LogisticRegressionModel, has it, and therefore it's possible to save your model to the desired path using it.
Spark < 1.6
Some operations on a DataFrames can be optimized and it translates to improved performance compared to plain RDDs. DataFramesprovide efficient caching and SQLish API is arguably easier to comprehend than RDD API.
ML Pipelinesare extremely useful and tools like cross-validator or differentevaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.
I believe that at the end of the day what you get by using ML over MLLibis quite elegant, high level API. One thing you can do is to combine both to create a custom multi-step pipeline:
use ML to load, clean and transform data,
extract required data (see for example [extractLabeledPoints ]4 method) and pass to MLLib algorithm,
add custom cross-validation / evaluation
save MLLib model using a method of your choice (Spark model or PMML)
In Jira also there is temporary solution provided . Temporary Solution

Hybrid recommender in spark

I am trying to build a hybrid recommender using prediction.io which functions as a layer on top of spark/mllib under the hood.
I'm looking for a way to incorporate a boost based on tags in the ALS algorithm when doing a recommendation request.
Using content information to improve collaborative filtering seems like such a usual path although I cannot find any documentation on combining a collaborative algorithm (eg ALS) with a content based measure.
Any examples or documentation on incorporating content similarity with collaborative filtering for either mllib (spark) or mahout (hadoop) would be greatly appreciated.
This PredictionIO Template uses Mahout's Spark version of Correlators so it can make use of multiple actions to recommend to users or find similar items. It allows you to include multiple categorical tag-like content to boost or filter recs.
http://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation
The v0.2.0 branch also has date range filtering and popular item backfill is in development.

Resources