Enable/force featuretools to use 2 or more columns to group by - featuretools

How to enable or force featuretools to create featuretools groupby features using 2 or more columns as group bys.
For example I have columns x, y, z
how to set groupby_primitives_options or something to get feature func(func(x) groupby y, z) ?

groupby_primitive_options does not currently support grouping by 2 or more columns, but you could write a custom primitive that does additional groupby operations inside the primitive function.
This may not directly apply to your situation, but in this user issue grouping by multiple columns was achieved by adding an intermediate table to the entityset.

Related

Z order column in databricks table

I am working on creating a notebook which end users could run by providing the table name as input and get an efficient sample query(by utilising the partition key and Z order column). I could get the partition column with describe table or spark.catalog, but not able to find a way to get the Z order column from table metadata?
The code for getting the partition column is given below.
columns = spark.catalog.listColumns(tableName=tablename,dbName=dbname)
partition_columns_details = list(filter(lambda c: c.isPartition , columns))
partition_columns=[ (c.name) for c in partition_columns_details ]
Maybe first an important thing to know about the difference between partitioning and Z ordering. Once you partition a table on a column, this partition remains after each transaction. If you do an insert, update, optimize, ... the table will still be partitioned on that column.
This is not the case for Z ordering. If you do Z ordering on a table during an optimize, and after that you do an insert, update, ... it is possible that the data is not well structured anymore.
That being said, here is an example to find which column(s) the last Z ordering was done on:
df = spark.sql(f'DESCRIBE HISTORY {table_name}')
(df.filter(F.col('operation')=='OPTIMIZE')
.orderBy(F.desc('timestamp'))
.select('operationParameters.zOrderBy')
.collect()[0].zOrderBy)
You can probably expand the code above with some extra information. E.g. where there many other transactions done after the z ordering, ...? Note that not all transactions 'destroy' the Z ordering result. VACUUM for example will be registered within the history, but does not impact Z ordering. After a small INSERT you also will probably still benefit from the Z ordering that was done before.

pyspark : Operate on column sets in parallel

If I was interested in working on groups of rows in Spark, I could, for example add a column that denoted group membership and use a pandas groupby to operate on these rows in parallel (independently). Is there a way to do this for columns instead? I have a df with many feature columns and 1 target column and I would like to run calculations on each set{column_i, target} for i=1...p.

What condition should I supply for a custom cross join in an Azure Data Factory dataflow?

In a dataflow, I have two datasets with one column each. Let's say dataset a with column a and dataset b with column b.
I want to cross join them, but when I select the custom cross join option it asks me to specify a condition. I don't understand what I need to supply here, I just want all the records from column a to be cross joined with all the records from column b. What should I put? I tried checking the official Microsoft documentation but there were no examples there.
The cross join in a join transformation of azure data factory dataflow requires a condition on which the join has to be applied. I have done the following to demonstrate how do cross join on the example that you have given.
I have two datasets (one column each). Dataset A has one column a with the following values.
Dataset B has column b with the following values.
I have used join transformation to join both the sources. Now, the dataflow join transformation prompts you to specify a cross join condition. If you don't have any condition and just want to apply cross join on all the data from both datasets, you give the cross join condition value as true() (As you want to do in this case).
This would apply cross join on all the records of column a with all the records of column b.
This is how you can achieve your requirement. If you have any condition, you can pass it to apply cross join based on it instead of using true(). Refer to this official Microsoft documentation to understand more about joins.

Apply custom function to groupby in vaex

I want to apply some custom logic to each individual group obtained by groupby. It is easy to do so in pandas. How to apply some custom function to groups created by groupby in vaex?
For example, suppose I want to find the min index and max index of each group and based on that, do some operation on the rows present in that group.
Is this possible in vaex?
I think vaex intentionally doesn't support this now, see for example this github issue https://github.com/vaexio/vaex/issues/752.

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Resources