Avoid duplication of date column for child entity - featuretools

I have a simple entity set parent1 <- child -> parent2 and a need to use a cutoff dataframe. My target is the parent1 and it's accessible at any time of predictions. I want to specify a date column only for the parent2 so that this time information could be joined to the child. It doesn't work this way and I get data leakage on the first level features from the parent1-child entities. The only thing I can do is to duplicate the date column to the child too. Is it possible to normalize the child avoiding the date column?
Example. Imagine we have 3 entities. Box player information (parent1 with "name"), match information (parent2 with "country"), and their combination (child with "n_hits" in one specific match):
import featuretools as ft
import pandas as pd
players = pd.DataFrame({"player_id": [1, 2, 3], "player_name": ["Oleg", "Kirill", "Max"]})
player_stats = pd.DataFrame({
"match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3],
"match_id": [11, 11, 12, 12], "n_hits": [20, 30, 40, 50]})
matches = pd.DataFrame({
"match_id": [11, 12], "match_date": pd.to_datetime(['2014-1-10', '2014-1-20']),
"country": ["Russia", "Germany"]})
es = ft.EntitySet()
es.entity_from_dataframe(
entity_id="players", dataframe=players,
index="player_id",
variable_types={"player_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
entity_id="player_stats", dataframe=player_stats,
index="match_player_id",
variable_types={"match_player_id": ft.variable_types.Categorical,
"player_id": ft.variable_types.Categorical,
"match_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
entity_id="matches", dataframe=matches,
index="match_id",
time_index="match_date",
variable_types={"match_id": ft.variable_types.Categorical})
es = es.add_relationship(ft.Relationship(es["players"]["player_id"],
es["player_stats"]["player_id"]))
es = es.add_relationship(ft.Relationship(es["matches"]["match_id"],
es["player_stats"]["match_id"]))
Here I want to use all available information that I have at the 15th January. So the only legal is the information for the first match, not for the second.
cutoff_df = pd.DataFrame({
"player_id":[1, 2, 3],
"match_date": pd.to_datetime(['2014-1-15', '2014-1-15', '2014-1-15'])})
fm, features = ft.dfs(entityset=es, target_entity='players', cutoff_time=cutoff_df,
cutoff_time_in_index=True, agg_primitives = ["mean"])
fm
I got
player_name MEAN(player_stats.n_hits)
player_id time
1 2014-01-15 Oleg 30
2 2014-01-15 Kirill 30
3 2014-01-15 Max 50
The only way I know to set up a proper match_date to player_stats is to join this information from matches
player_stats = pd.DataFrame({
"match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3],
"match_id": [11, 11, 12, 12], "n_hits": [20, 30, 40, 50],
"match_date": pd.to_datetime(
['2014-1-10', '2014-1-10', '2014-1-20', '2014-1-20']) ## a result of join
})
...
es = es.entity_from_dataframe(
entity_id="player_stats", dataframe=player_stats,
index="match_player_id",
time_index="match_date", ## a change here too
variable_types={"match_player_id": ft.variable_types.Categorical,
"player_id": ft.variable_types.Categorical,
"match_id": ft.variable_types.Categorical})
And I get the expected result
player_name MEAN(player_stats.n_hits)
player_id time
1 2014-01-15 Oleg 20.0
2 2014-01-15 Kirill 30.0
3 2014-01-15 Max NaN

Featuretools is very conservative when it comes to the time index of an entity. We try not to infer a time index if it isn't provided. Therefore, you have to create the duplicate column as you suggest.

Related

How to let pandas groupby add a count column for each group after applying list aggregations?

I have a pandas DataFrame:
df = pd.DataFrame({"col_1": ["apple", "banana", "apple", "banana", "banana"],
"col_2": [1, 4, 8, 8, 6],
"col_3": [56, 4, 22, 1, 5]})
on which I apply a groupby operation that aggregates multiple columns into a list, using:
df = df.groupby(['col_1'])[["col_2", "col_3"]].agg(list)
Now I want to additionally add a column that for each resulting group adds the number of elements in that group. The result should look like this:
{"col_1": ["apple", "banana"],
"col_2": [[1, 8], [4, 8, 6]],
"col_3": [[56, 22], [4, 1, 5]]
"count": [2, 3]}
I tried the following from reading other Stack Overflow posts:
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).size()
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list, "count")
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).agg("count")
But all gave either incorrect results (option 3) or an error (option 1 and 2)
How to solve this?
We can try named aggregation
d = {c:(c, list) for c in ('col_2', 'col_3')}
df.groupby('col_1').agg(**{**d, 'count': ('col_2', 'size')})
Or we can separately calculate the size of each group, then join it with the dataframe that contains the columns aggregated as lists
g = df.groupby('col_1')
g[['col_2', 'col_3']].agg(list).join(g.size().rename('count'))
col_2 col_3 count
col_1
apple [1, 8] [56, 22] 2
banana [4, 8, 6] [4, 1, 5] 3
Just adding another performant approach to solve the problem:
x = df.groupby('col_1')
x.agg({ 'col_2': lambda x: list(x),'col_3': lambda x: list(x),}).reset_index().join(
x['col_2'].transform('count').rename('count'))
Output
col_1 col_2 col_3 count
0 apple [1, 8] [56, 22] 2
1 banana [4, 8, 6] [4, 1, 5] 3

Python get index of class list on condition

I have the following code:
import numpy as np
np.random.seed(1)
class Order:
def __init__(self, amount, orderId):
self.orderId = orderId
self.amount = amount
n = 5
amounts = np.random.randint(low=8, high=20, size=n)
orderIds = np.arange(n)
orders = [Order(amounts[i],orderIds[i]) for i in range(n)]
And I would like to get a list of orderId for all orderId when the cumulative sum of amount of the sorted list if above some threshold.
Example:
orderIds = [0, 1, 2, 3, 4]
amounts = [15, 18, 19, 16, 10]
descending_ordered_amounts = [19, 18, 16, 15, 10]
cumsum = [19, 37, 52, 68, 78]
threshold = 55
cumsum > threshold # [False, False, False, True, True]
Then I would like to get Ids = [0,4]
What would be the fastest way to get that please?
The main problem here is to sort the ids along the values so you know what to return.
It can be done by something like this, using a intermediate value to hold a tuple of amounts and ids.
orderIds = [0, 1, 2, 3, 4]
amounts = [15, 18, 19, 16, 10]
des = [(x, y) for x, y in zip(amounts, orderIds)]
des.sort(key=lambda x: x[0], reverse=True)
sortedIds = np.array([x[1] for x in des])
threshold = 55
ids = np.cumsum([x[0] for x in des]) > threshold
print(sortedIds[ids])
that will print the ids that satisfy the requirement. i did not use a variable descending_ordered_amounts, since it is stored on the first column of des, so i just used a list comprehension in cumsum

Retrieving original data from PyTorch nn.Embedding

I'm passing a dataframe with 5 categories (ex. car, bus, ...) into nn.Embedding.
When I do embedding.parameters(), I can see that there are 5tensors but how do I know which index corresponds to the original input (ex. car, bus, ...)?
You can't as tensors are unnamed (only dimensions can be named, see PyTorch's Named Tensors).
You have to keep the names in separate data container, for example (4 categories here):
import pandas as pd
import torch
df = pd.DataFrame(
{
"bus": [1.0, 2, 3, 4, 5],
"car": [6.0, 7, 8, 9, 10],
"bike": [11.0, 12, 13, 14, 15],
"train": [16.0, 17, 18, 19, 20],
}
)
df_data = df.to_numpy().T
df_names = list(df)
embedding = torch.nn.Embedding(df_data.shape[0], df_data.shape[1])
embedding.weight.data = torch.from_numpy(df_data)
Now you can simply use it with any index you want:
index = 1
embedding(torch.tensor(index)), df_names[index]
This would give you (tensor[6, 7, 8, 9, 10], "car") so the data and respective column name.

delete duplicated rows based on conditions pandas

I want to delete rows in dataframe if (x1, x2, x3) are the same between different rows and save in variable all ids of the rows deleted.
For example, with this data, I want to delete the second row;
d = {'id': ["i1", "i2", "i3", "i4"], 'x1': [13, 13, 61, 61], 'x2': [10, 10, 13, 13], 'x3': [12, 12, 2, 22], 'x4': [24, 24,9, 12]}
df = pd.DataFrame(data=d)
#input data
d = {'id': ["i1", "i2", "i3", "i4"], 'x1': [13, 13, 61, 61], 'x2': [10, 10, 13, 13], 'x3': [12, 12, 2, 22], 'x4': [24, 24,9, 12]}
df = pd.DataFrame(data=d)
#create new column where contents from x1, x2 and x3 columns are merged
df['MergedColumn'] = df[df.columns[1:4]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
#remove duplicates based on the created column and drop created column
df1 = pd.DataFrame(df.drop_duplicates("MergedColumn", keep='first').drop(columns="MergedColumn"))
#print output dataframe
print(df1)
#merge two dataframes
df2 = pd.merge(df, df1, how='left', on = 'id')
#find rows with null values in the right table (rows that were removed)
df2 = df2[df2['x1_y'].isnull()]
#prints ids of rows that were removed
print(df2['id'])

Incorrect ArrayType elements inside Pyspark pandas_udf

I am using Spark 2.3.0 and trying the pandas_udf user-defined functions within my Pyspark code. According to https://github.com/apache/spark/pull/20114, ArrayType is currently supported. My user-defined function is:
def transform(c):
if not any(isinstance(x, (list, tuple, np.ndarray)) for x in c.values):
nvalues = c.values
else:
nvalues = np.array(c.values.tolist())
tvalues = some_external_function(nvalues)
if not any(isinstance(y, (list, tuple, np.ndarray)) for y in tvalues):
p = pd.Series(np.array(tvalues))
else:
p = pd.Series(list(tvalues))
return p
transform = pandas_udf(transform, ArrayType(LongType()))
When i am applying this function to a specific array column of a large Spark Dataframe, then i notice that the first element of the pandas series c, has different double size compared to the others, and the last one has 0 size:
0 [73, 10, 223, 46, 14, 73, 14, 5, 14, 21, 10, 2...
1 [223, 46, 14, 73, 14, 5, 14, 21, 30, 16]
2 [46, 14, 73, 14, 5, 14, 21, 30, 16, 15]
...
4695 []
Name: _70, Length: 4696, dtype: object
With the first array having 20 elements, the second 10 (which is the correct one), and the last one 0. And then of course the c.values fails with ValueError: setting an array element with a sequence., since the array has multiple sizes.
When i am trying the same function to column with array of strings, then all sizes are correct, and the rest of the functions steps as well.
Any idea what might be the issue? Possible bug?
UPDATED with example:
I am attaching a simple example, just printing the values inside the pandas_udf function.
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("testing pandas_udf")\
.getOrCreate()
arr = []
for i in range(100000):
arr.append([2,2,2,2,2])
df = spark.createDataFrame(arr, ArrayType(LongType()))
def transform(c):
print(c)
print(c.values)
return c
transform = pandas_udf(transform, ArrayType(LongType()))
df = df.withColumn('new_value', transform(col('value')))
df.show()
spark.stop()
If you check executor's log, you will see logs like:
0 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
1 [2, 2, 2, 2, 2]
2 [2, 2, 2, 2, 2]
3 [2, 2, 2, 2, 2]
4 [2, 2, 2, 2, 2]
5 [2, 2, 2, 2, 2]
...
9996 [2, 2, 2, 2, 2]
9997 [2, 2, 2, 2, 2]
9998 []
9999 []
Name: _0, Length: 10000, dtype: object
SOLVED:
If you face the same issue, upgrade to Spark 2.3.1 and pyarrow 0.9.0.post1.
yeah, looks like there is a bug in Spark. My situation concerns 2.3.0 and PyArrow 0.13.0. The only remedy available for me is just to convert array into string and then pass it to Pandas UDF.
def _identity(sample_array):
return sample_array.apply(lambda e: [int(i) for i in e.split(',')])
array_identity_udf = F.pandas_udf(_identity,
returnType=ArrayType(IntegerType()),
functionType=F.PandasUDFType.SCALAR)
test_df = (spark
.table('test_table')
.repartition(F.ceil(F.rand(seed=1234) * 100))
.cache())
test1_df = (test_df
.withColumn('array_test', array_identity_udf(F.concat_ws(',', F.col('sample_array')))))

Resources