Related
I recently discovered this interesting behavior of Python when it doesn't generate unique object IDs after instantiating a couple of new list objects.
Let me demonstrate:
print('id([1, 2, 3])= ', id([1, 2, 3]))
a = [1, 2, 3]
print('a= ', a)
print('id(a)= ', id(a))
print('id([1, 2, 3])= ', id([1, 2, 3]))
print('id(a)= ', id(a))
The output in terminal:
id([1, 2, 3])= 140117092252416
a= [1, 2, 3]
id(a)= 140117092252416
id([1, 2, 3])= 140117090393920
id(a)= 140117092252416
Despite calling [1,2,3] multiple times, there are two unique object IDs available.I got confused. Shouldn't they be the same in my mind?
Why should these object have the same IDs? They are different instances, that just happen to have the same content. String and numeric literals have the same ID, but that's an optimization that's only allowed since they're immutable.
If the two lists would have the same ID, it would mean they'd be the same instance of a list. That would also mean that you could change a by doing this:
a = [1, 2, 3]
[1, 2, 3].append(4)
I have a pandas DataFrame:
df = pd.DataFrame({"col_1": ["apple", "banana", "apple", "banana", "banana"],
"col_2": [1, 4, 8, 8, 6],
"col_3": [56, 4, 22, 1, 5]})
on which I apply a groupby operation that aggregates multiple columns into a list, using:
df = df.groupby(['col_1'])[["col_2", "col_3"]].agg(list)
Now I want to additionally add a column that for each resulting group adds the number of elements in that group. The result should look like this:
{"col_1": ["apple", "banana"],
"col_2": [[1, 8], [4, 8, 6]],
"col_3": [[56, 22], [4, 1, 5]]
"count": [2, 3]}
I tried the following from reading other Stack Overflow posts:
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).size()
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list, "count")
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).agg("count")
But all gave either incorrect results (option 3) or an error (option 1 and 2)
How to solve this?
We can try named aggregation
d = {c:(c, list) for c in ('col_2', 'col_3')}
df.groupby('col_1').agg(**{**d, 'count': ('col_2', 'size')})
Or we can separately calculate the size of each group, then join it with the dataframe that contains the columns aggregated as lists
g = df.groupby('col_1')
g[['col_2', 'col_3']].agg(list).join(g.size().rename('count'))
col_2 col_3 count
col_1
apple [1, 8] [56, 22] 2
banana [4, 8, 6] [4, 1, 5] 3
Just adding another performant approach to solve the problem:
x = df.groupby('col_1')
x.agg({ 'col_2': lambda x: list(x),'col_3': lambda x: list(x),}).reset_index().join(
x['col_2'].transform('count').rename('count'))
Output
col_1 col_2 col_3 count
0 apple [1, 8] [56, 22] 2
1 banana [4, 8, 6] [4, 1, 5] 3
I have a simple entity set parent1 <- child -> parent2 and a need to use a cutoff dataframe. My target is the parent1 and it's accessible at any time of predictions. I want to specify a date column only for the parent2 so that this time information could be joined to the child. It doesn't work this way and I get data leakage on the first level features from the parent1-child entities. The only thing I can do is to duplicate the date column to the child too. Is it possible to normalize the child avoiding the date column?
Example. Imagine we have 3 entities. Box player information (parent1 with "name"), match information (parent2 with "country"), and their combination (child with "n_hits" in one specific match):
import featuretools as ft
import pandas as pd
players = pd.DataFrame({"player_id": [1, 2, 3], "player_name": ["Oleg", "Kirill", "Max"]})
player_stats = pd.DataFrame({
"match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3],
"match_id": [11, 11, 12, 12], "n_hits": [20, 30, 40, 50]})
matches = pd.DataFrame({
"match_id": [11, 12], "match_date": pd.to_datetime(['2014-1-10', '2014-1-20']),
"country": ["Russia", "Germany"]})
es = ft.EntitySet()
es.entity_from_dataframe(
entity_id="players", dataframe=players,
index="player_id",
variable_types={"player_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
entity_id="player_stats", dataframe=player_stats,
index="match_player_id",
variable_types={"match_player_id": ft.variable_types.Categorical,
"player_id": ft.variable_types.Categorical,
"match_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
entity_id="matches", dataframe=matches,
index="match_id",
time_index="match_date",
variable_types={"match_id": ft.variable_types.Categorical})
es = es.add_relationship(ft.Relationship(es["players"]["player_id"],
es["player_stats"]["player_id"]))
es = es.add_relationship(ft.Relationship(es["matches"]["match_id"],
es["player_stats"]["match_id"]))
Here I want to use all available information that I have at the 15th January. So the only legal is the information for the first match, not for the second.
cutoff_df = pd.DataFrame({
"player_id":[1, 2, 3],
"match_date": pd.to_datetime(['2014-1-15', '2014-1-15', '2014-1-15'])})
fm, features = ft.dfs(entityset=es, target_entity='players', cutoff_time=cutoff_df,
cutoff_time_in_index=True, agg_primitives = ["mean"])
fm
I got
player_name MEAN(player_stats.n_hits)
player_id time
1 2014-01-15 Oleg 30
2 2014-01-15 Kirill 30
3 2014-01-15 Max 50
The only way I know to set up a proper match_date to player_stats is to join this information from matches
player_stats = pd.DataFrame({
"match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3],
"match_id": [11, 11, 12, 12], "n_hits": [20, 30, 40, 50],
"match_date": pd.to_datetime(
['2014-1-10', '2014-1-10', '2014-1-20', '2014-1-20']) ## a result of join
})
...
es = es.entity_from_dataframe(
entity_id="player_stats", dataframe=player_stats,
index="match_player_id",
time_index="match_date", ## a change here too
variable_types={"match_player_id": ft.variable_types.Categorical,
"player_id": ft.variable_types.Categorical,
"match_id": ft.variable_types.Categorical})
And I get the expected result
player_name MEAN(player_stats.n_hits)
player_id time
1 2014-01-15 Oleg 20.0
2 2014-01-15 Kirill 30.0
3 2014-01-15 Max NaN
Featuretools is very conservative when it comes to the time index of an entity. We try not to infer a time index if it isn't provided. Therefore, you have to create the duplicate column as you suggest.
I am using spark fp growth algorithm. I have given minsupport and confidence as o, so all combinations i should get
from pyspark.ml.fpm import FPGrowth
df = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 2])
], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.0, minConfidence=0.0)
model = fpGrowth.fit(df)
# Display generated association rules.
model.associationRules.show()
First problem is always my consequent contain only one element
[1] -> [5, 2] should be a sample output freq of 1 is 3, freq of 5,2 is 2 and freq of [5, 2, 1]| is 2. so This should come in rules
The spark implementation is such that it would only return 1 element in the consequent.
You can check the same in the below link.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala
//the consequent contains always only one element
itemSupport.get(consequent.head))
This is from the MLlib package(ML package uses MLlib implementation).
Cheers,
I starts to learn machine learning by myself.
I have a set of input variables (categorical and continuous): job (retired, manager, technician, etc.) education (high school, unknown, bachelor, master, etc.) , duration of contact, age, marital; etc.... output variable (yes or no) (agree to purchase new product?)
First of all I want to analyze the dataset, but I do not know how to find correlation between input and output variable for discrete input data in python?
Should I clear all the missing data (unknown)?
Two things come to mind:
1. Look at how features a correlated
2. Look at how purchased vs. not-purchased statistical distributions look like
Feature correlation
import pandas as pd
import seaborn as sns
df = pd.DataFrame({
'job': ['retired', 'retired', 'manager', 'manager', 'manager', 'technician', 'technician', None, None],
'education': ['high', 'high', 'master', 'unknown', 'master', 'master', 'high', 'unknown', 'master'],
'duration_of_contact': [3, 1, 5, 3, 1, 9 ,8, 3, 1],
'age': [50, 65, 30, 29, 38, 42, 25, 10, 10],
'married': [1, 1, 0, 1, 0, 1, 0, 0, 0],
'purchase': [0, 0, 1, 1, 1, 0, 0, 1, 1]
})
sns.heatmap(df.corr())
Statistical properties
You could look at the distributions of the two cases when purchase is True, and when purchase is False:
sns.boxplot(x="purchase", y="age", data=df)
See more plots you can use from seaborn: https://seaborn.pydata.org/tutorial/categorical.html