adding column to pandas dataframe that enumerates based on unique values in another column [duplicate] - python-3.x

I feel like there is a better way than this:
import pandas as pd
df = pd.DataFrame(
columns=" index c1 c2 v1 ".split(),
data= [
[ 0, "A", "X", 3, ],
[ 1, "A", "X", 5, ],
[ 2, "A", "Y", 7, ],
[ 3, "A", "Y", 1, ],
[ 4, "B", "X", 3, ],
[ 5, "B", "X", 1, ],
[ 6, "B", "X", 3, ],
[ 7, "B", "Y", 1, ],
[ 8, "C", "X", 7, ],
[ 9, "C", "Y", 4, ],
[ 10, "C", "Y", 1, ],
[ 11, "C", "Y", 6, ],]).set_index("index", drop=True)
def callback(x):
x['seq'] = range(1, x.shape[0] + 1)
return x
df = df.groupby(['c1', 'c2']).apply(callback)
print df
To achieve this:
c1 c2 v1 seq
0 A X 3 1
1 A X 5 2
2 A Y 7 1
3 A Y 1 2
4 B X 3 1
5 B X 1 2
6 B X 3 3
7 B Y 1 1
8 C X 7 1
9 C Y 4 1
10 C Y 1 2
11 C Y 6 3
Is there a way to do it that avoids the callback?

use cumcount(), see docs here
In [4]: df.groupby(['c1', 'c2']).cumcount()
Out[4]:
0 0
1 1
2 0
3 1
4 0
5 1
6 2
7 0
8 0
9 0
10 1
11 2
dtype: int64
If you want orderings starting at 1
In [5]: df.groupby(['c1', 'c2']).cumcount()+1
Out[5]:
0 1
1 2
2 1
3 2
4 1
5 2
6 3
7 1
8 1
9 1
10 2
11 3
dtype: int64

This might be useful
df = df.sort_values(['userID', 'date'])
grp = df.groupby('userID')['ItemID'].aggregate(lambda x: '->'.join(tuple(x))).reset_index()
print(grp)
it will create a sequence like this

If you have a dataframe similar to the one below and you want to add seq column by building it from c1 or c2, i.e. keep a running count of similar values (or until a flag comes up) in other column(s), read on.
df = pd.DataFrame(
columns=" c1 c2 seq".split(),
data= [
[ "A", 1, 1 ],
[ "A1", 0, 2 ],
[ "A11", 0, 3 ],
[ "A111", 0, 4 ],
[ "B", 1, 1 ],
[ "B1", 0, 2 ],
[ "B111", 0, 3 ],
[ "C", 1, 1 ],
[ "C11", 0, 2 ] ])
then first find group starters, (str.contains() (and eq()) is used below but any method that creates a boolean Series such as lt(), ne(), isna() etc. can be used) and call cumsum() on it to create a Series where each group has a unique identifying value. Then use it as the grouper on a groupby().cumsum() operation.
In summary, use a code similar to the one below.
# build a grouper Series for similar values
groups = df['c1'].str.contains("A$|B$|C$").cumsum()
# or build a grouper Series from flags (1s)
groups = df['c2'].eq(1).cumsum()
# groupby using the above grouper
df['seq'] = df.groupby(groups).cumcount().add(1)

The cleanliness of Jeff's answer is nice, but I prefer to sort explicitly...though generally without overwriting my df for these type of use-cases (e.g. Shaina Raza's answer).
So, to create a new column sequenced by 'v1' within each ('c1', 'c2') group:
df["seq"] = df.sort_values(by=['c1','c2','v1']).groupby(['c1','c2']).cumcount()
you can check with:
df.sort_values(by=['c1','c2','seq'])
or, if you want to overwrite the df, then:
df = df.sort_values(by=['c1','c2','seq']).reset_index()

Related

Python Pandas Dataframe: How to join 3 rows of data, separated by columns, then repeat this result for the 3 rows of data involved

Here's the pandas dataframe that I'm using to learn how to do this:
import pandas as pd
test_list = pd.DataFrame()
test_list["Item"] = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"]
test_list["Number"] = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11"]
test_list["Combined Numbers"]= ""
Based on that dataframe above, I intend to combine up to 3 numbers, separated by commas.
Following that, I intend to repeat this combined value I now have, for each of the test_list["Item"] and test_list["Number"] involved.
I've been scratching my head figuring it out so far. So far I've seen examples of groupby() function for situations like combining information based on a given criteria, like a duplicate value from a column. I'm learning to explore if I don't have anything to refer to, how can I work this out instead?
Here's my intended goal:
Item
Number
Combined Numbers
A
1
1, 2, 3
B
2
1, 2, 3
C
3
1, 2, 3
D
4
4, 5, 6
E
5
4, 5, 6
F
6
4, 5, 6
G
7
7, 8, 9
H
8
7, 8, 9
I
9
7, 8, 9
J
10
10, 11
K
11
10, 11
Thank you
In you case as you don't have this column with the groups, you can generate a range of the length of your dataframe (with np.arange) and do the floor division by 3 (//3). Use groupby.transform to keep the original shape of your data and do the join operation on the column.
test_list["Combined Numbers"] = (
test_list.groupby(np.arange(len(test_list))//3)
['Number'].transform(', '.join)
)
print(test_list)
# Item Number Combined Numbers
# 0 A 1 1, 2, 3
# 1 B 2 1, 2, 3
# 2 C 3 1, 2, 3
# 3 D 4 4, 5, 6
# 4 E 5 4, 5, 6
# 5 F 6 4, 5, 6
# 6 G 7 7, 8, 9
# 7 H 8 7, 8, 9
# 8 I 9 7, 8, 9
# 9 J 10 10, 11
# 10 K 11 10, 11

How to normalize the entity having multiple values for the one feature in featuretools?

Below is an example:
buy_log_df = pd.DataFrame(
[
["2020-01-02", 0, 1, 2, 2],
["2020-01-02", 1, 1, 1, 3],
["2020-01-02", 2, 2, 1, 1],
["2020-01-02", 3, 3, 3, 1],
],
columns=['date', 'sale_id', 'customer_id', "item_id", "quantity"]
)
item_df = pd.DataFrame(
[
[1, 100],
[2, 200],
[3, 300],
],
columns=['item_id', 'price']
)
item_df2 = pd.DataFrame(
[
[1, '1 3 10'],
[2, '1 3'],
[3, '2 5'],
],
columns=['item_id', 'tags']
)
As you can see here, each item in item_df has multiple tag values as an one feature.
Here is what I've tried:
item_df2 = pd.concat([item_df2, item_df2['tags'].str.split(expand=True)], axis=1)
item_df2 = pd.melt(
item_df2,
id_vars=['item_id'],
value_vars=[0,1,2],
value_name="tags"
)
tag_log_df = item_df2[item_df2['tags'].notna()].drop("variable", axis=1,).sort_values("item_id")
tag_log_df
>>>
item_id tags
0 1 1
3 1 3
6 1 10
1 2 1
4 2 3
2 3 2
5 3 5
It looks like I can't normalize this item entity (from buy_log entity) because it has multiple duplicated item_ids in the table.
How can I handle this case when I design the entityset?
Thanks for the question. To handle multiple tag values, you can normalize the tags into a data frame before structuring the entity set.
buy_log_df
date sale_id customer_id item_id quantity
2020-01-02 0 1 2 2
2020-01-02 1 1 1 3
2020-01-02 2 2 1 1
2020-01-02 3 3 3 1
item_df
item_id price
1 100
2 200
3 300
tag_log_df
item_id tags
1 1
1 3
1 10
2 1
2 3
3 2
3 5
With the normalized data, you can then structure the entity set.
es = ft.EntitySet()
es.entity_from_dataframe(
entity_id='buy_log',
dataframe=buy_log_df,
index='sale_id',
time_index='date',
)
es.entity_from_dataframe(
entity_id='item',
dataframe=item_df,
index='item_id',
)
es.entity_from_dataframe(
entity_id='tag_log',
dataframe=tag_log_df,
index='tag_log_id',
make_index=True,
)
parent = es['item']['item_id']
child = es['buy_log']['item_id']
es.add_relationship(ft.Relationship(parent, child))
child = es['tag_log']['item_id']
es.add_relationship(ft.Relationship(parent, child))

How to select rows and columns that meet criteria from a list

Let's say I've got a pandas dataframe that looks like:
df1 = pd.DataFrame({"Item ID":["A", "B", "C", "D", "E"], "Value1":[1, 2, 3, 4, 0],
"Value2":[4, 5, 1, 8, 7], "Value3":[3, 8, 1, 2, 0],"Value4":[4, 5, 7, 9, 4]})
print(df1)
Item_ID Value1 Value2 Value3 Value4
0 A 1 4 3 4
1 B 2 5 8 5
2 C 3 1 1 7
3 D 4 8 2 9
4 E 0 7 0 4
Now I've got a second dataframe that looks like:
df2 = {"Item ID":["A", "C", "D"], "Value5":[4, 5, 7]}
print(df2)
Item_ID Value5
0 A 4
1 C 5
2 D 7
What I want do is find where the Item ID's match between my two data frames, and then add the "Value5" column values to the intersection of the rows AND ONLY columns Value1 and Value2 from df1 (these columns could change every iteration, so these columns need to be contained in a variable).
My output should show:
4 added to Row A, columns "Value1" and "Value2"
5 added to Row C, columns "Value1" and "Value2"
7 added to Row D, columns "Value1" and "Value2"
Item_ID Value1 Value2 Value3 Value4
0 A 5 8 3 4
1 B 2 5 8 5
2 C 8 6 1 7
3 D 11 15 2 9
4 E 0 7 0 4
Of course my data is many thousand rows long. I can do it using a for loop, but this is taking way too long. I want to be able to vectorize this in some way. Any ideas?
This is what I ended up doing based on #sammywemmy's suggestions
#Takes columns names and changes them into a list
names = df1.colnames.tolist()
#Merge df1 and df2 based on 'Item_ID'
merged = df1.merge(df2, on='Item_ID', how='outer')
for i in range(len(names)):
#using assign and **, we can bring in variable names with assign.
#Then add our Value 5 column
merged = merged.assign(**{names[i] : lambda x : x[names[i]] + x.Value5})
#Only keep all the columns before and including 'Value4'
df1= merged.loc[:,:'Value4']
Try this:
#set 'Item ID' as the index
df1 = df1.set_index('Item ID')
df2 = df2.set_index('Item ID')
#create list of columns that you are interested in
list_of_cols = ['Value1','Value2']
#create two separate dataframes
#unselected will not contain the columns you want to add
unselected = df1.drop(list_of_cols,axis=1)
#this will contain the columns you wish to add
selected = df1.filter(list_of_cols)
#reindex df2 so it has the same indices as df1
#then convert to a series
#fill the null values with 0
A = df2.reindex(index=selected.index,fill_value=0).loc[:,'Value5']
#add the series A to selected
selected = selected.add(A,axis='index')
#combine selected and unselected into one dataframe
result = pd.concat([unselected,selected],axis=1)
#this part is extra to get ur dataframe back to the way it was
#assumption here is that it is value1, value 2, bla bla
#so 1>2>3
#if ur columns are not actually Value1, Value2,
#bla bla, then a different sorting has to be used
#alternatively before the calculations,
#you could create a mapping of the columns to numbers
#that will give u a sorting mechanism and
#restore ur dataframe after calculations are complete
columns = sorted(result.columns,key = lambda x : x[-1])
#reindex back to the way it was
result = result.reindex(columns,axis='columns')
print(result)
Value1 Value2 Value3 Value4
Item ID
A 5 8 3 4
B 2 5 8 5
C 8 6 1 7
D 11 15 2 9
E 0 7 0 4
Alternative solution, using python's built-in dictionaries:
#create dictionaries
dict1 = (df1
#create temporary column
#and set as index
.assign(temp=df1['Item ID'])
.set_index('temp')
.to_dict('index')
)
dict2 = (df2
.assign(temp=df2['Item ID'])
.set_index('temp')
.to_dict('index')
)
list_of_cols = ['Value1','Value2']
intersected_keys = dict1.keys() & dict2.keys()
key_value_pair = [(key,col) for key in intersected_keys
for col in list_of_cols ]
#check for keys that are in both dict1 and 2
#loop through dict 1 and add values from dict2
#can be optimized with a dict comprehension
#leaving as is for better clarity IMHO
for key, val in key_value_pair:
dict1[key][val] = dict1[key][val] + dict2[key]['Value5']
#print(dict1)
{'A': {'Item ID': 'A', 'Value1': 5, 'Value2': 8, 'Value3': 3, 'Value4': 4},
'B': {'Item ID': 'B', 'Value1': 2, 'Value2': 5, 'Value3': 8, 'Value4': 5},
'C': {'Item ID': 'C', 'Value1': 8, 'Value2': 6, 'Value3': 1, 'Value4': 7},
'D': {'Item ID': 'D', 'Value1': 11, 'Value2': 15, 'Value3': 2, 'Value4': 9},
'E': {'Item ID': 'E', 'Value1': 0, 'Value2': 7, 'Value3': 0, 'Value4': 4}}
#create dataframe
pd.DataFrame.from_dict(dict1,orient='index').reset_index(drop=True)
Item ID Value1 Value2 Value3 Value4
0 A 5 8 3 4
1 B 2 5 8 5
2 C 8 6 1 7
3 D 11 15 2 9
4 E 0 7 0 4

Remove all data in a DF by group based on a condition (pandas,python3)

I have a pandas DF like this:
User Enrolled Time
1 0 12
1 0 1
1 1 2
1 1 3
2 1 3
2 0 4
2 1 1
3 0 2
3 0 3
3 1 4
4 0 1
I want to remove all rows of a users information after they have enrolled. Each users chance to enroll is timed in order. Expected output to look like this:
User Enrolled Time
1 0 12
1 0 1
1 1 2
2 1 3
3 0 2
3 0 3
3 1 4
Hoping someone could help me!
EDIT: Example based on comment for correct answer:
User Enrolled Time
4 0 1
4 0 2
4 0 3
5 0 1
I think what you're looking for is a groupby followed by an apply which does the correct logic for each user. For example:
df = pd.DataFrame([[ 1, 0, 12],
[ 1, 0, 1],
[ 1, 1, 2],
[ 1, 1, 3],
[ 2, 1, 3],
[ 2, 0, 4],
[ 2, 1, 1],
[ 3, 0, 2],
[ 3, 0, 3],
[ 3, 1, 4]],
columns=['User', 'Enrolled', 'Time'])
def filter_enrollment(df):
enrolled = df[df.Enrolled == 1].index.min()
return df[df.index <= enrolled]
result = df.groupby('User').apply(filter_enrollment).reset_index(drop=True)
The result is:
>>> print(result)
User Enrolled Time
0 1 0 12
1 1 0 1
2 1 1 2
3 2 1 3
4 3 0 2
5 3 0 3
6 3 1 4
Here I'm assuming your rows are in order of time. If you want to expliticly filter by the time column instead just change index to Time in the filter function.
Edit: to get the answer of the edited question, you can change the filter function to something like this:
def filter_enrollment(df):
enrolled = df[df.Enrolled == 1].index.min()
if pd.isnull(enrolled):
return df
else:
return df[df.index <= enrolled]

How do I filter to match or exclude certain fields in Octave?

How do I filter to match or exclude certain fields in Octave?
Using Octave 3.0.5 on CentOS 5.8, I need to filter rows out of a larger matrix for some various analyses.
For example, I have an array that looks like this:
A = { [ 0, 5, 32 ],
[ 0, 3, 2 ],
[ 1, 4, 13 ],
[ 1, 2, 32 ],
[ 2, 7, 99 ],
[ 2, 0, 42 ] };
Now I need to be able to extract all rows where the first value is equal to 1, or maybe where the second value is greater than 3, etc. I've tried reading the documentation and searching for examples, but I'm just not seeing it.
Thanks!
You can use cellfun to go through the cell array and get an index (binary):
octave> cellfun (#(x) x(1) == 1 || x(2) > 3, A)
ans =
1
0
1
1
1
0
Using your example:
octave> A(cellfun (#(x) x(1) == 1 || x(2) > 3, A))
ans =
{
[1,1] =
0 5 32
[2,1] =
1 4 13
[3,1] =
1 2 32
[4,1] =
2 7 99
}
An alternative which may be faster is to ditch the cell array completely and use a matrix instead (as long as each cell in the cell array have the same size, a matrix makes a lot more sense, even if you need to create a multi-dimensional matrix). That's likely to be much faster and simpler to read:
octave> B = cell2mat (A);
octave> B(B(:,1) == 1 | B(:,2) > 3, :)
ans =
0 5 32
1 4 13
1 2 32
2 7 99

Resources