How to create a separate df after applying groupby? - python-3.x

I have a df as follows:
Product Step
1 1
1 3
1 6
1 6
1 8
1 1
1 4
2 2
2 4
2 8
2 8
2 3
2 1
3 1
3 3
3 6
3 6
3 8
3 1
3 4
What I would like to do is to:
For each Product, every Step must be grabbed and the order must not be changed, that is, if we look at Product 1, after Step 8, there is a 1 coming and that 1 must be after 8 only. So, the expected output for product 1 and product 3 should be of the order: 1, 3, 6, 8, 1, 4; for the product 2 it must be: 2, 4, 8, 3, 1.
Update:
Here, I only want one value of 6 for product 1 and 3, since in the main df both the 6 next to each other, but both the values of 1 must be present since they are not next to each other.
Once the first step is done, the products with the same Steps must be grouped together into a new df (in the below example: Product 1 and 3 have same Steps, so they must be grouped together)
What I have done:
import pandas as pd
sid = pd.DataFrame(data.groupby('Product').apply(lambda x: x['Step'].unique())).reset_index()
But it is yielding a result like:
Product 0
0 1 [1 3 6 8 4]
1 2 [2 4 8 3 1]
2 3 [1 3 6 8 4]
which is not the result I want. I would like the value for the first and third product to be [1 3 6 8 1 4].

IIUC Create the Newkey by using cumsum and diff
df['Newkey']=df.groupby('Product').Step.apply(lambda x : x.diff().ne(0).cumsum())
df.drop_duplicates(['Product','Newkey'],inplace=True)
s=df.groupby('Product').Step.apply(tuple)
s.reset_index().groupby('Step').Product.apply(list)
Step
(1, 3, 6, 8, 1, 4) [1, 3]
(2, 4, 8, 3, 1) [2]
Name: Product, dtype: object

groupby preservers the order of rows within a group, so there isn't much need to worry about the rows shifting.
A straightforward, but not greatly performant, solution would be to apply(tuple), since they are hashable allowing you to group on them to see which Products are identical. form_seq will make it so that consecutive values only appear once in the list of steps before forming the tuple.
def form_seq(x):
x = x[x != x.shift()]
return tuple(x)
s = df.groupby('Product').Step.apply(form_seq)
s.groupby(s).groups
#{(1, 3, 6, 8, 1, 4): Int64Index([1, 3], dtype='int64', name='Product'),
# (2, 4, 8, 3, 1): Int64Index([2], dtype='int64', name='Product')}
Or if you'd like a DataFrame:
s.reset_index().groupby('Step').Product.apply(list)
#Step
#(1, 3, 6, 8, 1, 4) [1, 3]
#(2, 4, 8, 3, 1) [2]
#Name: Product, dtype: object
The values of that dictionary are the groupings of products that share the step sequence (given by the dictionary keys). Products 1 and 3 are grouped together by the step sequence 1, 3, 6, 8, 1, 4.

Another very similar way:
df_no_dups=df[df.shift()!=df].dropna(how='all').ffill()
df_no_dups_grouped=df_no_dups.groupby('Product')['Step'].apply(list)

Related

Print a groupby object for a specific group/groups only

I need to print the result of groupby object in Python for a specific group/groups only.
Below is the dataframe:
import pandas as pd
df = pd.DataFrame({'ID' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4],
'Entry' : [1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6]})
print("\n df = \n",df)
In order to group the dataferame by ID and print the result I used these codes:
grouped_by_unit = df.groupby(by="ID")
print("\n", grouped_by_unit.apply(print))
Can somebody please let me know below two things:
How can I print the data frame grouped by 'ID=1' only?
I need to get the below output:
Likewise, how can I print the data frame grouped by 'ID=1' and 'ID=4' together?
I need to get the below output:
You can iterate over the groups for example with for-loop:
grouped_by_unit = df.groupby(by="ID")
for id_, g in grouped_by_unit:
if id_ == 1 or id_ == 4:
print(g)
print()
Prints:
ID Entry
0 1 1
1 1 2
2 1 3
3 1 4
ID Entry
12 4 1
13 4 2
14 4 3
15 4 4
16 4 5
17 4 6
You can use get_group function:
df.groupby(by="ID").get_group(1)
which prints
ID Entry
0 1 1
1 1 2
2 1 3
3 1 4
You can use the same method to print the group for the key 4.

How to aggregate n previous rows as list in Pandas DataFrame?

As the title says:
a = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])
Having a dataframe with 10 values we want to aggregate say last 5 rows and put them as list into a new column:
>>> a new_col
0
0 1
1 2
2 3
3 4
4 5 [1,2,3,4,5]
5 6 [2,3,4,5,6]
6 7 [3,4,5,6,7]
7 8 [4,5,6,7,8]
8 9 [5,6,7,8,9]
9 10 [6,7,8,9,10]
How?
Due to how rolling windows are implemented, you won't be able to aggregate the results as you expect, but we still can reach your desired result by iterating each window and storing the values as a list of values:
>>> new_col_values = [
window.to_list() if len(window) == 5 else None
for window in df["column"].rolling(5)
]
>>> df["new_col"] = new_col_values
>>> df
column new_col
0 1 None
1 2 None
2 3 None
3 4 None
4 5 [1, 2, 3, 4, 5]
5 6 [2, 3, 4, 5, 6]
6 7 [3, 4, 5, 6, 7]
7 8 [4, 5, 6, 7, 8]
8 9 [5, 6, 7, 8, 9]
9 10 [6, 7, 8, 9, 10]

Best way to get the column filtered from a list in dataframe

Just would like to different approached to get a list matched/filtered against df.columns in a pandas.
Below snippet works perfectly but looking forward for other approaches around.
Even we can consider a function, excuse my brevity as just learning pandas.
# list of columns names to be matched & checked
>>> matchObj = ['equity01', 'equity02', 'equity1' 'equity2']
# DataFrame construct
>>> df = pd.DataFrame({'equity01': [1, 2, 3], 'equity02': [4, 5, 6], 'equity03': [7, 8, 9], 'equity04': [2, 3, 4], 'equity05': [5, 6, 7]})
>>> df
equity01 equity02 equity03 equity04 equity05
0 1 4 7 2 5
1 2 5 8 3 6
2 3 6 9 4 7
# One way to with list comprehension as follows..
>>> print(df[[col for col in matchObj if col in df.columns]])
equity01 equity02
0 1 4
1 2 5
2 3 6
Thank a mile in advanced for any suggestion and solutions around.
Yes, using pd.Index.intersection():
df[df.columns.intersection(matchObj)]
equity01 equity02
0 1 4
1 2 5
2 3 6
Using pd.Index.isin()
df.loc[:,df.columns.isin(matchObj)]
equity01 equity02
0 1 4
1 2 5
2 3 6

How to extract data using python from a text file

I have been having troubles with extracting reading/manipulating/extracting data from a txt file. In the text file it has a general header with various information that is setup something like this below just as an example:
~ECOLOGY
~LOCATION
LAT: 59
LONG: 23
~PARAMETERS
Area. 8
Distribution. 3
Diversity. 5
~DATA X Y CONF DECID PEREN
3 6 1 3 0
7 2 4 2 1
4 8 0 6 2
9 9 6 2 0
2 3 2 5 4
6 5 0 2 7
7 1 2 4 2
I want to be able to extract the headers of the columns and use the headers of the columns as an index or key since sometimes the types of column data can change between files and the amount of rows of data can fluctuate as well. I want to be able to read the data in each column so that pending on location I can sum or add columns such as show below and export it as a separate file:
~DATA X Y CONF DECID PEREN TOTAL
3 6 1 3 0 4
7 2 4 2 1 7
4 8 0 6 2 8
9 9 6 2 0 8
2 3 2 5 4 11
6 5 0 2 7 9
7 1 2 4 2 8
Any suggestions?
This is what I have so far:
E = open("ECOLOGY.txt", "r")
with open(path) as E:
for i, line in enumerate(E):
sep_lines = line.rsplit()
if "~DATA" in sep_lines:
key =(line.rsplit())
key.remove('~DATA')
for j, value in enumerate(key):
print (j,value)
print (key)
dict = {L: v for v, L in enumerate(key)}
print(dict)
Life would be much easier for you if you learned a smidgen of Pandas. But you can do it without.
with open('ttl.txt') as ttl:
for _ in range(10):
next(ttl)
first = True
for line in ttl:
line = line.rstrip()
if first:
first = False
labels = line.split()+['TOTAL']
fmt = 7*'{:<9s}'
print (fmt.format(*labels))
else:
numbers = [int(_) for _ in line.split()]
total = sum(numbers[-3:])
other_items = numbers + [total]
fmt = 6*'{:<9d}'
fmt = '{:<9s}'+fmt
print (fmt.format('', *other_items))
~DATA X Y CONF DECID PEREN TOTAL
3 6 1 3 0 4
7 2 4 2 1 7
4 8 0 6 2 8
9 9 6 2 0 8
2 3 2 5 4 11
6 5 0 2 7 9
7 1 2 4 2 8
next skips lines in the input file. You can use split() to split input lines on whitespace, the use formatting to put items back together as you want them.
This a very basic, frail, format depending solution. But I hope it can help you.
with open("test.txt") as f:
data_part_reached = False
for line in f:
if "~DATA" in line:
column = [[elem] for elem in line.split(" ") if elem not in (" ", "", "\n", "~DATA")]
data_part_reached = True
elif data_part_reached:
values = [int(elem) for elem in line.split(" ") if elem not in (" ", "", "\n")]
for i in range(len(columns)):
columns[i].append(values[i])
columns =
[['X', 3, 7, 4, 9, 2, 6, 7],
['Y', 6, 2, 8, 9, 3, 5, 1],
['CONF', 1, 4, 0, 6, 2, 0, 2],
['DECID', 3, 2, 6, 2, 5, 2, 4],
['PEREN', 0, 1, 2, 0, 4, 7, 2],
['TOTAL', 4, 7, 8, 8, 11, 9, 8]]
This will get you a list of lists where the first element of each list is the header and the rest are the values. I casted the values to int since you said you want to operate with them. You can turn this list into a dict where the key is the header and the list of values of each column are the value if you want, like this.
d = {}
for column in columns:
d[column.pop(0)] = column
d =
{'DECID': [3, 2, 6, 2, 5, 2, 4],
'PEREN': [0, 1, 2, 0, 4, 7, 2],
'CONF': [1, 4, 0, 6, 2, 0, 2],
'X': [3, 7, 4, 9, 2, 6, 7],
'TOTAL': [4, 7, 8, 8, 11, 9, 8],
'Y': [6, 2, 8, 9, 3, 5, 1]}
Create a empty dictionary to store all needed data.
Read from the file object as E and loop until you reach a line starting with ~DATA.
Then split the header items, append TOTAL and then break from the loop.
Create a list to store the remaining data.
Loop to split the data and then append the sum total.
The list will append each list of data.
Loop ends and then adds to list of lists to the dictionary.
dic = {}
with open("ECOLOGY.txt") as E:
for line in E:
if line[:5] == '~DATA':
dic['header'] = line.split()[1:] + ['TOTAL']
break
data = []
for line in E:
cols = line.split()
cols.append(sum([int(num) for num in cols[2:]]))
data.append(cols)
dic['data'] = data
The dictionary will be i.e. {'header': [...], 'data': [[...], ...]}
edit: Added missing dic declaration at the beginning of code.

pandas how to derived values for a new column base on another column

I have a dataframe that has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, and assigns a unique integer to the corresponding row as id.
A sample dataframe is like,
document_no_list cluster_id
[1,2,3] 1
[4,5,6,7] 2
[8] nan
[9,10] 3
column cluster_id only considers the 1st, 2nd and 4th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column.
I am wondering how to do that in pandas.
We can use np.random.choice for unique random values with .loc for assignment i.e
df = pd.DataFrame({'document_no_list' :[[1,2,3],[4,5,6,7],[8],[9,10]]})
x = df['document_no_list'].apply(len) > 1
df.loc[x,'Cluster'] = np.random.choice(range(len(df)),x.sum(),replace=False)
Output :
document_no_list Cluster
0 [1, 2, 3] 2.0
1 [4, 5, 6, 7] 1.0
2 [8] NaN
3 [9, 10] 3.0
If you want continuous numbers then you can use
df.loc[x,'Cluster'] = np.arange(x.sum())+1
document_no_list Cluster
0 [1, 2, 3] 1.0
1 [4, 5, 6, 7] 2.0
2 [8] NaN
3 [9, 10] 3.0
Hope it helps
Create a boolean column based on condition and apply cumsum() on rows with 1's
df['cluster_id'] = df['document_no_list'].apply(lambda x: len(x)> 1).astype(int)
df.loc[df['cluster_id'] == 1, 'cluster_id'] = df.loc[df['cluster_id'] == 1, 'cluster_id'].cumsum()
document_no_list cluster_id
0 [1, 2, 3] 1
1 [4, 5, 6, 7] 2
2 [8] 0
3 [9, 10] 3

Resources