value count for an attribute from the column when there are multiple values for the attribute - python-3.x

I a trying to count and visualize netflix dataset depending on the country column, but when checked the data set I found there are some rows in the column that contains multiple values for country such as the
below one;
following is the code to count
country_count=joint_data['country'].value_counts().sort_values(ascending=False)
country_count=pd.DataFrame(country_count)
topcountries=country_count[0:11]
topcountries.shape
so I wanted to count those rows as individual countries to get the proper count of countries.

You can split the country column by , and then .explode(). Next step is .groupby():
df = df['country'].apply(lambda x: x.split(',')).explode().to_frame()
print( df.groupby('country').agg('size') )
Prints:
country
Austria 1
Canada 1
Germany 1
India 2
United Kingdom 1
United States 1
dtype: int64

You can compile all possible values from your 'country' column, make a set out of it and create new columns for each.
Then you can iterate your rows and fill in if the column is inside this rows 'country':
import pandas as pd
df = pd.DataFrame({"country":["A,B,C","A,D,E,F","G"]})
print(df)
df[[*sorted(set(','.join(df["country"]).split(",")))]] = 0
for row in df.iterrows():
row[1][ [*(row[1]["country"].split(","))]] = 1
print(df)
Output:
country A B C D E F G
0 A,B,C 1 1 1 None None None None
1 A,D,E,F 1 None None 1 1 1 None
2 G None None None None None None 1
If you'd rather have 0 instead of Noneuse df.fillna(0, inplace=True) to convert them:
# 0 instead of None
df.fillna(value=0, inplace=True)
print(df)
# print sums
for c in df.columns:
if c == "country":
continue
print(f"{c} {df[c].sum()}")
Output:
country A B C D E F G
0 A,B,C 1 1 1 0 0 0 0
1 A,D,E,F 1 0 0 1 1 1 0
2 G 0 0 0 0 0 0 1
A 2
B 1
C 1
D 1
E 1
F 1
G 1

Related

Get all columns per id where the column is equal to a value

Say I have a pandas dataframe:
id A B C D
id1 0 1 0 1
id2 1 0 0 1
id3 0 0 0 1
id4 0 0 0 0
I want to select all the columns per id where the column name is equal to 1, this list will then be a new column in the dataframe.
Expected output:
id A B C D Result
id1 0 1 0 1 [B,D]
id2 1 0 0 1 [A,D]
id3 0 0 0 1 [D]
id4 0 0 0 0 []
I tried df.apply(lambda row: row[row == 1].index, axis=1) but the output of the 'Result' was not in the form in specified above
You can do what you are trying to do adding .tolist():
df['Result'] = df.apply(lambda row: row[row == 1].index.tolist(), axis=1)
Saying that, your approach of using lists as values inside a single column seems contradictory with the Pandas approach of keeping data tabular (only one value per cell). It will probably be better to use nested lists instead of pandas to do what you are trying to do.
Setup
I used a different set of ones and zeros to highlight skipping an entire row.
df = pd.DataFrame(
[[0, 1, 0, 1], [1, 0, 0, 1], [0, 0, 0, 0], [0, 0, 1, 0]],
['id1', 'id2', 'id3', 'id4'],
['A', 'B', 'C', 'D']
)
df
A B C D
id1 0 1 0 1
id2 1 0 0 1
id3 0 0 0 0
id4 0 0 1 0
Not Your Grampa's Reverse Binerizer
n = len(df)
i, j = np.nonzero(df.to_numpy())
col_names = df.columns[j]
positions = np.bincount(i).cumsum()[:-1]
result = np.split(col_names, positions)
df.assign(Result=[a.tolist() for a in result])
A B C D Result
id
id1 0 1 0 1 [B, D]
id2 1 0 0 1 [A, D]
id3 0 0 0 0 []
id4 0 0 1 0 [C]
Explanations
Ohh, the details!
np.nonzero on a 2-D array will return two arrays of equal length. The first array will have the 1st dimensional position of each element that is not zero. The second array will have the 2nd dimensional position of each element that is not zero. I'll call the first array i and the second array j.
In the figure below, I label the columns with what j they represent and correspondingly, I label the rows with what i they represent.
For each non-zero element of the dataframe, I place above the value a tuple with the (ith, jth) dimensional positions and in brackets the [kth] non-zero element in the dataframe.
# j → 0 1 2 3
# A B C D
# i
# ↓ (0,1)[0] (0,3)[1]
# 0 id1 0 1 0 1
#
# (1,0)[2] (1,3)[3]
# 1 id2 1 0 0 1
#
#
# 2 id3 0 0 0 0
#
# (3,2)[4]
# 3 id4 0 0 1 0
In the figure below, I show what i and j look like. I label each row with the same k in the brackets in the figure above
# i j
# ----
# 0 1 k=0
# 0 3 k=1
# 1 0 k=2
# 1 3 k=3
# 3 2 k=4
Now it's easy to slice the df.columns with the j array to get all the column labels in one big array.
# df.columns[j]
# B
# D
# A
# D
# C
The plan is to use np.split to chop up df.columns[j] into the sub arrays for each row. Turns out that the information is embedded in the array i. I'll use np.bincount to count how many non-zero elements are in each row. I'll need to tell np.bincount the minimum number of bins we are assuming to have. That minimum is the number of rows in the dataframe. We assign it to n with n = len(df)
# np.bincount(i, minlength=n)
# 2 ← Two non-zero elements in the first row
# 2 ← Two more in the second row
# 0 ← None in this row
# 1 ← And one more in the fourth row
However, if we take the cumulative sum of this array, we get the positions we need to split at.
# np.bincount(i, minlength=n).cumsum()
# 2
# 4
# 4 ← This repeated value results in an empty array for the 3rd row
# 5
Let's look at how this matches up with df.columns[j]. We see below that the column slice gets split exactly where we need.
# B D A D D ← df.columns[j]
# 2 44 5 ← np.bincount(i, minlength=n).cumsum()
One issue is that the 4 values in this array will result in splitting the df.columns[j] array into 5 sub-arrays. This isn't horrible because the last array will always be empty. so we slice it to appropriate size np.bincount(i, minlength=n).cumsum()[:-1]
col_names = df.columns[j]
positions = np.bincount(i, minlength=n).cumsum()[:-1]
result = np.split(col_names, positions)
# result
# [B, D]
# [A, D]
# []
# [D]
The only thing left to do is assign it to a new columns and make the individual sub-arrays lists instead.
df.assign(Result=[a.tolist() for a in result])
# A B C D Result
# id
# id1 0 1 0 1 [B, D]
# id2 1 0 0 1 [A, D]
# id3 0 0 0 0 []
# id4 0 0 1 0 [C]

How to perform cumulative sum inside iterrows

I have a pandas dataframe as below:
df2 = pd.DataFrame({ 'b' : [1, 1, 1]})
df2
b
0 1
1 1
2 1
I want to create a column 'cumsum' with the cumulative sum of column b starting row 2. Also I want to use iterrows to perform this. I tried below code but it doesnot seem to work.
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b'].cumsum()
My expected output:
b cum_sum
0 1 NaN
1 1 2
2 1 3
As your requirement, you may try this
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[:row_index, 'b'].sum()
Out[10]:
b cumsum
0 1 NaN
1 1 2.0
2 1 3.0
To stick to iterrows():
i=0
df2['cumsum']=0
col=list(df2.columns).index('cumsum')
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b']+df2.iloc[i, col]
i+=1
Outputs:
b cumsum
0 1 0
1 1 1
2 1 2

How to replace the values of 1's and 0's of various column into a single column of a data frame?

The 0's and 1's need to be transposed to there appropriate headers in python.
How can I achieve this and get the column final_list?
If there is always only one 1 per rows use DataFrame.dot:
df = pd.DataFrame({'a':[0,1,0],
'b':[1,0,0],
'c':[0,0,1]})
df['Final'] = df.dot(df.columns)
print (df)
a b c Final
0 0 1 0 b
1 1 0 0 a
2 0 0 1 c
If possible multiple 1 also add separator and then remove it by Series.str.rstrip from output Series:
df = pd.DataFrame({'a':[0,1,0],
'b':[1,1,0],
'c':[1,1,1]})
df['Final'] = df.dot(df.columns + ',').str.rstrip(',')
print (df)
a b c Final
0 0 1 1 b,c
1 1 1 1 a,b,c
2 0 0 1 c

How do I make a panda frames values across multiple columns, its columns

I have the following dataframe loaded up in Pandas.
print(pandaDf)
id col1 col2 col3
12a a b d
22b d a b
33c c a b
I am trying to convert the values across multiple rows into its columns so the output would be like this :
Desired output:
id a b c d
12a 1 1 0 1
22b 1 1 0 0
33c 1 1 1 0
I have tried adding in a value column where the value = 1 and using a pivot table
pandaDf['value'] = 1
column = ['col1', 'col2', 'col3']
pandaDf.pivot_table(index = 'id', value = 'value', columns = column)
However, the resulting data frame is a multilevel index and the pandaDf.pivot() method does not allow multiple column values.
Please advise about how I could do this with an output of a single level index.
Thanks for taking the time to read this and I apologize if I have made any formatting errors in posting the question. I am still learning the proper stackoverflow syntax.
You can use One-Hot Encoding to solve this problem.
Here is one way to do this pd.get_dummies and some multiindex flatten and sum:
df1 = df.set_index('id')
df_out = pd.get_dummies(df1)
df_out.columns = df_out.columns.str.split('_', expand=True)
df_out = df_out.sum(level=1, axis=1).reset_index()
print(df_out)
Output:
id a c d b
0 12a 1 0 1 1
1 22b 1 0 1 1
2 33c 1 1 0 1
Using get_dummies
pd.get_dummies(df.set_index('id'),prefix='', prefix_sep='').sum(level=0,axis=1)
Out[81]:
a c d b
id
12a 1 0 1 1
22b 1 0 1 1
33c 1 1 0 1

Pandas If Statements (excel equivalent)

I'm trying to create a simple if statement in Pandas.
The excel version is as follows:
=IF(E2="ABC",C2,E2)
I'm stuck on how to assign it based on a string or partial string.
Here is what I have.
df['New Value'] = df['E'].map(lambda x: df['C'] if x == 'ABC' else df['E']]
I know I'm making a mistake here.
As the outcome is the entire dataframe values in each cell.
Any help would be much appreciated!
use np.where:
In [36]:
df = pd.DataFrame({'A':np.random.randn(5), 'B':0, 'C':np.arange(5),'D':1, 'E':['asdsa','ABC','DEF','ABC','DAS']})
df
Out[36]:
A B C D E
0 0.831728 0 0 1 asdsa
1 0.734007 0 1 1 ABC
2 -1.032752 0 2 1 DEF
3 1.414198 0 3 1 ABC
4 1.042621 0 4 1 DAS
In [37]:
df['New Value'] = np.where(df['E'] == 'ABC', df['C'], df['E'])
df
Out[37]:
A B C D E New Value
0 0.831728 0 0 1 asdsa asdsa
1 0.734007 0 1 1 ABC 1
2 -1.032752 0 2 1 DEF DEF
3 1.414198 0 3 1 ABC 3
4 1.042621 0 4 1 DAS DAS
The syntax for np.where is:
np.where( < condition >, True condition, False condition )
So when the condition is True it returns the True condition and when False the other condition.

Resources