How do I make a panda frames values across multiple columns, its columns

How do I make a panda frames values across multiple columns, its columns - python-3.x

I have the following dataframe loaded up in Pandas.
print(pandaDf)
id col1 col2 col3
12a a b d
22b d a b
33c c a b
I am trying to convert the values across multiple rows into its columns so the output would be like this :
Desired output:
id a b c d
12a 1 1 0 1
22b 1 1 0 0
33c 1 1 1 0
I have tried adding in a value column where the value = 1 and using a pivot table
pandaDf['value'] = 1
column = ['col1', 'col2', 'col3']
pandaDf.pivot_table(index = 'id', value = 'value', columns = column)
However, the resulting data frame is a multilevel index and the pandaDf.pivot() method does not allow multiple column values.
Please advise about how I could do this with an output of a single level index.
Thanks for taking the time to read this and I apologize if I have made any formatting errors in posting the question. I am still learning the proper stackoverflow syntax.

You can use One-Hot Encoding to solve this problem.
Here is one way to do this pd.get_dummies and some multiindex flatten and sum:
df1 = df.set_index('id')
df_out = pd.get_dummies(df1)
df_out.columns = df_out.columns.str.split('_', expand=True)
df_out = df_out.sum(level=1, axis=1).reset_index()
print(df_out)
Output:
id a c d b
0 12a 1 0 1 1
1 22b 1 0 1 1
2 33c 1 1 0 1

Using get_dummies
pd.get_dummies(df.set_index('id'),prefix='', prefix_sep='').sum(level=0,axis=1)
Out[81]:
a c d b
id
12a 1 0 1 1
22b 1 0 1 1
33c 1 1 0 1

Related

value count for an attribute from the column when there are multiple values for the attribute

I a trying to count and visualize netflix dataset depending on the country column, but when checked the data set I found there are some rows in the column that contains multiple values for country such as the
below one;
following is the code to count
country_count=joint_data['country'].value_counts().sort_values(ascending=False)
country_count=pd.DataFrame(country_count)
topcountries=country_count[0:11]
topcountries.shape
so I wanted to count those rows as individual countries to get the proper count of countries.

You can split the country column by , and then .explode(). Next step is .groupby():
df = df['country'].apply(lambda x: x.split(',')).explode().to_frame()
print( df.groupby('country').agg('size') )
Prints:
country
Austria 1
Canada 1
Germany 1
India 2
United Kingdom 1
United States 1
dtype: int64

You can compile all possible values from your 'country' column, make a set out of it and create new columns for each.
Then you can iterate your rows and fill in if the column is inside this rows 'country':
import pandas as pd
df = pd.DataFrame({"country":["A,B,C","A,D,E,F","G"]})
print(df)
df[[*sorted(set(','.join(df["country"]).split(",")))]] = 0
for row in df.iterrows():
row[1][ [*(row[1]["country"].split(","))]] = 1
print(df)
Output:
country A B C D E F G
0 A,B,C 1 1 1 None None None None
1 A,D,E,F 1 None None 1 1 1 None
2 G None None None None None None 1
If you'd rather have 0 instead of Noneuse df.fillna(0, inplace=True) to convert them:
# 0 instead of None
df.fillna(value=0, inplace=True)
print(df)
# print sums
for c in df.columns:
if c == "country":
continue
print(f"{c} {df[c].sum()}")
Output:
country A B C D E F G
0 A,B,C 1 1 1 0 0 0 0
1 A,D,E,F 1 0 0 1 1 1 0
2 G 0 0 0 0 0 0 1
A 2
B 1
C 1
D 1
E 1
F 1
G 1

How to perform cumulative sum inside iterrows

I have a pandas dataframe as below:
df2 = pd.DataFrame({ 'b' : [1, 1, 1]})
df2
b
0 1
1 1
2 1
I want to create a column 'cumsum' with the cumulative sum of column b starting row 2. Also I want to use iterrows to perform this. I tried below code but it doesnot seem to work.
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b'].cumsum()
My expected output:
b cum_sum
0 1 NaN
1 1 2
2 1 3

As your requirement, you may try this
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[:row_index, 'b'].sum()
Out[10]:
b cumsum
0 1 NaN
1 1 2.0
2 1 3.0

To stick to iterrows():
i=0
df2['cumsum']=0
col=list(df2.columns).index('cumsum')
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b']+df2.iloc[i, col]
i+=1
Outputs:
b cumsum
0 1 0
1 1 1
2 1 2

How to replace the values of 1's and 0's of various column into a single column of a data frame?

The 0's and 1's need to be transposed to there appropriate headers in python.
How can I achieve this and get the column final_list?

If there is always only one 1 per rows use DataFrame.dot:
df = pd.DataFrame({'a':[0,1,0],
'b':[1,0,0],
'c':[0,0,1]})
df['Final'] = df.dot(df.columns)
print (df)
a b c Final
0 0 1 0 b
1 1 0 0 a
2 0 0 1 c
If possible multiple 1 also add separator and then remove it by Series.str.rstrip from output Series:
df = pd.DataFrame({'a':[0,1,0],
'b':[1,1,0],
'c':[1,1,1]})
df['Final'] = df.dot(df.columns + ',').str.rstrip(',')
print (df)
a b c Final
0 0 1 1 b,c
1 1 1 1 a,b,c
2 0 0 1 c

Index order of a shuffle dataframe

I have two DataFrame, namely A and B. Bis generated by shuffling rows of A. I would like to know each row of B, what's the index of the same row in A.
Example:
A=pd.DataFrame({"a":[1,2,3],"b":[1,2,3],"c":[1,2,3]})
B=pd.DataFrame({"a":[2,3,1],"b":[2,3,1],"c":[2,3,1]})
A
a b c
0 1 1 1
1 2 2 2
2 3 3 3
B
a b c
0 2 2 2
1 3 3 3
2 1 1 1
The answer should be [1,2,0], because B equals A.loc[[1,2,0]]. I am wondering how to do this efficiently since my A and B is large.

I came up with probable solution using Dataframe.merge
A=pd.DataFrame({"a":[1,2,3],"b":[1,2,3],"c":[1,2,3]})
B=pd.DataFrame({"a":[2,3,1],"b":[2,3,1],"c":[2,3,1]})
A['index_a'] = A.index
B['index_b'] = B.index
merge_df= pd.merge(A, B, left_on=['a', 'b', 'c'], right_on=['a', 'b', 'c'])
Where merge_df is
a b c index_a index_b
0 1 1 1 0 2
1 2 2 2 1 0
2 3 3 3 2 1
Now you can reference the rows from A or B Dataframe
Example
You know that row with index 0 at A is at index 2 in B
NOTE Rows that do not match on neither dataframe will not be shown in merge_df

IIUC use merge
pd.merge(B.reset_index(), A.reset_index(),
left_on = A.columns.tolist(),
right_on = B.columns.tolist()).iloc[:,-1].values
array([1, 2, 0], dtype=int64)

Selecting data from multiple dataframes

my workbook Rule.xlsx has following data.
sheet1:
group ordercode quantity
0 A 1
B 3
1 C 1
E 2
D 1
Sheet 2:
group ordercode quantity
0 x 1
y 3
1 x 1
y 2
z 1
I have created dataframe using below method.
df1 =data.parse('sheet1')
df2=data.parse('sheet2')
my desired result is writing a sequence using these two dataframe.
df3:
group ordercode quantity
0 A 1
B 3
0 x 1
y 3
1 C 1
E 2
D 1
1 x 1
y 2
z 1
one from df1 and one from df2.
I wish to know how I can print the data by selecting group number (eg. group(0), group(1) etc).
any suggestion ?

After some comments solution is:
#create OrderDict of DataFrames
dfs = pd.read_excel('Rule.xlsx', sheet_name=None)
#ordering of DataFrames
order = 'SWC_1380_81,SWC_1382,SWC_1390,SWC_1391,SWM_1380_81'.split(',')
#in loops lookup dictionaries, replace NaNs and create helper column
L = [dfs[x].ffill().assign(g=i) for i, x in enumerate(order)]
#last join together, sorting and last remove helper column
df = pd.concat(L).sort_values(['group','g'])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do I make a panda frames values across multiple columns, its columns - python-3.x

Using get_dummies pd.get_dummies(df.set_index('id'),prefix='', prefix_sep='').sum(level=0,axis=1) Out[81]: a c d b id 12a 1 0 1 1 22b 1 0 1 1 33c 1 1 0 1

Related

value count for an attribute from the column when there are multiple values for the attribute

How to perform cumulative sum inside iterrows

How to replace the values of 1's and 0's of various column into a single column of a data frame?

Index order of a shuffle dataframe

Selecting data from multiple dataframes

Categories

Resources