Print dictionary to file using pandas DataFrame, but changing dataframe format - python-3.x

I have a dictionary of dictionaries I want to print into a csv file. I came across a way to do this using pandas.DataFrame:
import pandas as pd
dict = {'foo': {'A':'a', 'B':'b'}, 'bar': {'C':'c', 'D':'d'}}
df = pd.DataFrame(dict)
#df.to_csv(path_or_buf = r"results.txt", mode='w')
This gives me a formatted result like so:
bar foo
A NaN a
B NaN b
C c NaN
D d NaN
I expected (and would like to have) a DataFrame that instead looks like:
foo A a
foo B b
bar C c
bar D d
I'm new to manipulation of dataframes, so I'm not sure how to change the formatting - would I do it in the DataFrame argument? Or is there a way to change it once the dictionary is already a df?

You are looking for stack
df.stack()
Out[91]:
A foo a
B foo b
C bar c
D bar d
dtype: object
That is multiple index
dict = {'foo': {'A':'a', 'B':'b'}, 'bar': {'A':'a','C':'c', 'D':'d'}}
df = pd.DataFrame(dict)
df.stack()
Out[93]:
A bar a
foo a
B foo b
C bar c
D bar d
dtype: object
df.stack().reset_index()
Out[94]:
level_0 level_1 0
0 A bar a
1 A foo a
2 B foo b
3 C bar c
4 D bar d

Related

Groupby id and get each string from an id, in each diferent column

Hello I just want to group the elements by id and show each string in a separated column
Original dataframe:
id|elements|
1|a
1|b
1|c
1|d
2|a
2|b
2|b
3|a
3|a
3|b
3|c
3|c
3|c
Desired output:
id|column1|column2|column3|column4|column5|
1 |a|b|c|d| | |
2 |a|b|b|
3 |a|a|b|c|c|c|
Any ideas? Thank you very much in advance
Given your original data frame, you can simply do:
df.groupby('id').apply(lambda x: x['element'].to_list()).apply(pd.Series)
Output:
0 1 2 3 4 5
id
1 a b c d NaN NaN
2 a b b NaN NaN NaN
3 a a b c c c
If you do not want id to be the index, use .reset_index().
Try this
import pandas as pd
import numpy as np
F = {'id': [1,1,1,1,2,2,2,3,3,3,3,3], 'element': ['a','b','c','d','a','b','b','a','a','b','c','c']}
df = pd.DataFrame(data = F)
df2 = df.set_index('id').stack().groupby(level=[0,1]).apply(list).unstack()
df3 = pd.DataFrame(df2["element"].to_list(), columns=['element1', 'element2','element3', 'element4','element5'])

Sum a pandas dataframe

>>> df
A B C D E
0 one A foo 2.284039 0.802802
1 one B foo -1.463983 0.710178
2 two C foo -0.109677 2.930710
3 three A bar -0.356390 -1.972306
4 one B bar 1.425968 -0.285079
5 one C bar -0.657890 -0.555669
6 two A foo -0.168804 -1.930447
7 three B foo 0.488953 -2.512408
8 one C foo 0.251062 -0.465522
9 one A bar 0.427243 -0.845034
10 two B bar 0.629268 -0.892264
11 three C bar 0.171773 0.457268
I want to get the sum of column D, and the sum of column E, where column A is "one" and column C is "foo".
I know this works:
>>> x = df[df["A"] == "one"]
>>> y = x[x["C"] == "foo"]
>>> sum(y["D"])
1.0711178939632426
>>> sum(y["E"])
1.0474592505139344
Is there a more compact/elegant solution?
Using pandas, you can do:
df.groupby(['A','C']).sum()
Hope it helps you.

Pandas: Calculating value of difference between current column value and next column value depending if it meets criteria at a different column

I have a dataframe:
df = pd.DataFrame.from_items([('A', [10, 'foo']), ('B', [440, 'foo']), ('C', [790, 'bar']), ('D', [800, 'bar']), ('E', [7000, 'foo'])], orient='index', columns=['position', 'foobar'])
Which looks like the below:
position foobar
A 10 foo
B 440 foo
C 790 bar
D 800 bar
E 7000 foo
I would like to know the difference between each position and the next position that has the opposite value in the foobar column. Normally I would use the shift method to move down the position column:
df[comparisonCol].shift(-1) - df[comparisonCol]
but as I am using the foobar column to decide which position is applicable, I am not sure how to do this.
The result should look like:
position foobar difference
A 10 foo 780
B 440 foo 350
C 790 bar 6210
D 800 bar 6200
E 7000 foo NaN
I think you need if unique values in foobar are only 2, so is possible shift between groups in a Series:
#identify consecutive groups
a = df['foobar'].ne(df['foobar'].shift()).cumsum()
print (a)
A 1
B 1
C 2
D 2
E 3
Name: foobar, dtype: int32
#get first value by a of position column
b = df.groupby(a)['position'].first()
print (b)
foobar
1 10
2 790
3 7000
Name: position, dtype: int64
#subtract mapped value, but for next group is added 1 to a Series
df['difference'] = a.add(1).map(b) - df['position']
print (df)
position foobar difference
A 10 foo 780.0
B 440 foo 350.0
C 790 bar 6210.0
D 800 bar 6200.0
E 7000 foo NaN
Detail:
print (a.add(1).map(b))
A 790.0
B 790.0
C 7000.0
D 7000.0
E NaN
Name: foobar, dtype: float64

Get column names from pandas DataFrame in format dtype:object

I have a similar doubt to the one in the mentioned link. Instead of returning column names in a list, I want column names in the format dtype:object.
For example,
A
B
C
D
Name:x,dtype:object
I am using Excel file in xlsx format.
Link: Get list from pandas DataFrame column headers
I think you need read_excel first for df and then Series constructor or Index.to_series for Series from column names:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5]})
print (df)
A B C D
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5
s = pd.Series(df.columns.values, name='x')
print (s)
0 A
1 B
2 C
3 D
Name: x, dtype: object
s1 = df.columns.to_series().rename('x')
print (s1)
A A
B B
C C
D D
Name: x, dtype: object

pandas groupby apply does not broadcast into a DataFrame

Using pandas 0.19.0. The following code will reproduce the problem:
In [1]: import pandas as pd
import numpy as np
In [2]: df = pd.DataFrame({'c1' : list('AAABBBCCC'),
'c2' : list('abcdefghi'),
'c3' : np.random.randn(9),
'c4' : np.arange(9)})
df
Out[2]: c1 c2 c3 c4
0 A a 0.819618 0
1 A b 1.764327 1
2 A c -0.539010 2
3 B d 1.430614 3
4 B e -1.711859 4
5 B f 1.002522 5
6 C g 2.257341 6
7 C h 1.338807 7
8 C i -0.458534 8
In [3]: def myfun(s):
"""Function does practically nothing"""
req = s.values
return pd.Series({'mean' : np.mean(req),
'std' : np.std(req),
'foo' : 'bar'})
In [4]: res = df.groupby(['c1', 'c2'])['c3'].apply(myfun)
res.head(10)
Out[4]: c1 c2
A a foo bar
mean 0.819618
std 0
b foo bar
mean 1.76433
std 0
c foo bar
mean -0.53901
std 0
B d foo bar
And, of course I expect this:
Out[4]: foo mean std
c1 c2
A a bar 0.819618 0
b bar 1.76433 0
c bar -0.53901 0
B d bar 1.43061 0
Pandas automatically converts a Series to a DataFrame when returned by a function that is applied to a Series or a DataFrame. Why is the behavior different for functions applied to groups?
I am looking for an answer that will result in the output desired. Bonus points for explaining the difference in behavior among pandas.Series.apply or pandas.DataFrame.apply and pandas.core.groupby.GroupBy.apply
an easy fix would be to unstack
df = pd.DataFrame({'c1' : list('AAABBBCCC'),
'c2' : list('abcdefghi'),
'c3' : np.random.randn(9),
'c4' : np.arange(9)})
def myfun(s):
"""Function does practically nothing"""
req = s.values
return pd.Series({'mean' : np.mean(req),
'std' : np.std(req),
'foo' : 'bar'})
res = df.groupby(['c1', 'c2'])['c3'].apply(myfun)
res.unstack()

Resources