I'm running this piece of code:
df = pd.read_csv("./teste/teste_1.csv", sep=";")
df.fillna(0, inplace=True)
a = df['Total'] = df['A'] + df['B'] + df['C'] + df['D'] + df['E']
print(df)
df.to_csv("./teste/9table.csv", sep=";")
print("Done")
teste_1.csv:
META;A;B;C;D;E;%
A;;24.564;;;;-0.00%
B;;2.150;;;;3.55%
C;;;15.226;;;6.14%
And getting this print:
META A B C D E % Total
0 A 0.0 24.564 0.000 0.0 0.0 -0.00% 24.564
1 B 0.0 2.150 0.000 0.0 0.0 3.55% 2.150
2 C 0.0 0.000 15.226 0.0 0.0 6.14% 15.226
However, when I save it to csv, I get this result:
META A B C D E % Total
0 A 0.0 24.564 0.0 0.0 0.0 -0.00% 24.564
1 B 0.0 2.15 0.0 0.0 0.0 3.55% 2.15
2 C 0.0 0.0 15.225999999999900 0.0 0.0 6.14% 15.225999999999900
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 8 columns):
META 3 non-null object
A 3 non-null float64
B 3 non-null float64
C 3 non-null float64
D 3 non-null float64
E 3 non-null float64
% 3 non-null object
Total 3 non-null float64
dtypes: float64(6), object(2)
memory usage: 272.0+ bytes
Try to use float_format='%.2f' while saving on csv. Let's try like below
df.to_csv("./teste/9table.csv", sep=";", float_format='%.2f')
See: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
Related
I have a dataframe with Columns A,B,D and C. I would like to drop all NaN containing rows in the dataframe only where D and C columns contain value 0.
Eg:
Would anyone be able to help me in this issue.
Thanks & Best Regards
Michael
Use boolean indexing with inverted mask by ~:
np.random.seed(2021)
df = pd.DataFrame(np.random.choice([1,0,np.nan], size=(10, 4)), columns=list('ABCD'))
print (df)
A B C D
0 1.0 0.0 0.0 1.0
1 0.0 NaN NaN 1.0
2 NaN 0.0 0.0 0.0
3 1.0 1.0 NaN NaN
4 NaN NaN 0.0 0.0
5 0.0 NaN 0.0 1.0
6 0.0 NaN NaN 1.0
7 0.0 1.0 NaN NaN
8 1.0 0.0 1.0 0.0
9 0.0 NaN NaN NaN
If need remove columns if both D and C has 0 and another columns has NaNs use DataFrame.all for test if both values are 0 and chain by & for bitwise AND with
DataFrame.any for test if at least one value is NaN tested by DataFrame.isna:
m = df[['D','C']].eq(0).all(axis=1) & df.isna().any(axis=1)
df1 = df[~m]
print (df1)
A B C D
0 1.0 0.0 0.0 1.0
1 0.0 NaN NaN 1.0
3 1.0 1.0 NaN NaN
5 0.0 NaN 0.0 1.0
6 0.0 NaN NaN 1.0
7 0.0 1.0 NaN NaN
8 1.0 0.0 1.0 0.0
9 0.0 NaN NaN NaN
Another alternative without ~ for invert, but all conditions and also & is changed to | for bitwise OR:
m = df[['D','C']].ne(0).any(axis=1) | df.notna().all(axis=1)
df1 = df[m]
print (df1)
A B C D
0 1.0 0.0 0.0 1.0
1 0.0 NaN NaN 1.0
3 1.0 1.0 NaN NaN
5 0.0 NaN 0.0 1.0
6 0.0 NaN NaN 1.0
7 0.0 1.0 NaN NaN
8 1.0 0.0 1.0 0.0
9 0.0 NaN NaN NaN
I need some help with performing a few operations over subgroups, but I am getting really confused. I will try to describe quickly the operations and the desired output with the comments.
(1) Calculate the % frequency of appearance per subgroup
(2) Appear a record that does not exist with 0
(3) Rearrange order of records and columns
Assume the df below as the raw data:
df=pd.DataFrame({'store':[1,1,1,2,2,2,3,3,3,3],
'branch':['A','A','C','C','C','C','A','A','C','A'],
'products':['clothes', 'shoes', 'clothes', 'shoes', 'accessories', 'clothes', 'bags', 'bags', 'clothes', 'clothes']})
The grouped_df below is close to what I have in mind but I can't get the desired output.
grouped_df=df.groupby(['store', 'branch', 'products']).size().unstack('products').replace({np.nan:0})
# output:
products accessories bags clothes shoes
store branch
1 A 0.0 0.0 1.0 1.0
C 0.0 0.0 1.0 0.0
2 C 1.0 0.0 1.0 1.0
3 A 0.0 2.0 1.0 0.0
C 0.0 0.0 1.0 0.0
# desirable output: if (1), (2) and (3) take place somehow...
products clothes shoes accessories bags
store branch
1 B 0 0 0 0 #group 1 has 1 shoes and 1 clothes for A and C, so 3 in total which transforms each number to 33.3%
A 33.3 33.3 0 0
C 33.3 0.0 0 0
2 B 0 0 0 0
A 0 0 0 0
C 33.3 33.3 33.3 0
3 B 0 0 0 0 #group 3 has 2 bags and 1 clothes for A and C, so 4 in total which transforms the 2 bags into 50% and so on
A 25 0 0 50
C 25 0 0 0
# (3) rearrangement of columns with "clothes" and "shoes" going first
# (3)+(2) branch B appeared and the the order of branches changed to B, A, C
# (1) percentage calculations of the occurrences have been performed over groups that hopefully have made sense with the comments above
I have tried to handle each group separately, but i) it does not take into consideration the replaced NaN values, ii) I should avoid handling each group because I will need to concatenate afterwards a lot of groups (this df is just an example) as I will need to plot the whole group later on.
grouped_df.loc[[1]].transform(lambda x: x*100/sum(x)).round(0)
>>>
products accessories bags clothes shoes
store branch
1 A NaN NaN 50.0 100.0 #why has it transformed on axis='columns'?
C NaN NaN 50.0 0.0
Hopefully my question makes sense. Any insight into what I try to perform is very appreciated in advance, thank you a lot!
With the help of #Quang Hoang who tried to help out with this question a day before I post my answer, I managed to find a solution.
To explain the last bit of the calculation, I transformed every element by dividing it with the sum of counts for each group to find the frequency of each element 0th-level-group-wise and not row/column/total-wise.
grouped_df = df.groupby(['store', 'branch', 'products']).size()\
.unstack('branch')\
.reindex(['B','C','A'], axis=1, fill_value=0)\
.stack('branch')\
.unstack('products')\
.replace({np.nan:0})\
.transform(
lambda x: x*100/df.groupby(['store']).size()
).round(1)\
.reindex(['clothes', 'shoes', 'accessories', 'bags'], axis='columns')
Running the piece of code above, produces the desired output:
products accessories bags clothes shoes
store branch
1 B 0.0 0.0 0.0 0.0
C 0.0 0.0 33.3 0.0
A 0.0 0.0 33.3 33.3
2 B 0.0 0.0 0.0 0.0
C 33.3 0.0 33.3 33.3
3 B 0.0 0.0 0.0 0.0
C 0.0 0.0 25.0 0.0
A 0.0 50.0 25.0 0.0
I try to get new columns a and b based on the following dataframe:
a_x b_x a_y b_y
0 13.67 0.0 13.67 0.0
1 13.42 0.0 13.42 0.0
2 13.52 1.0 13.17 1.0
3 13.61 1.0 13.11 1.0
4 12.68 1.0 13.06 1.0
5 12.70 1.0 12.93 1.0
6 13.60 1.0 NaN NaN
7 12.89 1.0 NaN NaN
8 11.68 1.0 NaN NaN
9 NaN NaN 8.87 0.0
10 NaN NaN 8.77 0.0
11 NaN NaN 7.97 0.0
If b_x or b_y are 0.0 (at this case they have same values if they both exist), then a_x and b_y share same values, so I take either of them as new columns a and b; if b_x or b_y are 1.0, they are different values, so I calculate means of a_x and a_y as the values of a, take either b_x and b_y as b;
If a_x, b_x or a_y, b_y is not null, so I'll take existing values as a and b.
My expected results will like this:
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0
1 13.42 0.0 13.42 0.0 13.420 0
2 13.52 1.0 13.17 1.0 13.345 1
3 13.61 1.0 13.11 1.0 13.360 1
4 12.68 1.0 13.06 1.0 12.870 1
5 12.70 1.0 12.93 1.0 12.815 1
6 13.60 1.0 NaN NaN 13.600 1
7 12.89 1.0 NaN NaN 12.890 1
8 11.68 1.0 NaN NaN 11.680 1
9 NaN NaN 8.87 0.0 8.870 0
10 NaN NaN 8.77 0.0 8.770 0
11 NaN NaN 7.97 0.0 7.970 0
How can I get an result above? Thank you.
Use:
#filter all a and b columns
b = df.filter(like='b')
a = df.filter(like='a')
#test if at least one 0 or 1 value
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
#get means of a columns
a1 = a.mean(axis=1)
#forward filling mising values and select last column
b1 = b.ffill(axis=1).iloc[:, -1]
a2 = a.ffill(axis=1).iloc[:, -1]
#new Dataframe with 2 conditions
df1 = pd.DataFrame(np.select([m1, m2], [[a2, b1], [a1, b1]]), index=['a','b']).T
#join to original
df = df.join(df1)
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0
But I think solution should be simplify, because mean should be used for both conditions (because mean of same values is same like first value):
b = df.filter(like='b')
a = df.filter(like='a')
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
a1 = a.mean(axis=1)
b1 = b.ffill(axis=1).iloc[:, -1]
df['a'] = a1
df['b'] = b1
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0
I'm working with dynamic .csvs. So I never know what will be the column names. Example of those:
1)
ETC META A B C D E %
0 2.0 A 0.0 24.564 0.000 0.0 0.0 -0.00%
1 4.2 B 0.0 2.150 0.000 0.0 0.0 3.55%
2 5.0 C 0.0 0.000 15.226 0.0 0.0 6.14%
2)
META A C D E %
0 A 0.00 0.00 2.90 0.0 -0.00%
1 B 3.00 0.00 0.00 0.0 3.55%
2 C 0.00 21.56 0.00 0.0 6.14%
3)
FILL ETC META G F %
0 T 2.0 A 0.00 6.70 -0.00%
1 F 4.2 B 2.90 0.00 3.55%
2 T 5.0 C 0.00 34.53 6.14%
As I would like to create a new column with the SUM of all columns between META and %, I need to get all the names of each column, so I can create something like that:
a = df['Total'] = df['A'] + df['B'] + df['C'] + df['D'] + df['E']
As the columns name changes, the code below will work just for the example 1). So I need: 1) identify all the columns; 2) and then, sum them.
The solution has to work for the 3 examples above (1, 2 and 3).
Note that the only certainty is the columns are between META and %, but even they are not fixed.
Select all columns without first and last by DataFrame.iloc and then sum:
df['Total'] = df.iloc[:, 1:-1].sum(axis=1)
Or remove META and % columns by DataFrame.drop before sum:
df['Total'] = df.drop(['META','%'], axis=1).sum(axis=1)
print (df)
META A B C D E % Total
0 A 0.0 24.564 0.000 0.0 0.0 -0.00% 24.564
1 B 0.0 2.150 0.000 0.0 0.0 3.55% 2.150
2 C 0.0 0.000 15.226 0.0 0.0 6.14% 15.226
EDIT: You can select columns between META and %:
#META, % are not numeric
df['Total'] = df.loc[:, 'META':'%'].sum(axis=1)
#META is not numeric
df['Total'] = df.iloc[:, df.columns.get_loc('META'):df.columns.get_loc('%')].sum(axis=1)
#more general, META is before % column
df['Total'] = df.iloc[:, df.columns.get_loc('META')+1:df.columns.get_loc('%')].sum(axis=1)
I have a dataframe where 1 column is a list of values and another is the number of digits I need to round to. It looks like this:
ValueToPlot B_length
0 13.80 1.0
1 284.0 0.0
2 5.9 0.0
3 1.38 1.0
4 287.0 0.0
I am looking for an output that looks like this:
ValueToPlot B_length Rounded
0 13.80 1.0 13.8
1 284.0 0.0 284
2 5.9 0.0 6
3 1.38 1.0 1.4
4 287.0 0.0 287
Lastly, I would like the Rounded column to be in a string format, so the final result would be:
ValueToPlot B_length Rounded
0 13.80 1.0 '13.8'
1 284.0 0.0 '284'
2 5.9 0.0 '6'
3 1.38 1.0 '1.4'
4 287.0 0.0 '287'
I have attempted to use apply function in Pandas but have not been successful. I would prefer to avoid looping if possible.
Use chained formats
'{{:0.{}f}}'.format(3) evaluates to '{:0.3f}'. The double '{{}}' tells format to escape the '{}'. Then '{:0.3f}'.format(1) evaluates to 1.000. We can capture this concept by chaining.
f = lambda x: '{{:0.{}f}}'.format(int(x[1])).format(x[0])
df.assign(Rounded=df.apply(f, 1))
ValueToPlot B_length Rounded
0 13.80 1.0 13.8
1 284.00 0.0 284
2 5.90 0.0 6
3 1.38 1.0 1.4
4 287.00 0.0 287
A little more explicit with the column names
f = lambda x: '{{:0.{}f}}'.format(int(x['B_length'])).format(x['ValueToPlot'])
df.assign(Rounded=df.apply(f, 1))
ValueToPlot B_length Rounded
0 13.80 1.0 13.8
1 284.00 0.0 284
2 5.90 0.0 6
3 1.38 1.0 1.4
4 287.00 0.0 287
I generally like to use assign as it produces a copy of the data frame with a new column attached. I can edit the original data frame
f = lambda x: '{{:0.{}f}}'.format(int(x[1])).format(x[0])
df['Rounded'] = df.apply(f, 1)
Or I can use assign with an actual dictionary
f = lambda x: '{{:0.{}f}}'.format(int(x[1])).format(x[0])
df.assign(**{'Rounded': df.apply(f, 1)})
A little long ... but work
df.apply(lambda x : str(round(x['ValueToPlot'],int(x['B_length']))) if x['B_length']>0 else str(int(round(x['ValueToPlot'],int(x['B_length'])))),axis=1)
Out[1045]:
0 13.8
1 284
2 6
3 1.4
4 287
dtype: object