I'm working with dynamic .csvs. So I never know what will be the column names. Example of those:
1)
ETC META A B C D E %
0 2.0 A 0.0 24.564 0.000 0.0 0.0 -0.00%
1 4.2 B 0.0 2.150 0.000 0.0 0.0 3.55%
2 5.0 C 0.0 0.000 15.226 0.0 0.0 6.14%
2)
META A C D E %
0 A 0.00 0.00 2.90 0.0 -0.00%
1 B 3.00 0.00 0.00 0.0 3.55%
2 C 0.00 21.56 0.00 0.0 6.14%
3)
FILL ETC META G F %
0 T 2.0 A 0.00 6.70 -0.00%
1 F 4.2 B 2.90 0.00 3.55%
2 T 5.0 C 0.00 34.53 6.14%
As I would like to create a new column with the SUM of all columns between META and %, I need to get all the names of each column, so I can create something like that:
a = df['Total'] = df['A'] + df['B'] + df['C'] + df['D'] + df['E']
As the columns name changes, the code below will work just for the example 1). So I need: 1) identify all the columns; 2) and then, sum them.
The solution has to work for the 3 examples above (1, 2 and 3).
Note that the only certainty is the columns are between META and %, but even they are not fixed.
Select all columns without first and last by DataFrame.iloc and then sum:
df['Total'] = df.iloc[:, 1:-1].sum(axis=1)
Or remove META and % columns by DataFrame.drop before sum:
df['Total'] = df.drop(['META','%'], axis=1).sum(axis=1)
print (df)
META A B C D E % Total
0 A 0.0 24.564 0.000 0.0 0.0 -0.00% 24.564
1 B 0.0 2.150 0.000 0.0 0.0 3.55% 2.150
2 C 0.0 0.000 15.226 0.0 0.0 6.14% 15.226
EDIT: You can select columns between META and %:
#META, % are not numeric
df['Total'] = df.loc[:, 'META':'%'].sum(axis=1)
#META is not numeric
df['Total'] = df.iloc[:, df.columns.get_loc('META'):df.columns.get_loc('%')].sum(axis=1)
#more general, META is before % column
df['Total'] = df.iloc[:, df.columns.get_loc('META')+1:df.columns.get_loc('%')].sum(axis=1)
Related
I am faced with a small problem, the solution of which is certainly very simple, but I cannot find how to do it.
Let's say I have the following pandas dataframe df:
import pandas as pd
X = [0.78, 0.82, 1.03, 1.06, 1.21]
Y = [0.0, 0.2521, 0.4905, 0.5003, 1.0]
df = pd.DataFrame({'X':X, 'Y':Y})
df
X Y
0 0.78 0.0000
1 0.82 0.2521
2 1.03 0.4905
3 1.06 0.5003
4 1.21 1.0000
I want to recover the value of X for which Y exceeds 0.5; in other words, I am looking for a piece of program which creates a new variable val such as:
print (val)
1.06
I imagine only complicated things, style:
df['Z'] = df.apply(lambda row: 0 if row.Y <= 0.5 else 1, axis = 1)
df
X Y Z
0 0.78 0.0000 0
1 0.82 0.2521 0
2 1.03 0.4905 0
3 1.06 0.5003 1
4 1.21 1.0000 1
But this shows me where is the X value I want (first appearance of 1 in Z), but it doesn't extract that value.
How could I do that in a simple way?
We can check with idxmax, notice it will need have one value less than 0.5
df.loc[df.Y.gt(0.5).idxmax(),'Z']=1
df.Z.fillna(0,inplace=True)
df
X Y Z
0 0.78 0.0000 0.0
1 0.82 0.2521 0.0
2 1.03 0.4905 0.0
3 1.06 0.5003 1.0
4 1.21 1.0000 0.0
If would like separated dataframe
df1=df.loc[df.Y.gt(0.5)]
I need some help with performing a few operations over subgroups, but I am getting really confused. I will try to describe quickly the operations and the desired output with the comments.
(1) Calculate the % frequency of appearance per subgroup
(2) Appear a record that does not exist with 0
(3) Rearrange order of records and columns
Assume the df below as the raw data:
df=pd.DataFrame({'store':[1,1,1,2,2,2,3,3,3,3],
'branch':['A','A','C','C','C','C','A','A','C','A'],
'products':['clothes', 'shoes', 'clothes', 'shoes', 'accessories', 'clothes', 'bags', 'bags', 'clothes', 'clothes']})
The grouped_df below is close to what I have in mind but I can't get the desired output.
grouped_df=df.groupby(['store', 'branch', 'products']).size().unstack('products').replace({np.nan:0})
# output:
products accessories bags clothes shoes
store branch
1 A 0.0 0.0 1.0 1.0
C 0.0 0.0 1.0 0.0
2 C 1.0 0.0 1.0 1.0
3 A 0.0 2.0 1.0 0.0
C 0.0 0.0 1.0 0.0
# desirable output: if (1), (2) and (3) take place somehow...
products clothes shoes accessories bags
store branch
1 B 0 0 0 0 #group 1 has 1 shoes and 1 clothes for A and C, so 3 in total which transforms each number to 33.3%
A 33.3 33.3 0 0
C 33.3 0.0 0 0
2 B 0 0 0 0
A 0 0 0 0
C 33.3 33.3 33.3 0
3 B 0 0 0 0 #group 3 has 2 bags and 1 clothes for A and C, so 4 in total which transforms the 2 bags into 50% and so on
A 25 0 0 50
C 25 0 0 0
# (3) rearrangement of columns with "clothes" and "shoes" going first
# (3)+(2) branch B appeared and the the order of branches changed to B, A, C
# (1) percentage calculations of the occurrences have been performed over groups that hopefully have made sense with the comments above
I have tried to handle each group separately, but i) it does not take into consideration the replaced NaN values, ii) I should avoid handling each group because I will need to concatenate afterwards a lot of groups (this df is just an example) as I will need to plot the whole group later on.
grouped_df.loc[[1]].transform(lambda x: x*100/sum(x)).round(0)
>>>
products accessories bags clothes shoes
store branch
1 A NaN NaN 50.0 100.0 #why has it transformed on axis='columns'?
C NaN NaN 50.0 0.0
Hopefully my question makes sense. Any insight into what I try to perform is very appreciated in advance, thank you a lot!
With the help of #Quang Hoang who tried to help out with this question a day before I post my answer, I managed to find a solution.
To explain the last bit of the calculation, I transformed every element by dividing it with the sum of counts for each group to find the frequency of each element 0th-level-group-wise and not row/column/total-wise.
grouped_df = df.groupby(['store', 'branch', 'products']).size()\
.unstack('branch')\
.reindex(['B','C','A'], axis=1, fill_value=0)\
.stack('branch')\
.unstack('products')\
.replace({np.nan:0})\
.transform(
lambda x: x*100/df.groupby(['store']).size()
).round(1)\
.reindex(['clothes', 'shoes', 'accessories', 'bags'], axis='columns')
Running the piece of code above, produces the desired output:
products accessories bags clothes shoes
store branch
1 B 0.0 0.0 0.0 0.0
C 0.0 0.0 33.3 0.0
A 0.0 0.0 33.3 33.3
2 B 0.0 0.0 0.0 0.0
C 33.3 0.0 33.3 33.3
3 B 0.0 0.0 0.0 0.0
C 0.0 0.0 25.0 0.0
A 0.0 50.0 25.0 0.0
I do got some data within a pandas DataFrame looking like this.
df =
A B
time
0.1 10.0 1
0.15 12.1 2
0.19 4.0 2
0.21 5.0 2
0.22 6.0 2
0.25 7.0 1
0.3 8.1 1
0.4 9.45 2
0.5 3.0 1
Based on the following condition I look for a generic solution to find the first and last index of every subset.
cond = df.B == 2
So far I tried using the groupby concept but without the expected result.
df_1 = cond.reset_index()
df_2 = df_1.groupby(df_1['B']).agg(['first','last']).reset_index()
This is the output I got.
B time
first last
0 False 0.1 0.5
1 True 0.15 0.4
This is the output I like to get.
B time
first last
0 False 0.1 0.1
1 True 0.15 0.22
2 False 0.25 0.3
3 True 0.4 0.4
3 False 0.5 0.5
How can I accomplish this by a more or less generic approach?
Create helper Series by Series.shift with Series.ne and cumulative sum by Series.cumsum for groups by consecutive values, then for aggregation is used dictionary:
df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg({'B':'first','time': ['first','last']}).reset_index(drop=True)
print (df_2)
B time
first first last
0 False 0.10 0.10
1 True 0.15 0.22
2 False 0.25 0.30
3 True 0.40 0.40
4 False 0.50 0.50
If want avoid MultiIndex use named aggregations:
df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg(B=('B','first'),
first=('time','first'),
last=('time','last')).reset_index(drop=True)
print (df_2)
B first last
0 False 0.10 0.10
1 True 0.15 0.22
2 False 0.25 0.30
3 True 0.40 0.40
4 False 0.50 0.50
I'm running this piece of code:
df = pd.read_csv("./teste/teste_1.csv", sep=";")
df.fillna(0, inplace=True)
a = df['Total'] = df['A'] + df['B'] + df['C'] + df['D'] + df['E']
print(df)
df.to_csv("./teste/9table.csv", sep=";")
print("Done")
teste_1.csv:
META;A;B;C;D;E;%
A;;24.564;;;;-0.00%
B;;2.150;;;;3.55%
C;;;15.226;;;6.14%
And getting this print:
META A B C D E % Total
0 A 0.0 24.564 0.000 0.0 0.0 -0.00% 24.564
1 B 0.0 2.150 0.000 0.0 0.0 3.55% 2.150
2 C 0.0 0.000 15.226 0.0 0.0 6.14% 15.226
However, when I save it to csv, I get this result:
META A B C D E % Total
0 A 0.0 24.564 0.0 0.0 0.0 -0.00% 24.564
1 B 0.0 2.15 0.0 0.0 0.0 3.55% 2.15
2 C 0.0 0.0 15.225999999999900 0.0 0.0 6.14% 15.225999999999900
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 8 columns):
META 3 non-null object
A 3 non-null float64
B 3 non-null float64
C 3 non-null float64
D 3 non-null float64
E 3 non-null float64
% 3 non-null object
Total 3 non-null float64
dtypes: float64(6), object(2)
memory usage: 272.0+ bytes
Try to use float_format='%.2f' while saving on csv. Let's try like below
df.to_csv("./teste/9table.csv", sep=";", float_format='%.2f')
See: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
I have a dataframe where 1 column is a list of values and another is the number of digits I need to round to. It looks like this:
ValueToPlot B_length
0 13.80 1.0
1 284.0 0.0
2 5.9 0.0
3 1.38 1.0
4 287.0 0.0
I am looking for an output that looks like this:
ValueToPlot B_length Rounded
0 13.80 1.0 13.8
1 284.0 0.0 284
2 5.9 0.0 6
3 1.38 1.0 1.4
4 287.0 0.0 287
Lastly, I would like the Rounded column to be in a string format, so the final result would be:
ValueToPlot B_length Rounded
0 13.80 1.0 '13.8'
1 284.0 0.0 '284'
2 5.9 0.0 '6'
3 1.38 1.0 '1.4'
4 287.0 0.0 '287'
I have attempted to use apply function in Pandas but have not been successful. I would prefer to avoid looping if possible.
Use chained formats
'{{:0.{}f}}'.format(3) evaluates to '{:0.3f}'. The double '{{}}' tells format to escape the '{}'. Then '{:0.3f}'.format(1) evaluates to 1.000. We can capture this concept by chaining.
f = lambda x: '{{:0.{}f}}'.format(int(x[1])).format(x[0])
df.assign(Rounded=df.apply(f, 1))
ValueToPlot B_length Rounded
0 13.80 1.0 13.8
1 284.00 0.0 284
2 5.90 0.0 6
3 1.38 1.0 1.4
4 287.00 0.0 287
A little more explicit with the column names
f = lambda x: '{{:0.{}f}}'.format(int(x['B_length'])).format(x['ValueToPlot'])
df.assign(Rounded=df.apply(f, 1))
ValueToPlot B_length Rounded
0 13.80 1.0 13.8
1 284.00 0.0 284
2 5.90 0.0 6
3 1.38 1.0 1.4
4 287.00 0.0 287
I generally like to use assign as it produces a copy of the data frame with a new column attached. I can edit the original data frame
f = lambda x: '{{:0.{}f}}'.format(int(x[1])).format(x[0])
df['Rounded'] = df.apply(f, 1)
Or I can use assign with an actual dictionary
f = lambda x: '{{:0.{}f}}'.format(int(x[1])).format(x[0])
df.assign(**{'Rounded': df.apply(f, 1)})
A little long ... but work
df.apply(lambda x : str(round(x['ValueToPlot'],int(x['B_length']))) if x['B_length']>0 else str(int(round(x['ValueToPlot'],int(x['B_length'])))),axis=1)
Out[1045]:
0 13.8
1 284
2 6
3 1.4
4 287
dtype: object