Calculate sum of specific rows using Python - python-3.x

I need to calculate the sum of specific rows in my dataframe
for example, Nombre de reboot, Passage en mode privé, Passage en mode public, Nombre de Kilomètres parcourus, Heures de roulage,Temps de trajet..
I tested this code on the first three rows:
import pandas as pd
df = pd.read_excel('mycollected_data1.xlsx')
print (df.iloc[:3, df.columns.get_indexer(['Valeurs','Valeurs.1','Valeurs.2'])])
a = df.iloc[1:3, [1,2,3]].sum()
print ('a\n', a)
this is the output:
Valeurs Valeurs.1 Valeurs.2
0 3 5 0
1 2 1 1
2 0 0 0
a
Valeurs 2.0
Valeurs.1 1.0
Valeurs.2 1.0
dtype: float64
desired output:
Valeurs Valeurs.1 Valeurs.2 sum
0 3 5 0 8
1 2 1 1 4
2 0 0 0 0
how can I make it calculate the sum of specific given rows?

You need to use axis=1 in your sum function
Original df
df
Valeurs Valeurs.1 Valeurs.2
0 3 5 0
1 2 1 1
2 0 0 0
cols = ['Valeurs','Valeurs.1','Valeurs.2']
df['sum'] = df.loc[0:2, cols].sum(axis=1)
df # final df
Valeurs Valeurs.1 Valeurs.2 sum
0 3 5 0 8
1 2 1 1 4
2 0 0 0 0

IIUC,
we can use .sum while specifying the axis as 1 to work row wise.
cols = ['Valeurs','Valeurs.1','Valeurs.2']
df['sum'] = df[cols].sum(axis=1)
print(df)
Valeurs Valeurs.1 Valeurs.2 sum
0 3 5 0 8
1 2 1 1 4
2 0 0 0 0
edit, if you need to access certain rows you can use the .loc function as #quant has
row_start = 0
row_end = 2
df.loc[row_start:row_end,cols].sum(axis=1)
Just as a word of advice, it seems like you have repeated column names with similar datatypes, i would first clean your column headings and then melt your dataframe to get a tabular model.

Related

How to split string with the values in their specific columns indexed on their label?

I have the following data
Index Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
. .
. .
. .
I have to create split the data format into different columns each with their own header and their values, the result should be as below:
Index Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
. .
. .
. .
The data is not limited to CO/PET/CV/EL, will need as many columns needed each displaying its corresponding value.
The .str.split('-', expand=True) function will only delimit the data and keep all first values in same column and does not rename each column.
Is there a way to implement this in python?
You could do:
df.Data.str.split('-').explode().str.split(r'(?<=\d)(?=\D)',expand = True). \
reset_index().pivot('index',1,0).fillna(0).reset_index()
1 Index CO CV EL PET
0 0 100 0 0 0
1 1 50 0 0 50
2 2 0 98 2 0
3 3 50 50 0 0
Idea is first split values by -, then extract numbers and no numbers values to tuples, append to list and convert to dictionaries. It is passed in list comprehension to DataFrame cosntructor, replaced misisng values and converted to numeric:
import re
def f(x):
L = []
for val in x.split('-'):
k, v = re.findall('(\d+)(\D+)', val)[0]
L.append((v, k))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
If in data exist some values without number or number only solution should be changed for more general like:
print (df)
Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
4 AAA
5 20
def f(x):
L = []
for val in x.split('-'):
extracted = re.findall('(\d+)(\D+)', val)
if len(extracted) > 0:
k, v = extracted[0]
L.append((v, k))
else:
if val.isdigit():
L.append(('No match digit', val))
else:
L.append((val, 0))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL AAA No match digit
0 100CO 100 0 0 0 0 0
1 50CO-50PET 50 50 0 0 0 0
2 98CV-2EL 0 0 98 2 0 0
3 50CV-50CO 50 0 50 0 0 0
4 AAA 0 0 0 0 0 0
5 20 0 0 0 0 0 20
Try this:
import pandas as pd
import re
df = pd.DataFrame({'Data':['100CO', '50CO-50PET', '98CV-2EL', '50CV-50CO']})
split_df = pd.DataFrame(df.Data.apply(lambda x: {re.findall('[A-Z]+', el)[0] : re.findall('[0-9]+', el)[0] \
for el in x.split('-')}).tolist())
split_df = split_df.fillna(0)
df = pd.concat([df, split_df], axis = 1)

Comparing two different sized pandas Dataframes and to find the row index with equal values

I need some help with comparing two pandas dataframe
I have two dataframes
The first dataframe is
df1 =
a b c d
0 1 1 1 1
1 0 1 0 1
2 0 0 0 1
3 1 1 1 1
4 1 0 1 0
5 1 1 1 0
6 0 0 1 0
7 0 1 0 1
and the second dataframe is
df2 =
a b c d
0 1 1 1 1
1 1 0 1 0
2 0 0 1 0
I want to find the row index of dataframe 1 (df1) which the entire row is the same as the rows in dataframe 2 (df2). My expect result would be
0
3
4
6
The order of the above index does not need to be in order, all I want is the index of dataframe 1 (df1)
Is there a way without using for loop?
Thanks
Tommy
You can using merge
df1.merge(df2,indicator=True,how='left').loc[lambda x : x['_merge']=='both'].index
Out[459]: Int64Index([0, 3, 4, 6], dtype='int64')

Selective multiplication of a pandas dataframe

I have a pandas Dataframe and Series of the form
df = pd.DataFrame({'Key':[2345,2542,5436,2468,7463],
'Segment':[0] * 5,
'Values':[2,4,6,6,4]})
print (df)
Key Segment Values
0 2345 0 2
1 2542 0 4
2 5436 0 6
3 2468 0 6
4 7463 0 4
s = pd.Series([5436, 2345])
print (s)
0 5436
1 2345
dtype: int64
In the original df, I want to multiply the 3rd column(Values) by 7 except for the keys which are present in the series. So my final df should look like
What should be the best way to achieve this in Python 3.x?
Use DataFrame.loc with Series.isin for filter Value column with inverted condition for non membership with multiple by scalar:
df.loc[~df['Key'].isin(s), 'Values'] *= 7
print (df)
Key Segment Values
0 2345 0 2
1 2542 0 28
2 5436 0 6
3 2468 0 42
4 7463 0 28
Another method could be using numpy.where():
df['Values'] *= np.where(~df['Key'].isin([5436, 2345]), 7,1)

delete specific rows from csv using pandas

I have a csv file in the format shown below:
I have written the following code that reads the file and randomly deletes the rows that have steering value as 0. I want to keep just 10% of the rows that have steering value as 0.
df = pd.read_csv(filename, header=None, names = ["center", "left", "right", "steering", "throttle", 'break', 'speed'])
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
However, I get the following error:
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 1104, in mtrand.RandomState.choice
(numpy/random/mtrand/mtrand.c:17062)
ValueError: a must be greater than 0
Can you guys help me?
sample DataFrame built with #andrew_reece's code
In [9]: df
Out[9]:
center left right steering throttle brake
0 center_54.jpg left_75.jpg right_39.jpg 1 0 0
1 center_20.jpg left_81.jpg right_49.jpg 3 1 1
2 center_34.jpg left_96.jpg right_11.jpg 0 4 2
3 center_98.jpg left_87.jpg right_34.jpg 0 0 0
4 center_67.jpg left_12.jpg right_28.jpg 1 1 0
5 center_11.jpg left_25.jpg right_94.jpg 2 1 0
6 center_66.jpg left_27.jpg right_52.jpg 1 3 3
7 center_18.jpg left_50.jpg right_17.jpg 0 0 4
8 center_60.jpg left_25.jpg right_28.jpg 2 4 1
9 center_98.jpg left_97.jpg right_55.jpg 3 3 0
.. ... ... ... ... ... ...
90 center_31.jpg left_90.jpg right_43.jpg 0 1 0
91 center_29.jpg left_7.jpg right_30.jpg 3 0 0
92 center_37.jpg left_10.jpg right_15.jpg 1 0 0
93 center_18.jpg left_1.jpg right_83.jpg 3 1 1
94 center_96.jpg left_20.jpg right_56.jpg 3 0 0
95 center_37.jpg left_40.jpg right_38.jpg 0 3 1
96 center_73.jpg left_86.jpg right_71.jpg 0 1 0
97 center_85.jpg left_31.jpg right_0.jpg 3 0 4
98 center_34.jpg left_52.jpg right_40.jpg 0 0 2
99 center_91.jpg left_46.jpg right_17.jpg 0 0 0
[100 rows x 6 columns]
In [10]: df.steering.value_counts()
Out[10]:
0 43 # NOTE: 43 zeros
1 18
2 15
4 12
3 12
Name: steering, dtype: int64
In [11]: df.shape
Out[11]: (100, 6)
your solution (unchanged):
In [12]: df = df.drop(df.query('steering==0').sample(frac=0.90).index)
In [13]: df.steering.value_counts()
Out[13]:
1 18
2 15
4 12
3 12
0 4 # NOTE: 4 zeros (~10% from 43)
Name: steering, dtype: int64
In [14]: df.shape
Out[14]: (61, 6)
NOTE: make sure that steering column has numeric dtype! If it's a string (object) then you would need to change your code as follows:
df = df.drop(df.query('steering=="0"').sample(frac=0.90).index)
# NOTE: ^ ^
after that you can save the modified (reduced) DataFrame to CSV:
df.to_csv('/path/to/filename.csv', index=False)
Here's a one-line approach, using concat() and sample():
import numpy as np
import pandas as pd
# first, some sample data
# generate filename fields
positions = ['center','left','right']
N = 100
fnames = ['{}_{}.jpg'.format(loc, np.random.randint(100)) for loc in np.repeat(positions, N)]
df = pd.DataFrame(np.array(fnames).reshape(3,100).T, columns=positions)
# generate numeric fields
values = [0,1,2,3,4]
probas = [.5,.2,.1,.1,.1]
df['steering'] = np.random.choice(values, p=probas, size=N)
df['throttle'] = np.random.choice(values, p=probas, size=N)
df['brake'] = np.random.choice(values, p=probas, size=N)
print(df.shape)
(100,3)
The first few rows of sample output:
df.head()
center left right steering throttle brake
0 center_72.jpg left_26.jpg right_59.jpg 3 3 0
1 center_75.jpg left_68.jpg right_26.jpg 0 0 2
2 center_29.jpg left_8.jpg right_88.jpg 0 1 0
3 center_22.jpg left_26.jpg right_23.jpg 1 0 0
4 center_88.jpg left_0.jpg right_56.jpg 4 1 0
5 center_93.jpg left_18.jpg right_15.jpg 0 0 0
Now drop all but 10% of rows with steering==0:
newdf = pd.concat([df.loc[df.steering!=0],
df.loc[df.steering==0].sample(frac=0.1)])
With the probability weightings I used in this example, you'll see somewhere between 50-60 remaining entries in newdf, with about 5 steering==0 cases remaining.
Using a mask on steering combined with a random number should work:
df = df[(df.steering != 0) | (np.random.rand(len(df)) < 0.1)]
This does generate some extra random values, but it's nice and compact.
Edit: That said, I tried your example code and it worked as well. My guess is the error is coming from the fact that your df.query() statement is returning an empty dataframe, which probably means that the "sample" column does not contain any zeros, or alternatively that the column is read as strings rather than numeric. Try converting the column to integer before running the above snippet.

Placing n rows of pandas a dataframe into their own dataframe

I have a large dataframe with many rows and columuns.
An example of the structure is:
a = np.random.rand(6,3)
df = pd.DataFrame(a)
I'd like to split the DataFrame into seperate data frames each consisting of 3 rows.
you can use groupby
g = df.groupby(np.arange(len(df)) // 3)
for n, grp in g:
print(grp)
0 1 2
0 0.278735 0.609862 0.085823
1 0.836997 0.739635 0.866059
2 0.691271 0.377185 0.225146
0 1 2
3 0.435280 0.700900 0.700946
4 0.796487 0.018688 0.700566
5 0.900749 0.764869 0.253200
to get it into a handy dictionary
mydict = {k: v for k, v in g}
You can use numpy.split() method:
In [8]: df = pd.DataFrame(np.random.rand(9, 3))
In [9]: df
Out[9]:
0 1 2
0 0.899366 0.991035 0.775607
1 0.487495 0.250279 0.975094
2 0.819031 0.568612 0.903836
3 0.178399 0.555627 0.776856
4 0.498039 0.733224 0.151091
5 0.997894 0.018736 0.999259
6 0.345804 0.780016 0.363990
7 0.794417 0.518919 0.410270
8 0.649792 0.560184 0.054238
In [10]: for x in np.split(df, len(df)//3):
...: print(x)
...:
0 1 2
0 0.899366 0.991035 0.775607
1 0.487495 0.250279 0.975094
2 0.819031 0.568612 0.903836
0 1 2
3 0.178399 0.555627 0.776856
4 0.498039 0.733224 0.151091
5 0.997894 0.018736 0.999259
0 1 2
6 0.345804 0.780016 0.363990
7 0.794417 0.518919 0.410270
8 0.649792 0.560184 0.054238

Resources