Skipping every nth row in pandas - python-3.x

I am trying to slice my dataframe by skipping every 4th row. The best way I could get it done is by getting the index of every 4th row and then selecting all the other rows. Like below:-
df[~df.index.isin(df[::4].index)]
I was wondering if there is a simpler and/or more pythonic way of getting this done.

One possible solution is create mask by modulo and filter by boolean indexing:
df = pd.DataFrame({'a':range(10, 30)}, index=range(20))
#print (df)
b = df[np.mod(np.arange(df.index.size),4)!=0]
print (b)
a
1 11
2 12
3 13
5 15
6 16
7 17
9 19
10 20
11 21
13 23
14 24
15 25
17 27
18 28
19 29
Details:
print (np.mod(np.arange(df.index.size),4))
[0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3]
print (np.mod(np.arange(df.index.size),4)!=0)
[False True True True False True True True False True True True
False True True True False True True True]
If unique index values use a bit changed #jpp solution from comment:
b = df.drop(df.index[::4], 0)

Related

Python dataframes - How to track recent lowest value

I have a data frame like below in a dataframe called data
SN value
1 895.1
2 900.94
3 920.26
4 918.9
5 927.23
6 919.32
7 923.33
8 896.42
9 898.72
10 881.03
11 879.56
12 882.68
13 879.13
14 901.05
15 905.84
16 932.68
17 940.74
I need to keep track of recent lowest value.
I tried
# creating new column to track recent low and initializing all values to first value. This is to update it if a new low value is found
# data['new_low'] = data.iat[0,0]
#creating another column to compare last two rows and record whether it is new low or not
data['is_new_low'] = data.value.lt(data.new_low)
#if new low is found, make current value as the new low, otherwise keep the previous value
data['new_low'] = np.where((data['is_new_low']== True),data.value,data.new_low.shift())
This code is working for one pass. But when the flips for 2nd time, its not updating. Refer Row with SN 14
My Code out put
SN value new_low is_new_low
1 895.1 NaN FALSE
2 900.94 895.1 FALSE
3 920.26 895.1 FALSE
4 918.9 895.1 FALSE
5 927.23 895.1 FALSE
6 919.32 895.1 FALSE
7 923.33 895.1 FALSE
8 896.42 895.1 FALSE
9 898.72 895.1 FALSE
10 881.03 881.03 TRUE
11 879.56 879.56 TRUE
12 882.68 882.68 TRUE
13 879.13 879.13 TRUE
14 901.05 895.1 FALSE #Here it should be 879.13. But 895.1 is coming
15 905.84 895.1 FALSE
16 850.2 895.1 TRUE
17 870.14 895.1 TRUE
Desired output is
SN value new_low is_new_low
1 895.1 NaN FALSE
2 900.94 895.1 FALSE
3 920.26 895.1 FALSE
4 918.9 895.1 FALSE
5 927.23 895.1 FALSE
6 919.32 895.1 FALSE
7 923.33 895.1 FALSE
8 896.42 895.1 FALSE
9 898.72 895.1 FALSE
10 881.03 881.03 TRUE
11 879.56 879.56 TRUE
12 882.68 882.68 TRUE
13 879.13 879.13 TRUE
14 901.05 879.13 FALSE
15 905.84 879.13 FALSE
16 850.2 850.2 TRUE
17 870.14 870.14 FALSE
How to achieve this ?
You can use df.itertuples() to iterate through all your rows and update the new_low value. For every row, check if your value is less than the new_low value in previous row. If yes, update the new low value as the current value else set the new_low value to the new_low value of previous column.
for row in data.itertuples():
if(row.Index):
if data.at[row.Index,"value"] < data.at[row.Index -1,"new_low"]:
data.at[row.Index,"new_low"] = data.at[row.Index,"value"]
else:
data.at[row.Index,"new_low"] = data.at[row.Index -1,"new_low"]
You can also do something like this:
df['new_low'] = df['value'].cummin()
And if you'd like to use numpy for better performance, you can use:
np.minimum.accumulate(df.value)

In pyspark generate minimum value for a window partition based on two column value, variable and consecutive negative values

Created a rdd, having column 'a' which has a mix of positive and negative values
df = pd.DataFrame({"b": ['A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],
"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
"a": [3,-4,2, -1, -3, -1,-7,-6, 1, 1, -1, 1,4,5,-3,2,3,4, -1, -2],
"pos_neg": ['false','true','false','true','true','true','true','true','false','false','true','false','false','false','true','false','false','false','true','true'],
"neg_val_count":[0,1,1,2,1,1,1,1,1,0,1,1,1,1,2,2,2,2,3,3]})
df2=spark.createDataFrame(df)
the column 'pos_neg' represents if fields in 'a' is positive or negative, if negative it is true. 'neg_val_count' is counter for negative values within the fields for variable 'b'. Every time the variable 'b'changes the counter resets and consecutive negative values are taken as single. Hence for variable 'B' (in column 'b') counter is one even though there are three negative values.
I would like to generate a column which will have the minimum value for a combination of variable in 'b' (say A) and value in 'a' (for true cases between two false). for instance for first combination of 'A' and true, the value will be -4 (it is surrounded by false),for second combination of 'A' and true value will be -1, for B there are three consecutive true so the value will be least of them as -7. Basically consecutive negative values are taken as one and minimum is taking out of them. Expected value refers to the outcome required
b Sno a pos_neg neg_val_count expected value
0 A 1 3 false 0 3
1 A 2 -4 true 1 -4
2 A 3 2 false 1 2
3 A 4 -1 true 2 -1
4 B 5 -3 true 1 -7
5 B 6 -1 true 1 -7
6 B 7 -7 true 1 -7
7 C 8 -6 true 1 -6
8 C 9 1 false 1 1
9 D 10 1 false 0 1
10 D 11 -1 true 1 -1
11 D 12 1 false 1 1
12 D 13 4 false 1 4
13 D 14 5 false 1 5
14 D 15 -3 true 2 -3
15 D 16 2 false 2 2
16 D 17 3 false 2 3
17 D 18 4 false 2 4
18 D 19 -1 true 3 -2
19 D 20 -2 true 3 -2
I Tried using the following but it is not working, any support in this regard will be great.
w3 = Window.partitionBy('b','pos_neg').rowsBetween(Window.unboundedPreceding, 0).orderBy('Sno')
df2.withColumn('new_col', F.min('a').over(w3))

Search value in Next Month Record Pandas

Given that i have a df like this:
ID Date Amount
0 a 2014-06-13 12:03:56 13
1 b 2014-06-15 08:11:10 14
2 a 2014-07-02 13:00:01 15
3 b 2014-07-19 16:18:41 22
4 b 2014-08-06 09:39:14 17
5 c 2014-08-22 11:20:56 55
...
129 a 2016-11-06 09:39:14 12
130 c 2016-11-22 11:20:56 35
131 b 2016-11-27 09:39:14 42
132 a 2016-12-11 11:20:56 18
I need to create a column df['Checking'] to show that ID will appear in next month or not and i tried the code as below:
df['Checking']= df.apply(lambda x: check_nextmonth (x.Date,
x.ID), axis=1)
where
def check_nextmonth(date, id)=
x= id in df['user_id'][df['Date'].dt.to_period('M')== ((date+
relativedelta(months=1))).to_period('M')].values
return x
but it take too long to process a single row.
How can i improve this code or another way to achieve what i want?
Using pd.to_datetime with ts tricks:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df['tmp'] = (df['Date'] - pd.DateOffset(months=1)).dt.month
s = df.groupby('ID').apply(lambda x:x['Date'].dt.month.isin(x['tmp']))
df['Checking'] = s.reset_index(level=0)['Date']
Output:
ID Date Amount tmp Checking
0 a 2014-06-13 12:03:56 13 5 True
1 b 2014-06-15 08:11:10 14 5 True
2 a 2014-07-02 13:00:01 15 6 False
3 b 2014-07-19 16:18:41 16 6 True
4 b 2014-08-06 09:39:14 17 7 False
5 c 2014-08-22 11:20:56 18 7 False
Here's one method of doing it, check if the grouped id's next month is equal to current month + 1, and assign the same by sorting the ID.
check = df.groupby('ID').apply(lambda x : x['Date'].dt.month.shift(-1) == x['Date'].dt.month+1).stack().values
df = df.sort_values('ID').assign( checking = check).sort_index()
ID Date Amount checking
0 a 2014-06-13 12:03:56 13 True
1 b 2014-06-15 08:11:10 14 True
2 a 2014-07-02 13:00:01 15 False
3 b 2014-07-19 16:18:41 16 True
4 b 2014-08-06 09:39:14 17 False
5 c 2014-08-22 11:20:56 18 False

panda dataframe row wise iteration with referencing previous row values for conditional matching

I have to find out how many times a bike was on overspeed, and in each instances for how long(for simplicity how many kms)
df = pd.DataFrame({'bike':['b1']*15, 'km':list(range(1,16)), 'speed':[20,30,38,33,28,39,26,33,35,46,53,27,37,42,20]})
>>> df
bike km speed
0 b1 1 20
1 b1 2 30
2 b1 3 38
3 b1 4 33
4 b1 5 28
5 b1 6 39
6 b1 7 26
7 b1 8 33
8 b1 9 35
9 b1 10 46
10 b1 11 53
11 b1 12 27
12 b1 13 37
13 b1 14 42
14 b1 15 20
#Expected result is
bike last_OS_loc for_how_long_on_OS
b1 4 2km
b1 11 5km
b1 15 1km
Now Logic-
has to flag the speed >= 30 as Overspeed_Flag
If the speed remains on more than 30 for 1 or 1+km then those continuation are treated as a overspeed session (eg: when b1 was in between 2 to 4 km, 6to11, 13-14km, MARK it was not a overspeed session when b1 was at 6km, as it was only for that row, no continuation on >30 found).
then measure for a session, how long/for how many kms he remains at overspeed limit. Refer expected result tab.
also finding out for a overspeed session what was the last km mark.
Kindly suggest how can i achieve this. And do let me know if anything is not clear in the question.
P:S: i am also trying, but it is little complex for me(Pretty confused on how to mark if it is a continuation of OS_flag or a single instance of OS.), Will get back if successful in doing this. Thanks in ADV.
You can use:
#boolean mask
mask = df['speed'] >= 30
#consecutive groups
df['g'] = mask.ne(mask.shift()).cumsum()
#get size of each group
df['count'] = mask.groupby(df['g']).transform('size')
#filter by mask and remove unique rows
df = df[mask & (df['count'] > 1)]
print (df)
bike km speed g count
1 b1 2 30 2 3
2 b1 3 38 2 3
3 b1 4 33 2 3
7 b1 8 33 6 4
8 b1 9 35 6 4
9 b1 10 46 6 4
10 b1 11 53 6 4
12 b1 13 37 8 2
13 b1 14 42 8 2
#aggregate first and last values
df1 = df.groupby(['bike','g'])['km'].agg([('last_OS_loc', 'last'),
('for_how_long_on_OS','first')])
#substract last with first
df1['for_how_long_on_OS'] = df1['last_OS_loc'] - df1['for_how_long_on_OS']
#data cleaning
df1 = df1.reset_index(level=1, drop=True).reset_index()
print (df1)
bike last_OS_loc for_how_long_on_OS
0 b1 4 2
1 b1 11 3
2 b1 14 1
EDIT:
print (pd.concat([mask,
mask.shift(),
mask.ne(mask.shift()),
mask.ne(mask.shift()).cumsum()], axis=1,
keys=('mask', 'shifted', 'not equal (!=)', 'cumsum')))
mask shifted not equal (!=) cumsum
0 False NaN True 1
1 True False True 2
2 True True False 2
3 True True False 2
4 False True True 3
5 True False True 4
6 False True True 5
7 True False True 6
8 True True False 6
9 True True False 6
10 True True False 6
11 False True True 7
12 True False True 8
13 True True False 8
14 False True True 9
Here is another approach using a couple of helper Series and a lambda func:
os_session = (df['speed'].ge(30) & (df['speed'].shift(-1).ge(30) | df['speed'].shift().ge(30))).astype(int)
groups = (os_session.diff(1) != 0).astype('int').cumsum()
f_how_long = lambda x: x.max() - x.min()
grouped_df = (df.groupby([os_session, groups, 'bike'])['km']
.agg([('last_OS_loc', 'max'),
('for_how_long_on_OS',f_how_long)])
.xs(1, level=0)
.reset_index(level=0, drop=True))
print(grouped_df)
last_OS_loc for_how_long_on_OS
bike
b1 4 2
b1 11 3
b1 14 1

How to use “na_values='?'” option in the pd.read.csv() function?

I am trying to find the operation with na_values='?' option in the pd.read.csv() function.
So that I can find the list of rows containing "?" value and then remove that value.
Sample:
import pandas as pd
from pandas.compat import StringIO
#test data
temp=u"""id,col1,col2,col3
1,13?,15,14
1,13,15,?
1,12,15,13
2,?,15,?
2,18,15,13
2,18?,15,13"""
#in real data use
#df = pd.read_csv('test.csv')
df = pd.read_csv(StringIO(temp))
print (df)
id col1 col2 col3
0 1 13? 15 14
1 1 13 15 ?
2 1 12 15 13
3 2 ? 15 ?
4 2 18 15 13
5 2 18? 15 13
If want remove values with ? which are separately or substrings need mask created by str.contains and then check if at least one True per row by DataFrame.any:
print (df.astype(str).apply(lambda x: x.str.contains('?', regex=False)))
id col1 col2 col3
0 False True False False
1 False False False True
2 False False False False
3 False True False True
4 False False False False
5 False True False False
m = ~df.astype(str).apply(lambda x: x.str.contains('?', regex=False)).any(axis=1)
print (m)
0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool
df = df[m]
print (df)
id col1 col2 col3
2 1 12 15 13
4 2 18 15 13
If want replace only separately ? simply compare value:
print (df.astype(str) == '?')
id col1 col2 col3
0 False False False False
1 False False False True
2 False False False False
3 False True False True
4 False False False False
5 False False False False
m = ~(df.astype(str) == '?').any(axis=1)
print (m)
0 True
1 False
2 True
3 False
4 True
5 True
dtype: bool
df = df[m]
print (df)
id col1 col2 col3
0 1 13? 15 14
2 1 12 15 13
4 2 18 15 13
5 2 18? 15 13
It replace all ? to NaNs is necessary parameter na_values and dropna if want remove all rows with NaNs:
import pandas as pd
from pandas.compat import StringIO
#test data
temp=u"""id,col1,col2,col3
1,13?,15,14
1,13,15,?
1,12,15,13
2,?,15,?
2,18,15,13
2,18?,15,13"""
#in real data use
#df = pd.read_csv('test.csv', na_values='?')
df = pd.read_csv(StringIO(temp), na_values='?')
print (df)
id col1 col2 col3
0 1 13? 15 14.0
1 1 13 15 NaN
2 1 12 15 13.0
3 2 NaN 15 NaN
4 2 18 15 13.0
5 2 18? 15 13.0
df = df.dropna()
print (df)
id col1 col2 col3
0 1 13? 15 14.0
2 1 12 15 13.0
4 2 18 15 13.0
5 2 18? 15 13.0
na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('some-data.csv', na_values=na_values)
Create a list with useless parameters and use it trough reading from file
"??" or "####" type of junk values can be converted into missing value, since in python all the blank values can be replaced with nan. Hence you can also replace these type of junk value to missing value by passing them as as list to the parameter
'na_values'.
data_csv = pd.read_csv('test.csv',na_values = ["??"])
If you want to remove the rows which are contain "?" in pandas dataframe, you can try with:
suppose you have df:
import pandas as pd
df = pd.read_csv('test.csv')
df:
A B
0 Maths 4/13/2017
1 Physics 4/15/2016
2 English 4/16/2016
3 test?dsfsa 9/15/2016
check if column A contain "?" to generate new df1:
df1 = df[df.A.str.contains("\?")==False]
df1 will be:
A B
0 Maths 4/13/2017
1 Physics 4/15/2016
2 English 4/16/2016
which will give you the new df1 which doesn't contain "?".

Resources