Pandas dataframe - Setting with copy warning - python-3.x

I would like to forwardfill an hourly dataframe so that the value for hour 1 gets forwardfilled for every hour 1 on the following days. The same for each of the 24 hours.
The dataframe looks like this:
Timestamp input1 input2 input3
… … … ..
01.01.2018 00:00 2 5 4
01.01.2018 01:00 3 3 2
01.01.2018 02:00 5 6 1
…
01.01.2018 22:00 2 0 1
01.01.2018 23:00 5 3 3
02.01.2018 00:00 6 2 5
02.01.2018 01:00 3 6 4
02.01.2018 02:00 3 9 6
02.01.2018 03:00 5 1 7
…
02.01.2018 23:00 2 5 1
03.01.2018 00:00 NaN NaN NaN
…
03.01.2018 23:00 NaN NaN NaN
I am using the following code for this:
for hr in range(0,24):
df.loc[df.index.hour == hr, Inputs] = df.loc[df.index.hour == hr, Inputs].fillna(method='ffill')
This works.
Unfortunately I am getting a Warning Message:
\Python\WPy-3670_32bit\python-3.6.7\lib\site-packages\pandas\core\indexing.py:543: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
How can I solve this, that I do not get a warning anymore?
The resulting df should have the NaNs filled.

I executed your code and dind't get the mentioned warning (neither any other).
Using loc is just the right way to avoid such warning (as said in this message).
Maybe you are using some older version of Pandas?
Upgrade to 0.25 if you have an older version and try again.
Another suspicion: Maybe this warning pertains to some other instruction
in your code (without loc)?

This works:
df[df.index.hour == hr] = df[df.index.hour == hr].fillna(method="ffill")
Very similar to .loc, but does not tend to raise as many Settingwithcopy warnings.

Related

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

Fill missing value in different columns of dataframe using mean or median of last n values

I have a dataframe which contains timeseries data. What i want to do is efficiently fill all the missing values in different columns by substituting with a median value using timedelta of say "N" mins. E.g if for a column say i have data for 10:20, 10:21,10:22,10:23,10:24,.... and data in 10:22 is missing then with timedelta of say 2 mins i would want it to be filled by median value of 10:20,10:21,10:23 and 10:24.
One way i can do is :
for all column in dataframe:
Find index which has nan value
for all index which has nan value:
extract all values using between_time with index-timedelta and index_+deltatime
find the media of extracted value
set value in the index with that extracted median value.
This looks like 2 for loops running and not a very efficient one. Is there a efficient way to do it.
Thanks
IIUC you can resample your time column, then fillna with rolling window set to center:
# dummy data setup
np.random.seed(500)
n = 2
df = pd.DataFrame({"time":pd.to_timedelta([f"10:{i}:00" for i in range(15)]),
"value":np.random.randint(2, 10, 15)})
df = df.drop(df.index[[5,10]]).reset_index(drop=True)
print (df)
time value
0 10:00:00 4
1 10:01:00 9
2 10:02:00 3
3 10:03:00 3
4 10:04:00 8
5 10:06:00 9
6 10:07:00 2
7 10:08:00 9
8 10:09:00 9
9 10:11:00 7
10 10:12:00 3
11 10:13:00 3
12 10:14:00 7
s = df.set_index("time").resample("60S").asfreq()
print (s.fillna(s.rolling(n*2+1, min_periods=1, center=True).mean()))
value
time
10:00:00 4.0
10:01:00 9.0
10:02:00 3.0
10:03:00 3.0
10:04:00 8.0
10:05:00 5.5
10:06:00 9.0
10:07:00 2.0
10:08:00 9.0
10:09:00 9.0
10:10:00 7.0
10:11:00 7.0
10:12:00 3.0
10:13:00 3.0
10:14:00 7.0

How to merge timestamps that are only few seconds apart [Pandas]

I have this dataframe with shape 22341x3:
tID DateTime
0 1 2020-04-04 10:15:40
1 2 2020-04-04 10:15:56
2 2 2020-04-04 11:07:11
3 3 2020-04-04 11:08:14
4 3 2020-04-04 11:18:46
5 4 2020-04-04 11:23:56
6 5 2020-04-04 11:24:14
7 6 2020-04-04 11:29:12
8 7 2020-04-04 11:29:23
9 8 2020-04-04 11:34:23
Now I have to create a column called merged_timestamp that merges all the timestamps that are only a few seconds apart and give them a new number: mtID
So for example: if we consider 2020-04-04 10:15:40 as a reference, the timestamps with few seconds apart can be from 40 seconds until 44 seconds. They can have hours and minutes with a big gap as compared to the reference, but their seconds should be only few seconds apart in order to get merged.
Any help would be appreciated.
EDIT: I tried doing dfd.resample('5s')[0:5] where dfd is my dataframe. It gives me this error TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
resample works on the index, so make the datetime your index;
df.index = pd.to_datetime(df['DateTime'])
then you can resample with;
df.resample('5s').count()
or some other aggregation, not sure what you are trying to do. And then you could drop the rows that your not interested in.

Select two or more consecutive rows based on a criteria using python

I have a data set like this:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
A 2019-01-01 11.18 TX 234567 3
B 2019-01-02 12.19 WA 456789 4
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
B 2019-01-02 12.50 DC 157890 7
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
A 2019-01-04 09:40 CA 234567 11
In this data set I want to compare and select two or more consecutive which fit the following criteria:
User should be same
Time difference should be less than 15 mins
Cookie should be different
So if I apply the filter I should get the following data:
user time city cookie index
A 2019-01-01 11.00 NYC 123456 1
A 2019-01-01 11.12 CA 234567 2
B 2019-01-02 12.21 FL 456789 5
B 2019-01-02 12.31 VT 987654 6
A 2019-01-03 09:12 CA 123456 8
A 2019-01-03 09:27 NYC 345678 9
A 2019-01-03 09:34 TX 123456 10
So, in the above, comparing first two rows(index 1 and 2) satisfy all the conditions above. The next two (index 2 and 3) has same cookie, index 3 and 4 has different user, 5 and 6 is selected and displayed, 6 and 7 has time difference more than 15 mins. 8,9 and 10 fit the criteria but 11 doesnt as the date is 24 hours apart.
How can I solve this using python dataframe? All help is appreciated.
What I have tried:
I tried creating flags using
shift()
cookiediff=pd.DataFrame(df.Cookie==df.Cookie.shift())
cookiediff.columns=['Cookiediffs']
timediff=pd.DataFrame(pd.to_datetime(df.time) - pd.to_datetime(df.time.shift()))
timediff.columns=['timediff']
mask = df.user != df.user.shift(1)
timediff.timediff[mask] = np.nan
cookiediff['Cookiediffs'][mask] = np.nan
This will do the trick:
import numpy as np
#you have inconsistent time delim-just to correct it per your sample data
df["time"]=df["time"].str.replace(":", ".")
df["time"]=pd.to_datetime(df["time"], format="%Y-%m-%d %H.%M")
cond_=np.logical_or(
df["time"].sub(df["time"].shift()).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift()) &\
df["cookie"].ne(df["cookie"].shift()),
df["time"].sub(df["time"].shift(-1)).astype('timedelta64[m]').lt(15) &\
df["user"].eq(df["user"].shift(-1)) &\
df["cookie"].ne(df["cookie"].shift(-1)),
)
res=df.loc[cond_]
Few points- you need to ensure your time column is datetime in order to make the 15 minutes condition verifiable.
Then - the final filter (cond_) you obtain by comparing each row to the previous one, checking all 3 conditions OR by doing the same, but checking against the next one (otherwise you would just get all the consecutive matching rows, except the first one).
Outputs:
user time city cookie index
0 A 2019-01-01 11:00:00 NYC 123456 1
1 A 2019-01-01 11:12:00 CA 234567 2
4 B 2019-01-02 12:21:00 FL 456789 5
5 B 2019-01-02 12:31:00 VT 987654 6
7 A 2019-01-03 09:12:00 CA 123456 8
8 A 2019-01-03 09:27:00 NYC 345678 9
9 A 2019-01-03 09:34:00 TX 123456 10
You could use regular expressions to isolate the fields and use named groups and the groupdict() function to store the value of each field into a dictionary and compare the values from the last dictionary to the current one. So iterate through each line of the dataset with two dictionaries, the current dictionary and the last dictionary, and perform a re.search() on each line with the regex pattern string to separate each line into named fields, then compare the value of the two dictionaries.
So, something like:
import re
c_dict=re.search('(?P<user>\w) +(?P<time>\d{4}-\d{2}-\d{2} \d{2}\.\d{2}) +(?P<city>\w+) +(?P<cookie>\d{6}) +(?P<index>\d+)',s).groupdict()
for each line of your dataset. For the first line of your dataset, this would create the dictionary {'user': 'A', 'time': '2019-01-01 11.00', 'city': 'NYC', 'cookie': '123456', 'index': '1'}. With the fields isolated, you could easily compare the values of the fields to previous lines if you stored those in another dictionary.

Why there are many "NaN" in the index after importing a MultiIndex Dataframe from an Excel file?

I have an Excel file looks like below in Excel:
2016-1-1 2016-1-2 2016-1-3 2016-1-4
300100 am 1 3 5 1
pm 3 2 4 5
300200 am 2 5 2 6
pm 5 1 3 7
300300 am 1 6 3 2
pm 3 7 2 3
300400 am 3 1 1 3
pm 2 5 5 2
300500 am 1 6 6 1
pm 5 7 7 5
But after I imported it by pd.read_excel and printed it, it was displayed like below in Python:
2016-1-1 2016-1-2 2016-1-3 2016-1-4
300100 am 1 3 5 1
NaN pm 3 2 4 5
300200 am 2 5 2 6
NaN pm 5 1 3 7
300300 am 1 6 3 2
NaN pm 3 7 2 3
300400 am 3 1 1 3
NaN pm 2 5 5 2
300500 am 1 6 6 1
NaN pm 5 7 7 5
How can I solve this to make the Dataframe look like the format in Excel, without so many "NaN"? Thanks!
Most of the time when Excel looks like what you have in your example, it does actually have blanks where those spaces are. But, the cells are merged, so it looks pretty. When you import it into pandas, it reads them as empty or NaN.
To fix it, forward fill the empty cells, then set as the index.
df.ffill()
Without access to the Excel files or knowledge of the versions it's impossible to be sure, but it just looks like you have a column of numbers (the first column) with every other row blank. Pandas expects uniformly filled columns, so while in Excel you have a sort of "structure" of the information for both AM and PM for each first-column number (id?), Pandas just sees two rows, one with an invalid first column. Depending on how you actually want to access this data, an easy fix would be to replace every NaN with the number directly above it, so each row contains either the AM or PM information for the "id". Another fix would be to change your column structure to have 2016-1-1-am and 2016-1-1-pm fields.
You're looking for the fillna method:
df = df.fillna('')

Resources