I use panda in the processing of a table and I am looking to add 2 columns using the column CardCode with one condition. I have a user table with 3 columns CardCode / Id_customer / Id_group and I would like to add 2 columns in my first array (customer_id and group_id) according to the CardCode.
Here is my first table (csv file)
CardCode ItemCode ItemCodeP Amount Price Currency Discount ListNum
FromDate ToDate Type ReducType KeyItem
0 C8500165 BTHC48 BTHC48 1 65,000000 EUR ,000000 2 2018-10-18 00:00:00 2050-12-31 00:00:00 SPP2 amount BTHC48_C8500165_SPP2
1 C8500165 BTHC48 BTHC48 5 59,000000 EUR ,000000 2 2018-10-18 00:00:00 2050-12-31 00:00:00 SPP2 amount BTHC48_C8500165_SPP2
2 C8500165 BTHC48 BTHC48 10 49,000000 EUR ,000000 2 2018-10-18 00:00:00 2050-12-31 00:00:00 SPP2 amount BTHC48_C8500165_SPP2
3 C1400164 BTHC48 BTHC48 1 65,000000 EUR ,000000 2 2018-10-18 00:00:00 2050-12-31 00:00:00 SPP2 amount BTHC48_C1400164_SPP2
4 C1400164 BTHC48 BTHC48 5 59,000000 EUR ,000000 2 2018-10-18 00:00:00 2050-12-31 00:00:00 SPP2 amount BTHC48_C1400164_SPP2
... ... ... ... ... ... ... ... ... ... ... ... ... ...
99994 C9204154 369398 369398 1 445,980000 EUR 30,000000 2 1980-01-01 00:00:00 2050-12-31 00:00:00 OEDG-52 percentage 369398_C9204154_OEDG-52
99995 C7300423 69031190 69031190 1 77,220000 EUR 20,000000 2 1980-01-01 00:00:00 2050-12-31 00:00:00 OEDG-52 percentage 69031190_C7300423_OEDG-52
99996 C3800239 50001160 50001160 1 -1,000000 EUR 40,000000 0 1980-01-01 00:00:00 2050-12-31 00:00:00 OEDG-52 percentage 50001160_C3800239_OEDG-52
99997 C0200028 000008309 000008309_I0450 1 779,440000 EUR 20,000000 2 1980-01-01 00:00:00 2050-12-31 00:00:00 OEDG-52 percentage 000008309_I0450_C0200028_OEDG-52
99998 C0700024 000008309 000008309_I1000 1 779,440000 EUR 40,000000 2 1980-01-01 00:00:00 2050-12-31 00:00:00 OEDG-52 percentage 000008309_I1000_C0700024_OEDG-52
My array customers:
0 1 2
0 C6710024 1 10
1 C0100003 7 10
2 C0100008 8 10
3 C0100048 9 10
4 C0100078 11 10
... ... ... ..
1899 C4400373 2798 10
1900 C7800620 2801 10
1901 C6303124 2802 10
1902 C4600023 2808 10
1903 C0600345 2811 10
Warning there are several identical CardCodes on the first board.
Thank you for help.
If i got the question right, You want to add two new columns(customer_id and group_id) to the 'My array customer' base on their 'CardCod ' value in your csv file?
Related
I want to simulate battery charging data.
Imagine there is a battery with a constant capacity. e.g. 30000, in the real world a person will charge it at random times at 18:00-18:30, so sometimes the he starts at 18:29 some times start at 18:00, so the half hourly value will be varied by the start charging time. But the total amount won't be changed.
index value
0 2021-01-01 00:00:00 0
1 2021-01-01 00:30:00 0
2 2021-01-01 01:00:00 0
3 2021-01-01 01:30:00 0
4 2021-01-01 02:00:00 0
... ... ...
995 2021-01-21 17:30:00 0
996 2021-01-21 18:00:00 0
997 2021-01-21 18:30:00 0
998 2021-01-21 19:00:00 0
999 2021-01-21 19:30:00 0
1000 2021-01-21 20:00:00 0
So, if the charging speed is 5000 per half hour, it sometimes likes:[10,5000,5000,5000,5000,5000,4990], and sometimes [2500,5000,5000,5000,5000,2500].
And I want to generate such a pattern and insert it into a given time.
index value
0 2021-01-01 00:00:00 0
1 2021-01-01 00:30:00 0
2 2021-01-01 01:00:00 0
3 2021-01-01 01:30:00 0
4 2021-01-01 02:00:00 0
... ... ...
995 2021-01-21 17:30:00 0
996 2021-01-21 18:00:00 2500
997 2021-01-21 18:30:00 5000
998 2021-01-21 19:00:00 5000
999 2021-01-21 19:30:00 5000
1000 2021-01-21 20:00:00 5000
1001 2021-01-01 20:30:00 2500
1002 2021-01-01 21:00:00 0
Assume he charging around the time defined by start parameter. If the start is '2021-01-01 18:00' it will start charging between 18:00 to 18:30.
The function I want:
def insertPattern(emptyTimeseriesDF, capacity, speed, start) :
return dfWithInsertedPattern
Empty ts generated by:
index = pd.date_range(datetime.datetime(2021,1,1), periods=1000, freq='30min')
columns = ['value']
df = pd.DataFrame(index=index, columns=columns)
df = df.fillna(0)
df = df.reset_index()
df
I have a situation where month and date are messed up for few dates in my dataframe. For e.g here is the input:
df['work_date'].head(15)
0 2018-01-01
1 2018-02-01
2 2018-03-01
3 2018-04-01
4 2018-05-01
5 2018-06-01
6 2018-07-01
7 2018-08-01
8 2018-09-01
9 2018-10-01
10 2018-11-01
11 2018-12-01
12 2018-01-13
13 2018-01-14
14 2018-01-15
The date is stored as a string. As you can see, the date is in the format yyyy-dd-mm till 12th of Jan and then becomes yyyy-mm-dd. The dataframe consists of 3 years worth data and this pattern repeats for all months for all years.
My expected output is to standardize the date to format dddd-mm-yy like below.
0 2018-01-01
1 2018-01-02
2 2018-01-03
3 2018-01-04
4 2018-01-05
5 2018-01-06
6 2018-01-07
7 2018-01-08
8 2018-01-09
9 2018-01-10
10 2018-01-11
11 2018-01-12
12 2018-01-13
13 2018-01-14
14 2018-01-15
Below is the code that I wrote and it gets the job done. Basically, I split the date string and do some string manipulations. However, as you can see its not too pretty. I am checking to see if there could be some other elegant solution to this other than doing the df.apply and the loops.
def func(x):
d = x.split('-')
print(d)
if (int(d[1]) <= 12) & (int(d[2]) <= 12) :
d = [d[0],d[2],d[1]]
x = '-'.join(d)
return x
else:
return x
df['work_date'] = df['work_date'].apply(lambda x:func(x))
You could just update the column based on the fact that it is in order and there is only one date and all days of the year are included consecutively:
df['Date'] = pd.date_range(df['work_date'].min(), '2018-01-12', freq='1D')
# you can specify df['work_date'].min() OR df['work_date'].max) OR A STRING. It really depends on what format your minimum and your maximum is
df
Out[1]:
work_date date
0 2018-01-01 2018-01-01
1 2018-02-01 2018-01-02
2 2018-03-01 2018-01-03
3 2018-04-01 2018-01-04
4 2018-05-01 2018-01-05
5 2018-06-01 2018-01-06
6 2018-07-01 2018-01-07
7 2018-08-01 2018-01-08
8 2018-09-01 2018-01-09
9 2018-10-01 2018-01-10
10 2018-11-01 2018-01-11
11 2018-12-01 2018-01-12
12 2018-01-13 2018-01-13
13 2018-01-14 2018-01-14
14 2018-01-15 2018-01-15
To make this more dynamic, you could also do some try / except shown below:
minn = df['work_date'].min()
maxx = df['work_date'].max()
try:
df['Date'] = pd.date_range(minn, maxx, freq='1D')
except ValueError:
s = maxx.split('-')
df['Date'] = pd.date_range(minn, f'{s[0]}-{s[2]}-{s[1]}', freq='1D')
except ValueError:
s = minn.split('-')
df['Date'] = pd.date_range(f'{s[0]}-{s[2]}-{s[1]}', maxx, freq='1D')
df
I'm totally new to Time Series Analysis and I'm trying to work on examples available online
this is what I have currently:
# Time based features
data = pd.read_csv('Train_SU63ISt.csv')
data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M')
data['Hour'] = data['Datetime'].dt.hour
data['minute'] = data['Datetime'].dt.minute
data.head()
ID Datetime Count Hour Minute
0 0 2012-08-25 00:00:00 8 0 0
1 1 2012-08-25 01:00:00 2 1 0
2 2 2012-08-25 02:00:00 6 2 0
3 3 2012-08-25 03:00:00 2 3 0
4 4 2012-08-25 04:00:00 2 4 0
What I'm looking for is something like this:
ID Datetime Count Hour Minute 4-Hour-window
0 0 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
1 1 2012-08-25 04:00:00 22 8 0 04:00:00 - 08:00:00
2 2 2012-08-25 08:00:00 18 12 0 08:00:00 - 12:00:00
3 3 2012-08-25 12:00:00 16 16 0 12:00:00 - 16:00:00
4 4 2012-08-25 16:00:00 18 20 0 16:00:00 - 20:00:00
5 5 2012-08-25 20:00:00 14 24 0 20:00:00 - 00:00:00
6 6 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
7 7 2012-08-26 04:00:00 24 8 0 04:00:00 - 08:00:00
8 8 2012-08-26 08:00:00 20 12 0 08:00:00 - 12:00:00
9 9 2012-08-26 12:00:00 10 16 0 12:00:00 - 16:00:00
10 10 2012-08-26 16:00:00 18 20 0 16:00:00 - 20:00:00
11 11 2012-08-26 20:00:00 14 24 0 20:00:00 - 00:00:00
I think what you are looking for is the resample function, see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
Something like this should work (not tested):
sampled_data = data.resample(
'4H',
kind='timestamp',
on='Datetime',
label='left'
).sum()
The function is very similar to groupby and groups the data into chunks of the column specified in on=, in this case we use timestamps and chunks of 4 hours.
Finally, you need to use some kind of disaggregation, in this case sum(), to convert all elements of each group into a single element per timechunk
I have a list of data with total number of orders and I would like to calculate the average number of orders per day of the week. For example, average number of order on Monday.
0 2018-01-01 00:00:00 3162
1 2018-01-02 00:00:00 1146
2 2018-01-03 00:00:00 396
3 2018-01-04 00:00:00 848
4 2018-01-05 00:00:00 1624
5 2018-01-06 00:00:00 3052
6 2018-01-07 00:00:00 3674
7 2018-01-08 00:00:00 1768
8 2018-01-09 00:00:00 1190
9 2018-01-10 00:00:00 382
10 2018-01-11 00:00:00 3170
Make sure your date column is in datetime format (looks like it already is)
Add column to convert date to day of week
Group by the day of week and take average
df['Date'] = pd.to_datetime(df['Date']) # Step 1
df['DayofWeek'] =df['Date'].dt.day_name() # Step 2
df.groupby(['DayofWeek']).mean() # Step 3
I have a pd.DataFrame
utc_time year month day weekday hour
0 2017-01-01 21:00:00 2017 1 1 7 21
1 2017-01-01 23:00:00 2017 1 1 7 23
2 2017-01-02 00:00:00 2017 1 2 1 0
3 2017-01-02 01:00:00 2017 1 2 1 1
In the df above, hour 22 doesn't show up. I want every hour include in the dataframe, like:
utc_time year month day weekday hour
0 2017-01-01 21:00:00 2017 1 1 7 21
0 2017-01-01 22:00:00 2017 1 1 7 22
1 2017-01-01 23:00:00 2017 1 1 7 23
2 2017-01-02 00:00:00 2017 1 2 1 0
3 2017-01-02 01:00:00 2017 1 2 1 1
How to build function to detect the missing hour and insert into the dataframe ?
IIUC resample +bfill and ffill
s=df.set_index('utc_time').resample('1H')
(s.ffill()+s.bfill())/2
Out[163]:
year month day weekday hour
utc_time
2017-01-01 21:00:00 2017 1 1 7 21
2017-01-01 22:00:00 2017 1 1 7 22
2017-01-01 23:00:00 2017 1 1 7 23
2017-01-02 00:00:00 2017 1 2 1 0
2017-01-02 01:00:00 2017 1 2 1 1