why groupby does not work before i write dataframe to csv? - python-3.x

I have problem with "groupby" method, which works just fine in Pandas 1.1.5 and has problems in Pandas 1.3.2 (or i am doing something wrong). Please take a look at the code:
#! /usr/bin/env python3
import pandas as pd
import numpy as np
wallet = pd.DataFrame(columns=['Data','Ticker','Cost','Qty'])
if __name__=="__main__":
to_add=['17022019','pcr',10,10]
a_series = pd.Series(to_add, index = wallet.columns)
wallet = wallet.append(a_series, ignore_index=True)
to_add=['12042020','pcr',12,15]
a_series = pd.Series(to_add, index = wallet.columns)
wallet = wallet.append(a_series, ignore_index=True)
to_add=['19012021','peo',8,16]
a_series = pd.Series(to_add, index = wallet.columns)
wallet = wallet.append(a_series, ignore_index=True)
#wallet.to_csv('tmp.csv', sep = ';', encoding = "ISO-8859-2",index=False)
#wallet = pd.read_csv('tmp.csv', sep = ';', encoding = "ISO-8859-2")
#os.remove('tmp.csv')
print(wallet)
summary = wallet.groupby(['Ticker']).sum()
summary.reset_index()
print(summary)
Three lines before "print(wallet)" are in comment and lets keep them this way for a moment. Result for this code on Python 3.6.6 and Pandas 1.1.5 looks like that:
Data Ticker Cost Qty
0 17022019 pcr 10 10
1 12042020 pcr 12 15
2 19012021 peo 8 16
Data Cost Qty
Ticker
pcr 1702201912042020 22 25
peo 19012021 8 16
So, pretty OK. If i "uncomment" these three lines, result looks like this:
Data Ticker Cost Qty
0 17022019 pcr 10 10
1 12042020 pcr 12 15
2 19012021 peo 8 16
Data Cost Qty
Ticker
pcr 29064039 22 25
peo 19012021 8 16
Data in summary looks different, but it is OK. The real problem is on Python 3.9.6 and Pandas 1.3.2. With commented lines, output looks like this:
Data Ticker Cost Qty
0 17022019 pcr 10 10
1 12042020 pcr 12 15
2 19012021 peo 8 16
Empty DataFrame
Columns: []
Index: [pcr, peo]
So summary is empty dataframe with indexes only (why??? why does it not work? Is it my mistake, or error in pandas? and why there is difference between pandas 1.1.5 and 1.3.2???)
But when I uncomment these three commented lines, output looks like this:
Data Ticker Cost Qty
0 17022019 pcr 10 10
1 12042020 pcr 12 15
2 19012021 peo 8 16
Data Cost Qty
Ticker
pcr 29064039 22 25
peo 19012021 8 16
So again, as it should.
Questions are:
Why output looks differently between pandas 1.1.5 and 1.3.2?
Why if i just write dataframe to csv on hard-drive and read it again before making groupby on it it works normally, and without that i get result as empty dataframe?
Is there any other solution than writing it and reading again from hard-drive? maybe i can write it to some buffer in memory??
Am i doing something wrong with this groupby in pandas 1.3.2? This mechanism has been changed and i do not know about something?

Related

pandas computation on rolling 1 calendar month

I have a pandas DataFrame with date as the index and a column, 'spendings'. I intend to get the rolling max() of the 'spendings' column for the trailing 1 calendar month (not 30 days or 4 weeks).
I tried to capture a snippet with custom data for addressing the problem, below (borrowed from Pandas monthly rolling operation):
import pandas as pd
from io import StringIO
data = StringIO(
"""\
date spendings
20210325 15
20210405 20
20210415 10
20210425 40
20210505 3
20210515 2
20210525 2
20210527 1
"""
)
df = pd.read_csv(data,sep="\s+", parse_dates=True)
df.index = pd.to_datetime(df.date, format='%Y%m%d')
del(df['date'])
Now, to create a column 'max' to hold rolling last 1 calendar month's max() val, I use:
df['max'] = df.loc[(df.index - pd.tseries.offsets.DateOffset(months=1)):df.index, 'spendings'].max()
This raises an exception like:
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [DatetimeIndex(['2021-02-25', '2021-03-05', '2021-03-15', '2021-03-25',
'2021-04-05', '2021-04-15', '2021-04-25'],
dtype='datetime64[ns]', name='date', freq=None)] of type DatetimeIndex
However, if I manually access a random month window like below, it works without exception:
>>> df['2021-04-16':'2021-05-15']
spendings
date
2021-04-25 40
2021-05-05 3
2021-05-15 2
(I could have followed the method using list comprehension here: https://stackoverflow.com/a/47199274/235415, but I would like to use panda's vectorized method. I have many DataFrames and each is very large - using list comprehension is very slow here).
Q: How to get the vectorized method of performing rolling 1 calendar month's max()?
The expected o/p, ie primarily the 'max' column (holding the max value of 'spendings' for last 1 calendar month) will be something like this:
>>> df
spendings max
date
2021-03-25 15 15
2021-04-05 20 20
2021-04-15 10 20
2021-04-25 40 40
2021-05-05 3 40
2021-05-15 2 40
2021-05-25 2 40
2021-05-27 1 3
The answer will be
[df.loc[x- pd.tseries.offsets.DateOffset(months=1):x, 'spendings'].max() for x in df.index]
Out[53]: [15, 20, 20, 40, 40, 40, 40, 3]

Pandas dataframe not correct format for groupby, what is wrong?

I am trying to sum all columns based on the value of the first, but groupby.sum is unexpectedly not working.
Here is a minimal example:
import pandas as pd
data = [['Alex',10, 11],['Bob',12, 10],['Clarke',13, 9], ['Clarke',1, 1]]
df = pd.DataFrame(data,columns=['Name','points1', 'points2'])
print(df)
df.groupby('Name').sum()
print(df)
I get this:
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 13 9
3 Clarke 1 1
And not this:
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10
From what i understand, the dataframe is not the right format for pandas to perform group by. I would like to understand what is wrong with it because this is just a toy example but i have the same problem with a real data-set.
The real data i'm trying to read is the John Hopkins University Covid-19 dataset:
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series
You forget assign output of aggregation to variable, because aggregation not working inplace. So in your solution print (df) before and after groupby returned same original DataFrame.
df1 = df.groupby('Name', as_index=False).sum()
print (df1)
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10
Or you can set to same variable df:
df = df.groupby('Name', as_index=False).sum()
print (df)
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10

Analyse Logdata out of a cvs file all written in a long row

I have a set of a couple hundreds sensors whose data is recorded in a log file. Each measurement cycle of all sensors is written in one line of the log file in cvs format. I need to be able to structure the log file to make some analysis with plots and calculations of the values.
The format of the CVS is like the following:
ID;Time;SensorID;ValueA;ValueB;ValueC;ValueD;SensorID;ValueA;ValueB;ValueC;ValueD[3..400]SensorID;ValueA;ValueB;ValueC;ValueD
11234;11:12:123456;12345678;5.3;53.4;53;-36.6;72345670;5.8;57.4;56;-39.6;[...]92345670;5.9;60.4;55;-33.6;
So I have a very long table with about 5000 or 6000 columns which contain my values but I'm not sure which is the right way to extract it in a easy way to perform some analysis. The table has about 600 rows.
I have written a report function in python with the help of pandas. The format I can already analyse is like the following:
Time;SensorID;ValueA;ValueB;ValueC;ValueD
11:12:123456;12345678;5.3;53.4;53;-36.6;
11:12:123457;12345679;5.5;55;54;-40;
So the time is slightly different and the Sensor ID will be different.
I use groupby(SensorID) and plots of the groupby and after that I perform some value_count() within some columns
If I understand you correct, each data line in the CSV contains ValueA - D for a set sensors, sharting the same ID and Time columns. Also your dataline ends with a semicolon which will throw pandas off.
[...]92345670;5.9;60.4;55;-33.6;
This answer leaves the semicolon stays in place since I assume you can change the process that produces the CSV file.
from io import StringIO
string = """
ID;Time;SensorID;ValueA;ValueB;ValueC;ValueD;SensorID;ValueA;ValueB;ValueC;ValueD;SensorID;ValueA;ValueB;ValueC;ValueD;
11234;11:12:123456;12345678;5.3;53.4;53;-36.6;72345670;5.8;57.4;56;-39.6;92345670;5.9;60.4;55;-33.6;
"""
df = pd.read_csv(StringIO(string), sep=';', engine='python')
# Shift the columns. We need this because the extra semicolon!
columns = df.columns[1:]
df = df.iloc[:, :-1]
df.columns = columns
df = df.set_index('Time')
# n is how many groups of sensor measurement are stored in each line
n = df.shape[1] // 5
idx = pd.MultiIndex.from_product([range(n), ['SensorID', 'ValueA', 'ValueB', 'ValueC', 'ValueD']])
result = df.stack(level=0).droplevel(-1).reset_index()
Result:
Time SensorID ValueA ValueB ValueC ValueD
0 11:12:123456 12345678 5.3 53.4 53 -36.6
1 11:12:123456 72345670 5.8 57.4 56 -39.6
2 11:12:123456 92345670 5.9 60.4 55 -33.6
Now you can send it to your analysis function.
thanks for thinking about it. Have you tested this code? I'm getting this table as output:
Time 0
0 11234 11:12:123456
1 11234 12345678
2 11234 5.3
3 11234 53.4
4 11234 53
5 11234 -36.6
6 11234 72345670
7 11234 5.8
8 11234 57.4
9 11234 56
10 11234 -39.6
11 11234 92345670
12 11234 5.9
13 11234 60.4
14 11234 55
15 11234 -33.6

Split dates into time ranges in pandas

14 [2018-03-14, 2018-03-13, 2017-03-06, 2017-02-13]
15 [2017-07-26, 2017-06-09, 2017-02-24]
16 [2018-09-06, 2018-07-06, 2018-07-04, 2017-10-20]
17 [2018-10-03, 2018-09-13, 2018-09-12, 2018-08-3]
18 [2017-02-08]
this is my data, every ID has it's own dates that range between 2017-02-05 and 2018-06-30. I need to split dates into 5 time ranges of 4 months each, so that for the first 4 months every ID should have dates only in that time range (from 2017-02-05 to 2017-06-05), like this
14 [2017-03-06, 2017-02-13]
15 [2017-02-24]
16 [null] # or delete empty rows, it doesn't matter
17 [null]
18 [2017-02-08]
then for 2017-06-05 to 2017-10-05 and so on for every 4 month ranges. Also I can't use nested for loops because the data is too big. This is what I tried so far
months_4 = individual_dates.copy()
for _ in months_4['Date']:
_ = np.where(pd.to_datetime(_) <= pd.to_datetime('2017-9-02'), _, np.datetime64('NaT'))
and
months_8 = individual_dates.copy()
range_8 = pd.date_range(start='2017-9-02', end='2017-11-02')
for _ in months_8['Date']:
_ = _[np.isin(_, range_8)]
achieved absolutely no result, data stays the same no matter what
update: I did what you said
individual_dates['Date'] = individual_dates['Date'].str.strip('[]').str.split(', ')
df = pd.DataFrame({
'Date' : list(chain.from_iterable(individual_dates['Date'].tolist())),
'ID' : individual_dates['ClientId'].repeat(individual_dates['Date'].str.len())
})
df
and here is the result
Date ID
0 '2018-06-30T00:00:00.000000000' '2018-06-29T00... 14
1 '2017-03-28T00:00:00.000000000' '2017-03-27T00... 15
2 '2018-03-14T00:00:00.000000000' '2018-03-13T00... 16
3 '2017-12-14T00:00:00.000000000' '2017-03-28T00... 17
4 '2017-05-30T00:00:00.000000000' '2017-05-22T00... 18
5 '2017-03-28T00:00:00.000000000' '2017-03-27T00... 19
6 '2017-03-27T00:00:00.000000000' '2017-03-26T00... 20
7 '2017-12-15T00:00:00.000000000' '2017-11-20T00... 21
8 '2017-07-05T00:00:00.000000000' '2017-07-04T00... 22
9 '2017-12-12T00:00:00.000000000' '2017-04-06T00... 23
10 '2017-05-21T00:00:00.000000000' '2017-05-07T00... 24
For better performance I suggest convert list to column - flatten it and then filtering by isin with boolean indexing:
from itertools import chain
df = pd.DataFrame({
'Date' : list(chain.from_iterable(individual_dates['Date'].tolist())),
'ID' : individual_dates['ID'].repeat(individual_dates['Date'].str.len())
})
range_8 = pd.date_range(start='2017-02-05', end='2017-06-05')
df['Date'] = pd.to_datetime(df['Date'])
df = df[df['Date'].isin(range_8)]
print (df)
Date ID
0 2017-03-06 14
0 2017-02-13 14
1 2017-02-24 15
4 2017-02-08 18

can you re-sample a series without dates?

I have a time series from months 1 to 420 (35 years). I would like to convert to an annual series using the average of the 12 months in each year so I can put in a dataframe I have with annual datapoints. I have it setup using a range with steps of 12 but it gets kind of messy. Ideally would like to use the resample function but having trouble since no dates. Any way around this?
There's no need to resample in this case. Just use groupby with integer division to obtain the average over the years.
import numpy as np
import pandas as pd
# Sample Data
np.random.seed(123)
df = pd.DataFrame({'Months': np.arange(1,421,1),
'val': np.random.randint(1,10,420)})
# Create Yearly average. 1-12, 13-24, Subtract 1 before // to get this grouping
df.groupby((df.Months-1)//12).val.mean().reset_index().rename(columns={'Months': 'Year'})
Outputs:
Year val
0 0 3.083333
1 1 4.166667
2 2 5.250000
3 3 4.416667
4 4 5.500000
5 5 4.583333
...
31 31 5.333333
32 32 5.000000
33 33 6.250000
34 34 5.250000
Feel free to add 1 to the year column or whatever you need to make it consistent with indexing in your other annual df. Otherwise, you could just use df.groupby((df.Months+11)//12).val().mean() to get the Year to start at 1.

Resources