df has three columns - date, name, and qty. For each name and date combination I want to insert n rows such that name is repeated in these next n rows but date is increased by 1 business day and qty=nan if that name and date combination doesn't exist already in df.
>>> import pandas as pd
>>> from datetime import datetime
>>> df = pd.DataFrame({'name':['abd']*3 + ['pqr']*2 + ['xyz']*1, 'date':[datetime(2020,1,6), datetime(2020,1,8), datetime(2020,2,5), datetime(2017,10,4), datetime(2017,10,13), datetime(2013,5,27)], 'qty':range(6)})
>>> df
name date qty
0 abd 2020-01-06 10
1 abd 2020-01-08 1
2 abd 2020-02-05 2
3 pqr 2017-10-04 3
4 pqr 2017-10-13 4
5 xyz 2013-05-27 5
I am not sure how to go about it. Any thoughts/clues. Thanks a lot!
Desired output for n=3:
name date qty
0 abd 2020-01-06 10
1 abd 2020-01-07 nan
2 abd 2020-01-08 1
3 abd 2020-01-09 nan
4 abd 2020-01-10 nan
5 abd 2020-01-13 nan
6 abd 2020-02-05 2
7 abd 2020-02-08 nan
8 abd 2020-02-09 nan
9 abd 2020-02-10 nan
10 pqr 2017-10-04 3
11 pqr 2017-10-05 nan
12 pqr 2017-10-06 nan
13 pqr 2017-10-09 nan
14 pqr 2017-10-13 4
15 pqr 2017-10-16 nan
16 pqr 2017-10-17 nan
17 pqr 2017-10-18 nan
18 xyz 2013-05-27 5
19 xyz 2013-05-28 nan
20 xyz 2013-05-29 nan
21 xyz 2013-05-30 nan
Here is a way:
from functools import reduce
n = 3
new_index = (
df.groupby("name")
.apply(
lambda x: reduce(
lambda i, j: i.union(j),
[pd.bdate_range(i, periods=n + 1) for i in x["date"]],
)
)
.explode()
)
midx = pd.MultiIndex.from_frame(new_index.reset_index(), names=["name", "date"])
df_out = df.set_index(["name", "date"]).reindex(midx).reset_index()
df_out
If explode cannot be used:
from functools import reduce
n = 3
new_index = (
df.groupby("name")
.apply(
lambda x: reduce(
lambda i, j: i.union(j),
[pd.bdate_range(i, periods=n + 1) for i in x["date"]],
)
)
.apply(pd.Series)
.stack()
.reset_index(level=0)
.rename(columns={0:'date'})
)
df_out = new_index.merge(df, how='left', on=['name', 'date'])
df_out
Output:
name date qty
0 abd 2020-01-06 0.0
1 abd 2020-01-07 NaN
2 abd 2020-01-08 1.0
3 abd 2020-01-09 NaN
4 abd 2020-01-10 NaN
5 abd 2020-01-13 NaN
6 abd 2020-02-05 2.0
7 abd 2020-02-06 NaN
8 abd 2020-02-07 NaN
9 abd 2020-02-10 NaN
10 pqr 2017-10-04 3.0
11 pqr 2017-10-05 NaN
12 pqr 2017-10-06 NaN
13 pqr 2017-10-09 NaN
14 pqr 2017-10-13 4.0
15 pqr 2017-10-16 NaN
16 pqr 2017-10-17 NaN
17 pqr 2017-10-18 NaN
18 xyz 2013-05-27 5.0
19 xyz 2013-05-28 NaN
20 xyz 2013-05-29 NaN
21 xyz 2013-05-30 NaN
How it works:
First import reduce from functools to use pd.Index.union to create a single list of dates. The list of dates is created from pd.bdate_range, with in groupby for each name. Convert that list of new_index, and names to a MultiIndex using pd.MultiIndex.from_frame. Use reindex after set_index on the original dataframe.
Related
Given a data sample as follows:
date value1 value2 value3
0 2021-10-12 1.015 1.115668 1.015000
1 2021-10-13 NaN 1.104622 1.030225
2 2021-10-14 NaN 1.093685 NaN
3 2021-10-15 1.015 1.082857 NaN
4 2021-10-16 1.015 1.072135 1.077284
5 2021-10-29 1.015 1.061520 1.093443
6 2021-10-30 1.015 1.051010 1.109845
7 2021-10-31 1.015 NaN 1.126493
8 2021-11-1 1.015 NaN NaN
9 2021-11-2 1.015 1.020100 NaN
10 2021-11-3 NaN 1.010000 NaN
11 2021-11-30 1.015 1.000000 NaN
Let's say I want to drop columns whose values all are NaNs in the November of 2021, which means range of 2021-11-01 to 2021-11-30 (including the starting and ending date).
Under this requirement, vlue3 will be drop since all its values in 2021-11 are NaNs. Other columns have NaNs in 2021-11 but not all, so those columns will be kept.
How could I achieve that in Pandas? Thanks.
EDIT:
df['date'] = pd.to_datetime(df['date'])
mask = (df['date'] >= '2021-11-01') & (df['date'] <= '2021-11-30')
df.loc[mask]
Out:
date value1 value2 value3
8 2021-11-01 1.015 NaN NaN
9 2021-11-02 1.015 1.0201 NaN
10 2021-11-03 NaN 1.0100 NaN
11 2021-11-30 1.015 1.0000 NaN
You can filter rows by November of 2021 and test if all rows has NaNs by conditions:
df['date'] = pd.to_datetime(df['date'])
df = df.loc[:, ~df[df['date'].dt.to_period('m') == pd.Period('2021-11')].isna().all()]
Or:
df['date'] = pd.to_datetime(df['date'])
df = df.loc[:, df[df['date'].dt.to_period('m') == pd.Period('2021-11')].notna().any()]
EDIT: If need manually set some columns for not processing use:
mask = (df['date'] >= '2021-11-01') & (df['date'] <= '2021-11-30')
df = df.loc[:, df.loc[mask].notna().any()]
Out:
date value1 value2
0 2021-10-12 1.015 1.115668
1 2021-10-13 NaN 1.104622
2 2021-10-14 NaN 1.093685
3 2021-10-15 1.015 1.082857
4 2021-10-16 1.015 1.072135
5 2021-10-29 1.015 1.061520
6 2021-10-30 1.015 1.051010
7 2021-10-31 1.015 NaN
8 2021-11-01 1.015 NaN
9 2021-11-02 1.015 1.020100
10 2021-11-03 NaN 1.010000
11 2021-11-30 1.015 1.000000
EDIT:
df = df.assign(value4 = np.nan)
print (df)
date value1 value2 value3 value4
0 2021-10-12 1.015 1.115668 1.015000 NaN
1 2021-10-13 NaN 1.104622 1.030225 NaN
2 2021-10-14 NaN 1.093685 NaN NaN
3 2021-10-15 1.015 1.082857 NaN NaN
4 2021-10-16 1.015 1.072135 1.077284 NaN
5 2021-10-29 1.015 1.061520 1.093443 NaN
6 2021-10-30 1.015 1.051010 1.109845 NaN
7 2021-10-31 1.015 NaN 1.126493 NaN
8 2021-11-1 1.015 NaN NaN NaN
9 2021-11-2 1.015 1.020100 NaN NaN
10 2021-11-3 NaN 1.010000 NaN NaN
11 2021-11-30 1.015 1.000000 NaN NaN
df['date'] = pd.to_datetime(df['date'])
m = df[df['date'].dt.to_period('m') == pd.Period('2021-11')].isna().all()
m.loc['value4'] = False
print (m)
date False
value1 False
value2 False
value3 True
value4 False
dtype: bool
df = df.loc[:, ~m]
print (df)
date value1 value2 value4
0 2021-10-12 1.015 1.115668 NaN
1 2021-10-13 NaN 1.104622 NaN
2 2021-10-14 NaN 1.093685 NaN
3 2021-10-15 1.015 1.082857 NaN
4 2021-10-16 1.015 1.072135 NaN
5 2021-10-29 1.015 1.061520 NaN
6 2021-10-30 1.015 1.051010 NaN
7 2021-10-31 1.015 NaN NaN
8 2021-11-01 1.015 NaN NaN
9 2021-11-02 1.015 1.020100 NaN
10 2021-11-03 NaN 1.010000 NaN
11 2021-11-30 1.015 1.000000 NaN
I am trying to do an index match in 2 data set but having trouble. Here is an example of what I am trying to do. I want to fill in column "a", "b", "c" that are empty in df with the df2 data where "Machine", "Year", and "Order Type".
The first dataframe lets call this one "df"
Machine Year Cost a b c
0 abc 2014 5500 nan nan nan
1 abc 2015 89 nan nan nan
2 abc 2016 600 nan nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 nan nan nan
6 dcb 2020 3000 nan nan nan
7 dcb 2021 100 nan nan nan
The second data set is called "df2"
Order Type Machine Year Total Count
0 a abc 2014 1
1 b abc 2014 1
2 c abc 2014 2
4 c dcb 2015 4
3 a abc 2016 3
Final Output is:
Machine Year Cost a b c
0 abc 2014 5500 1 1 2
1 abc 2015 89 nan nan nan
2 abc 2016 600 3 nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 1 nan nan
6 dcb 2014 3000 nan nan 4
7 dcb 2015 100 nan nan nan
Thanks for help in advance
Consider DataFrame.pivot to reshape df2 to merge with df1.
final_df = (
df1.reindex(["Machine", "Type", "Cost"], axis=True)
.merge(
df.pivot(
index=["Machine", "Year"],
columns="Order Type",
values="Total Count"
).reset_index(),
on = ["Machine", "Year"]
)
)
I want to merge content for respective rows' data only where some specific conditions are met.
Here is the test dataframe I am working on
Date Desc Debit Credit Bal
0 04-08-2019 abcdef 45654 NaN 345.0
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.06
4 06-08-2019 xyz NaN 350.0 695.06
in which, I want to join the rows where there is nan into Date to the previous row.
Output required:
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654 NaN 345.06
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.0
4 06-08-2019 xyz NaN 350.0 695.06
If anybody help me out with this? I have tried the following:
for j in [x for x in range(lst[0], lst[-1]+1) if x not in lst]:
print (test.loc[j-1:j, ].apply(lambda x: ''.join(str(x)), axis=1))
But could not get the expected result.
You can use
d = df["Date"].fillna(method='ffill')
df.update(df.groupby(d).transform('sum'))
print(df)
output
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 0.0 351.0
1 NaN abcdefjklmn 45654.0 0.0 351.0
2 05-08-2019 abd 45.0 0.0 345.0
3 06-08-2019 xyz 0.0 345.0 54645.0
idx = test.loc[test["Date"].isna()].index
test.loc[idx-1, "Desc"] = test.loc[idx-1]["Desc"].str.cat(test.loc[idx]["Desc"])
test.loc[idx-1, "Bal"] = (test.loc[idx-1]["Bal"].astype(str)
.str.cat(test.loc[idx]["Bal"].astype(str)))
## I tried to add two values but it didn't work as expected, giving 351.0
# test.loc[idx-1, "Bal"] = test.loc[idx-1]["Bal"].values + test.loc[idx]["Bal"].values
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 NaN 345.06.0
1 NaN jklmn NaN NaN 6
2 05-08-2019 abd 45.0 NaN 345
3 06-08-2019 xyz NaN 345.0 54645
I want to create a variable : SumOfPrevious5OccurencesAtIDLevel which is the sum of previous 5 values (as per Date variable) of Var1 at an ID level (column 1) , otherwise it will take a value of NA
Sample Data and Output:
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel
1 1/1/2018 0 NA
1 1/2/2018 1 NA
1 1/3/2018 2 NA
1 1/4/2018 3 NA
2 1/1/2018 4 NA
2 1/2/2018 5 NA
2 1/3/2018 6 NA
2 1/4/2018 7 NA
2 1/5/2018 8 NA
2 1/6/2018 9 30
2 1/7/2018 10 35
2 1/8/2018 11 40
Use groupby with transform and functions rolling and shift:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
#if not sorted ID with datetimes
df = df.sort_values(['ID','Date'])
df['new'] = df.groupby('ID')['Var1'].transform(lambda x: x.rolling(5).sum().shift())
print (df)
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel new
0 1 2018-01-01 0 NaN NaN
1 1 2018-01-02 1 NaN NaN
2 1 2018-01-03 2 NaN NaN
3 1 2018-01-04 3 NaN NaN
4 2 2018-01-01 4 NaN NaN
5 2 2018-01-02 5 NaN NaN
6 2 2018-01-03 6 NaN NaN
7 2 2018-01-04 7 NaN NaN
8 2 2018-01-05 8 NaN NaN
9 2 2018-01-06 9 30.0 30.0
10 2 2018-01-07 10 35.0 35.0
11 2 2018-01-08 11 40.0 40.0
I am trying to load a csv file from the following URL into a dataframe using Python 3.5 and Pandas:
link = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv"
The csv file (API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv) is inside of a zip file. My try:
import urllib.request
urllib.request.urlretrieve(link, "GDP.zip")
import zipfile
compressed_file = zipfile.ZipFile('GDP.zip')
csv_file = compressed_file.open('API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv')
GDP = pd.read_csv(csv_file)
But when reading it, I got the error "pandas.io.common.CParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 62".
Any idea?
I think you need parameter skiprows, because csv header is in row 5:
GDP = pd.read_csv(csv_file, skiprows=4)
print (GDP.head())
Country Name Country Code Indicator Name Indicator Code 1960 \
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN
1 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN
2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08
3 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN
4 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN
1961 1962 1963 1964 1965 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
2008 2009 2010 2011 \
0 ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09
1 ... 4.001201e+09 3.650083e+09 3.346517e+09 3.427023e+09
2 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10
3 ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041159e+11
4 ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10
2012 2013 2014 2015 2016 Unnamed: 61
0 NaN NaN NaN NaN NaN NaN
1 3.146152e+09 3.248925e+09 NaN NaN NaN NaN
2 2.053654e+10 2.004633e+10 2.005019e+10 1.933129e+10 NaN NaN
3 1.153984e+11 1.249121e+11 1.267769e+11 1.026269e+11 NaN NaN
4 1.231978e+10 1.278103e+10 1.321986e+10 1.139839e+10 NaN NaN