INDEX and MATCH with multiple criteria in Pandas Python - python-3.x

I am trying to do an index match in 2 data set but having trouble. Here is an example of what I am trying to do. I want to fill in column "a", "b", "c" that are empty in df with the df2 data where "Machine", "Year", and "Order Type".
The first dataframe lets call this one "df"
Machine Year Cost a b c
0 abc 2014 5500 nan nan nan
1 abc 2015 89 nan nan nan
2 abc 2016 600 nan nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 nan nan nan
6 dcb 2020 3000 nan nan nan
7 dcb 2021 100 nan nan nan
The second data set is called "df2"
Order Type Machine Year Total Count
0 a abc 2014 1
1 b abc 2014 1
2 c abc 2014 2
4 c dcb 2015 4
3 a abc 2016 3
Final Output is:
Machine Year Cost a b c
0 abc 2014 5500 1 1 2
1 abc 2015 89 nan nan nan
2 abc 2016 600 3 nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 1 nan nan
6 dcb 2014 3000 nan nan 4
7 dcb 2015 100 nan nan nan
Thanks for help in advance

Consider DataFrame.pivot to reshape df2 to merge with df1.
final_df = (
df1.reindex(["Machine", "Type", "Cost"], axis=True)
.merge(
df.pivot(
index=["Machine", "Year"],
columns="Order Type",
values="Total Count"
).reset_index(),
on = ["Machine", "Year"]
)
)

Related

Filter the rows based on the missing values on the specific set columns - pandas

I have data frame as shown below
ID Type Desc D_N D_A C_N C_A
1 Edu Education 3 100 NaN NaN
1 Bank In_Pay NaN NaN NaN 900
1 Eat Food 4 200 NaN NaN
1 Edu Education NaN NaN NaN NaN
1 Bank NaN NaN NaN 4 700
1 Eat Food NaN NaN NaN NaN
2 Edu Education NaN NaN 1 100
2 Bank NaN NaN NaN 8 NaN
2 NaN Food 4 NaN NaN NaN
3 Edu Education NaN NaN NaN NaN
3 Bank NaN 2 300 NaN NaN
3 Eat Food NaN 140 NaN NaN
From the above df, I would like to filter the rows where exactly one of the columns D_N, D_A, C_N and C_A has non zero.
Expected Output:
ID Type Desc D_N D_A C_N C_A
1 Bank In_Pay NaN NaN NaN 900
2 Bank NaN NaN NaN 8 NaN
2 NaN Food 4 NaN NaN NaN
3 Eat Food NaN 140 NaN NaN
I tried the below code but that does not work.
df[df.loc[:, ["D_N", "D_A", "C_N", "C_A"]].isna().sum(axis=1).eq(1)]
Use DataFrame.count for count values excluded missing values:
df1 = df[df[["D_N", "D_A", "C_N", "C_A"]].count(axis=1).eq(1)]
print (df1)
ID Type Desc D_N D_A C_N C_A
1 1 Bank In_Pay NaN NaN NaN 900.0
7 2 Bank NaN NaN NaN 8.0 NaN
8 2 NaN Food 4.0 NaN NaN NaN
11 3 Eat Food NaN 140.0 NaN NaN
Your solution is possible modify with test non missing values:
df1 = df[df[["D_N", "D_A", "C_N", "C_A"]].notna().sum(axis=1).eq(1)]

Remove the rows when the values of the all the specific 4 columns are NaN in pandas

I have a df as shown below
df:
ID Type Desc D_N D_A C_N C_A
1 Edu Education 3 100 NaN NaN
1 Bank In_Pay NaN NaN 8 900
1 Eat Food 4 200 NaN NaN
1 Edu Education NaN NaN NaN NaN
1 Bank NaN NaN NaN 4 700
1 Eat Food NaN NaN NaN NaN
2 Edu Education NaN NaN 1 100
2 Bank In_Pay NaN NaN 8 NaN
2 Eat Food 4 200 NaN NaN
3 Edu Education NaN NaN NaN NaN
3 Bank NaN 2 300 NaN NaN
3 Eat Food NaN NaN NaN NaN
About the df:
whenever D_N non-NaN D_A should be non-NaN, at the same time C_N and C_A should be NaN and vice versa.
In the above data, I would like to filter rows where any of the D_N, D_A, C_N and C_A are non-NaN
Expected Output:
ID Type Desc D_N D_A C_N C_A
1 Edu Education 3 100 NaN NaN
1 Bank In_Pay NaN NaN 8 900
1 Eat Food 4 200 NaN NaN
1 Bank NaN NaN NaN 4 700
2 Edu Education NaN NaN 1 100
2 Bank In_Pay NaN NaN 8 NaN
2 Eat Food 4 200 NaN NaN
3 Bank NaN 2 300 NaN NaN
print(df.dropna(subset=["D_N", "D_A", "C_N", "C_A"], how="all"))
Prints:
ID Type Desc D_N D_A C_N C_A
0 1 Edu Education 3.0 100.0 NaN NaN
1 1 Bank In_Pay NaN NaN 8.0 900.0
2 1 Eat Food 4.0 200.0 NaN NaN
4 1 Bank NaN NaN NaN 4.0 700.0
6 2 Edu Education NaN NaN 1.0 100.0
7 2 Bank In_Pay NaN NaN 8.0 NaN
8 2 Eat Food 4.0 200.0 NaN NaN
10 3 Bank NaN 2.0 300.0 NaN NaN

Fill one column value to another one randomly selected from multiple columns in Python

Given a dataset as follows:
city value1 March April May value2 Jun Jul Aut
0 bj 12 NaN NaN NaN 15 NaN NaN NaN
1 sh 8 NaN NaN NaN 13 NaN NaN NaN
2 gz 9 NaN NaN NaN 9 NaN NaN NaN
3 sz 6 NaN NaN NaN 16 NaN NaN NaN
I would like to fill value1 to randomly select one column from 'March', 'April', 'May', also fill value2 to one column randomly selected from 'Jun', 'Jul', 'Aut'.
Output desired:
city value1 March April May value2 Jun Jul Aut
0 bj 12 NaN 12.0 NaN 15 NaN 15.0 NaN
1 sh 8 8.0 NaN NaN 13 NaN NaN 13.0
2 gz 9 NaN NaN 9.0 9 NaN 9.0 NaN
3 sz 6 NaN 6.0 NaN 16 16.0 NaN NaN
How could I do that in Python? Thanks.
Here is one way by defining a function which randomly selects the indices from the slice of dataframe as defined by the passed cols then fills the corresponding values from the value column (val_col) passed to the function:
def fill(df, val_col, cols):
i = np.random.choice(len(cols), len(df))
vals = df[cols].to_numpy()
vals[range(len(df)), i] = list(df[val_col])
return df.assign(**dict(zip(cols, vals.T)))
>>> df = fill(df, 'value1', ['March', 'April', 'May'])
>>> df
city value1 March April May value2 Jun Jul Aut
0 bj 12 12.0 NaN NaN 15 NaN NaN NaN
1 sh 8 NaN NaN 8.0 13 NaN NaN NaN
2 gz 9 NaN 9.0 NaN 9 NaN NaN NaN
3 sz 6 NaN 6.0 NaN 16 NaN NaN NaN
>>> df = fill(df, 'value2', ['Jun', 'Jul', 'Aut'])
>>> df
city value1 March April May value2 Jun Jul Aut
0 bj 12 NaN NaN 12.0 15 NaN NaN 15.0
1 sh 8 NaN NaN 8.0 13 13.0 NaN NaN
2 gz 9 NaN NaN 9.0 9 NaN NaN 9.0
3 sz 6 NaN 6.0 NaN 16 NaN NaN 16.0

Combine text from multiple rows in pandas

I want to merge content for respective rows' data only where some specific conditions are met.
Here is the test dataframe I am working on
Date Desc Debit Credit Bal
0 04-08-2019 abcdef 45654 NaN 345.0
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.06
4 06-08-2019 xyz NaN 350.0 695.06
in which, I want to join the rows where there is nan into Date to the previous row.
Output required:
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654 NaN 345.06
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.0
4 06-08-2019 xyz NaN 350.0 695.06
If anybody help me out with this? I have tried the following:
for j in [x for x in range(lst[0], lst[-1]+1) if x not in lst]:
print (test.loc[j-1:j, ].apply(lambda x: ''.join(str(x)), axis=1))
But could not get the expected result.
You can use
d = df["Date"].fillna(method='ffill')
df.update(df.groupby(d).transform('sum'))
print(df)
output
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 0.0 351.0
1 NaN abcdefjklmn 45654.0 0.0 351.0
2 05-08-2019 abd 45.0 0.0 345.0
3 06-08-2019 xyz 0.0 345.0 54645.0
idx = test.loc[test["Date"].isna()].index
test.loc[idx-1, "Desc"] = test.loc[idx-1]["Desc"].str.cat(test.loc[idx]["Desc"])
test.loc[idx-1, "Bal"] = (test.loc[idx-1]["Bal"].astype(str)
.str.cat(test.loc[idx]["Bal"].astype(str)))
## I tried to add two values but it didn't work as expected, giving 351.0
# test.loc[idx-1, "Bal"] = test.loc[idx-1]["Bal"].values + test.loc[idx]["Bal"].values
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 NaN 345.06.0
1 NaN jklmn NaN NaN 6
2 05-08-2019 abd 45.0 NaN 345
3 06-08-2019 xyz NaN 345.0 54645

using Pandas to download/load zipped csv file from URL

I am trying to load a csv file from the following URL into a dataframe using Python 3.5 and Pandas:
link = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv"
The csv file (API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv) is inside of a zip file. My try:
import urllib.request
urllib.request.urlretrieve(link, "GDP.zip")
import zipfile
compressed_file = zipfile.ZipFile('GDP.zip')
csv_file = compressed_file.open('API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv')
GDP = pd.read_csv(csv_file)
But when reading it, I got the error "pandas.io.common.CParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 62".
Any idea?
I think you need parameter skiprows, because csv header is in row 5:
GDP = pd.read_csv(csv_file, skiprows=4)
print (GDP.head())
Country Name Country Code Indicator Name Indicator Code 1960 \
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN
1 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN
2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08
3 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN
4 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN
1961 1962 1963 1964 1965 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
2008 2009 2010 2011 \
0 ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09
1 ... 4.001201e+09 3.650083e+09 3.346517e+09 3.427023e+09
2 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10
3 ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041159e+11
4 ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10
2012 2013 2014 2015 2016 Unnamed: 61
0 NaN NaN NaN NaN NaN NaN
1 3.146152e+09 3.248925e+09 NaN NaN NaN NaN
2 2.053654e+10 2.004633e+10 2.005019e+10 1.933129e+10 NaN NaN
3 1.153984e+11 1.249121e+11 1.267769e+11 1.026269e+11 NaN NaN
4 1.231978e+10 1.278103e+10 1.321986e+10 1.139839e+10 NaN NaN

Resources