DataFrame merging with ordered indices and different columns - python-3.x

I have two pandas data-frames, which I wanted to merge. The data-frames have different columns and overlapping indices. I want to merge them, keeping the order of indices intact.
Dataframe (d1)
Dec 16 Dec 15
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN <-- 'all Nan' row
Other Long Term Liabilities 8.37 9.30
Long Term Provisions 13.53 12.74 <-- Not present in d2
Total Non-Current Liabilities 21.90 22.04
CURRENT LIABILITIES NaN NaN <-- 'all Nan' row
Trade Payables 32.49 24.26
Dataframe (d2)
Dec 11 Dec 10
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN
Deferred Tax Liabilities [Net] 0.00 7.40 <-- Not present in d1
Other Long Term Liabilities 14.13 0.00
Total Non-Current Liabilities 14.13 7.40
CURRENT LIABILITIES NaN NaN
Trade Payables 77.35 60.40
I tried the following ways to merge these data-frames, but none of them worked.
d1.merge(d2, how='left', left_index=True,right_index=True)
d1.merge(d2, how='outer', left_index=True,right_index=True)
pd.merge_ordered(d1,d2,left_on=['Dec 16'],right_on=['Dec 11'])
pd.concat([d1.merge(d2, how='left', left_index=True,right_index=True),d1.merge(d2, how='right', left_index=True,right_index=True)]).drop_duplicates(subset='Dec 16',keep='last')
I am expecting the resulting dataframe to look like this
Dec 16 Dec 15 Dec 11 Dec 10
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN NaN NaN
Deferred Tax Liabilities [Net] NaN NaN 0.00 7.40 <-- from d2
Other Long Term Liabilities 8.37 9.30 14.13 0.00 <-- d1+d2 merged
Long Term Provisions 13.53 12.74 NaN NaN <-- from d1
Total Non-Current Liabilities 21.90 22.04 14.13 7.40 <-- d1+d2 merged
CURRENT LIABILITIES NaN NaN NaN NaN
Trade Payables 32.49 24.26 77.35 60.40
Note that the overall order matters (e.g all NaN rows need to be in same order), but not the order of merged indices between the 'all NaN' rows. Also the columns of d1 should come prior to d2 columns.

Use how=outer with merge and reindex with custom order
In [1424]: order_index = ['NON-CURRENT LIABILITIES', 'Deferred Tax Liabilities [Net]',
'Other Long Term Liabilities', 'Long Term Provisions',
'Total Non-Current Liabilities', 'CURRENT LIABILITIES',
'Trade Payables']
In [1425]: df1.merge(df2,how='outer',left_index=True,right_index=True).reindex(order_index)
Out[1425]:
Dec 16 Dec 15 Dec 11 Dec 10
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN NaN NaN
Deferred Tax Liabilities [Net] NaN NaN 0.00 7.4
Other Long Term Liabilities 8.37 9.30 14.13 0.0
Long Term Provisions 13.53 12.74 NaN NaN
Total Non-Current Liabilities 21.90 22.04 14.13 7.4
CURRENT LIABILITIES NaN NaN NaN NaN
Trade Payables 32.49 24.26 77.35 60.4
Also, join works
In [1426]: df1.join(df2, how='outer').reindex(order_index)
Out[1426]:
Dec 16 Dec 15 Dec 11 Dec 10
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN NaN NaN
Deferred Tax Liabilities [Net] NaN NaN 0.00 7.4
Other Long Term Liabilities 8.37 9.30 14.13 0.0
Long Term Provisions 13.53 12.74 NaN NaN
Total Non-Current Liabilities 21.90 22.04 14.13 7.4
CURRENT LIABILITIES NaN NaN NaN NaN
Trade Payables 32.49 24.26 77.35 60.4
Details
In [1417]: df1
Out[1417]:
Dec 16 Dec 15
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN
Other Long Term Liabilities 8.37 9.30
Long Term Provisions 13.53 12.74
Total Non-Current Liabilities 21.90 22.04
CURRENT LIABILITIES NaN NaN
Trade Payables 32.49 24.26
In [1418]: df2
Out[1418]:
Dec 11 Dec 10
Balance Sheet
NON-CURRENT LIABILITIES NaN NaN
Deferred Tax Liabilities [Net] 0.00 7.4
Other Long Term Liabilities 14.13 0.0
Total Non-Current Liabilities 14.13 7.4
CURRENT LIABILITIES NaN NaN
Trade Payables 77.35 60.4

Related

How to melt a dataframe into a long form?

I have the following dataframe
recycling 1 metric tonne (1000 kilogram) per waste type Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 1 barrel oil is approximately 159 litres of oil NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 material Plastic Glass Ferrous Metal Non-Ferrous Metal Paper
3 energy_saved 5774 Kwh 42 Kwh 642 Kwh 14000 Kwh 4000 kWh
4 crude_oil saved 16 barrels NaN 1.8 barrels 40 barrels 1.7 barrels
For reference look at the image:
What I want to do is to get the rows 2, 3, 4 into cols in a new dataframe. It should be looking some like this..
material energy_saved crude_oil saved
plastic 5774Kwh 16 barrels
Glass 42 Kwh NaN
... ... ...
I tried using .melt but it was not working.
If you notice, the col name and its values are in a single row. I just want them to be in a new data frame as col and value.
IIUC, is it just:
out = df.loc[[2,3,4],:].T.reset_index(drop=True)

Fill one column value to another one randomly selected from multiple columns in Python

Given a dataset as follows:
city value1 March April May value2 Jun Jul Aut
0 bj 12 NaN NaN NaN 15 NaN NaN NaN
1 sh 8 NaN NaN NaN 13 NaN NaN NaN
2 gz 9 NaN NaN NaN 9 NaN NaN NaN
3 sz 6 NaN NaN NaN 16 NaN NaN NaN
I would like to fill value1 to randomly select one column from 'March', 'April', 'May', also fill value2 to one column randomly selected from 'Jun', 'Jul', 'Aut'.
Output desired:
city value1 March April May value2 Jun Jul Aut
0 bj 12 NaN 12.0 NaN 15 NaN 15.0 NaN
1 sh 8 8.0 NaN NaN 13 NaN NaN 13.0
2 gz 9 NaN NaN 9.0 9 NaN 9.0 NaN
3 sz 6 NaN 6.0 NaN 16 16.0 NaN NaN
How could I do that in Python? Thanks.
Here is one way by defining a function which randomly selects the indices from the slice of dataframe as defined by the passed cols then fills the corresponding values from the value column (val_col) passed to the function:
def fill(df, val_col, cols):
i = np.random.choice(len(cols), len(df))
vals = df[cols].to_numpy()
vals[range(len(df)), i] = list(df[val_col])
return df.assign(**dict(zip(cols, vals.T)))
>>> df = fill(df, 'value1', ['March', 'April', 'May'])
>>> df
city value1 March April May value2 Jun Jul Aut
0 bj 12 12.0 NaN NaN 15 NaN NaN NaN
1 sh 8 NaN NaN 8.0 13 NaN NaN NaN
2 gz 9 NaN 9.0 NaN 9 NaN NaN NaN
3 sz 6 NaN 6.0 NaN 16 NaN NaN NaN
>>> df = fill(df, 'value2', ['Jun', 'Jul', 'Aut'])
>>> df
city value1 March April May value2 Jun Jul Aut
0 bj 12 NaN NaN 12.0 15 NaN NaN 15.0
1 sh 8 NaN NaN 8.0 13 13.0 NaN NaN
2 gz 9 NaN NaN 9.0 9 NaN NaN 9.0
3 sz 6 NaN 6.0 NaN 16 NaN NaN 16.0

Split the Datetime into Year and Month column in python

How can we split the Datetime values to Year and month and need to split the columns year
(2017_year, 2018_year so on...) and values under the year column should get month of respective year?
Example data:
call time area age
2017-12-12 19:38:00 Rural 28
2018-01-12 22:05:00 Rural 50
2018-02-12 22:33:00 Rural 76
2019-01-12 22:37:00 Urban 45
2020-02-13 00:26:00 Urban 52
Required Output:
call time area age Year_2017 Year_2018
2017-12-12 19:38:00 Rural 28 jan jan
2018-01-12 22:05:00 Rural 50 Feb Feb
2018-02-12 22:33:00 Rural 76 mar mar
2019-01-12 22:37:00 Urban 45 Apr Apr
2020-02-13 00:26:00 Urban 52 may may
I think you need generate years and month from call time datetimes, so output is different:
Explanation - First generate column of months by DataFrame.assign and Series.dt.strftime, then convert years to index with append=True for MultiIndex, so possible reshape by Series.unstack, last add to original:
df1 = (df.assign(m = df['call time'].dt.strftime('%b'))
.set_index(df['call time'].dt.year, append=True)['m']
.unstack()
.add_prefix('Year_'))
print (df1)
call time Year_2017 Year_2018 Year_2019 Year_2020
0 Dec NaN NaN NaN
1 NaN Jan NaN NaN
2 NaN Feb NaN NaN
3 NaN NaN Jan NaN
4 NaN NaN NaN Feb
df = df.join(df1)
print (df)
call time area age Year_2017 Year_2018 Year_2019 Year_2020
0 2017-12-12 19:38:00 Rural 28 Dec NaN NaN NaN
1 2018-01-12 22:05:00 Rural 50 NaN Jan NaN NaN
2 2018-02-12 22:33:00 Rural 76 NaN Feb NaN NaN
3 2019-01-12 22:37:00 Urban 45 NaN NaN Jan NaN
4 2020-02-13 00:26:00 Urban 52 NaN NaN NaN Feb

How to sum the last 7 days in Pandas between two dates

Here is my raw data
Raw Data
Here is the data (including types) after I add on the column 'Date_2wks_Ago' within Pandas
enter image description here
I would like to add on a new column 'Rainfall_Last7Days' that calculates, for each day, the total amount of rainfall for the last week.
So (ignoring the other columns that aren't relevant) it would look a little like this...
Ideal Dataset
Anyone know how to do this in Pandas?
My data is about 1000 observations long, so not huge.
I think what you are looking for is the rolling() function.
This section recreates a simplified version of table
import pandas as pd
import numpy as np
# Create df
rainfall_from_9am=[4.6
,0.4
,3.6
,3.5
,3.2
,5.5
,2.2
,1.3
,0
,0
,0.04
,0
,0
,0
,0.04
,0.4]
date=['2019-02-03'
,'2019-02-04'
,'2019-02-05'
,'2019-02-06'
,'2019-02-07'
,'2019-02-08'
,'2019-02-09'
,'2019-02-10'
,'2019-02-11'
,'2019-02-12'
,'2019-02-13'
,'2019-02-14'
,'2019-02-15'
,'2019-02-16'
,'2019-02-17'
,'2019-02-18'
]
# Create df from list
df=pd.DataFrame({'rainfall_from_9am':rainfall_from_9am
,'date':date
})
This part calculates the rolling sum of rainfall for the current and previous 6 records.
df['rain_last7days']=df['rainfall_from_9am'].rolling(7).sum()
print(df)
Output:
date rainfall_from_9am rain_last7days
0 2019-02-03 4.60 NaN
1 2019-02-04 0.40 NaN
2 2019-02-05 3.60 NaN
3 2019-02-06 3.50 NaN
4 2019-02-07 3.20 NaN
5 2019-02-08 5.50 NaN
6 2019-02-09 2.20 23.00
7 2019-02-10 1.30 19.70
8 2019-02-11 0.00 19.30
9 2019-02-12 0.00 15.70
10 2019-02-13 0.04 12.24
11 2019-02-14 0.00 9.04
12 2019-02-15 0.00 3.54
13 2019-02-16 0.00 1.34
14 2019-02-17 0.04 0.08
15 2019-02-18 0.40 0.48
Conscious that this output does not match exactly with the example in your original question. Can you please help verify the correct logic you are after?

using Pandas to download/load zipped csv file from URL

I am trying to load a csv file from the following URL into a dataframe using Python 3.5 and Pandas:
link = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv"
The csv file (API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv) is inside of a zip file. My try:
import urllib.request
urllib.request.urlretrieve(link, "GDP.zip")
import zipfile
compressed_file = zipfile.ZipFile('GDP.zip')
csv_file = compressed_file.open('API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv')
GDP = pd.read_csv(csv_file)
But when reading it, I got the error "pandas.io.common.CParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 62".
Any idea?
I think you need parameter skiprows, because csv header is in row 5:
GDP = pd.read_csv(csv_file, skiprows=4)
print (GDP.head())
Country Name Country Code Indicator Name Indicator Code 1960 \
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN
1 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN
2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08
3 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN
4 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN
1961 1962 1963 1964 1965 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
2008 2009 2010 2011 \
0 ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09
1 ... 4.001201e+09 3.650083e+09 3.346517e+09 3.427023e+09
2 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10
3 ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041159e+11
4 ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10
2012 2013 2014 2015 2016 Unnamed: 61
0 NaN NaN NaN NaN NaN NaN
1 3.146152e+09 3.248925e+09 NaN NaN NaN NaN
2 2.053654e+10 2.004633e+10 2.005019e+10 1.933129e+10 NaN NaN
3 1.153984e+11 1.249121e+11 1.267769e+11 1.026269e+11 NaN NaN
4 1.231978e+10 1.278103e+10 1.321986e+10 1.139839e+10 NaN NaN

Resources