Why there are many "NaN" in the index after importing a MultiIndex Dataframe from an Excel file? - python-3.x

I have an Excel file looks like below in Excel:
2016-1-1 2016-1-2 2016-1-3 2016-1-4
300100 am 1 3 5 1
pm 3 2 4 5
300200 am 2 5 2 6
pm 5 1 3 7
300300 am 1 6 3 2
pm 3 7 2 3
300400 am 3 1 1 3
pm 2 5 5 2
300500 am 1 6 6 1
pm 5 7 7 5
But after I imported it by pd.read_excel and printed it, it was displayed like below in Python:
2016-1-1 2016-1-2 2016-1-3 2016-1-4
300100 am 1 3 5 1
NaN pm 3 2 4 5
300200 am 2 5 2 6
NaN pm 5 1 3 7
300300 am 1 6 3 2
NaN pm 3 7 2 3
300400 am 3 1 1 3
NaN pm 2 5 5 2
300500 am 1 6 6 1
NaN pm 5 7 7 5
How can I solve this to make the Dataframe look like the format in Excel, without so many "NaN"? Thanks!

Most of the time when Excel looks like what you have in your example, it does actually have blanks where those spaces are. But, the cells are merged, so it looks pretty. When you import it into pandas, it reads them as empty or NaN.
To fix it, forward fill the empty cells, then set as the index.
df.ffill()

Without access to the Excel files or knowledge of the versions it's impossible to be sure, but it just looks like you have a column of numbers (the first column) with every other row blank. Pandas expects uniformly filled columns, so while in Excel you have a sort of "structure" of the information for both AM and PM for each first-column number (id?), Pandas just sees two rows, one with an invalid first column. Depending on how you actually want to access this data, an easy fix would be to replace every NaN with the number directly above it, so each row contains either the AM or PM information for the "id". Another fix would be to change your column structure to have 2016-1-1-am and 2016-1-1-pm fields.

You're looking for the fillna method:
df = df.fillna('')

Related

For and if loop combination takes lot of time in Pandas (Data manipulation)

I have two datasets, each about half a million observations. I am writing the below code and it seems the code never seems to stop executing. I would like to know if there is a better way of doing it. Appreciate inputs.
Below are sample formats of my dataframes. Both dataframes share a set of 'sid' values , meaning all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values. The 'tid' values and consequently the 'rid' values (which are a combination of 'sid' and 'tid' values) may not appear in both sets.
The task is simple. I would like to create the 'tv' column in df2. Wherever the 'rid' in df2 matches with the 'rid' in 'df1', the 'tv' column in df2 takes the corresponding 'tv' value from df1. If it does not match, the 'tv' value in 'df2' will be the median 'tv' value for the matching 'sid' subset in 'df1'.
In fact my original task includes creating a few more similar columns like 'tv' in df2 (based on their values in 'df1' ; these columns exist in 'df1').
I believe as my code contains for loop combined with if else statement and multiple value assignment statements, it is taking forever to execute. Appreciate any inputs.
df1
sid tid rid tv
0 0 0 0-0 9
1 0 1 0-1 8
2 0 3 0-3 4
3 1 5 1-5 2
4 1 7 1-7 3
5 1 9 1-9 14
6 1 10 1-10 24
7 1 11 1-11 13
8 2 14 2-14 2
9 2 16 2-16 5
10 3 17 3-17 6
11 3 18 3-18 8
12 3 20 3-20 5
13 3 21 3-21 11
14 4 23 4-23 6
df2
sid tid rid
0 0 0 0-0
1 0 2 0-2
2 1 3 1-3
3 1 6 1-6
4 1 9 1-9
5 2 10 2-10
6 2 12 2-12
7 3 1 3-1
8 3 15 3-15
9 3 1 3-1
10 4 19 4-19
11 4 22 4-22
rids = [rid.split('-') for rid in df1.rid]
for r in df2.rid:
s,t = r.split('-')
if [s,t] in rids:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.rid == r,'tv']
else:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.sid == int(s),'tv'].median()
The expected df2 shall be as follows:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
You can left merge on df2 with a subset(because you need only tv column you can also pass the df1 without any subset) of df1 on 'rid' then calculate median and fill values:
out=df2.merge(df1[['rid','tv']],on='rid',how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tid_y','tv_y'], axis=1)
out = out.rename(columns = {'tid_x': 'tid'})
out
OR
Since you said that:
all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values
So you can also left merge them on ['sid','rid'] and then fillna() value of tv with the median of df1 'tv' column by mapping values using map() method:
out=df2.merge(df1,on=['sid','rid'],how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tv_y'], axis=1)
out
output of out:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
Here is a suggestion without any loops, based on dictionaries:
matching_values = dict(zip(df1['rid'][df1['rid'].isin(df2['rid'])], df1['tv'][df1['rid'].isin(df2['rid'])]))
df2[df2['rid'].isin(df1['rid'])]['tv'] = df2[df2['rid'].isin(df1['rid'])]['rid']
df2[df2['rid'].isin(df1['rid'])]['tv'].replace(matching_values)
median_values = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])].groupby('sid')['tv'].median().to_dict()
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'] = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['sid']
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'].replace(median_values)
This should do the trick. The logic here is that we first create a dictionary, in which the "rid and "sid" values are the keys and the median and matching "tv" values are the dictionary values. Next, we replace the "tv" values in df2 with the rid and sid keys, respectively, (because they are the dictionary keys) which can thus easily be replaced by the correct tv values by calling .replace().
Don't use for loops in pandas, that is known to be slow. That way you don't get to benefit from all the internal optimizations that have been made.
Try to use the split-apply-combine pattern:
split df1 into sid to calculate the median: df1.groupby('sid')['tv'].median()
join df2 on df1: df2.join(df1.set_index('rid'), on='rid')
fill the NaN values with the median calculated in step 1.
(Haven't tested the code).

How to find again the index after pivoting dataframe?

I created a dataframe form a csv file containing data on number of deaths by year (running from 1946 to 2021) and month (within year):
dataD = pd.read_csv('MY_FILE.csv', sep=',')
First rows (out of 902...) of output are :
dataD
Year Month Deaths
0 2021 2 55500
1 2021 1 65400
2 2020 12 62800
3 2020 11 64700
4 2020 10 56900
As expected, the dataframe contains an index numbered 0,1,2, ... and so on.
Now, I pivot this dataframe in order to have only 1 row by year and months in column, using the following code:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths')
The first rows of the result are now:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
My question is:
What do I have to change in the previous pivoting code in order to find again the index 0,1,2,..etc. when I output the pivoted file? I think I need to specify index=*** in order to make the pivot instruction run. But afterwards, I would like to recover an index "as usual" (if I can say), exactly like in my first file dataD.
Any possibility?
You can reset_index() after pivoting:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths').reset_index()
This would give you the following:
Month Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
Note that the "Month" here might look like the index name but is actually df.columns.name. You can unset it if preferred:
df.columns.name = None
Which then gives you:
Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0

Pandas DataFrame: how do we keep columns based on the index name?

I seem to run into some python or enumerate bugs that I am not quite sure how to fix it (See here for more details).
Long story short, I desire to see multiple data sets that has a column name of 0,4,6,8,10,12,14.
0 4 6 8 10 12
1 2 5 4 2 1
5 3 0 1 5 10
....
But my current data looks like the following
0 4 2 6 8 10 12
1 2 5 4 2 1
5 3 0 1 5 10
....
Therefore, I would like to add a code that keeps columns based on the index number (including only 0,4,6,8,10,12).
Is there a pandas function that can help with this?

Pull Values from a Table in Excel

I am working on creating a user friendly character sheet for the new Pathfinder Playtest in Excel. I have run into an issue with a section and I have come here for help, not sure if it's the right place.
I want to have a cell return a value from a table (below) based on two other cell's values, e.g., if A1=19 and B1=4th it would pull the number from the appropriate area (3 in this case).
1st 2nd 3rd 4th 5th 6th 7th 8th 9th
1 2
2 3
3 3 2
4 3 3
5 3 3 2
6 3 3 3
7 3 3 3 2
8 3 3 3 3
9 3 3 3 3 2
10 3 3 3 3 3
11 3 3 3 3 3 2
12 3 3 3 3 3 3
13 3 3 3 3 3 3 2
14 3 3 3 3 3 3 3
15 3 3 3 3 3 3 3 2
16 3 3 3 3 3 3 3 3
17 3 3 3 3 3 3 3 3 2
18 3 3 3 3 3 3 3 3 3
19 3 3 3 3 3 3 3 3 3
20 3 3 3 3 3 3 3 3 3
I have tried using the below as well as just Indexing and I can't figure this out. Any help is appreciated, thanks!
=INDEX(P137:X156,MATCH(B2,O137:O156,1),MATCH(A10,P137:P156,1))
=INDEX(O137:O156,MATCH(1,(J125=P137:P156)*(J126=Q137:Q156)*(J127=R137:R156)*(J128=S137:S156)*(J129=T137:T156)*(J130=U137:U156)*(J131=V137:V156)*(J132=W137:W156)*(J133=X137:X156),0))
Let's say your data starts at A1 like image below:
I Added 2 simple cells where user chooses the row and the column. Both cells use data validation lists related to your data, so no wrong info can be entered.
The formula is:
=INDEX($1:$1048576;MATCH($C$25;$A:$A;0);MATCH($C$26;$1:$1;0))
Hope you can adapt this to your needs.
You can download the sample from Google Drive if you wish:
https://drive.google.com/open?id=1QXFmmEPMtJeiHDjKKM0o6kclpMIzaw_i

Data fill in specific pattern

I am trying to fill data in MS Excel. I am given following pattern:
1 2
1
1
2 5
2 5
2
3
3 6
3
4
4
5 4
And I want my output in following format:
1 2
1 2
1 2
2 5
2 5
2 5
3 6
3 6
3 6
4
4
5 4
I tried using if(b2,b2,c1) in column 3. but that doesn't solve the problem for a=3 and a=4.
Any idea how to do this in Excel?
With sorting thus:
(the effect of which in this case is merely to move 6 up once cell) and a blank row above:
=IF(AND(A2<>A1,B2=""),"",IF(B2<>"",B2,C1))
In C2 and copied down should get the result you ask for from the data sample provided.

Resources