Merging sheets of excel using python - python-3.x

I am trying to take data of two sheets and comparing with each other if it matches i want to append column. Let me explain this by showing what i am doing and what i am trying to get in output using python.
This is my sheet1 from excel.xlsx:
it contains four column name,class,age and group.
This is my sheet2 from excel.xlsx:
it contains default, and name column with extra names in it.
So, Now i am trying to match name of sheet2 with sheet1, if the name containing in sheet1 matches with sheet2 then i want to add default value corresponding to that name from sheet2.
This i need in output:
As you can see only Ravi and Neha having default in sheet2 and that name matches with sheet1 name. Suhash and Aish dont have any default value so not anything coming there.
This code i tried:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
df1['DEFAULT'] = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)
and getting output excel like this:
Not getting default against Ravi.
Please help me with this to get this expected output using python.

Assuming you read each sheet into a dataframe (df = sheet1, df2 = sheet2)
it's quite easy and there are a few options (ranked in order of speed, from fastest to slowest):
# .merge
df = df.merge(df2, how='left', on='Name')
# pd.conact
df = pd.concat([df.set_index('Name'), df2.set_index('Name').Default], axis=1, sort='Name', join='inner')
# .join
df = df.set_index('Name').join(df2.set_index('Name'))
# .map
df.Default = df.Name.map(df2.set_index('Name')['Default'].to_dict())
All of them will have the following output:
Name Default Class Age Group
0 NaN NaN 4 2 tig
1 Ravi 2.0 5 5 rose
2 NaN NaN 3 3 lily
3 Suhas NaN 5 5 rose
4 NaN NaN 2 2 sun
5 Neha 3.0 5 5 rose
6 NaN NaN 5 2 sun
7 Aish NaN 5 5 rose
Then you overwrite the original sheet by using df.to_excel
EDIT
So the code you shared has 3 problems. One of which seems to be a language barrier... You only need 1 of the options I gave you. Secondly there's a missing ' when reading the first sheet into df. And lastly you're inconsistent when using the df names. you defined df1 and df2 but used just df in the code which doesn't work
So the correct code would be as follows:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1') #Here the ' was missing
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
## Now you chose one of the options, I used map here, but you can pick any one of them
df1.DEFAULT = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)

Related

Please help me with the following pandas data frame manipulation

I have a dataframe as follow:
`
`pd.DataFrame({
'location':['Hyd','Mum','Viz'],
'rank1':[1,2,3],
'rank2':[np.NaN,1,2],
})
it will look like this:
location rank1 rank2
0 Hyd 1 NaN
1 Mum 2 1
2 Viz 3 2
now I want to add a column ' source' such that it will look like below
location rank1 rank2 source
0 Hyd 1 NaN Mum
1 Mum 2 1 Viz
2 Viz 3 2 none
as you can see above in the first column we got Mum because rank1 = 1 in the first row = rank2 in the 2nd row which has location Mum so we allocated mum to source in the first column and the same for others
Please help me with it
Thank you
The following code should work and as it is rank, assuming that it is a unique value, then 'Source' column is bound to have 'nan' in the last row.
# let x be the dataframe
import pandas as pd
import numpy as np
x=pd.DataFrame({
'location':['Hyd','Mum','Viz'],
'rank1':[1,2,3],
'rank2':[np.NaN,1,2],
})
# Make an empty list of source
source=[]
# Loop over the 'rank1'
for i in x['rank1']:
# for every i in rank1
# find the index of row with i in rank2
# (yes, index will be very next one,
# provided that the data is sorted)
try:
source.append(x['location'][list(x['rank2']).index(i)])
# as the last value in rank1
# will not be present in rank2
# catch this error and append with nan
except:
source.append(np.NaN)
# once source list is ready,
# make a new column and fill these values
x['Source']=source

How do I validate data mapping between 2 data frames in pandas

I am trying to validate a data mapping between two data frames for specific columns. I need to validate the following:
if values in a specific column in df1 matches the mapping in a specific column in df2.
if values in a specific column in df1 does not match the specified mapping in a specific column in df2 - a different value in df2.
if values in a specific column in df1 does not have a match in in df2.
df1 looks like this:
cp_id
cp_code
2A23
A
2A24
D
3A45
G
7A96
B
2A30
R
6A18
K
df2 looks like like:
cp_type_id
cp_type_code
2A23
8
2A24
7
3A45
3
2A44
1
6A18
8
4A08
2
The data mapping constitutes of sets of values where the combination could match any values within the set, as following:
('A','C','F','K','M') in df1 should map to (2, 8) in df2 - either 2 or 8
('B') in df1 should map to 4 in df2
('D','G','I') in df1 should map to 7 in df2
('T','U') in df1 should map to (3,5) in df2 - either 3 or 5
Note that df1 has a cp_code as R which is not mapped and that 3A45 is a mismatch. The good news is there is a unique identifier key to use.
First, I created a list for each mapping set and created a statement using merge to check for each mapping. I ended up with 3 lists and 3 statements per set, which I am not sure if this is the right way to do it.
At the end I want to combine the matches into one df that I call match, all no_matches into another df that I call no_match, and all no_mappings into another df that I call no_mapping, like the following:
Match
cp_id
cp_code
cp_type_id
cp_type_code
2A23
A
2A23
8
2A24
D
2A24
7
6A18
K
6A18
8
Mismatch
cp_id
cp_code
cp_type_id
cp_type_code
3A45
G
3A45
3
No Mapping
cp_id
cp_code
cp_type_id
cp_type_code
7A96
B
NaN
NaN
NaN
NaN
2A44
1
2A30
R
NaN
NaN
NaN
NaN
4A08
2
I am having a hard time to make the no_match to work.
This is what I tried for no match:
filtered df1 based on the set 2 codes
filtered df2 based on not in map 2 codes
for the no mapping, I did a df merge with on='cp_id'
no_mapping_set2 = df1_filtered.merge(df2_filtered, on='cp_id', indicator = True)
With the code above, for cp_id = 'B', for example, instead of getting only 1 row back, I get a lot of duplicate rows with cp_id = 'B'.
Just to state my level, I am a beginner in Python. Any help would be appreciated.
Thank you so much for your time.
Rob

How to add an empty column in a dataframe using pandas (without specifying column names)?

I have a dataframe with only one column (headerless). I want to add another empty column to it having the same number of rows.
To make it clearer, currently, the size of my data frame is 1050 (since only one column), I want the new size to be 1050*2 with the second column being completely empty.
In pandas in DataFrame are always columns, so for new default column filled by missing values use length of columns:
s = pd.Series([2,3,4])
df = s.to_frame()
df[len(df.columns)] = np.nan
#what is same for one column df like
#df[1] = np.nan
print (df)
0 1
0 2 NaN
1 3 NaN
2 4 NaN

Create pandas dataframe from csv rows in string or list format

I convert some data into a csv string like format row by row for example the rows look like:
string format
1st row: "A,B,R,K,S,E"
2nd row: "B,C,S,E,G,Q,W,R,W" # sometimes longer rows
3rd row: "A,E,R,E,S" # sometimes shorter rows
or list format
1st row: ['A','B','R','K','S','E']
2nd row: ['B','C','S','E','G','Q','W','R','W']
3rd row: ['A','E','R','E','S']
I can also add \n at the end of each row.
I want to create a pandas dataframe from these rows but not sure how.
Normally I just save this data into a .csv file then I do pd.read_csv but I want to skip that step.
Thanks for the help
This will solve your problem:
import numpy as np
import pandas as pd
First_row=['A','B','R','K','S','E']
Second_row=['B','C','S','E','G','Q','W','R','W']
Third_row=['A','E','R','E','S']
df=pd.DataFrame({'1st row':pd.Series(First_row),'2nd row':pd.Series(Second_row),'3rd row':pd.Series(Third_row)})
answer=df.T
answer
0 1 2 3 4 5 6 7 8
1st row A B R K S E NaN NaN NaN
2nd row B C S E G Q W R W
3rd row A E R E S NaN NaN NaN NaN
Method - 1 : From List
Take 2D list and append it. Else, it would add the values in columns.
Method - 2 : From String

Add new rows to dataframe using existing rows from previous year

I'm creating a Pandas dataframe from an existing file and it ends up essentially like this.
import pandas as pd
import datetime
data = [[i, i+1] for i in range(14)]
index = pd.date_range(start=datetime.date(2019,1,1), end=datetime.date(2020,2,1), freq='MS')
columns = ['col1', 'col2']
df = pd.DataFrame(data, index, columns)
Notice that this doesn't go all the way up to the present -- often the file I'm pulling from is a month or two behind. What I then need to do is add on any missing months and fill them with the same value as the previous year.
So in this case I need to add another row that is
2020-03-01 2 3
It could be anywhere from 0-2 rows that need to be added to the end of the dataframe at a given point in time. What's the best way to do this?
Note: The data here is not real so please don't take advantage of the simple pattern of entries I gave above. It was just a quick way to fill two columns of a table as an example.
If I understand your problem, then the following should help you. This does assume that you always have data 12 months ago however. You can define a new DataFrame which includes the months up to the most recent date.
# First create the new index. Get the most recent date and add an offset.
start, end = df.index[-1] + pd.DateOffset(), pd.Timestamp.now()
index_new = pd.date_range(start, end, freq='MS')
Create your DataFrame
# Get the data from the previous year.
data = df.loc[index_new - pd.DateOffset(years=1)].values
df_new = pd.DataFrame(data, index = index_new, columns=df.columns)
which looks like
col1 col2
2020-03-01 2 3
then just use;
pd.concat([df, df_new], axis=0)
Which gives
col1 col2
2019-01-01 0 1
2019-02-01 1 2
2019-03-01 2 3
... ... ...
2020-02-01 13 14
2020-03-01 2 3
Note
This also works for cases where the number of months missing is greater than 1.
Edit
Slightly different variation
# Create series with missing months added.
# Get the corresponding data 12 months prior.
s = pd.date_range(df.index[0], pd.Timestamp.now(), freq='MS')
fill = df.loc[s[~s.isin(df.index)] - pd.DateOffset(years=1)]
# Reindex the original dataframe
df = df.reindex(s)
# Find the dates to fill and replace with lagged data
df.iloc[-1 * fill.shape[0]:] = fill.values

Resources