How to combine a data frame with another that contains comma separated values? - python-3.x

I am working with 2 data frames that I created based from an Excel file. One data frame contains values that are separated with commas, that is,
df1 df2
----------- ------------
0 LFTEG42 X,Y,Z
1 JOCOROW 1,2
2 TLR_U01 I
3 PR_UDG5 O,M
df1 and df2 are my column names. My intention is to merge the two data frames and generate the following output:
desired result
----------
0 LFTEG42X
1 LFTEG42Y
2 LFTEG42Z
3 JOCOROW1
4 JOCOROW2
5 TLR_U01I
6 .....
n PR_UDG5M
This is the code that I used but I ended up with the following result:
input_file = pd.ExcelFile \
('C:\\Users\\devel\\Desktop_12\\Testing\\latest_Calculation' + str(datetime.now()).split(' ')[0] + '.xlsx')
# convert the worksheets to dataframes
df1 = pd.read_excel(input_file, index_col=None, na_values=['NA'], parse_cols="H",
sheetname="Analysis")
df2 = pd.read_excel(input_file, index_col=None, na_values=['NA'], parse_cols="I",
sheetname="Analysis")
data_frames_merged = df1.append(df2, ignore_index=True)
current result
--------------
NaN XYZ
NaN 1,2
NaN I
... ...
PR_UDG5 NaN
Questions
Why did I end up receiving a NaN (not a number) value?
How can I achieve my desired result of merging these two data frames with the comma values?

I break down the steps
df=pd.concat([df1,df2],axis=1)
df.df2=df.df2.str.split(',')
df=df.set_index('df1').df2.apply(pd.Series).stack().reset_index().drop('level_1',1).rename(columns={0:'df2'})
df['New']=df.df1+df.df2
df
Out[34]:
df1 df2 New
0 LFTEG42 X LFTEG42X
1 LFTEG42 Y LFTEG42Y
2 LFTEG42 Z LFTEG42Z
3 JOCOROW 1 JOCOROW1
4 JOCOROW 2 JOCOROW2
5 TLR_U01 I TLR_U01I
6 PR_UDG5 O PR_UDG5O
7 PR_UDG5 M PR_UDG5M
Data Input :
df1
Out[36]:
df1
0 LFTEG42
1 JOCOROW
2 TLR_U01
3 PR_UDG5
df2
Out[37]:
df2
0 X,Y,Z
1 1,2
2 I
3 O,M

Dirty one-liner
new_df = pd.concat([df1['df1'], df2['df2'].str.split(',', expand = True).stack()\
.reset_index(1,drop = True)], axis = 1).sum(1)
0 LFTEG42X
0 LFTEG42Y
0 LFTEG42Z
1 JOCOROW1
1 JOCOROW2
2 TLR_U01I
3 PR_UDG5O
3 PR_UDG5M

Also, similar to #vaishali except using melt
df = pd.concat([df1,df2['df2'].str.split(',',expand=True)],axis=1).melt(id_vars='df1').dropna().drop('variable',axis=1).sum(axis=1)
0 LFTEG42X
1 JOCOROW1
2 TLR_U01I
3 PR_UDG5O
4 LFTEG42Y
5 JOCOROW2
7 PR_UDG5M
8 LFTEG42Z

Setup
df1 = pd.DataFrame(dict(A='LFTEG42 JOCOROW TLR_U01 PR_UDG5'.split()))
df2 = pd.DataFrame(dict(A='X,Y,Z 1,2 I O,M'.split()))
Getting creative
df1.A.repeat(df2.A.str.count(',') + 1) + ','.join(df2.A).split(',')
0 LFTEG42X
0 LFTEG42Y
0 LFTEG42Z
1 JOCOROW1
1 JOCOROW2
2 TLR_U01I
3 PR_UDG5O
3 PR_UDG5M
dtype: object

Related

merging and joining the wo pandas data frames with out including right side data frame columns

I have two dataframes like given below.
***df1 = pd.DataFrame({'a':[1,1,2,2,3,3], 'b':[1,2,1,2,1,2], 'c':[1,2,4,0,0,2]})***
df1
a b c
0 1 1 1
1 1 2 2
2 2 1 4
3 2 2 0
4 3 1 0
5 3 2 2
***df2 = pd.DataFrame({'a':[1,1,2,2], 'b':[1,2,1,2], 'c':[1,5,6,2]})***
df2
a b c
0 1 1 1
1 1 2 5
2 2 1 6
3 2 2 2
I want to apply inner join of the both data frames and don't want the columns from df2, so tried with below code.
***merged_df = df1.merge(df2, how='inner', left_on=["a", "b"], right_on=["a","b"])***
a b c_x c_y
0 1 1 1 1
1 1 2 2 5
2 2 1 4 6
3 2 2 0 2
from the above code without droping c_x and c_y manually, is there any way to not to merge right dataframe(df2)
basically, I want all the columns from df1 and don't want any columns from df2 after merging.
Thanks in advance.
Idea is filter only columns for merging, here a,b. If want merge by both columns on parameter should be omit (then pandas merge by intersection of columns in both DataFrames):
merged_df = df1.merge(df2[["a", "b"]])
working like:
merged_df = df1.merge(df2[["a", "b"]], on=['a','b'])

Calculation using shifting is not working in a for loop

The problem consist on calculate from a dataframe the column "accumulated" using the columns "accumulated" and "weekly". The formula to do this is accumulated in t = weekly in t + accumulated in t-1
The desired result should be:
weekly accumulated
2 0
1 1
4 5
2 7
The result I'm obtaining is:
weekly accumulated
2 0
1 1
4 4
2 2
What I have tried is:
for key, value in df_dic.items():
df_aux = df_dic[key]
df_aux['accumulated'] = 0
df_aux['accumulated'] = (df_aux.weekly + df_aux.accumulated.shift(1))
#df_aux["accumulated"] = df_aux.iloc[:,2] + df_aux.iloc[:,3].shift(1)
df_aux.iloc[0,3] = 0 #I put this because I want to force the first cell to be 0.
Being df_aux.iloc[0,3] the first row of the column "accumulated".
What I´m doing wrong?
Thank you
EDIT: df_dic is a dictionary with 5 dataframes. df_dic is seen as {0: df1, 1:df2, 2:df3}. All the dataframes have the same size and same columns names. So i do the for loop to do the same calculation in every dataframe inside the dictionary.
EDIT2 : I'm trying doing the computation outside the for loop and is not working.
What im doing is:
df_auxp = df_dic[0]
df_auxp['accumulated'] = 0
df_auxp['accumulated'] = df_auxp["weekly"] + df_auxp["accumulated"].shift(1)
df_auxp.iloc[0,3] = df_auxp.iloc[0,3].fillna(0)
Maybe have something to do with the dictionary interaction...
To solve for 3 dataframes
import pandas as pd
df1 = pd.DataFrame({'weekly':[2,1,4,2]})
df2 = pd.DataFrame({'weekly':[3,2,5,3]})
df3 = pd.DataFrame({'weekly':[4,3,6,4]})
print (df1)
print (df2)
print (df3)
for d in [df1,df2,df3]:
d['accumulated'] = d['weekly'].cumsum() - d.iloc[0,0]
print (d)
The output of this will be as follows:
Original dataframes:
df1
weekly
0 2
1 1
2 4
3 2
df2
weekly
0 3
1 2
2 5
3 3
df3
weekly
0 4
1 3
2 6
3 4
Updated dataframes:
df1:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7
df2:
weekly accumulated
0 3 0
1 2 2
2 5 7
3 3 10
df3:
weekly accumulated
0 4 0
1 3 3
2 6 9
3 4 13
To solve for 1 dataframe
You need to use cumsum and then subtract the value from first row. That will give you the desired result. here's how to do it.
import pandas as pd
df = pd.DataFrame({'weekly':[2,1,4,2]})
print (df)
df['accumulated'] = df['weekly'].cumsum() - df.iloc[0,0]
print (df)
Original dataframe:
weekly
0 2
1 1
2 4
3 2
Updated dataframe:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7

Converting dataframe fraction to float

I would like to convert the string values in column b to float. Wondering how should I make it.
A B
1 16-1/4
2 3-1/4
3 21-1/4
4 8-1/4
Update:
Give map a try to avoid limit 100 rows on pd.eval
df['C'] = df.B.str.replace('-', '+').map(pd.eval)
Original:
As your comment, it seems you adding the fraction to the whole number, so the solution would be
df['C'] = pd.eval(df.B.str.replace('-', '+'))
Out[5]:
A B C
0 1 16-1/4 16.25
1 2 3-1/4 3.25
2 3 21-1/4 21.25
3 4 8-1/4 8.25
Use built-in Python function eval():
df.B = df.B.apply(eval)
Test:
In[1]: df
A B
0 1 16-1/4
1 2 3-1/4
2 3 21-1/4
3 4 8-1/4
In[2]: df.B = df.B.apply(eval)
In[3]: df
A B
0 1 15.75
1 2 2.75
2 3 20.75
3 4 7.75

In Pandas, how to filter against other dataframe with Multi-Index

I have two dataframes. The first one (df1) has a Multi-Index A,B.
The second one (df2) has those fields A and B as columns.
How do I filter df2 for a large dataset (2 million rows in each) to get only the rows in df2 where A and B are not in the multi index of df1
import pandas as pd
df1 = pd.DataFrame([(1,2,3),(1,2,4),(1,2,4),(2,3,4),(2,3,1)],
columns=('A','B','C')).set_index(['A','B'])
df2 = pd.DataFrame([(7,7,1,2,3),(7,7,1,2,4),(6,6,1,2,4),
(5,5,6,3,4),(2,7,2,2,1)],
columns=('X','Y','A','B','C'))
df1:
C
A B
1 2 3
2 4
2 4
2 3 4
3 1
df2 before filtering:
X Y A B C
0 7 7 1 2 3
1 7 7 1 2 4
2 6 6 1 2 4
3 5 5 6 3 4
4 2 7 2 2 1
df2 wanted result:
X Y A B C
3 5 5 6 3 4
4 2 7 2 2 1
Create MultiIndex in df2 by A,B columns and filter by Index.isin with ~ for invert boolean mask with boolean indexing:
df = df2[~df2.set_index(['A','B']).index.isin(df1.index)]
print (df)
X Y A B C
3 5 5 6 3 4
4 2 7 2 2 1
Another similar solution with MultiIndex.from_arrays:
df = df2[~pd.MultiIndex.from_arrays([df2['A'],df2['B']]).isin(df1.index)]
Another solution by #Sandeep Kadapa:
df = df2[df2[['A','B']].ne(df1.reset_index()[['A','B']]).any(axis=1)]

Placing n rows of pandas a dataframe into their own dataframe

I have a large dataframe with many rows and columuns.
An example of the structure is:
a = np.random.rand(6,3)
df = pd.DataFrame(a)
I'd like to split the DataFrame into seperate data frames each consisting of 3 rows.
you can use groupby
g = df.groupby(np.arange(len(df)) // 3)
for n, grp in g:
print(grp)
0 1 2
0 0.278735 0.609862 0.085823
1 0.836997 0.739635 0.866059
2 0.691271 0.377185 0.225146
0 1 2
3 0.435280 0.700900 0.700946
4 0.796487 0.018688 0.700566
5 0.900749 0.764869 0.253200
to get it into a handy dictionary
mydict = {k: v for k, v in g}
You can use numpy.split() method:
In [8]: df = pd.DataFrame(np.random.rand(9, 3))
In [9]: df
Out[9]:
0 1 2
0 0.899366 0.991035 0.775607
1 0.487495 0.250279 0.975094
2 0.819031 0.568612 0.903836
3 0.178399 0.555627 0.776856
4 0.498039 0.733224 0.151091
5 0.997894 0.018736 0.999259
6 0.345804 0.780016 0.363990
7 0.794417 0.518919 0.410270
8 0.649792 0.560184 0.054238
In [10]: for x in np.split(df, len(df)//3):
...: print(x)
...:
0 1 2
0 0.899366 0.991035 0.775607
1 0.487495 0.250279 0.975094
2 0.819031 0.568612 0.903836
0 1 2
3 0.178399 0.555627 0.776856
4 0.498039 0.733224 0.151091
5 0.997894 0.018736 0.999259
0 1 2
6 0.345804 0.780016 0.363990
7 0.794417 0.518919 0.410270
8 0.649792 0.560184 0.054238

Resources