Build Pandas DataFrame with String Entries using 2 Separate DataFrames - python-3.x

Suppose you have two separate pandas DataFrames with the same row and column indices (in my case, the column indices were constructed by .unstack()'ing a MultiIndex built using df.groupby([col1,col2]))
df1 = pd.DataFrame({'a':[.01,.02,.03],'b':[.04,.05,.06]})
df2 = pd.DataFrame({'a':[.04,.05,.06],'b':[.01,.02,.03]})
Now suppose I would like to create a 3rd DataFrame, df3, where each entry of df3 is a string which uses the corresponding element-wise entries of df1 and df2. For example,
df3.iloc[0,0] = '{:.0%}'.format(df1.iloc[0,0]) + '\n' + '{:.0%}'.format(df2.iloc[0,0])
I recognize this is probably easy enough to do by looping over all entries in df1 and df2 and creating a new entry in df3 based on these values (which can be slow for large DataFrames), or even by joining the two DataFrames together (which may require renaming columns), but I am wondering if there a more pythonic / pandorable way of accomplishing this, possibly using applymap or some other built-in pandas function?
The question is similar to Combine two columns of text in dataframe in pandas/python but the previous question does not consider combining multiple DataFrames into a single.

IIUC, you just need add df1 and df2 with '\n'
df3 = df1.astype(str) + '\n' + df2.astype(str)
Out[535]:
a b
0 0.01\n0.04 0.04\n0.01
1 0.02\n0.05 0.05\n0.02
2 0.03\n0.06 0.06\n0.03

You can make use of the vectorized operations of Pandas (given that the dataframes share row and column index)
(df1 * 100).astype(str) + '%\n' + (df2 * 100).astype(str) + '%'
You get
a b
0 1.0%\n4.0% 4.0%\n1.0%
1 2.0%\n5.0% 5.0%\n2.0%
2 3.0%\n6.0% 6.0%\n3.0%

Related

How to create single row panda DataFrame with headers from big panda DataFrame

I have a DataFrame which contains 55000 rows and 3 columns
I want to return every row as DataFrame from this bigdataframe for using it as parameter of different function.
My idea was iterating over big DataFrame by iterrows(),iloc but I can't make it as DataFrame it is showing series type. How could I solve this
I think it is obviously not necessary, because index of Series is same like columns of DataFrame.
But it is possible by:
df1 = s.to_frame().T
Or:
df1 = pd.DataFrame([s.to_numpy()], columns=s.index)
Also you can try yo avoid iterrows, because obviously very slow.
I suspect you're doing something not optimal if you need what you describe. That said, if you need each row as a dataframe:
l = [pd.DataFrame(df.iloc[i]) for i in range(len(df))]
This makes a list of dataframes for each row in df

Splitting dataframe based on multiple column values

I have a dataframe with 1M+ rows. A sample of the dataframe is shown below:
df
ID Type File
0 123 Phone 1
1 122 Computer 2
2 126 Computer 1
I want to split this dataframe based on Type and File. If the total count of Type is 2 (Phone and Computer), total number of files is 2 (1,2), then the total number of splits will be 4.
In short, total splits is as given below:
total_splits=len(set(df['Type']))*len(set(df['File']))
In this example, total_splits=4. Now, I want to split the dataframe df in 4 based on Type and File.
So the new dataframes should be:
df1 (having data of type=Phone and File=1)
df2 (having data of type=Computer and File=1)
df3 (having data of type=Phone and File=2)
df4 (having data of type=Computer and File=2)
The splitting should be done inside a loop.
I know we can split a dataframe based on one condition (shown below), but how do you split it based on two ?
My Code:
data = {'ID' : ['123', '122', '126'],'Type' :['Phone','Computer','Computer'],'File' : [1,2,1]}
df=pd.DataFrame(data)
types=list(set(df['Type']))
total_splits=len(set(df['Type']))*len(set(df['File']))
cnt=1
for i in range(0,total_splits):
for j in types:
locals()["df"+str(cnt)] = df[df['Type'] == j]
cnt += 1
The result of the above code gives 2 dataframes, df1 and df2. df1 will have data of Type='Phone' and df2 will have data of Type='Computer'.
But this is just half of what I want to do. Is there a way we can make 4 dataframes here based on 2 conditions ?
Note: I know I can first split on 'Type' and then split the resulting dataframe based on 'File' to get the output. However, I want to know of a more efficient way of performing the split instead of having to create multiple dataframes to get the job done.
EDIT
This is not a duplicate question as I want to split the dataframe based on multiple column values, not just one!
You can make do with groupby:
dfs = {}
for k, d in df.groupby(['Type','File']):
type, file = k
# do want ever you want here
# d is the dataframe corresponding with type, file
dfs[k] = d
You can also create a mask:
df['mask'] = df['File'].eq(1) * 2 + df['Type'].eq('Phone')
Then, for example:
df[df['mask'].eq(0)]
gives you the first dataframe you want, i.e. Type==Phone and File==1, and so on.

Python: DataFrame Index shifting

I have several dataframes that I have concatenated with pandas in the line:
xspc = pd.concat([df1,df2,df3], axis = 1, join_axes = [df3.index])
In df2 the index values read one day later than the values of df1, and df3. So for instance when the most current date is 7/1/19 the index values for df1 and df3 will read "7/1/19" while df2 reads '7/2/19'. I would like to be able to concatenate each series so that each dataframe is joined on the most recent date, so in other words I would like all the dataframe values from df1 index value '7/1/19' to be concatenated with dataframe 2 index value '7/2/19' and dataframe 3 index value '7/1/19'. When methods can I use to shift the data around to join on these not matching index values?
You can reset the index of the data frame and then concat the dataframes
df1=df1.reset_index()
df2=df2.reset_index()
df3=df3.reset_index()
df_final = pd.concat([df1,df2,df3],axis=1, join_axes=[df3.index])
This should work since you mentioned that the date in df2 will be one day after df1 or df3

Difference Between two Data frames

Is thee any way yo subtract values of two existing dataframe with the common headers in java ?
For example
DF1
|H0|H1|H2|H3|
|00|01|02|03|
|04|05|06|07|
|08|09|10|11|
DF2
|H0|H1|H2|H3|H4|
|01|02|03|04|12|
|05|06|07|08|13|
|09|11|12|13|14|
Subtraction example:
DF2 - DF1
|H0|H1|H2|H3|H4|
|01|01|01|01|12|
|01|01|01|01|13|
|01|01|01|01|14|

Pandas - Aggregating column value from another dataframe based on common column between 2 dataframes

I have 2 different dataframes like so -
and
I need to add a column "Present In" to the first dataframe, that lists all the items in C that correspond to the K ID in the second dataframe. So, the first table should look something like -
How can I do this using Pandas? Thanks! :)
I will do gruopby with df2 , the map
s=df2.groupby('K ID')['C'].apply(','.join)
df1['Present In']=df1['K ID'].map(s).fillna('')

Resources