How to combine rows in pandas - python-3.x

I have a dataset like this
df = pd.DataFrame({'a' : ['a', 'b' , 'b', 'a'], 'b': ['a', 'b' , 'b', 'a'] })
And i want to combine first two rows and get dataset like this
df = pd.DataFrame({'a' : ['a b' , 'b', 'a'], 'b': ['a b' , 'b', 'a'] })
no rules but first two rows. I do not know how to combine row so i 'create' method to combine by transpose() as below
db = df.transpose()
db["new"] = db[0].map(str) +' '+ db[1]
db.drop([0, 1], axis=1, inplace=True) # remove these two columns
cols = db.columns.tolist() # re order
cols = cols[-1:] + cols[:-1]
db = db[cols]
df = db.transpose() # reverse operation
df.reset_index()
It works but i think there is an easier way

You can simply add the two rows
df.loc[0] = df.loc[0]+ df.loc[1]
df.drop(1, inplace = True)
You get
a b
0 ab ab
2 b b
3 a a
A bit more fancy looking :)
df.loc[0]= df[:2].apply(lambda x: ''.join(x))
df.drop(1, inplace = True)

Related

Combine all column in df with pandas (itertools)

Hello I have a df such as :
COL1 COL2 COL3 COL4
A B C D
how can I get a nex df with all combination between columns ?
and get
COL1_COL2 COL1_COL3 COL1_COL4 COL2_COL3 COL2_COL4 COL3_COL4
['A','B']['A','C'] ['A','D'] ['B','C'] ['B','D'] ['C','D']
I gess we coule use itertool?
Indeed itertools are useful here
from itertools import combinations
columns = [df[c] for c in df.columns]
column_pairs = ([pd.DataFrame(
columns = [pair[0].name + '_' + pair[1].name],
data= pd.concat([pair[0],pair[1]],axis=1)
.apply(list,axis=1))
for pair in combinations(columns, 2)]
)
pd.concat(column_pairs, axis = 1)
produces
COL1_COL2 COL1_COL3 COL1_COL4 COL2_COL3 COL2_COL4 COL3_COL4
-- ----------- ----------- ----------- ----------- ----------- -----------
0 ['A', 'B'] ['A', 'C'] ['A', 'D'] ['B', 'C'] ['B', 'D'] ['C', 'D']
1 ['a', 'b'] ['a', 'c'] ['a', 'd'] ['b', 'c'] ['b', 'd'] ['c', 'd']
(I added another row to the original df with a, b, c, d, to make sure it works in this slightly more general case)
The code is fairly straightforward. columns are a list of columns, each as pd.Series, of the original dataframe. combinations(columns, 2) enumerate all pairs of those. The pd.DataFrame(columns = [pair[0].name + '_' + pair[1].name], data= pd.concat([pair[0],pair[1]],axis=1).apply(list,axis=1)) combines first and second column from the tuple pair into a single-column df with the combined name and values. Finally pd.concat(column_pairs, axis = 1) combines them all together

Performing a merge in Pandas on a column containing a Python `range` or list-like

My question is an extension of this one made a few years ago.
I'm attempting a left join but one of the columns I want to join on needs to be a range value. It needs to be a range because expanding it would mean millions of new (and unnecessary) rows. Intuitively it seems possible using Python's in operator (as x in range(y, z) is very common) but would involve a nasty for loop and if/else block. There has to be a better way.
Here's a simple version of my data:
# These are in any order
sample = pd.DataFrame({
'col1': ['1b', '1a', '1a', '1b'],
'col2': ['2b', '2b', '2a', '2a'],
'col3': [42, 3, 21, 7]
})
# The 'look-up' table
look_up = pd.DataFrame({
'col1': ['1a', '1a', '1a', '1a', '1b', '1b', '1b', '1b'],
'col2': ['2a', '2a', '2b', '2b', '2a', '2a', '2b', '2b'],
'col3': [range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101)],
'col4': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
})
I initially tried a merge to see if pandas would understand but there was a type mismatch error.
sample.merge(
look_up,
how='left',
left_on=['col1', 'col2', 'col3'],
right_on=['col1', 'col2', 'col3']
)
# ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
Reviewing the documentation for pd.concat looks like it will not give me result I want either. Rather than appending, I'm still trying to get a result like merge. I tried to follow the answer given to question I linked at the start but that didn't work either. It's entirely possible I misunderstood how to use np.where but also I'm hoping there is a solution that is a little less hacky.
Here's my attempt using np.where:
s1 = sample['col1'].values
s2 = sample['col2'].values
s3 = sample['col3'].values
l1 = look_up['col1'].values
l2 = look_up['col2'].values
l3 = look_up['col3'].values
i, j = np.where((s3[:, None] in l3) & (s2[:, None] == l2) & (s1[:, None] == l1))
result = pd.DataFrame(
np.column_stack([sample.values[i], look_up.values[j]]),
columns=sample.columns.append(look_up.columns)
)
len(result) # returns 0
The result I want should look like this:
col1 col2 col3 col4
'1b' '2b' 42 'h'
'1a' '2b' 3 'c'
'1a' '2a' 21 'b'
'1b' '2a' 7 'e'
Since it looks like ranges are pretty big, and you are working with integer vales, you can just compute the min, max:
columns = look_up.columns
look_up['minval'] = look_up['col3'].apply(min)
look_up['maxval'] = look_up['col3'].apply(max)
(sample.merge(look_up, on=['col1','col2'], how='left',
suffixes=['','_'])
.query('minval <= col3 <= maxval')
[columns]
)
Output:
col1 col2 col3 col4
1 1b 2b 42 h
2 1a 2b 3 c
5 1a 2a 21 b
6 1b 2a 7 e

Whether slicing of DataFrame in python return copy or reference to the original DataFrame

import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)
print(randn(5, 4))
df = pd.DataFrame( randn(5, 4), ['A', 'B', 'C', 'D', 'F'], ['W', 'X', 'Y', 'Z'] )
tmp_df = df['X']
print(type(tmp_df)) # Here type is Series (as expected)
tmp_df.loc[:] = 12.3
print(tmp_df)
print(df)
This code changes the content of (original) DataFrame df.
np.random.seed(101)
print(randn(5, 4))
df = pd.DataFrame( randn(5, 4), ['A', 'B', 'C', 'D', 'F'], ['W', 'X', 'Y', 'Z'] )
tmp_df = df.loc[['A', 'B'], ['W', 'X']]
print(type(tmp_df)) # Type is DataFrame
tmp_df.loc[:] = 12.3 # whereas, here when I change the content of the tmp_df it doesn't reflect on original array.
print(tmp_df)
print(df)
So, does that mean if we slice the Series out of DataFrame, reference is passed to sliced object.
Whereas, if it's DataFrame that has been sliced then it doesn't point to original DataFrame.
Please confirm whether my conclusion above is correct or not? Help would be appreciated.
To put it in a simple manner: Indexing with lists in loc always returns a copy.
Let's work with a DataFrame df:
df=pd.DataFrame({'A':[i for i in range(100)]})
df.head(3)
Output:
0 0
1 1
2 2
When we try to do an operation on the sliced data.
h=df.loc[[0,1,2],['A']]
h.loc[:] = 12.3
h
Output of h:
0 12.3
1 12.3
2 22.3
The results don't reflect like how it happened in your case:
df.head(3)
Output:
0 0
1 1
2 2
But when you're doing this tmp_df = df['X'], the series tmp_df is referring to contents of "X" in column df. Which is meant to change when you modify tmp_df.

Fetching value from dataframe with certain condition

I have a dataframe which is containing 3 columns (['A','B','C]) and 3 rows in it.
We are using a for loop to fetch value(storing into variable) from above dataframe based upon certain condition from column B.
Further we are using list to store value present in variable.
Here question is upon checking list value, we are getting variable value, its type.
I'm not sure why it is happening. As list should contain only variable value only.
Please can anyone help us to get ideal solution for same.
Thanks,
Bhuwan
dataframe: columns-A,B,C rows value- a to i :df = ([a,b,c][d,b,f][g,b,i]).
list_1=[]
for i in range(0,9):
variable_1=df['A'][df.B == 'b']
list_1.append(variable_1)
print(list_1):
Ideal output: ['a','d','g']
while we are getting output as
['a type: object','d type: object','g type: object'].
You can get your ideal output like this:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'd', 'g'], 'B': ['b', 'b', 'b'], 'C': ['c', 'f', 'i']})
list_1 = list(df[df['B'] == 'b']['A'].values) # <- this line
print(list_1)
> ['a', 'd', 'g']
You just need:
1) to filter your dataframe by column "B" df[df['B'] == 'b']
2) and only then take values of the resulted column "A", turning them into list

Calling a list of DataFrame index with index key value

df = pd.DataFrame([[3,3,3]]*4,index=['a','b','c','d'])
While we can extract a copy of a section of an index via specifying row numbers like below:
i1=df.index[1:3].copy()
Unfortunately we can't extract a copy of a section of an index via specifying the key (like the case of df.loc method). When I try the below:
i2=df.index['a':'c'].copy()
I get the below error:
TypeError: slice indices must be integers or None or have an __index__ method
Is there any alternative to call a subset of an index based on its keys? Thank you
Simpliest is loc with index:
i1 = df.loc['b':'c'].index
print (i1)
Index(['b', 'c'], dtype='object')
Or is possible use get_loc for positions:
i1 = df.index
i1 = i1[i1.get_loc('b') : i1.get_loc('d') + 1]
print (i1)
Index(['b', 'c'], dtype='object')
i1 = i1[i1.get_loc('b') : i1.get_loc('d') + 1]
print (i1)
Index(['b', 'c', 'd'], dtype='object')
Alternative:
i1 = i1[i1.searchsorted('b') : i1.searchsorted('d') + 1]
print (i1)
Index(['b', 'c', 'd'], dtype='object')
Try using .loc, see this documentation:
i2 = df.loc['a':'c'].index
print(i2)
Output:
Index(['a', 'b', 'c'], dtype='object')
or
df.loc['a':'c'].index.tolist()
Output:
['a', 'b', 'c']

Resources