I have multiple dataframes with a different number of rows and columns respectively.
example:
df1:
a b c d
0 1 5 6
8 9 8 7
and df2:
g h
9 8
4 5
6 7
I have to append both the dataframes without a change in their dimensions.
The desired output should be one dataframe Result_df as:
a b c d
0 1 5 6
8 9 8 7
g h
9 8
4 5
6 7
Can anyone please help me to append dataframes without change in their structure.
Thank you
Related
I am having issues trying to modify a value if columnA contains lets say 0,2,12,44,75,81 (looking at 33 different numerical values in columnA. If columnA contains 1 of the 33 vaules defined, i need to then change the same row of colulmnB to "approved".
I have tried this:
df[(df.columnA == '0|2|12|44|75|81')][df.columnB] = 'approved'
I get an error that there are none of index are in the columns but, i know that isn't correct looking at my csv file data. I must be searching wrong or using the wrong syntax.
As others have said, if you need the value in column A to match exactly with a value in the list, then you can use .isin. If you need a partial match, then you can convert to string dtypes and use .contains.
setup:
nums_to_match = [0,2,9]
df
A B
0 6 e
1 1 f
2 3 b
3 6 g
4 6 i
5 0 f
6 9 a
7 6 i
8 6 a
9 2 h
Exact match:
df.loc[df.A.isin(nums_to_match),'B']='approved'
df:
A B
0 6 e
1 1 f
2 3 b
3 6 g
4 6 i
5 0 approved
6 9 approved
7 6 i
8 6 a
9 2 approved
partial match:
nums_to_match_str = list(map(str,nums_to_match))
df['A']=df['A'].astype(str)
df.loc[df.A.str.contains('|'.join(nums_to_match_str),case=False,na=False),'B']='approved'
df
A B
0 1 h
1 4 c
2 6 i
3 7 d
4 3 d
5 9 approved
6 5 i
7 1 c
8 0 approved
9 5 d
I have some data like this
df = pd.DataFrame({'class':['a','a','b','b','a','a','b','c','c'],'score':[3,5,6,7,8,9,10,11,14]})
df
class score
0 a 3
1 a 5
2 b 6
3 b 7
4 a 8
5 a 9
6 b 10
7 c 11
8 c 14
I want to use groupby function extract top n% data(descending by score),i know the nlargest can make it,but the number of every group is different,so i don't know how to do it
I tried this function
top_n = 0.5
g = df.groupby(['class'])['score'].apply(lambda x:x.nlargest(int(round(top_n*len(x))),keep='all')).reset_index()
g
class level_1 score
0 a 5 9
1 a 4 8
2 b 6 10
3 b 3 7
4 c 8 14
but it can not deal with big data(more than 10 million),it is very slow,how do i speed it,thank you!
I have two data frames. Examples:
df1:
A B C
5 7 6
8 1 1
1 0 7
3 4 9
5 7 4
9 2 0
df2:
A B C
3 2 1
6 5 7
9 7 9
1 1 2
6 4 5
0 8 6
Both data frames have same index.
What I want is , wherever df1's value is less than 5,
I want to update df2's value to 0, else keep it same.
I tried the following code:
df2[df1<5]=0
but when I am printing df2, its showing same values as original df2.
I know I am missing something really simple.
Please help me.
Thank you.
I have a dataframe that I need to randomise in a very specific way with a particular rule, and I'm a bit lost. A simplified version is here:
idx type time
1 a 1
2 a 1
3 a 1
4 b 2
5 b 2
6 b 2
7 a 3
8 a 3
9 a 3
10 b 4
11 b 4
12 b 4
13 a 5
14 a 5
15 a 5
16 b 6
17 b 6
18 b 6
19 a 7
20 a 7
21 a 7
If we consider this as containing seven "bunches", I'd like to randomly shuffle by those bunches, i.e. retaining the time column. However, the constraint is that after shuffling, a particular bunch type (a or b in this case) cannot appear more than n (e.g. 2) times in a row. So an example correct result looks like this:
idx type time
21 a 7
20 a 7
19 a 7
7 a 3
8 a 3
9 a 3
17 b 6
16 b 6
18 b 6
6 b 2
5 b 2
4 b 2
2 a 1
3 a 1
1 a 1
14 a 5
13 a 5
15 a 5
12 b 4
11 b 4
10 b 4
I was thinking I could create a separate "order" array from 1 to 7 and np.random.shuffle() it, then sort the dataframe by time in that order, which will probably work - I can think of ways to do that part, but I'm especially struggling with the rule of restricting the number of repeats.
I know roughly that I should use a while loop, shuffle it in that way, loop over the frame and track the number of consecutive types, if it exceeds my n then break out and start the while loop again until it completes without breaking out, in which case set a value to end the while loop. But this got so messy and didn't work.
Any ideas?
See if this works.
import pandas as pd
import numpy as np
n = [['a',1],['a',1],['a',1],
['b',2],['b',2],['b',2],
['a',3],['a',3],['a',3]]
df = pd.DataFrame(n)
df.columns = ['type','time']
print(df)
order = np.unique(np.array(df['time']))
print("Before Shuffling",order)
np.random.shuffle(order)
print("Shuffled",order)
n =2
for i in order:
print(df[df['time']==i].iloc[0:n])
Using python 3 am trying for each uniqe row in the column 'Name' to get the last 5 records from the column 'Number'. How exactly can this be done in python?
My df looks like this:
Name Number
a 5
a 6
b 7
b 8
a 9
a 10
b 11
b 12
a 9
b 8
I saw same exmples(like this one Get sum of last 5 rows for each unique id ) in SQL but that is time consuming and I would like to learn how to do it in python.
My expected output df would be like this:
Name 1 2 3 4 5
a 5 6 9 10 9
b 7 8 11 12 8
I think you need something like this:
df_out = df.groupby('Name').tail(5)
df_out.set_index(['Name', df_out.groupby('Name').cumcount() +1])['Number'].unstack()
Output:
1 2 3 4 5
Name
a 5 6 9 10 9
b 7 8 11 12 8
Looks like you need pivot after a groupby.cumcount()
df1=df.groupby('Name').tail(5)
final=(df1.assign(k=df1.groupby('Name').cumcount()+1)
.pivot(index='Name', columns='k', values='Number')
.reset_index().rename_axis(None, axis=1))
print(final)
Name 1 2 3 4 5
0 a 5 6 9 10 9
1 b 7 8 11 12 8