I have 2 large DataFrames with the same set of columns but different values. I need to combine the values in respective columns (A and B here, maybe be more in actual data) into single values in the same columns (see required output below). I have a quick way of implementing this using np.vectorize and df.to_numpy() but I am looking for a way to implement this strictly with pandas. Criteria here is first readability of code then time complexity.
df1 = pd.DataFrame({'A':[1,2,3,4,5], 'B':[5,4,3,2,1]})
print(df1)
A B
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
and,
df2 = pd.DataFrame({'A':[10,20,30,40,50], 'B':[50,40,30,20,10]})
print(df2)
A B
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
I have one way of doing it which is quite fast -
#This function might change into something more complex
def conc(a,b):
return str(a)+'_'+str(b)
conc_v = np.vectorize(conc)
required = pd.DataFrame(conc_v(df1.to_numpy(), df2.to_numpy()), columns=df1.columns)
print(required)
#Required Output
A B
0 1_10 5_50
1 2_20 4_40
2 3_30 3_30
3 4_40 2_20
4 5_50 1_10
Looking for an alternate way (strictly pandas) of solving this.
Criteria here is first readability of code
Another simple way is using add and radd
df1.astype(str).add(df2.astype(str).radd('-'))
A B
0 1-10 5-50
1 2-20 4-40
2 3-30 3-30
3 4-40 2-20
4 5-50 1-10
Related
I have two Dataframes of identical Size. For simplicity's sake
df1 =
start n end
0 20200712 50000 20200812
1 20200714 51000 20200814
2 20200716 51500 20200816
3 20200719 53000 20200819
4 20200721 54000 20200821
5 20200724 55000 20200824
6 20200729 57000 20200824
df2 =
start n end
0 20200712 0 20200812
1 20200714 15 20200814
2 20200716 51500 20200816
3 20200719 53000 20200819
4 20200721 30 20200821
5 20200724 55000 20200824
6 20200729 57000 20200824
I would like to replace all 'n's in df2 with those in df1 when a condition is met(here n <50)
I have something like this in mind, which works.
df2.loc[df2['n']<50,'n'] = df1['n']
to get
start n end
0 20200712 50000 20200812
1 20200714 51000 20200814
2 20200716 51500 20200816
3 20200719 53000 20200819
4 20200721 54000 20200821
5 20200724 55000 20200824
6 20200729 57000 20200824
What is the most efficient or 'proper' way when i have multiple such 'n'columns?
Put the other n columns in a list. That can be easily done by slicing the list given by:
my_n_columns = list(df2.columns)[1:-1] # slicing adapted to your example
(As the n columns seems to be in between start and end)
Then apply your code to each column through a loop:
for col in my_n_columns:
df_result = df2.loc[df2[col]<50,col] = df1[col]
The cleaner way:
df2.where(df2 >= 50, df1)
I am trying to find a generic way to sort a DataFrame on multiple columns, where each column is sorted by a different arbitrary sort function.
For example, for input I might have
df = pd.DataFrame([[2,"Basic",6],[1,"Intermediate",9],[2,"Intermediate",6],[0,"Advanced",6],[0,"Basic",2],[1, 'Advanced', 6], [0,"Basic",3], ], columns=['Hour','Level','Value'])
Hour Level Value
0 2 Basic 6
1 1 Intermediate 9
2 2 Intermediate 6
3 0 Advanced 6
4 0 Basic 2
5 1 Advanced 6
6 0 Basic 3
and I want my output to be
Hour Level Value
0 0 Advanced 6
1 0 Basic 3
2 0 Basic 2
3 1 Advanced 6
4 1 Intermediate 9
5 2 Intermediate 6
6 2 Basic 6
I might have a function map as such
lambdaMap = {
"Hour": lambda x: x,
"Level": lambda x: [['Advanced', 'Intermediate', 'Basic'].index(l) for l in x]
"Value": lambda x: -x
}
I can apply any one of the sorting functions individually:
sortValue="Hour"
df.sort_values(by=sortValue, key=lambdaMap[sortValue])
I could create a loop to apply each sort successively:
for (column, func) in lambdaSort.items():
df = df.sort_values(by=column, key=func)
But none of these will create the output I'm looking for. Is this even possible? There are a lot of examples with how to achieve similar things for specific instances, but I'm curious if there is a way to achieve this generically, for use in the creation of API and/or general support libraries.
you can convert to categorical and do a sort:
df['Level'] = pd.Categorical(df['Level'],['Advanced', 'Intermediate', 'Basic'],
ordered=True)
out = df.sort_values(['Hour','Level','Value'],ascending=[True,True,False])
print(out)
Hour Level Value
3 0 Advanced 6
6 0 Basic 3
4 0 Basic 2
5 1 Advanced 6
1 1 Intermediate 9
2 2 Intermediate 6
0 2 Basic 6
Suppose I have the following dataframe,
d = {'col1':['a','b','c','a','c','c','c','c','c','c'],
'col2':['a1','b1','c1','a1','c1','c1','c1','c1','c1','c1'],
'col3':[1,2,3,2,3,3,3,3,3,3]}
data = pd.DataFrame(d)
I want to go through categorical columns and replace strings with integers. The usual way of doing this is to do:
col1 = {'a': 1,'b': 2, 'c':3}
data.col1 = [col1[item] for item in data.col1]
Namely to make a dictionary for each categorical column and do the replacement. But if you have many columns making dictionary for them one by one is time consuming, so I wonder if there is a better way of doing it? Also how can I do this without dictionary. In this example we can 3 distinct values on col1 for example but if we have many more we should have wrote all that by hand (say {'a': 1,'b': 2, 'c':3, ..., 'z':26}). I wonder what is the most efficient way of doing this? namely to go through all the categorical column and replace the string with numbers without needing to make dictionaries column by column?
Get only object columns first by DataFrame.select_dtypes and then for each column use factorize in DataFrame.apply:
cols = data.select_dtypes(object).columns
data[cols] = data[cols].apply(lambda x: pd.factorize(x)[0]) + 1
print (data)
col1 col2 col3
0 1 1 1
1 2 2 2
2 3 3 3
3 1 1 2
4 3 3 3
5 3 3 3
6 3 3 3
7 3 3 3
8 3 3 3
9 3 3 3
If possible, you could avoid the apply,by using a dictionary comprehension in the assign expression(I feel a dictionary is going to be more efficient; I may be wrong):
values = {col: data[col].factorize()[0] + 1
for col in data.select_dtypes(object)}
data.assign(**values)
col1 col2 col3
0 1 1 1
1 2 2 2
2 3 3 3
3 1 1 2
4 3 3 3
5 3 3 3
6 3 3 3
7 3 3 3
8 3 3 3
9 3 3 3
When I have a DataFrame object and an unknown number of rows, I want to select 5 rows each time.
For instance, df has 11 rows , it will be selected 3 times, 5+5+1, and if the rows is 4, only one time will be selected.
How can I write the code using pandas?
Use groupby with a little arithmetic. This should be clean.
chunks = [g for _, g in df.groupby(df.index // 5)]
Depending on how you want your output structured, you may change g to g.values.tolist() (if you want a list instead).
numpy.split
np.split(df, np.arange(5, len(df), 5))
Demo
df = pd.DataFrame(dict(A=range(11)))
print(*np.split(df, np.arange(5, len(df), 5)), sep='\n\n')
A
0 0
1 1
2 2
3 3
4 4
A
5 5
6 6
7 7
8 8
9 9
A
10 10
Create a loop and then use the index for indexing the DataFrame:
for i in range(len(df), 5):
data = df.iloc[i*5:(i+1)*5]
I am trying to randomly select a certain percentage of rows and columns in my dataframe and fit these features into a logistic regression over 10 iterations. My dependent variable is whether a team won (1) or lost (0).
If I have a df that looks something like this (data is made up):
Won Field Injuries Weather Fouls Players
1 2 3 1 2 8
0 3 2 0 1 5
1 4 5 3 2 6
1 3 2 1 4 5
0 2 3 0 1 6
1 4 2 0 2 8
...
For example, let's say I want to select 50% (but this could change). I want to randomly select 50% (or the closest amount to 50% if its an odd number) of the columns (field,injuries,weather,fouls,players) and 50% of the rows in those columns to place in my model.
Here is my current code which right now runs by selecting all of the columns and rows and fitting it into my model but I would like to dictate a random percentage:
z = []
For i in range(10):
train_cols = df.columns[1:]
logit = sm.Logit(df['Won'], df[train_cols])
result = logit.fit()
exp = np.exp(result.params)
z.append([i, exp])