I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I'd like to create a variable called period that makes Year = 2000 and quarter= q2 into 2000q2.
Can anyone help with that?
If both columns are strings, you can concatenate them directly:
df["period"] = df["Year"] + df["quarter"]
If one (or both) of the columns are not string typed, you should convert it (them) first,
df["period"] = df["Year"].astype(str) + df["quarter"]
Beware of NaNs when doing this!
If you need to join multiple string columns, you can use agg:
df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)
Where "-" is the separator.
Small data-sets (< 150rows)
[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
or slightly slower but more compact:
df.Year.str.cat(df.quarter)
Larger data sets (> 150rows)
df['Year'].astype(str) + df['quarter']
UPDATE: Timing graph Pandas 0.23.4
Let's test it on 200K rows DF:
In [250]: df
Out[250]:
Year quarter
0 2014 q1
1 2015 q2
In [251]: df = pd.concat([df] * 10**5)
In [252]: df.shape
Out[252]: (200000, 2)
UPDATE: new timings using Pandas 0.19.0
Timing without CPU/GPU optimization (sorted from fastest to slowest):
In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop
In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop
In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop
In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop
In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop
In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop
Timing using CPU/GPU optimization:
In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop
In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop
In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop
In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop
In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop
In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop
Answer contribution by #anton-vbr
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)
Yields this dataframe
Year quarter period
0 2014 q1 2014q1
1 2015 q2 2015q2
This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']] with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1).
You can check more information about apply() method here
The method cat() of the .str accessor works really well for this:
>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"],
... ["2015", "q3"]],
... columns=('Year', 'Quarter'))
>>> print(df)
Year Quarter
0 2014 q1
1 2015 q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
Year Quarter Period
0 2014 q1 2014q1
1 2015 q3 2015q3
cat() even allows you to add a separator so, for example, suppose you only have integers for year and period, you can do this:
>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
... [2015, 3]],
... columns=('Year', 'Quarter'))
>>> print(df)
Year Quarter
0 2014 1
1 2015 3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
Year Quarter Period
0 2014 1 2014q1
1 2015 3 2015q3
Joining multiple columns is just a matter of passing either a list of series or a dataframe containing all but the first column as a parameter to str.cat() invoked on the first column (Series):
>>> df = pd.DataFrame(
... [['USA', 'Nevada', 'Las Vegas'],
... ['Brazil', 'Pernambuco', 'Recife']],
... columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
Country State City AllTogether
0 USA Nevada Las Vegas USA - Nevada - Las Vegas
1 Brazil Pernambuco Recife Brazil - Pernambuco - Recife
Do note that if your pandas dataframe/series has null values, you need to include the parameter na_rep to replace the NaN values with a string, otherwise the combined column will default to NaN.
Use of a lamba function this time with string.format().
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df
Quarter Year
0 q1 2014
1 q2 2015
Quarter Year YearQuarter
0 q1 2014 2014q1
1 q2 2015 2015q2
This allows you to work with non-strings and reformat values as needed.
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df
Quarter int64
Year object
dtype: object
Quarter Year
0 1 2014
1 2 2015
Quarter Year YearQuarter
0 1 2014 2014q1
1 2 2015 2015q2
generalising to multiple columns, why not:
columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)
You can use lambda:
combine_lambda = lambda x: '{}{}'.format(x.Year, x.quarter)
And then use it with creating the new column:
df['period'] = df.apply(combine_lambda, axis = 1)
Let us suppose your dataframe is df with columns Year and Quarter.
import pandas as pd
df = pd.DataFrame({'Quarter':'q1 q2 q3 q4'.split(), 'Year':'2000'})
Suppose we want to see the dataframe;
df
>>> Quarter Year
0 q1 2000
1 q2 2000
2 q3 2000
3 q4 2000
Finally, concatenate the Year and the Quarter as follows.
df['Period'] = df['Year'] + ' ' + df['Quarter']
You can now print df to see the resulting dataframe.
df
>>> Quarter Year Period
0 q1 2000 2000 q1
1 q2 2000 2000 q2
2 q3 2000 2000 q3
3 q4 2000 2000 q4
If you do not want the space between the year and quarter, simply remove it by doing;
df['Period'] = df['Year'] + df['Quarter']
Although the #silvado answer is good if you change df.map(str) to df.astype(str) it will be faster:
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop
In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop
Here is an implementation that I find very versatile:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
...: [1, 'fox', 'jumps', 'over'],
...: [2, 'the', 'lazy', 'dog']],
...: columns=['c0', 'c1', 'c2', 'c3'])
In [3]: def str_join(df, sep, *cols):
...: from functools import reduce
...: return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep),
...: [df[col] for col in cols])
...:
In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')
In [5]: df
Out[5]:
c0 c1 c2 c3 cat
0 0 the quick brown 0-the-quick-brown
1 1 fox jumps over 1-fox-jumps-over
2 2 the lazy dog 2-the-lazy-dog
more efficient is
def concat_df_str1(df):
""" run time: 1.3416s """
return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)
and here is a time test:
import numpy as np
import pandas as pd
from time import time
def concat_df_str1(df):
""" run time: 1.3416s """
return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)
def concat_df_str2(df):
""" run time: 5.2758s """
return df.astype(str).sum(axis=1)
def concat_df_str3(df):
""" run time: 5.0076s """
df = df.astype(str)
return df[0] + df[1] + df[2] + df[3] + df[4] + \
df[5] + df[6] + df[7] + df[8] + df[9]
def concat_df_str4(df):
""" run time: 7.8624s """
return df.astype(str).apply(lambda x: ''.join(x), axis=1)
def main():
df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
df = df.astype(int)
time1 = time()
df_en = concat_df_str4(df)
print('run time: %.4fs' % (time() - time1))
print(df_en.head(10))
if __name__ == '__main__':
main()
final, when sum(concat_df_str2) is used, the result is not simply concat, it will trans to integer.
Using zip could be even quicker:
df["period"] = [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
Graph:
import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
myfuncs = {
"df['Year'].astype(str) + df['quarter']":
lambda: df['Year'].astype(str) + df['quarter'],
"df['Year'].map(str) + df['quarter']":
lambda: df['Year'].map(str) + df['quarter'],
"df.Year.str.cat(df.quarter)":
lambda: df.Year.str.cat(df.quarter),
"df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)":
lambda: df.loc[:, ['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].astype(str).sum(axis=1)":
lambda: df[['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)":
lambda: df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1),
"[''.join(i) for i in zip(dataframe['Year'].map(str),dataframe['quarter'])]":
lambda: [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
}
d = defaultdict(dict)
step = 10
cont = True
while cont:
lendf = len(df); print(lendf)
for k,v in myfuncs.items():
iters = 1
t = 0
while t < 0.2:
ts = timeit.repeat(v, number=iters, repeat=3)
t = min(ts)
iters *= 10
d[k][lendf] = t/iters
if t > 2: cont = False
df = pd.concat([df]*step)
pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()
This solution uses an intermediate step compressing two columns of the DataFrame to a single column containing a list of the values.
This works not only for strings but for all kind of column-dtypes
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['list']=df[['Year','quarter']].values.tolist()
df['period']=df['list'].apply(''.join)
print(df)
Result:
Year quarter list period
0 2014 q1 [2014, q1] 2014q1
1 2015 q2 [2015, q2] 2015q2
Here is my summary of the above solutions to concatenate / combine two columns with int and str value into a new column, using a separator between the values of columns. Three solutions work for this purpose.
# be cautious about the separator, some symbols may cause "SyntaxError: EOL while scanning string literal".
# e.g. ";;" as separator would raise the SyntaxError
separator = "&&"
# pd.Series.str.cat() method does not work to concatenate / combine two columns with int value and str value. This would raise "AttributeError: Can only use .cat accessor with a 'category' dtype"
df["period"] = df["Year"].map(str) + separator + df["quarter"]
df["period"] = df[['Year','quarter']].apply(lambda x : '{} && {}'.format(x[0],x[1]), axis=1)
df["period"] = df.apply(lambda x: f'{x["Year"]} && {x["quarter"]}', axis=1)
my take....
listofcols = ['col1','col2','col3']
df['combined_cols'] = ''
for column in listofcols:
df['combined_cols'] = df['combined_cols'] + ' ' + df[column]
'''
As many have mentioned previously, you must convert each column to string and then use the plus operator to combine two string columns. You can get a large performance improvement by using NumPy.
%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One can use assign method of DataFrame:
df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}).
assign(period=lambda x: x.Year+x.quarter ))
Similar to #geher answer but with any separator you like:
SEP = " "
INPUT_COLUMNS_WITH_SEP = ",sep,".join(INPUT_COLUMNS).split(",")
df.assign(sep=SEP)[INPUT_COLUMNS_WITH_SEP].sum(axis=1)
def madd(x):
"""Performs element-wise string concatenation with multiple input arrays.
Args:
x: iterable of np.array.
Returns: np.array.
"""
for i, arr in enumerate(x):
if type(arr.item(0)) is not str:
x[i] = x[i].astype(str)
return reduce(np.core.defchararray.add, x)
For example:
data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])
df
Year quarter period
0 2000 q1 2000q1
1 2000 q2 2000q2
2 2000 q3 2000q3
3 2000 q4 2000q4
Use .combine_first.
df['Period'] = df['Year'].combine_first(df['Quarter'])
When combining columns with strings by concatenating them using the addition operator + if any is NaN then entire output will be NaN so use fillna()
df["join"] = "some" + df["col"].fillna(df["val_if_nan"])
I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0
I have a pandas dataframe something like the below:
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023 etc
ref1 100 2020 5 10 0 0 0 0 0
ref2 20 2028 2 5 0 0 0 0 0
ref3 30 2021 7 16 0 0 0 0 0
ref4 40 2025 9 18 0 0 0 0 0
ref5 10 2022 4 30 0 0 0 0 0
The 'Total' column shows how many of a product needs to be delivered.
'First_yr_Del' tells you how many will be delivered in the first year. After this the delivery rate reverts to 'Del_rate' - a flat rate that can be applied each year until all products are delivered.
The 'Year to Use' column tells you the first year column to begin delivery from.
EXAMPLE: Ref1 has 100 to deliver. It will start delivering in 2020 and will deliver 5 in the first year, and 10 each year after that until all 100 are accounted for.
Any ideas how to go about this?
I thought i might use something like the below to reference which columns to use in turn, but i'm not even sure if that's helpful or not as it will depend on the solution (in the proper version, base_date.year is defined as the first column in the table - 2019):
start_index_for_slice = df.columns.get_loc(base_date.year)
end_index_for_slice = start_index_for_slice+no_yrs_to_project
df.columns[start_index_for_slice:end_index_for_slice]
I'm pretty new to python and aren't sure if i'm getting ahead of myself a bit...
The way i would think to go about it would be to use a for loop, or something using iterrows, but other posts seem to say this is a bad idea and i should be using vectorisation, cython or lambdas. Of those 3 i've only managed a very simple lambda so far. The others are a bit of a mystery to me since the solution seems to suggest doing one action after another until complete.
Any and all help appreciated!
Thanks
EDIT: Example expected output below (I edited some of the dates so you can better see the logic):
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023etc
ref1 100 2020 5 10 0 5 10 10 10
ref2 20 2021 2 5 0 0 2 5 5
ref3 30 2021 7 16 0 0 7 16 7
ref4 40 2019 9 18 9 18 13 0 0
ref5 10 2020 4 30 0 4 6 0 0
Here's another option, which separates the calculation of the rates/years matrix and appends it to the input df later on. Still does looping in the script itself (not "externalized" to some numpy / pandas function). Should be fine for 5k rows I'd guesstimate.
import pandas as pd
import numpy as np
# def gen_df1():
# create the inital df without years/rates
df = pd.DataFrame({'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2021, 2021, 2019, 2020],
'First_Year_Del': [5, 2, 7, 9, 10],
'Del_rate': [10, 5, 16, 18, 30]})
# get number of rates + remainder
n, r = np.divmod((df['Total']-df['First_Year_Del']), df['Del_rate'])
# get the year of the last rate considering all rows
max_year = np.max(n + r.astype(np.bool) + df['Yr_to_Use'])
# get the offsets for the start of delivery, year zero is 2019
offset = df['Yr_to_Use'] - 2019
# subtracting the year zero lets you use this as an index...
# get a year index; this determines the the columns that will be created
yrs = np.arange(2019, max_year+1)
# prepare a n*m array to hold the rates for all years, initalize with all zero
out = np.zeros((df['Total'].shape[0], yrs.shape[0]))
# n: number of rows of the df, m: number of years where rates will have to be payed
# calculate the rates for each year and insert them into the output array
for i in range(df['Total'].shape[0]):
# concatenate: year of the first rate, all yearly rates, a final rate if there was a remainder
if r[i]: # if rest is not zero, append it as well
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]], [r[i]]])
else: # rest is zero, skip it
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]]])
# insert the rates at the apropriate location of the output array:
out[i, offset[i]:offset[i]+rates.shape[0]] = rates
# add the years/rates matrix to the original df
df = pd.concat([df, pd.DataFrame(out, columns=yrs.astype(str))], axis=1, sort=False)
You can accomplish this using two user-defined function and apply method
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'id': ['ref1','ref2','ref3','ref4','ref5'],
'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2028, 2021, 2025, 2022],
'First_Year_Del': [5,2,7,9,4],
'Del_rate':[10,5,16,18,30]})
def f(r):
'''
Computes values per year and respective year
'''
n = (r['Total'] - r['First_Year_Del'])//r['Del_rate']
leftover = (r['Total'] - r['First_Year_Del'])%r['Del_rate']
r['values'] = [r['First_Year_Del']] + [r['Del_rate'] for _ in range(n)] + [leftover]
r['years'] = np.arange(r['Yr_to_Use'], r['Yr_to_Use'] + len(r['values']))
return r
df = df.apply(f, axis=1)
def get_year_range(r):
'''
Computes min and max year for each row
'''
r['y_min'] = min(r['years'])
r['y_max'] = max(r['years'])
return r
df = df.apply(get_year_range, axis=1)
y_min = df['y_min'].min()
y_max = df['y_max'].max()
#Initialize each year value to zero
for year in range(y_min, y_max+1):
df[year] = 0
def expand(r):
'''
Update value for each year
'''
for v, y in zip(r['values'], r['years']):
r[y] = v
return r
# Apply and drop temporary columns
df = df.apply(expand, axis=1).drop(['values', 'years', 'y_min', 'y_max'], axis=1)