Pandas : Concatenate multiple columns and few additional characters [duplicate]

Pandas : Concatenate multiple columns and few additional characters [duplicate] - python-3.x

I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I'd like to create a variable called period that makes Year = 2000 and quarter= q2 into 2000q2.
Can anyone help with that?

If both columns are strings, you can concatenate them directly:
df["period"] = df["Year"] + df["quarter"]
If one (or both) of the columns are not string typed, you should convert it (them) first,
df["period"] = df["Year"].astype(str) + df["quarter"]
Beware of NaNs when doing this!
If you need to join multiple string columns, you can use agg:
df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)
Where "-" is the separator.

Small data-sets (< 150rows)
[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
or slightly slower but more compact:
df.Year.str.cat(df.quarter)
Larger data sets (> 150rows)
df['Year'].astype(str) + df['quarter']
UPDATE: Timing graph Pandas 0.23.4
Let's test it on 200K rows DF:
In [250]: df
Out[250]:
Year quarter
0 2014 q1
1 2015 q2
In [251]: df = pd.concat([df] * 10**5)
In [252]: df.shape
Out[252]: (200000, 2)
UPDATE: new timings using Pandas 0.19.0
Timing without CPU/GPU optimization (sorted from fastest to slowest):
In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop
In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop
In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop
In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop
In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop
In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop
Timing using CPU/GPU optimization:
In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop
In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop
In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop
In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop
In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop
In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop
Answer contribution by #anton-vbr

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)
Yields this dataframe
Year quarter period
0 2014 q1 2014q1
1 2015 q2 2015q2
This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']] with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1).
You can check more information about apply() method here

The method cat() of the .str accessor works really well for this:
>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"],
... ["2015", "q3"]],
... columns=('Year', 'Quarter'))
>>> print(df)
Year Quarter
0 2014 q1
1 2015 q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
Year Quarter Period
0 2014 q1 2014q1
1 2015 q3 2015q3
cat() even allows you to add a separator so, for example, suppose you only have integers for year and period, you can do this:
>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
... [2015, 3]],
... columns=('Year', 'Quarter'))
>>> print(df)
Year Quarter
0 2014 1
1 2015 3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
Year Quarter Period
0 2014 1 2014q1
1 2015 3 2015q3
Joining multiple columns is just a matter of passing either a list of series or a dataframe containing all but the first column as a parameter to str.cat() invoked on the first column (Series):
>>> df = pd.DataFrame(
... [['USA', 'Nevada', 'Las Vegas'],
... ['Brazil', 'Pernambuco', 'Recife']],
... columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
Country State City AllTogether
0 USA Nevada Las Vegas USA - Nevada - Las Vegas
1 Brazil Pernambuco Recife Brazil - Pernambuco - Recife
Do note that if your pandas dataframe/series has null values, you need to include the parameter na_rep to replace the NaN values with a string, otherwise the combined column will default to NaN.

Use of a lamba function this time with string.format().
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df
Quarter Year
0 q1 2014
1 q2 2015
Quarter Year YearQuarter
0 q1 2014 2014q1
1 q2 2015 2015q2
This allows you to work with non-strings and reformat values as needed.
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df
Quarter int64
Year object
dtype: object
Quarter Year
0 1 2014
1 2 2015
Quarter Year YearQuarter
0 1 2014 2014q1
1 2 2015 2015q2

generalising to multiple columns, why not:
columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)

You can use lambda:
combine_lambda = lambda x: '{}{}'.format(x.Year, x.quarter)
And then use it with creating the new column:
df['period'] = df.apply(combine_lambda, axis = 1)

Let us suppose your dataframe is df with columns Year and Quarter.
import pandas as pd
df = pd.DataFrame({'Quarter':'q1 q2 q3 q4'.split(), 'Year':'2000'})
Suppose we want to see the dataframe;
df
>>> Quarter Year
0 q1 2000
1 q2 2000
2 q3 2000
3 q4 2000
Finally, concatenate the Year and the Quarter as follows.
df['Period'] = df['Year'] + ' ' + df['Quarter']
You can now print df to see the resulting dataframe.
df
>>> Quarter Year Period
0 q1 2000 2000 q1
1 q2 2000 2000 q2
2 q3 2000 2000 q3
3 q4 2000 2000 q4
If you do not want the space between the year and quarter, simply remove it by doing;
df['Period'] = df['Year'] + df['Quarter']

Although the #silvado answer is good if you change df.map(str) to df.astype(str) it will be faster:
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop
In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop

Here is an implementation that I find very versatile:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
...: [1, 'fox', 'jumps', 'over'],
...: [2, 'the', 'lazy', 'dog']],
...: columns=['c0', 'c1', 'c2', 'c3'])
In [3]: def str_join(df, sep, *cols):
...: from functools import reduce
...: return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep),
...: [df[col] for col in cols])
...:
In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')
In [5]: df
Out[5]:
c0 c1 c2 c3 cat
0 0 the quick brown 0-the-quick-brown
1 1 fox jumps over 1-fox-jumps-over
2 2 the lazy dog 2-the-lazy-dog

more efficient is
def concat_df_str1(df):
""" run time: 1.3416s """
return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)
and here is a time test:
import numpy as np
import pandas as pd
from time import time
def concat_df_str1(df):
""" run time: 1.3416s """
return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)
def concat_df_str2(df):
""" run time: 5.2758s """
return df.astype(str).sum(axis=1)
def concat_df_str3(df):
""" run time: 5.0076s """
df = df.astype(str)
return df[0] + df[1] + df[2] + df[3] + df[4] + \
df[5] + df[6] + df[7] + df[8] + df[9]
def concat_df_str4(df):
""" run time: 7.8624s """
return df.astype(str).apply(lambda x: ''.join(x), axis=1)
def main():
df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
df = df.astype(int)
time1 = time()
df_en = concat_df_str4(df)
print('run time: %.4fs' % (time() - time1))
print(df_en.head(10))
if __name__ == '__main__':
main()
final, when sum(concat_df_str2) is used, the result is not simply concat, it will trans to integer.

Using zip could be even quicker:
df["period"] = [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
Graph:
import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
myfuncs = {
"df['Year'].astype(str) + df['quarter']":
lambda: df['Year'].astype(str) + df['quarter'],
"df['Year'].map(str) + df['quarter']":
lambda: df['Year'].map(str) + df['quarter'],
"df.Year.str.cat(df.quarter)":
lambda: df.Year.str.cat(df.quarter),
"df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)":
lambda: df.loc[:, ['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].astype(str).sum(axis=1)":
lambda: df[['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)":
lambda: df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1),
"[''.join(i) for i in zip(dataframe['Year'].map(str),dataframe['quarter'])]":
lambda: [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
}
d = defaultdict(dict)
step = 10
cont = True
while cont:
lendf = len(df); print(lendf)
for k,v in myfuncs.items():
iters = 1
t = 0
while t < 0.2:
ts = timeit.repeat(v, number=iters, repeat=3)
t = min(ts)
iters *= 10
d[k][lendf] = t/iters
if t > 2: cont = False
df = pd.concat([df]*step)
pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()

This solution uses an intermediate step compressing two columns of the DataFrame to a single column containing a list of the values.
This works not only for strings but for all kind of column-dtypes
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['list']=df[['Year','quarter']].values.tolist()
df['period']=df['list'].apply(''.join)
print(df)
Result:
Year quarter list period
0 2014 q1 [2014, q1] 2014q1
1 2015 q2 [2015, q2] 2015q2

Here is my summary of the above solutions to concatenate / combine two columns with int and str value into a new column, using a separator between the values of columns. Three solutions work for this purpose.
# be cautious about the separator, some symbols may cause "SyntaxError: EOL while scanning string literal".
# e.g. ";;" as separator would raise the SyntaxError
separator = "&&"
# pd.Series.str.cat() method does not work to concatenate / combine two columns with int value and str value. This would raise "AttributeError: Can only use .cat accessor with a 'category' dtype"
df["period"] = df["Year"].map(str) + separator + df["quarter"]
df["period"] = df[['Year','quarter']].apply(lambda x : '{} && {}'.format(x[0],x[1]), axis=1)
df["period"] = df.apply(lambda x: f'{x["Year"]} && {x["quarter"]}', axis=1)

my take....
listofcols = ['col1','col2','col3']
df['combined_cols'] = ''
for column in listofcols:
df['combined_cols'] = df['combined_cols'] + ' ' + df[column]
'''

As many have mentioned previously, you must convert each column to string and then use the plus operator to combine two string columns. You can get a large performance improvement by using NumPy.
%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One can use assign method of DataFrame:
df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}).
assign(period=lambda x: x.Year+x.quarter ))

Similar to #geher answer but with any separator you like:
SEP = " "
INPUT_COLUMNS_WITH_SEP = ",sep,".join(INPUT_COLUMNS).split(",")
df.assign(sep=SEP)[INPUT_COLUMNS_WITH_SEP].sum(axis=1)

def madd(x):
"""Performs element-wise string concatenation with multiple input arrays.
Args:
x: iterable of np.array.
Returns: np.array.
"""
for i, arr in enumerate(x):
if type(arr.item(0)) is not str:
x[i] = x[i].astype(str)
return reduce(np.core.defchararray.add, x)
For example:
data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])
df
Year quarter period
0 2000 q1 2000q1
1 2000 q2 2000q2
2 2000 q3 2000q3
3 2000 q4 2000q4

Use .combine_first.
df['Period'] = df['Year'].combine_first(df['Quarter'])

When combining columns with strings by concatenating them using the addition operator + if any is NaN then entire output will be NaN so use fillna()
df["join"] = "some" + df["col"].fillna(df["val_if_nan"])

Related

Loop through Pandas dataframe to set values based on 2 lists of values

I have the following code that takes over an hour to run
I have been tasked to make it run faster.
This is a sample of the Pandas dataframe. It is 750,000 rows.
YEAR MO DAY HR TEMP
0 1948 1 12 6 21.02
1 1948 1 12 7 39.02
1 1948 1 12 7 39.02
This is the existing code:
mintempf_list = [-25.6, -29.6, -16.8, 8.2, 24.3, 37.4, 42.8, 40.3, 26.2, 14.0, -12.8, -20.7]
maxtempf_list = [71.6, 80.6, 91.4, 97.9, 102.2, 107.8, 111.7, 106.9, 105.8, 95.7, 86.0, 75.2]
for row in range(derive_sfc_df.shape[0]):
mo = derive_sfc_df.at[row, 'MO']
temp = derive_sfc_df.at[row, 'TEMP']
for mm in range(1, 13):
if (mo == mm and (temp < mintempf_list[mm - 1] or temp > maxtempf_list[mm - 1] or np.isnan(temp))):
derive_sfc_df.at[row, 'TEMP'] = np.nan
I have tried using numpy vectorize but I get errors with the index of the lists.
Is there any way of going through the 750,000 rows of the dataframe any faster?
I don't know how to use numpy.where or Sereis.isin for a pandas dataframe.
Any help would be greatly appreciated.

You can turn your array data into another dataframe, merge, update column on condition and drop extra columns:
temps_df = pd.DataFrame({'MO': range(1, len(mintempf_list) + 1),
'min': mintempf_list, 'max': maxtempf_list})
df = df.merge(temps_df, on='MO', how='left')
df.loc[~df['TEMP'].between(df['min'], df['max']), 'TEMP'] = np.NaN
df = df.drop(['min', 'max'], axis=1)

How do I perform inter-row operations within a pandas.dataframe

How do I write the nested for loop to access every other row with respect to a row within a pandas.dataframe?
I am trying to perform some operations between rows in a pandas.dataframe
The operation for my example code is calculating Euclidean distances between each row with each other row.
The results are then saved into a some list in the form
[(row_reference, name, dist)].
I understand how to access each row in a pandas.dataframe using df.itterrows() but I'm not sure how to access every other row with respect to the current row in order to perform the inter-row operation.
import pandas as pd
import numpy
import math
df = pd.DataFrame([{'name': "Bill", 'c1': 3, 'c2': 8}, {'name': "James", 'c1': 4, 'c2': 12},
{'name': "John", 'c1': 12, 'c2': 26}])
#Euclidean distance function where x1=c1_row1 ,x2=c1_row2, y1=c2_row1, #y2=c2_row2
def edist(x1, x2, y1, y2):
dist = math.sqrt(math.pow((x1 - x2),2) + math.pow((y1 - y2),2))
return dist
# Calculate Euclidean distance for one row (e.g. Bill) against each other row
# (e.g. "James" and "John"). Save results to a list (N_name, dist).
all_results = []
for index, row in df.iterrows():
results = []
# secondary loop to look for OTHER rows with respect to the current row
# results.append(row2['name'],edist())
all_results.append(row,results)
I hope to perform some operation edist() on all rows with respect to the current row/index.
I expect the loop to do the following:
In[1]:
result = []
result.append(['James',edist(3,4,8,12)])
result.append(['John',edist(3,12,8,26)])
results_all=[]
results_all.append([0,result])
result2 = []
result2.append(['John',edist(4,12,12,26)])
result2.append(['Bill',edist(4,3,12,8)])
results_all.append([1,result2])
result3 = []
result3.append(['Bill',edist(12,3,26,8)])
result3.append(['James', edist(12,4,26,12)])
results_all.append([2,result3])
results_all
With the following expected resulting output:
OUT[1]:
[[0, [['James', 4.123105625617661], ['John', 20.12461179749811]]],
[1, [['John', 16.1245154965971], ['Bill', 4.123105625617661]]],
[2, [['Bill', 20.12461179749811], ['James', 16.1245154965971]]]]

If you data is not too long, you can check out scipy's distance_matrix:
all_results = pd.DataFrame(distance_matrix(df[['c1','c2']],df[['c1','c2']]),
index=df['name'],
columns=df['name'])
Output:
name Bill James John
name
Bill 0.000000 4.123106 20.124612
James 4.123106 0.000000 16.124515
John 20.124612 16.124515 0.000000

Consider shift and avoid any rowwise looping. And because you run straightforward arithmetic, run the expression directly on columns using help of numpy for vectorized calculation.
import numpy as np
df = (df.assign(c1_shift = lambda x: x['c1'].shift(1),
c2_shift = lambda x: x['c2'].shift(1))
)
df['dist'] = np.sqrt(np.power(df['c1'] - df['c1_shift'], 2) +
np.power(df['c2'] - df['c2_shift'], 2))
print(df)
# name c1 c2 c1_shift c2_shift dist
# 0 Bill 3 8 NaN NaN NaN
# 1 James 4 12 3.0 8.0 4.123106
# 2 John 12 26 4.0 12.0 16.124515
Should you want every row combination with each other, consider a cross join on itself and query out reverse duplicates:
df = (pd.merge(df.assign(key=1), df.assign(key=1), on="key")
.query("name_x < name_y")
.drop(columns=['key'])
)
df['dist'] = np.sqrt(np.power(df['c1_x'] - df['c1_y'], 2) +
np.power(df['c2_x'] - df['c2_y'], 2))
print(df)
# name_x c1_x c2_x name_y c1_y c2_y dist
# 1 Bill 3 8 James 4 12 4.123106
# 2 Bill 3 8 John 12 26 20.124612
# 5 James 4 12 John 12 26 16.124515

Populating pandas column based on moving date range (efficiently)

I have 2 pandas dataframes, one of them contains dates with measurements, and the other contains dates with an event ID.
df1
from datetime import datetime as dt
from datetime import timedelta
import pandas as pd
import numpy as np
today = dt.now()
ndays = 10
df1 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays))})
df1.Date = df1.Date.dt.date
Date measurement
2018-01-10 8
2018-01-11 2
2018-01-12 7
2018-01-13 3
2018-01-14 1
2018-01-15 1
2018-01-16 6
2018-01-17 9
2018-01-18 8
2018-01-19 4
df2
df2 = pd.DataFrame({'Date': ['2018-01-11', '2018-01-14', '2018-01-16', '2018-01-19'], 'letter': ['event_a', 'event_b', 'event_c', 'event_d']})
df2.Date = pd.to_datetime(df2.Date, format = '%Y-%m-%d')
df2.Date = df2.Date.dt.date
Date event_id
2018-01-11 event_a
2018-01-14 event_b
2018-01-16 event_c
2018-01-19 event_d
I give the dates in df1 an event_id from df2 only if it's between two event dates. The resulting dataframe would look something like:
df3
today = dt.now()
ndays = 10
df3 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays)), 'event_id': ['event_a', 'event_a', 'event_b', 'event_b', 'event_b', 'event_c', 'event_c', 'event_d', 'event_d', 'event_d']})
df3.Date = df3.Date.dt.date
Date event_id measurement
2018-01-10 event_a 4
2018-01-11 event_a 2
2018-01-12 event_b 1
2018-01-13 event_b 5
2018-01-14 event_b 5
2018-01-15 event_c 4
2018-01-16 event_c 6
2018-01-17 event_d 6
2018-01-18 event_d 9
2018-01-19 event_d 6
The code I use to achieve this is:
n = 1
while n <= len(list(df2.Date)) - 1 :
for date in list(df1.Date):
if date <= df2.iloc[n].Date and (date > df2.iloc[n-1].Date):
df1.loc[df1.Date == date, 'event_id'] = df2.iloc[n].event_id
n += 1
The dataset that I am working with is significantly larger than this (a few million rows) and this method runs far too long. Is there a more efficient way to accomplish this?

So there are quite a few things to improve performance.
The first question I have is: does it have to be a pandas frame to begin with? Meaning can't df1 and df2 just be lists of tuples or list of lists?
The thing is that pandas adds a significant overhead when accessing items but especially when setting values individually.
Pandas excels when it comes to vectorized operations but I don't see an efficient alternative right now (maybe someone comes up with such an answer, that would be ideal).
Now what I'd do is:
Convert your df1 and df2 to records -> e.g. d1 = df1.to_records() what you get is an array of tuples, basically with the same structure as the dataframe.
Now run your algorithm but instead of operating on pandas dataframes you operate on the arrays of tuples d1 and d2
Use a third list of tuples d3 where you store the newly created data (each tuple is a row)
Now if you want you can convert d3 back to a pandas dataframe:
df3 = pd.DataFrame.from_records(d3, myKwArgs**)
This will speed up your code significantly I'd assume by more than 100-1000%. It does increase memory usage though, so if you are low on memory try to avoid the pandas dataframes all-together or dereference unused pandas frames df1, df2 once you used them to create the records (and if you run into problems call gc manually).
EDIT: Here a version of your code using the procedure above:
d3 = []
n = 1
while n < range(len(d2)):
for i in range(len(d1)):
date = d1[i][0]
if date <= d2[n][0] and date > d2[n-1][0]:
d3.append( (date, d2[n][1], d1[i][1]) )
n += 1

You can try df.apply() method to achieve this. Refer pandas.DataFrame.apply. I think my code will works faster than yours.
My approach:
Merge two dataframes df1 and df2 and create new one df3 by
df3 = pd.merge(df1, df2, on='Date', how='outer')
Sort df3 by date to make easy to travserse.
df3['Date'] = pd.to_datetime(df3.Date)
df3.sort_values(by='Date')
Create set_event_date() method to apply for each rows in df3.
new_event_id = np.nan
def set_event_date(df3):
global new_event_id
if df3.event_id is not np.nan:
new_event_id = df3.event_id
return new_event_id
Apply set_event_method() to each rows in df3.
df3['new_event_id'] = df3.apply(set_event_date,axis=1)
Final Output will be:
Date Measurement New_event_id
0 2018-01-11 2 event_a
1 2018-01-12 1 event_a
2 2018-01-13 3 event_a
3 2018-01-14 6 event_b
4 2018-01-15 3 event_b
5 2018-01-16 5 event_c
6 2018-01-17 7 event_c
7 2018-01-18 9 event_c
8 2018-01-19 7 event_d
9 2018-01-20 4 event_d
Let me know once you tried my solution and it works faster than yours.
Thanks.

Python Dataframe Single Row with Label

import pandas as pd
data = ["X", "Y", "Z", "A", "B"]
label = ['a','b','c','d','e']
df = pd.DataFrame(data, columns=label)
print(df)
I want to get the dataframe to be:
a b c d e
X Y Z A B
I am getting
ValueError: Shape of passed values is (1, 5), indices imply (5, 5)
How to fix this to get the desired dataframe ?

Pass it as a list of list.
In [439]: pd.DataFrame([data], columns=label)
Out[439]:
a b c d e
0 X Y Z A B

You can use a bit complicated, but very fast solution if large data - convert list to numpy array and then reshape:
df = pd.DataFrame(np.array(data).reshape(-1, len(data)), columns=label)
print(df)
a b c d e
0 X Y Z A B
Timings:
N = 100
data = ["X", "Y", "Z", "A", "B"] * N
label = ['a','b','c','d','e'] * N
In [30]: %timeit pd.DataFrame([data], columns=label)
10 loops, best of 3: 178 ms per loop
In [31]: %timeit pd.DataFrame(np.array(data).reshape(-1, len(data)), columns=label)
1000 loops, best of 3: 1.06 ms per loop
N = 1000
In [35]: %timeit pd.DataFrame([data], columns=label)
1 loop, best of 3: 1.7 s per loop
In [36]: %timeit pd.DataFrame(np.array(data).reshape(-1, len(data)), columns=label)
100 loops, best of 3: 3.83 ms per loop

Pandas Set Top Row as MultiIndex Level 1

Given the following data frame:
d2=pd.DataFrame({'Item':['items','y','z','x'],
'other':['others','bb','cc','dd']})
d2
Item other
0 items others
1 y bb
2 z cc
3 x dd
I'd like to create a multiindexed set of headers such that the current headers become level 0 and the current top row becomes level 1.
Thanks in advance!

Another solution is create MultiIndex.from_tuples:
cols = list(zip(d2.columns, d2.iloc[0,:]))
c1 = pd.MultiIndex.from_tuples(cols, names=[None, 0])
print (pd.DataFrame(data=d2[1:].values, columns=c1, index=d2.index[1:]))
Item other
0 items others
1 y bb
2 z cc
3 x dd
Or if column names are not important:
cols = list(zip(d2.columns, d2.iloc[0,:]))
d2.columns = pd.MultiIndex.from_tuples(cols)
print (d2[1:])
Item other
items others
1 y bb
2 z cc
3 x dd
Timings:
len(df)=400k:
In [63]: %timeit jez(d22)
100 loops, best of 3: 6.22 ms per loop
In [64]: %timeit piR(d2)
10 loops, best of 3: 84.9 ms per loop
len(df)=40:
In [70]: %timeit jez(d22)
The slowest run took 4.61 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 941 µs per loop
In [71]: %timeit piR(d2)
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.36 ms per loop
Code:
import pandas as pd
d2=pd.DataFrame({'Item':['items','y','z','x'],
'other':['others','bb','cc','dd']})
print (d2)
d2 = pd.concat([d2]*100000).reset_index(drop=True)
#d2 = pd.concat([d2]*10).reset_index(drop=True)
d22 = d2.copy()
def piR(d2):
return (d2.T.set_index(0, append=1).T)
def jez(d2):
cols = list(zip(d2.columns, d2.iloc[0,:]))
c1 = pd.MultiIndex.from_tuples(cols, names=[None, 0])
return pd.DataFrame(data=d2[1:].values, columns=c1, index=d2.index[1:])
print (piR(d2))
print (jez(d22))
print ((piR(d2) == jez(d22)).all())
Item items True
other others True
dtype: bool

Transpose the DataFrame, set_index with the first column with parameter append = True, then Transpose back.
d2.T.set_index(0, append=1).T

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas : Concatenate multiple columns and few additional characters [duplicate] - python-3.x

I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I'd like to create a variable called period that makes Year = 2000 and quarter= q2 into 2000q2. Can anyone help with that?

generalising to multiple columns, why not: columns = ['whatever', 'columns', 'you', 'choose'] df['period'] = df[columns].astype(str).sum(axis=1)

You can use lambda: combine_lambda = lambda x: '{}{}'.format(x.Year, x.quarter) And then use it with creating the new column: df['period'] = df.apply(combine_lambda, axis = 1)

my take.... listofcols = ['col1','col2','col3'] df['combined_cols'] = '' for column in listofcols: df['combined_cols'] = df['combined_cols'] + ' ' + df[column] '''

One can use assign method of DataFrame: df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}). assign(period=lambda x: x.Year+x.quarter ))

Similar to #geher answer but with any separator you like: SEP = " " INPUT_COLUMNS_WITH_SEP = ",sep,".join(INPUT_COLUMNS).split(",") df.assign(sep=SEP)[INPUT_COLUMNS_WITH_SEP].sum(axis=1)

Use .combine_first. df['Period'] = df['Year'].combine_first(df['Quarter'])

When combining columns with strings by concatenating them using the addition operator + if any is NaN then entire output will be NaN so use fillna() df["join"] = "some" + df["col"].fillna(df["val_if_nan"])

Related

Loop through Pandas dataframe to set values based on 2 lists of values

How do I perform inter-row operations within a pandas.dataframe

Populating pandas column based on moving date range (efficiently)

Python Dataframe Single Row with Label

Pandas Set Top Row as MultiIndex Level 1

Categories

Resources