Pandas Create DataFrame with ColumnNames from a list - python-3.x

Considering the following list made up of sub-lists as elements, I need to create a pandas dataframe
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
The desired output is as following, with the first argument being converted to the column name in the dataframe.
tom nick juli
0 10 15 14
Is there a way by which this output can be achieved?
Best Regards.

Use dictionary comprehension and pass to DataFrame constructor:
print ({x[0]: x[1:] for x in data})
{'tom': [10], 'nick': [15], 'juli': [14]}
df = pd.DataFrame({x[0]: x[1:] for x in data})
print (df)
tom nick juli
0 10 15 14

You could also use dict + extended iterable unpacking:
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
result = pd.DataFrame(dict((column, values) for column, *values in data))
print(result)
Output
tom nick juli
0 10 15 14

We also do:
pd.DataFrame(data).set_index(0).T
0 tom nick juli
1 10 15 14

Related

Pandas: Merging rows into one

I have the following table:
Name
Age
Data_1
Data_2
Tom
10
Test
Tom
10
Foo
Anne
20
Bar
How I can merge this rows to get this output:
Name
Age
Data_1
Data_2
Tom
10
Test
Foo
Anne
20
Bar
I tried this code (and some other related (agg, groupby other fields, et cetera)):
import pandas as pd
data = [['tom', 10, 'Test', ''], ['tom', 10, 1, 'Foo'], ['Anne', 20, '', 'Bar']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Data_1', 'Data_2'])
df = df.groupby("Name").sum()
print(df)
But I only get something like this:
c2
Name
--------
--------------
Anne
Foo
Tom
Bar
Just a groupby and a sum will do.
df.groupby(['Name','Age']).sum().reset_index()
Name Age Data_1 Data_2
0 Anne 20 Bar
1 tom 10 Test Foo
Use this if the empty cells are NaN :
(df.set_index(['Name', 'Age'])
.stack()
.groupby(level=[0, 1, 2])
.apply(''.join)
.unstack()
.reset_index()
)
Otherwise, add this line df.replace('', np.nan, inplace=True) before the code above.
# Output
Name Age Data_1 Data_2
0 Anne 20 NaN Bar
1 Tom 10 Test Foo

Pandas : Concatenate multiple columns and few additional characters [duplicate]

I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I'd like to create a variable called period that makes Year = 2000 and quarter= q2 into 2000q2.
Can anyone help with that?
If both columns are strings, you can concatenate them directly:
df["period"] = df["Year"] + df["quarter"]
If one (or both) of the columns are not string typed, you should convert it (them) first,
df["period"] = df["Year"].astype(str) + df["quarter"]
Beware of NaNs when doing this!
If you need to join multiple string columns, you can use agg:
df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)
Where "-" is the separator.
Small data-sets (< 150rows)
[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
or slightly slower but more compact:
df.Year.str.cat(df.quarter)
Larger data sets (> 150rows)
df['Year'].astype(str) + df['quarter']
UPDATE: Timing graph Pandas 0.23.4
Let's test it on 200K rows DF:
In [250]: df
Out[250]:
Year quarter
0 2014 q1
1 2015 q2
In [251]: df = pd.concat([df] * 10**5)
In [252]: df.shape
Out[252]: (200000, 2)
UPDATE: new timings using Pandas 0.19.0
Timing without CPU/GPU optimization (sorted from fastest to slowest):
In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop
In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop
In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop
In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop
In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop
In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop
Timing using CPU/GPU optimization:
In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop
In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop
In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop
In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop
In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop
In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop
Answer contribution by #anton-vbr
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)
Yields this dataframe
Year quarter period
0 2014 q1 2014q1
1 2015 q2 2015q2
This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']] with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1).
You can check more information about apply() method here
The method cat() of the .str accessor works really well for this:
>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"],
... ["2015", "q3"]],
... columns=('Year', 'Quarter'))
>>> print(df)
Year Quarter
0 2014 q1
1 2015 q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
Year Quarter Period
0 2014 q1 2014q1
1 2015 q3 2015q3
cat() even allows you to add a separator so, for example, suppose you only have integers for year and period, you can do this:
>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
... [2015, 3]],
... columns=('Year', 'Quarter'))
>>> print(df)
Year Quarter
0 2014 1
1 2015 3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
Year Quarter Period
0 2014 1 2014q1
1 2015 3 2015q3
Joining multiple columns is just a matter of passing either a list of series or a dataframe containing all but the first column as a parameter to str.cat() invoked on the first column (Series):
>>> df = pd.DataFrame(
... [['USA', 'Nevada', 'Las Vegas'],
... ['Brazil', 'Pernambuco', 'Recife']],
... columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
Country State City AllTogether
0 USA Nevada Las Vegas USA - Nevada - Las Vegas
1 Brazil Pernambuco Recife Brazil - Pernambuco - Recife
Do note that if your pandas dataframe/series has null values, you need to include the parameter na_rep to replace the NaN values with a string, otherwise the combined column will default to NaN.
Use of a lamba function this time with string.format().
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df
Quarter Year
0 q1 2014
1 q2 2015
Quarter Year YearQuarter
0 q1 2014 2014q1
1 q2 2015 2015q2
This allows you to work with non-strings and reformat values as needed.
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df
Quarter int64
Year object
dtype: object
Quarter Year
0 1 2014
1 2 2015
Quarter Year YearQuarter
0 1 2014 2014q1
1 2 2015 2015q2
generalising to multiple columns, why not:
columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)
You can use lambda:
combine_lambda = lambda x: '{}{}'.format(x.Year, x.quarter)
And then use it with creating the new column:
df['period'] = df.apply(combine_lambda, axis = 1)
Let us suppose your dataframe is df with columns Year and Quarter.
import pandas as pd
df = pd.DataFrame({'Quarter':'q1 q2 q3 q4'.split(), 'Year':'2000'})
Suppose we want to see the dataframe;
df
>>> Quarter Year
0 q1 2000
1 q2 2000
2 q3 2000
3 q4 2000
Finally, concatenate the Year and the Quarter as follows.
df['Period'] = df['Year'] + ' ' + df['Quarter']
You can now print df to see the resulting dataframe.
df
>>> Quarter Year Period
0 q1 2000 2000 q1
1 q2 2000 2000 q2
2 q3 2000 2000 q3
3 q4 2000 2000 q4
If you do not want the space between the year and quarter, simply remove it by doing;
df['Period'] = df['Year'] + df['Quarter']
Although the #silvado answer is good if you change df.map(str) to df.astype(str) it will be faster:
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop
In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop
Here is an implementation that I find very versatile:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
...: [1, 'fox', 'jumps', 'over'],
...: [2, 'the', 'lazy', 'dog']],
...: columns=['c0', 'c1', 'c2', 'c3'])
In [3]: def str_join(df, sep, *cols):
...: from functools import reduce
...: return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep),
...: [df[col] for col in cols])
...:
In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')
In [5]: df
Out[5]:
c0 c1 c2 c3 cat
0 0 the quick brown 0-the-quick-brown
1 1 fox jumps over 1-fox-jumps-over
2 2 the lazy dog 2-the-lazy-dog
more efficient is
def concat_df_str1(df):
""" run time: 1.3416s """
return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)
and here is a time test:
import numpy as np
import pandas as pd
from time import time
def concat_df_str1(df):
""" run time: 1.3416s """
return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)
def concat_df_str2(df):
""" run time: 5.2758s """
return df.astype(str).sum(axis=1)
def concat_df_str3(df):
""" run time: 5.0076s """
df = df.astype(str)
return df[0] + df[1] + df[2] + df[3] + df[4] + \
df[5] + df[6] + df[7] + df[8] + df[9]
def concat_df_str4(df):
""" run time: 7.8624s """
return df.astype(str).apply(lambda x: ''.join(x), axis=1)
def main():
df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
df = df.astype(int)
time1 = time()
df_en = concat_df_str4(df)
print('run time: %.4fs' % (time() - time1))
print(df_en.head(10))
if __name__ == '__main__':
main()
final, when sum(concat_df_str2) is used, the result is not simply concat, it will trans to integer.
Using zip could be even quicker:
df["period"] = [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
Graph:
import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
myfuncs = {
"df['Year'].astype(str) + df['quarter']":
lambda: df['Year'].astype(str) + df['quarter'],
"df['Year'].map(str) + df['quarter']":
lambda: df['Year'].map(str) + df['quarter'],
"df.Year.str.cat(df.quarter)":
lambda: df.Year.str.cat(df.quarter),
"df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)":
lambda: df.loc[:, ['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].astype(str).sum(axis=1)":
lambda: df[['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)":
lambda: df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1),
"[''.join(i) for i in zip(dataframe['Year'].map(str),dataframe['quarter'])]":
lambda: [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
}
d = defaultdict(dict)
step = 10
cont = True
while cont:
lendf = len(df); print(lendf)
for k,v in myfuncs.items():
iters = 1
t = 0
while t < 0.2:
ts = timeit.repeat(v, number=iters, repeat=3)
t = min(ts)
iters *= 10
d[k][lendf] = t/iters
if t > 2: cont = False
df = pd.concat([df]*step)
pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()
This solution uses an intermediate step compressing two columns of the DataFrame to a single column containing a list of the values.
This works not only for strings but for all kind of column-dtypes
import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['list']=df[['Year','quarter']].values.tolist()
df['period']=df['list'].apply(''.join)
print(df)
Result:
Year quarter list period
0 2014 q1 [2014, q1] 2014q1
1 2015 q2 [2015, q2] 2015q2
Here is my summary of the above solutions to concatenate / combine two columns with int and str value into a new column, using a separator between the values of columns. Three solutions work for this purpose.
# be cautious about the separator, some symbols may cause "SyntaxError: EOL while scanning string literal".
# e.g. ";;" as separator would raise the SyntaxError
separator = "&&"
# pd.Series.str.cat() method does not work to concatenate / combine two columns with int value and str value. This would raise "AttributeError: Can only use .cat accessor with a 'category' dtype"
df["period"] = df["Year"].map(str) + separator + df["quarter"]
df["period"] = df[['Year','quarter']].apply(lambda x : '{} && {}'.format(x[0],x[1]), axis=1)
df["period"] = df.apply(lambda x: f'{x["Year"]} && {x["quarter"]}', axis=1)
my take....
listofcols = ['col1','col2','col3']
df['combined_cols'] = ''
for column in listofcols:
df['combined_cols'] = df['combined_cols'] + ' ' + df[column]
'''
As many have mentioned previously, you must convert each column to string and then use the plus operator to combine two string columns. You can get a large performance improvement by using NumPy.
%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One can use assign method of DataFrame:
df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}).
assign(period=lambda x: x.Year+x.quarter ))
Similar to #geher answer but with any separator you like:
SEP = " "
INPUT_COLUMNS_WITH_SEP = ",sep,".join(INPUT_COLUMNS).split(",")
df.assign(sep=SEP)[INPUT_COLUMNS_WITH_SEP].sum(axis=1)
def madd(x):
"""Performs element-wise string concatenation with multiple input arrays.
Args:
x: iterable of np.array.
Returns: np.array.
"""
for i, arr in enumerate(x):
if type(arr.item(0)) is not str:
x[i] = x[i].astype(str)
return reduce(np.core.defchararray.add, x)
For example:
data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])
df
Year quarter period
0 2000 q1 2000q1
1 2000 q2 2000q2
2 2000 q3 2000q3
3 2000 q4 2000q4
Use .combine_first.
df['Period'] = df['Year'].combine_first(df['Quarter'])
When combining columns with strings by concatenating them using the addition operator + if any is NaN then entire output will be NaN so use fillna()
df["join"] = "some" + df["col"].fillna(df["val_if_nan"])

How do I perform inter-row operations within a pandas.dataframe

How do I write the nested for loop to access every other row with respect to a row within a pandas.dataframe?
I am trying to perform some operations between rows in a pandas.dataframe
The operation for my example code is calculating Euclidean distances between each row with each other row.
The results are then saved into a some list in the form
[(row_reference, name, dist)].
I understand how to access each row in a pandas.dataframe using df.itterrows() but I'm not sure how to access every other row with respect to the current row in order to perform the inter-row operation.
import pandas as pd
import numpy
import math
df = pd.DataFrame([{'name': "Bill", 'c1': 3, 'c2': 8}, {'name': "James", 'c1': 4, 'c2': 12},
{'name': "John", 'c1': 12, 'c2': 26}])
#Euclidean distance function where x1=c1_row1 ,x2=c1_row2, y1=c2_row1, #y2=c2_row2
def edist(x1, x2, y1, y2):
dist = math.sqrt(math.pow((x1 - x2),2) + math.pow((y1 - y2),2))
return dist
# Calculate Euclidean distance for one row (e.g. Bill) against each other row
# (e.g. "James" and "John"). Save results to a list (N_name, dist).
all_results = []
for index, row in df.iterrows():
results = []
# secondary loop to look for OTHER rows with respect to the current row
# results.append(row2['name'],edist())
all_results.append(row,results)
I hope to perform some operation edist() on all rows with respect to the current row/index.
I expect the loop to do the following:
In[1]:
result = []
result.append(['James',edist(3,4,8,12)])
result.append(['John',edist(3,12,8,26)])
results_all=[]
results_all.append([0,result])
result2 = []
result2.append(['John',edist(4,12,12,26)])
result2.append(['Bill',edist(4,3,12,8)])
results_all.append([1,result2])
result3 = []
result3.append(['Bill',edist(12,3,26,8)])
result3.append(['James', edist(12,4,26,12)])
results_all.append([2,result3])
results_all
With the following expected resulting output:
OUT[1]:
[[0, [['James', 4.123105625617661], ['John', 20.12461179749811]]],
[1, [['John', 16.1245154965971], ['Bill', 4.123105625617661]]],
[2, [['Bill', 20.12461179749811], ['James', 16.1245154965971]]]]
If you data is not too long, you can check out scipy's distance_matrix:
all_results = pd.DataFrame(distance_matrix(df[['c1','c2']],df[['c1','c2']]),
index=df['name'],
columns=df['name'])
Output:
name Bill James John
name
Bill 0.000000 4.123106 20.124612
James 4.123106 0.000000 16.124515
John 20.124612 16.124515 0.000000
Consider shift and avoid any rowwise looping. And because you run straightforward arithmetic, run the expression directly on columns using help of numpy for vectorized calculation.
import numpy as np
df = (df.assign(c1_shift = lambda x: x['c1'].shift(1),
c2_shift = lambda x: x['c2'].shift(1))
)
df['dist'] = np.sqrt(np.power(df['c1'] - df['c1_shift'], 2) +
np.power(df['c2'] - df['c2_shift'], 2))
print(df)
# name c1 c2 c1_shift c2_shift dist
# 0 Bill 3 8 NaN NaN NaN
# 1 James 4 12 3.0 8.0 4.123106
# 2 John 12 26 4.0 12.0 16.124515
Should you want every row combination with each other, consider a cross join on itself and query out reverse duplicates:
df = (pd.merge(df.assign(key=1), df.assign(key=1), on="key")
.query("name_x < name_y")
.drop(columns=['key'])
)
df['dist'] = np.sqrt(np.power(df['c1_x'] - df['c1_y'], 2) +
np.power(df['c2_x'] - df['c2_y'], 2))
print(df)
# name_x c1_x c2_x name_y c1_y c2_y dist
# 1 Bill 3 8 James 4 12 4.123106
# 2 Bill 3 8 John 12 26 20.124612
# 5 James 4 12 John 12 26 16.124515

Avoiding explicit for-loop in Python with pandas dataframe

I would like to find a better way to carry out the following process.
#import packages
import pandas as pd
I have defined a pandas dataframe.
# Create dataframe
data = {'name': ['Jason', 'Jason', 'Tina', 'Tina', 'Tina'],
'reports': [4, 24, 31, 2, 3],
'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data)
After the dataframe is created, I want to add an extra column to the dataframe. This column contains the rank based on the values in the coverage column for every name seperately.
#Add column with ranks based on 'coverage' for every name separately.
df_end = pd.DataFrame()
for person_names in df.groupby('name').groups:
one_name = df.groupby('name').get_group(person_names)
one_name['coverageRank'] = one_name['coverage'].rank()
df_end = df_end.append(one_name)
Is it possible to achieve this simple task in a simpler way? Maybe without using the for-loop?
I think you need DataFrameGroupBy.rank:
df['coverageRank'] = df.groupby('name')['coverage'].rank()
print (df)
coverage name reports coverageRank
0 25 Jason 4 1.0
1 94 Jason 24 2.0
2 57 Tina 31 1.0
3 62 Tina 2 2.0
4 70 Tina 3 3.0

Pandas: How do I get the key (index) of the first and last row of a dataframe

I have a dataframe (df) with a datetime index and one field "myfield"
I want to find out the first and last datetime of the dataframe.
I can access the first and last dataframe element like this:
df.iloc[0]
df.iloc[-1]
for df.iloc[0] I get the result:
myfield myfieldcontent
Name: 2017-07-24 00:00:00, dtype: float
How can I access the datetime of the row?
You can use select index by [0] or [-1]:
df = pd.DataFrame({'myfield':[1,4,5]}, index=pd.date_range('2015-01-01', periods=3))
print (df)
myfield
2015-01-01 1
2015-01-02 4
2015-01-03 5
print (df.iloc[-1])
myfield 5
Name: 2015-01-03 00:00:00, dtype: int64
print (df.index[0])
2015-01-01 00:00:00
print (df.index[-1])
2015-01-03 00:00:00
If you are using pandas 1.1.4 or higher, you can use "name" attribute.
import pandas as pd
df = pd.DataFrame({'myfield': [1, 4, 5]}, index=pd.date_range('2015-01-01', periods=3))
df = df.reset_index()
print("Index value: ", df.iloc[-1].name) #pandas-series
#Convert to python datetime
print("Index datetime: ", df.iloc[-1].name.to_pydatetime())
jezrael's answer is perfect. Just to provide an alternative, if you insist on using loc then you should first reset_index.
import pandas as pd
df = pd.DataFrame({'myfield': [1, 4, 5]}, index=pd.date_range('2015-01-01', periods=3))
df = df.reset_index()
print df['index'].iloc[0]
print df['index'].iloc[-1]

Resources