How to export multiple excels based on one column - excel

I was trying to export multiple excels based on the value of one column. For example:
import pandas as pd
df = pd.DataFrame({'state':['PA','PA','TX','TX'],'county':['Centre','Berks','Austin','Taylor'],'a':[4,3,2,1],'b':[3,4,5,6]})
df
How can I export this dataframe to multiple excels based on the attribute of column "state". For example, to export a separate excel only with "state" = "PA" and another excel with "state" = "TX". Thanks.

solution for n elements in the state column.
1.imagine that this is your dataframe
import pandas as pd
df = pd.DataFrame({'state':['PA','PA','TX','TX','RX'],'county':['Centre','Berks','Austin','Taylor','Mike'],'a':[4,3,2,1,0],'b':[3,4,5,6,7]})
print(df)
state county a b
0 PA Centre 4 3
1 PA Berks 3 4
2 TX Austin 2 5
3 TX Taylor 1 6
4 RX Mike 0 7
2.the idea: Series.unique
df['state'].unique()
array(['PA', 'TX', 'RX'], dtype=object)
as you can see unique returns the different and unrepeatable elements present in the series.
3. For loop
you can use a for loop to filter the dataframe based on the unique state elements returned by unique:
for state in df['state'].unique():
print(df[df['state'].eq(state)])
print('-'*20)
state county a b
0 PA Centre 4 3
1 PA Berks 3 4
--------------------
state county a b
2 TX Austin 2 5
3 TX Taylor 1 6
--------------------
state county a b
4 RX Mike 0 7
4 send to excel
for state in df['state'].unique():
df[df['state'].eq(state)].to_excel(state+'.xlsx')
on the use of DataFrame.eq, DataFrame.ne and the operator ~
My suggestion in your comment on the use of ~ is because there were only two states.
the following expressions are equivalent:
~df.eq(a)
df.ne(a)

Related

Algo to identify slightly different uniquely identifiable common names in 3 DataFrame columns

Sample DataFrame df has 3 columns to identify any given person, viz., name, nick_name, initials. They can have slight differences in the way they are specified but looking at three columns together it is possible to overcome these differences and separate out all the rows for given person and normalize these 3 columnns with single value for each person.
>>> import pandas as pd
>>> df = pd.DataFrame({'ID':range(9), 'name':['Theodore', 'Thomas', 'Theodore', 'Christian', 'Theodore', 'Theodore R', 'Thomas', 'Tomas', 'Cristian'], 'nick_name':['Tedy', 'Tom', 'Ted', 'Chris', 'Ted', 'Ted', 'Tommy', 'Tom', 'Chris'], 'initials':['TR', 'Tb', 'TRo', 'CS', 'TR', 'TR', 'tb', 'TB', 'CS']})
>>> df
ID name nick_name initials
0 0 Theodore Tedy TR
1 1 Thomas Tom Tb
2 2 Theodore Ted TRo
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore R Ted TR
6 6 Thomas Tommy tb
7 7 Tomas Tom TB
8 8 Cristian Chris CS
In this case desired output is as follows:
ID name nick_name initials
0 0 Theodore Ted TR
1 1 Thomas Tom TB
2 2 Theodore Ted TR
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore Ted TR
6 6 Thomas Tom TB
7 7 Thomas Tom TB
8 8 Christian Chris CS
The common value can be anything as long as it is normalized to same value. For example, name is Theodore or Theodore R - both fine.
My actual DataFrame is about 4000 rows. Could someone help specify optimal algo to do this.
You'll want to use Levenshtein distance to identify similar strings. A good Python package for this is fuzzywuzzy. Below I used a basic dictionary approach to collect similar rows together, then overwrite each chunk with a designated master row. Note this leaves a CSV with many duplicate rows, I don't know if this is what you want, but if not, easy enough to take the duplicates out.
import pandas as pd
from itertools import chain
from fuzzywuzzy import fuzz
def cluster_rows(df):
row_clusters = {}
threshold = 90
name_rows = list(df.iterrows())
for i, nr in name_rows:
name = nr['name']
new_cluster = True
for other in row_clusters.keys():
if fuzz.ratio(name, other) >= threshold:
row_clusters[other].append(nr)
new_cluster = False
if new_cluster:
row_clusters[name] = [nr]
return row_clusters
def normalize_rows(row_clusters):
for name in row_clusters:
master = row_clusters[name][0]
for row in row_clusters[name][1:]:
for key in row.keys():
row[key] = master[key]
return row_clusters
if __name__ == '__main__':
df = pd.read_csv('names.csv')
rc = cluster_rows(df)
normalized = normalize_rows(rc)
pd.DataFrame(chain(*normalized.values())).to_csv('norm-names.csv')

pandas remove records conditionally based on records count of groups

I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
and mapping dataframe
mapping = pd.DataFrame({'Product':['A','C'],'Product1':['B','D']}, columns = ['Product','Product1'])
and i wanted to compare products as per mapping. product A data should match with product B data.. the logic is product A number of records is 4 so product B records also should be 4 and those 4 records should be from the week number before and after form last week number of product A and including the last week number. so before 1 week of week number 4 i.e. 3rd week and after 2 weeks of week number 4 i.e 5,6 and week 4 data.
similarly product C number of records is 3 so product D records also should be 3 and those records before and after last week number of product C. so product c last week number 3 so product D records will be week number 2,3,4.
wanted data frame will be like below i wanted to remove those yellow records
Define the following function selecting rows from df, for products from
the current row in mapping:
def selRows(row, df):
rows_1 = df[df.Product == row.Product]
nr_1 = rows_1.index.size
lastWk_1 = rows_1.Week.iat[-1]
rows_2 = df[df.Product.eq(row.Product1) & df.Week.ge(lastWk_1 - 1)].iloc[:nr_1]
return pd.concat([rows_1, rows_2])
Then call it the following way:
result = pd.concat([ selRows(row, grp)
for _, grp in df2.groupby(['Country'])
for _, row in mapping.iterrows() ])
The list comprehension above creates a list on DataFrames - results of
calls of selRows on:
each group of rows from df2, for consecutive countries (the outer loop),
each row from mapping (the inner loop).
Then concat concatenates all of them into a single DataFrame.
Solution first create mapped column by mapping DataFrame and create dictionaries for mapping for length and last (maximal) value by groups by Country and Product:
df2['mapp'] = df2['Product'].map(mapping.set_index('Product1')['Product'])
df1 = df2.groupby(['Country','Product'])['Week'].agg(['max','size'])
#subtracted 1 for last previous value
dprev = df1['max'].sub(1).to_dict()
dlen = df1['size'].to_dict()
print(dlen)
{('UK', 'A'): 4, ('UK', 'B'): 8, ('UK', 'C'): 3, ('UK', 'D'): 6}
Then Series.map values of dict and filter out less values, then filter by second dictionary by lengths with DataFrame.head:
df3 = (df2[df2[['Country','mapp']].apply(tuple, 1).map(dprev) <= df2['Week']]
.groupby(['Country','mapp'])
.apply(lambda x: x.head(dlen.get(x.name))))
print(df3)
Country Product Week val mapp
Country mapp
UK A 6 UK B 3 7 A
7 UK B 4 8 A
8 UK B 5 9 A
9 UK B 6 10 A
C 16 UK D 2 6 C
17 UK D 3 7 C
18 UK D 4 8 C
Then filter original rows unmatched mapping['Product1'], add new df3 and sorting:
df = (df2[~df2['Product'].isin(mapping['Product1'])]
.append(df3, ignore_index=True)
.sort_values(['Country','Product'])
.drop('mapp', axis=1))
print(df)
Country Product Week val
0 UK A 1 5
1 UK A 2 4
2 UK A 3 3
3 UK A 4 1
7 UK B 3 7
8 UK B 4 8
9 UK B 5 9
10 UK B 6 10
4 UK C 1 5
5 UK C 2 5
6 UK C 3 5
11 UK D 2 6
12 UK D 3 7
13 UK D 4 8

How to combine multiple rows of pandas dataframe into one between two other row values python3?

I have a pandas dataframe with a single column that contains name, address, and phone info separated by blank or na rows like this:
data
0 Business name one
1 1234 address ln
2 Town, ST 55655
3 (555) 555-5555
4 nan
5 Business name two
6 5678 address dr
7 New Town, ST 55677
8 nan
9 Business name three
10 nan
and so on...
What I want is this:
Name Addr1 Addr2 Phone
0 Business name one 1234 address ln Town, ST 55655 (555) 555-5555
1 Business name two 5678 address dr New Town, ST 55677
2 Business name three
I am using python 3 and have been stuck, any help is much appreciated!
You can use:
create groups for each row with isnull and cumsum
for align with non NaN rows add reindex
remove NaNs by dropna, set_index to MultiIndex with cumcount
reshape by unstack
a = df['data'].isnull().cumsum().reindex(df.dropna().index)
print (a)
0 0
1 0
2 0
3 0
5 1
6 1
7 1
9 2
Name: data, dtype: int32
df = df.dropna().set_index([a, a.groupby(a).cumcount()])['data'].unstack()
df.columns = ['Name','Addr1','Addr2','Phone']
print (df)
Name Addr1 Addr2 Phone
data
0 Business name one 1234 address ln Town, ST 55655 (555) 555-5555
1 Business name two 5678 address dr New Town, ST 55677 None
2 Business name three None None None
If there is multiple address is possible create columns dynamically:
df.columns = ['Name'] +
['Addr{}'.format(x+1) for x in range(len(df.columns) - 2)] +
['Phone']
df['group']=df['data'].str.contains('Business').cumsum().replace({True:1}).ffill()
df1=df.groupby('group')['data'].apply(list).apply(pd.Series).dropna(axis=1,thresh =1)
df1.columns=['Name','Addr1','Addr2','Phone']
df1
Out[1221]:
Name Addr1 Addr2 \
group
1.0 Business name one 1234 address ln Town, ST 55655
2.0 Business name two 5678 address dr New Town, ST 55677
3.0 Business name three NaN NaN
Phone
group
1.0 (555) 555-5555
2.0 NaN
3.0 NaN

How to group by two Columns using Pandas?

I am working on an algorithm, which requires grouping by two columns. Pandas supports grouping by two columns by using:
df.groupby([col1, col2])
But the resulting dataframe is not the required dataframe
Work Setup:
Python : v3.5
Pandas : v0.18.1
Pandas Dataframe - Input Data:
Type Segment
id
1 Domestic 1
2 Salary 3
3 NRI 1
4 Salary 4
5 Salary 3
6 NRI 4
7 Salary 4
8 Salary 3
9 Salary 4
10 NRI 4
Required Dataframe:
Count of [Domestic, Salary, NRI] in each Segment
Domestic Salary NRI
Segment
1 1 3 1
3 0 0 0
4 0 3 2
Experiments:
group = df.groupby(['Segment', 'Type'])
group.size()
Segment Type Count
1 Domestic 1
NRI 1
3 Salary 3
4 Salary 3
NRI 2
I am able to achieve the required dataframe using MS Excel Pivot Table feature. Is there any way, where I can achieve similar results using pandas?
After the Groupby.size operation, a multi-index(2 level index) series object gets created that needs to be converted into a dataframe, which could be done by unstacking the 2nd level index and optionally filling NaNs obtained with 0.
df.groupby(['Segment', 'Type']).size().unstack(level=1, fill_value=0)

How to apply a fuzzy matching function on the target and reference columns for pandas dataframes

******Edited with Solution Below*******
I have carefully read the guidelines, hope the question is acceptable.
I have two pandas dataframes, I need to apply a fuzzy matching function on the target and reference columns and merge the data based on the similarity score preserving the original data.
i have checked similar questions, e.g. see:
is it possible to do fuzzy match merge with python pandas?
but I am not able to use this solution.
So far I have:
df1 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'],'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
import distance
fulldf=[]
for name1 in df1['RefName']:
for name2 in df2['TarName']:
if distance.jaccard(name1, name2)<0.6:
fulldf.append({'RefName':name1 ,'Score':distance.jaccard(name1, name2),'TarName':name2 })
pd_fulldf= pd.DataFrame(fulldf)
How can I include the 'NameId' and 'Type' (and eventual other columns) in the final output e.g.:
df1_NameId RefName df1_Type df1_NewColumn Score df2_NameId TarName df2_Type df2_NewColumn
1 robert johnes Person … 0.0000 1 roberto johnes Person …
Is there a way to code this so that is easily scalable, and can be performed on datasets with hundred thousands of rows?
I have solved the original problem by unpacking the dataframes in the loop:
import distance
import pandas as pd
#Create test Dataframes
df1 = pd.DataFrame({'NameId': [1,2,3], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
results=[]
#Create two generators objects to loop through each dataframe row one at the time
#Call each dataframe element that you want to have in the final output in the loop
#Append results to the empty list you created
for a,b,c in df1.itertuples():
for d,e,f in df2.itertuples():
results.append((a,b,c,distance.jaccard(c, f),e,d,f))
result_df=pd.DataFrame(results)
print(result_df)
I believe what you need is Cartesian Product of TarName and RefName. Applying distance function to the product is the result you required.
df1["mergekey"] = 0
df2["mergekey"] = 0
df_merged = pd.merge(df1, df2, on = "mergekey")
df_merged["Distance"] = df_merged.apply(lambda x: distance.jaccard(x.RefName, x.TarName), axis = 1)
Result:
NameId_x RefName Type_x mergekey NameId_y TarName Type_y Distance
0 1 robert johnes Person 0 1 roberto johnes Person 0.000000
1 1 robert johnes Person 0 2 lew malinosky Person 0.705882
2 1 robert johnes Person 0 3 andreatta della blatta Person 0.538462
3 2 lew malinsky Person 0 1 roberto johnes Person 0.764706
4 2 lew malinsky Person 0 2 lew malinosky Person 0.083333
5 2 lew malinsky Person 0 3 andreatta della blatta Person 0.666667
6 3 gioberto delle lanterne Person 0 1 roberto johnes Person 0.533333
7 3 gioberto delle lanterne Person 0 2 lew malinosky Person 0.588235
8 3 gioberto delle lanterne Person 0 3 andreatta della blatta Person 0.250000

Resources