Count from a dataframe, then sort based on that count [duplicate] - python-3.x

I have a dataset
category
cat a
cat b
cat a
I'd like to be able to return something like (showing unique values and frequency)
category freq
cat a 2
cat b 1

Use value_counts() as #DSM commented.
In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df['a'].value_counts()
Out[37]:
b 3
a 2
s 2
dtype: int64
Also groupby and count. Many ways to skin a cat here.
In [38]:
df.groupby('a').count()
Out[38]:
a
a
a 2
b 3
s 2
[3 rows x 1 columns]
See the online docs.
If you wanted to add frequency back to the original dataframe use transform to return an aligned index:
In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df
Out[41]:
a freq
0 a 2
1 b 3
2 s 2
3 s 2
4 b 3
5 a 2
6 b 3
[7 rows x 2 columns]

If you want to apply to all columns you can use:
df.apply(pd.value_counts)
This will apply a column based aggregation function (in this case value_counts) to each of the columns.

df.category.value_counts()
This short little line of code will give you the output you want.
If your column name has spaces you can use
df['category'].value_counts()

df.apply(pd.value_counts).fillna(0)
value_counts - Returns object containing counts of unique values
apply - count frequency in every column. If you set axis=1, you get frequency in every row
fillna(0) - make output more fancy. Changed NaN to 0

In 0.18.1 groupby together with count does not give the frequency of unique values:
>>> df
a
0 a
1 b
2 s
3 s
4 b
5 a
6 b
>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]
However, the unique values and their frequencies are easily determined using size:
>>> df.groupby('a').size()
a
a 2
b 3
s 2
With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.

Using list comprehension and value_counts for multiple columns in a df
[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]
https://stackoverflow.com/a/28192263/786326

As everyone said, the faster solution is to do:
df.column_to_analyze.value_counts()
But if you want to use the output in your dataframe, with this schema:
df input:
category
cat a
cat b
cat a
df output:
category counts
cat a 2
cat b 1
cat a 2
you can do this:
df['counts'] = df.category.map(df.category.value_counts())
df

If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().
index, counts = np.unique(df.values,return_counts=True)
np.bincount() could be faster if your values are integers.

You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.
cats = ['client', 'hotel', 'currency', 'ota', 'user_country']
df[cats] = df[cats].astype('category')
and then calling describe:
df[cats].describe()
This will give you a nice table of value counts and a bit more :):
client hotel currency ota user_country
count 852845 852845 852845 852845 852845
unique 2554 17477 132 14 219
top 2198 13202 USD Hades US
freq 102562 8847 516500 242734 340992

Without any libraries, you could do this instead:
def to_frequency_table(data):
frequencytable = {}
for key in data:
if key in frequencytable:
frequencytable[key] += 1
else:
frequencytable[key] = 1
return frequencytable
Example:
to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}

I believe this should work fine for any DataFrame columns list.
def column_list(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return pd.DataFrame(column_list_df)
column_list_df.rename(columns={0: "Feature", 1: "Value_count"})
The function "column_list" checks the columns names and then checks the uniqueness of each column values.

#metatoaster has already pointed this out.
Go for Counter. It's blazing fast.
import pandas as pd
from collections import Counter
import timeit
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])
Timers
%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop
%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop
%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop
%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop
Cheers!

The following code creates frequency table for the various values in a column called "Total_score" in a dataframe called "smaller_dat1", and then returns the number of times the value "300" appears in the column.
valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]

n_values = data.income.value_counts()
First unique value count
n_at_most_50k = n_values[0]
Second unique value count
n_greater_50k = n_values[1]
n_values
Output:
<=50K 34014
>50K 11208
Name: income, dtype: int64
Output:
n_greater_50k,n_at_most_50k:-
(11208, 34014)

your data:
|category|
cat a
cat b
cat a
solution:
df['freq'] = df.groupby('category')['category'].transform('count')
df = df.drop_duplicates()

Related

Add Column For Results Of Dataframe Resample [duplicate]

I have the following data frame in IPython, where each row is a single stock:
In [261]: bdata
Out[261]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21210 entries, 0 to 21209
Data columns:
BloombergTicker 21206 non-null values
Company 21210 non-null values
Country 21210 non-null values
MarketCap 21210 non-null values
PriceReturn 21210 non-null values
SEDOL 21210 non-null values
yearmonth 21210 non-null values
dtypes: float64(2), int64(1), object(4)
I want to apply a groupby operation that computes cap-weighted average return across everything, per each date in the "yearmonth" column.
This works as expected:
In [262]: bdata.groupby("yearmonth").apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
Out[262]:
yearmonth
201204 -0.109444
201205 -0.290546
But then I want to sort of "broadcast" these values back to the indices in the original data frame, and save them as constant columns where the dates match.
In [263]: dateGrps = bdata.groupby("yearmonth")
In [264]: dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/mnt/bos-devrnd04/usr6/home/espears/ws/Research/Projects/python-util/src/util/<ipython-input-264-4a68c8782426> in <module>()
----> 1 dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
TypeError: 'DataFrameGroupBy' object does not support item assignment
I realize this naive assignment should not work. But what is the "right" Pandas idiom for assigning the result of a groupby operation into a new column on the parent dataframe?
In the end, I want a column called "MarketReturn" than will be a repeated constant value for all indices that have matching date with the output of the groupby operation.
One hack to achieve this would be the following:
marketRetsByDate = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
bdata["MarketReturn"] = np.repeat(np.NaN, len(bdata))
for elem in marketRetsByDate.index.values:
bdata["MarketReturn"][bdata["yearmonth"]==elem] = marketRetsByDate.ix[elem]
But this is slow, bad, and unPythonic.
In [97]: df = pandas.DataFrame({'month': np.random.randint(0,11, 100), 'A': np.random.randn(100), 'B': np.random.randn(100)})
In [98]: df.join(df.groupby('month')['A'].sum(), on='month', rsuffix='_r')
Out[98]:
A B month A_r
0 -0.040710 0.182269 0 -0.331816
1 -0.004867 0.642243 1 2.448232
2 -0.162191 0.442338 4 2.045909
3 -0.979875 1.367018 5 -2.736399
4 -1.126198 0.338946 5 -2.736399
5 -0.992209 -1.343258 1 2.448232
6 -1.450310 0.021290 0 -0.331816
7 -0.675345 -1.359915 9 2.722156
While I'm still exploring all of the incredibly smart ways that apply concatenates the pieces it's given, here's another way to add a new column in the parent after a groupby operation.
In [236]: df
Out[236]:
yearmonth return
0 201202 0.922132
1 201202 0.220270
2 201202 0.228856
3 201203 0.277170
4 201203 0.747347
In [237]: def add_mkt_return(grp):
.....: grp['mkt_return'] = grp['return'].sum()
.....: return grp
.....:
In [238]: df.groupby('yearmonth').apply(add_mkt_return)
Out[238]:
yearmonth return mkt_return
0 201202 0.922132 1.371258
1 201202 0.220270 1.371258
2 201202 0.228856 1.371258
3 201203 0.277170 1.024516
4 201203 0.747347 1.024516
As a general rule when using groupby(), if you use the .transform() function pandas will return a table with the same length as your original. When you use other functions like .sum() or .first() then pandas will return a table where each row is a group.
I'm not sure how this works with apply but implementing elaborate lambda functions with transform can be fairly tricky so the strategy that I find most helpful is to create the variables I need, place them in the original dataset and then do my operations there.
If I understand what you're trying to do correctly first you can calculate the total market cap for each group:
bdata['group_MarketCap'] = bdata.groupby('yearmonth')['MarketCap'].transform('sum')
This will add a column called "group_MarketCap" to your original data which would contain the sum of market caps for each group. Then you can calculate the weighted values directly:
bdata['weighted_P'] = bdata['PriceReturn'] * (bdata['MarketCap']/bdata['group_MarketCap'])
And finally you would calculate the weighted average for each group using the same transform function:
bdata['MarketReturn'] = bdata.groupby('yearmonth')['weighted_P'].transform('sum')
I tend to build my variables this way. Sometimes you can pull off putting it all in a single command but that doesn't always work with groupby() because most of the time pandas needs to instantiate the new object to operate on it at the full dataset scale (i.e. you can't add two columns together if one doesn't exist yet).
Hope this helps :)
May I suggest the transform method (instead of aggregate)? If you use it in your original example it should do what you want (the broadcasting).
I did not find a way to make assignment to the original dataframe. So I just store the results from the groups and concatenate them. Then we sort the concatenated dataframe by index to get the original order as the input dataframe. Here is a sample code:
In [10]: df = pd.DataFrame({'month': np.random.randint(0,11, 100), 'A': np.random.randn(100), 'B': np.random.randn(100)})
In [11]: df.head()
Out[11]:
month A B
0 4 -0.029106 -0.904648
1 2 -2.724073 0.492751
2 7 0.732403 0.689530
3 2 0.487685 -1.017337
4 1 1.160858 -0.025232
In [12]: res = []
In [13]: for month, group in df.groupby('month'):
...: new_df = pd.DataFrame({
...: 'A^2+B': group.A ** 2 + group.B,
...: 'A+B^2': group.A + group.B**2
...: })
...: res.append(new_df)
...:
In [14]: res = pd.concat(res).sort_index()
In [15]: res.head()
Out[15]:
A^2+B A+B^2
0 -0.903801 0.789282
1 7.913327 -2.481270
2 1.225944 1.207855
3 -0.779501 1.522660
4 1.322360 1.161495
This method is pretty fast and extensible. You can derive any feature here.
Note: If the dataframe is too large, concat may cause you MMO error.

Merge two Dataframes in combination with .isin() or .contains() or difflib? [duplicate]

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.
Any similarity algorithm will do (soundex, Levenshtein, difflib's).
Say one DataFrame has the following data:
df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
number
one 1
two 2
three 3
four 4
five 5
df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
letter
one a
too b
three c
fours d
five e
Then I want to get the resulting DataFrame
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
Similar to #locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:
In [23]: import difflib
In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>
In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
In [26]: df2
Out[26]:
letter
one a
two b
three c
four d
five e
In [31]: df1.join(df2)
Out[31]:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
.
If these were columns, in the same vein you could apply to the column then merge:
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])
df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)
Using fuzzywuzzy
Since there are no examples with the fuzzywuzzy package, here's a function I wrote which will return all matches based on a threshold you can set as a user:
Example datframe
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
# df1
Key
0 Apple
1 Banana
2 Orange
3 Strawberry
# df2
Key
0 Aple
1 Mango
2 Orag
3 Straw
4 Bannanna
5 Berry
Function for fuzzy matching
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
Using our function on the dataframes: #1
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)
Key matches
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
3 Strawberry Straw, Berry
Using our function on the dataframes: #2
df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})
fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)
Col1 matches
0 Microsoft Mcrsoft
1 Google gogle
2 Amazon Amason
3 IBM
Installation:
Pip
pip install fuzzywuzzy
Anaconda
conda install -c conda-forge fuzzywuzzy
I have written a Python package which aims to solve this problem:
pip install fuzzymatcher
You can find the repo here and docs here.
Basic usage:
Given two dataframes df_left and df_right, which you want to fuzzy join, you can write the following:
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Or if you just want to link on the closest match:
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].
This is how I would do it with Jaro-Winkler from the jellyfish package:
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string
return best_match
df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
df1.join(df2)
Output:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
For a general approach: fuzzy_merge
For a more general scenario in which we want to merge columns from two dataframes which contain slightly different strings, the following function uses difflib.get_close_matches along with merge in order to mimic the functionality of pandas' merge but with fuzzy matching:
import difflib
def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
df_other= df2.copy()
df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff)
for x in df_other[right_on]]
return df1.merge(df_other, on=left_on, how=how)
def get_closest_match(x, other, cutoff):
matches = difflib.get_close_matches(x, other, cutoff=cutoff)
return matches[0] if matches else None
Here are some use cases with two sample dataframes:
print(df1)
key number
0 one 1
1 two 2
2 three 3
3 four 4
4 five 5
print(df2)
key_close letter
0 three c
1 one a
2 too b
3 fours d
4 a very different string e
With the above example, we'd get:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
And we could do a left join with:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
4 five 5 NaN NaN
For a right join, we'd have all non-matching keys in the left dataframe to None:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')
key number key_close letter
0 one 1.0 one a
1 two 2.0 too b
2 three 3.0 three c
3 four 4.0 fours d
4 None NaN a very different string e
Also note that difflib.get_close_matches will return an empty list if no item is matched within the cutoff. In the shared example, if we change the last index in df2 to say:
print(df2)
letter
one a
too b
three c
fours d
a very different string e
We'd get an index out of range error:
df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
IndexError: list index out of range
In order to solve this the above function get_closest_match will return the closest match by indexing the list returned by difflib.get_close_matches only if it actually contains any matches.
http://pandas.pydata.org/pandas-docs/dev/merging.html does not have a hook function to do this on the fly. Would be nice though...
I would just do a separate step and use difflib getclosest_matches to create a new column in one of the 2 dataframes and the merge/join on the fuzzy matched column
I used Fuzzymatcher package and this worked well for me. Visit this link for more details on this.
use the below command to install
pip install fuzzymatcher
Below is the sample Code (already submitted by RobinL above)
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Errors you may get
ZeroDivisionError: float division by zero---> Refer to this
link to resolve it
OperationalError: No Such Module:fts4 --> downlaod the sqlite3.dll
from here and replace the DLL file in your python or anaconda
DLLs folder.
Pros :
Works faster. In my case, I compared one dataframe with 3000 rows with anohter dataframe with 170,000 records . This also uses SQLite3 search across text. So faster than many
Can check across multiple columns and 2 dataframes. In my case, I was looking for closest match based on address and company name. Sometimes, company name might be same but address is the good thing to check too.
Gives you score for all the closest matches for the same record. you choose whats the cutoff score.
cons:
Original package installation is buggy
Required C++ and visual studios installed too
Wont work for 64 bit anaconda/Python
There is a package called fuzzy_pandas that can use levenshtein, jaro, metaphone and bilenco methods. With some great examples here
import pandas as pd
import fuzzy_pandas as fpd
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
results = fpd.fuzzy_merge(df1, df2,
left_on='Key',
right_on='Key',
method='levenshtein',
threshold=0.6)
results.head()
Key Key
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
As a heads up, this basically works, except if no match is found, or if you have NaNs in either column. Instead of directly applying get_close_matches, I found it easier to apply the following function. The choice of NaN replacements will depend a lot on your dataset.
def fuzzy_match(a, b):
left = '1' if pd.isnull(a) else a
right = b.fillna('2')
out = difflib.get_close_matches(left, right)
return out[0] if out else np.NaN
You can use d6tjoin for that
import d6tjoin.top1
d6tjoin.top1.MergeTop1(df1.reset_index(),df2.reset_index(),
fuzzy_left_on=['index'],fuzzy_right_on=['index']).merge()['merged']
index number index_right letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
4 five 5 five e
It has a variety of additional features such as:
check join quality, pre and post join
customize similarity function, eg edit distance vs hamming distance
specify max distance
multi-core compute
For details see
MergeTop1 examples - Best match join examples notebook
PreJoin examples - Examples for diagnosing join problems
I have used fuzzywuzz in a very minimal way whilst matching the existing behaviour and keywords of merge in pandas.
Just specify your accepted threshold for matching (between 0 and 100):
from fuzzywuzzy import process
def fuzzy_merge(df, df2, on=None, left_on=None, right_on=None, how='inner', threshold=80):
def fuzzy_apply(x, df, column, threshold=threshold):
if type(x)!=str:
return None
match, score, *_ = process.extract(x, df[column], limit=1)[0]
if score >= threshold:
return match
else:
return None
if on is not None:
left_on = on
right_on = on
# create temp column as the best fuzzy match (or None!)
df2['tmp'] = df2[right_on].apply(
fuzzy_apply,
df=df,
column=left_on,
threshold=threshold
)
merged_df = df.merge(df2, how=how, left_on=left_on, right_on='tmp')
del merged_df['tmp']
return merged_df
Try it out using the example data:
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
fuzzy_merge(df, df2, on='Key', threshold=80)
Using thefuzz
Using SeatGeek's great package thefuzz, which makes use of Levenshtein distance. This works with data held in columns. It adds matches as rows rather than columns, to preserve a tidy dataset, and allows additional columns to be easily pulled through to the output dataframe.
Sample data
df1 = pd.DataFrame({'col_a':['one','two','three','four','five'], 'col_b':[1, 2, 3, 4, 5]})
col_a col_b
0 one 1
1 two 2
2 three 3
3 four 4
4 five 5
df2 = pd.DataFrame({'col_a':['one','too','three','fours','five'], 'col_b':['a','b','c','d','e']})
col_a col_b
0 one a
1 too b
2 three c
3 fours d
4 five e
Function used to do the matching
def fuzzy_match(
df_left, df_right, column_left, column_right, threshold=90, limit=1
):
# Create a series
series_matches = df_left[column_left].apply(
lambda x: process.extract(x, df_right[column_right], limit=limit) # Creates a series with id from df_left and column name _column_left_, with _limit_ matches per item
)
# Convert matches to a tidy dataframe
df_matches = series_matches.to_frame()
df_matches = df_matches.explode(column_left) # Convert list of matches to rows
df_matches[
['match_string', 'match_score', 'df_right_id']
] = pd.DataFrame(df_matches[column_left].tolist(), index=df_matches.index) # Convert match tuple to columns
df_matches.drop(column_left, axis=1, inplace=True) # Drop column of match tuples
# Reset index, as in creating a tidy dataframe we've introduced multiple rows per id, so that no longer functions well as the index
if df_matches.index.name:
index_name = df_matches.index.name # Stash index name
else:
index_name = 'index' # Default used by pandas
df_matches.reset_index(inplace=True)
df_matches.rename(columns={index_name: 'df_left_id'}, inplace=True) # The previous index has now become a column: rename for ease of reference
# Drop matches below threshold
df_matches.drop(
df_matches.loc[df_matches['match_score'] < threshold].index,
inplace=True
)
return df_matches
Use function and merge data
import pandas as pd
from thefuzz import process
df_matches = fuzzy_match(
df1,
df2,
'col_a',
'col_a',
threshold=60,
limit=1
)
df_output = df1.merge(
df_matches,
how='left',
left_index=True,
right_on='df_left_id'
).merge(
df2,
how='left',
left_on='df_right_id',
right_index=True,
suffixes=['_df1', '_df2']
)
df_output.set_index('df_left_id', inplace=True) # For some reason the first merge operation wrecks the dataframe's index. Recreated from the value we have in the matches lookup table
df_output = df_output[['col_a_df1', 'col_b_df1', 'col_b_df2']] # Drop columns used in the matching
df_output.index.name = 'id'
id col_a_df1 col_b_df1 col_b_df2
0 one 1 a
1 two 2 b
2 three 3 c
3 four 4 d
4 five 5 e
Tip: Fuzzy matching using thefuzz is much quicker if you optionally install the python-Levenshtein package too.
For more complex use cases to match rows with many columns you can use recordlinkage package. recordlinkage provides all the tools to fuzzy match rows between pandas data frames which helps to deduplicate your data when merging. I have written a detailed article about the package here
if the join axis is numeric this could also be used to match indexes with a specified tolerance:
def fuzzy_left_join(df1, df2, tol=None):
index1 = df1.index.values
index2 = df2.index.values
diff = np.abs(index1.reshape((-1, 1)) - index2)
mask_j = np.argmin(diff, axis=1) # min. of each column
mask_i = np.arange(mask_j.shape[0])
df1_ = df1.iloc[mask_i]
df2_ = df2.iloc[mask_j]
if tol is not None:
mask = np.abs(df2_.index.values - df1_.index.values) <= tol
df1_ = df1_.loc[mask]
df2_ = df2_.loc[mask]
df2_.index = df1_.index
out = pd.concat([df1_, df2_], axis=1)
return out
TheFuzz is the new version of a fuzzywuzzy
In order to fuzzy-join string-elements in two big tables you can do this:
Use apply to go row by row
Use swifter to parallel, speed up and visualize default apply function (with colored progress bar)
Use OrderedDict from collections to get rid of duplicates in the output of merge and keep the initial order
Increase limit in thefuzz.process.extract to see more options for merge (stored in a list of tuples with % of similarity)
'*' You can use thefuzz.process.extractOne instead of thefuzz.process.extract to return just one best-matched item (without specifying any limit). However, be aware that several results could have same % of similarity and you will get only one of them.
'**' Somehow the swifter takes a minute or two before starting the actual apply. If you need to process small tables you can skip this step and just use progress_apply instead
from thefuzz import process
from collections import OrderedDict
import swifter
def match(x):
matches = process.extract(x, df1, limit=6)
matches = list(OrderedDict((x, True) for x in matches).keys())
print(f'{x:20} : {matches}')
return str(matches)
df1 = df['name'].values
df2['matches'] = df2['name'].swifter.apply(lambda x: match(x))

groupby consecutive identical values in pandas dataframe and cumulative count of the number of occurences

I have a problem where I would like to count the number of times the current value has not changed in a dataframe over rolling periods.
For example:
df = pd.DataFrame({'col':list('aaaabbab')})
would somehow give output of
0
1
2
3
0
1
0
0
I have been trying something along the following
df['col'] = df['col'] == df['col'].shift(1)
df.rolling(window=3).sum().reset_index(drop=True, level=0)
I have added in the rolling as I will want to look at the full data set in terms of rolling periods but even without having it over rolling periods I can not quite figure out the logic.
I am not sure if I am missing something simple or this may not be possible using shift
You need to generate a grouper for the change in values. For this compare each value with the previous one and apply a cumsum. This gives you groups in the itertools.groupby style ([1, 1, 1, 1, 2, 2, 3, 4]), finally group and apply a cumcount.
df['count'] = (df.groupby(df['col'].ne(df['col'].shift()).cumsum())
.cumcount()
)
output:
col count
0 a 0
1 a 1
2 a 2
3 a 3
4 b 0
5 b 1
6 a 0
7 b 0
edit: for fun here is a solution using itertools (much faster):
from itertools import groupby, chain
df['count'] = list(chain(*(list(range(len(list(g))))
for _,g in groupby(df['col']))))
NB. this runs much faster (88 µs vs 707 µs on the provided example)
I can't comment so just to add some more to #mozway answer.
My goal was to count consecutives value for an entire huge dataframe effectively.
The pb I encounter is that by construction
np.nan == np.nan
will return False so you could have a whole column full of only NaN and yet the counter will be at 0.
A simple workaround would be to replace all NaN in your df by a value not already in it.
For instance in the case of a float dataset you could do
df.fillna('NA')
which will work but by changing the dtype of your columns to Object the following code will be much slower (20x on my set up).
I would rather advised something like :
all_values = list(np.unique(np.array(df)))
all_values = [a for a in all_values if a==a]
unik_val = min(all_values)-1
temp = df.fillna(unik_val).copy()
from itertools import groupby, chain
for col in temp.columns:
temp[col] = list(chain(*(list(range(len(list(g))))
for _,g in groupby(temp[col]))))
count_df

How to split a pandas column into multiple columns [duplicate]

I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.
Currently, I do the following:
data = pandas.read_csv('mydata.csv')
which gives something like:
data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
I'd like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.
It is not possible to write something like
observations = data[:'c']
features = data['c':]
I'm not sure what the best method is. Do I need a pd.Panel?
By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is.
Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]
2017 Answer - pandas 0.20: .ix is deprecated. Use .loc
See the deprecation in the docs
.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.
Let's assume we have a DataFrame with the following columns:
foo, bar, quz, ant, cat, sat, dat.
# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat
.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step
# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat
# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar
# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat
# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned
# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar
# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat
# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat
You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z
# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
# foo ant
# w
# x
# y
Note: .ix has been deprecated since Pandas v0.20. You should instead use .loc or .iloc, as appropriate.
The DataFrame.ix index is what you want to be accessing. It's a little confusing (I agree that Pandas indexing is perplexing at times!), but the following seems to do what you want:
>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
b c d e
0 0.418762 0.042369 0.869203 0.972314
1 0.991058 0.510228 0.594784 0.534366
2 0.407472 0.259811 0.396664 0.894202
3 0.726168 0.139531 0.324932 0.906575
where .ix[row slice, column slice] is what is being interpreted. More on Pandas indexing here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced
Lets use the titanic dataset from the seaborn package as an example
# Load dataset (pip install seaborn)
>> import seaborn.apionly as sns
>> titanic = sns.load_dataset('titanic')
using the column names
>> titanic.loc[:,['sex','age','fare']]
using the column indices
>> titanic.iloc[:,[2,3,6]]
using ix (Older than Pandas <.20 version)
>> titanic.ix[:,[‘sex’,’age’,’fare’]]
or
>> titanic.ix[:,[2,3,6]]
using the reindex method
>> titanic.reindex(columns=['sex','age','fare'])
Also, Given a DataFrame
data
as in your example, if you would like to extract column a and d only (e.i. the 1st and the 4th column), iloc mothod from the pandas dataframe is what you need and could be used very effectively. All you need to know is the index of the columns you would like to extract. For example:
>>> data.iloc[:,[0,3]]
will give you
a d
0 0.883283 0.100975
1 0.614313 0.221731
2 0.438963 0.224361
3 0.466078 0.703347
4 0.955285 0.114033
5 0.268443 0.416996
6 0.613241 0.327548
7 0.370784 0.359159
8 0.692708 0.659410
9 0.806624 0.875476
You can slice along the columns of a DataFrame by referring to the names of each column in a list, like so:
data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data_ab = data[list('ab')]
data_cde = data[list('cde')]
And if you came here looking for slicing two ranges of columns and combining them together (like me) you can do something like
op = df[list(df.columns[0:899]) + list(df.columns[3593:])]
print op
This will create a new dataframe with first 900 columns and (all) columns > 3593 (assuming you have some 4000 columns in your data set).
Here's how you could use different methods to do selective column slicing, including selective label based, index based and the selective ranges based column slicing.
In [37]: import pandas as pd
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))
In [44]: df
Out[44]:
a b c d e f g
0 0.409038 0.745497 0.890767 0.945890 0.014655 0.458070 0.786633
1 0.570642 0.181552 0.794599 0.036340 0.907011 0.655237 0.735268
2 0.568440 0.501638 0.186635 0.441445 0.703312 0.187447 0.604305
3 0.679125 0.642817 0.697628 0.391686 0.698381 0.936899 0.101806
In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing
Out[45]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing
Out[46]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
In [47]: df.iloc[:, 0:3] ## index based column ranges slicing
Out[47]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
### with 2 different column ranges, index based slicing:
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
Another way to get a subset of columns from your DataFrame, assuming you want all the rows, would be to do:
data[['a','b']] and data[['c','d','e']]
If you want to use numerical column indexes you can do:
data[data.columns[:2]] and data[data.columns[2:]]
Its equivalent
>>> print(df2.loc[140:160,['Relevance','Title']])
>>> print(df2.ix[140:160,[3,7]])
if Data frame look like that:
group name count
fruit apple 90
fruit banana 150
fruit orange 130
vegetable broccoli 80
vegetable kale 70
vegetable lettuce 125
and OUTPUT could be like
group name count
0 fruit apple 90
1 fruit banana 150
2 fruit orange 130
if you use logical operator np.logical_not
df[np.logical_not(df['group'] == 'vegetable')]
more about
https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html
other logical operators
logical_and(x1, x2, /[, out, where, ...]) Compute the truth value of
x1 AND x2 element-wise.
logical_or(x1, x2, /[, out, where, casting,
...]) Compute the truth value of x1 OR x2 element-wise.
logical_not(x, /[, out, where, casting, ...]) Compute the truth
value of NOT x element-wise.
logical_xor(x1, x2, /[, out, where, ..]) Compute the truth value of x1 XOR x2, element-wise.
You can use the method truncate
df = pd.DataFrame(np.random.rand(10, 5), columns = list('abcde'))
df_ab = df.truncate(before='a', after='b', axis=1)
df_cde = df.truncate(before='c', axis=1)

Panda remove duplicates but keep relationship [duplicate]

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.
So this:
A B
1 10
1 20
2 30
2 40
3 10
Should turn into this:
A B
1 20
2 40
3 10
I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?
This takes the last. Not the maximum though:
In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10
You can do also something like:
In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10
The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
A B
1 1 20
3 2 40
4 3 10
Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()
Simplest solution:
To drop duplicates based on one column:
df = df.drop_duplicates('column_name', keep='last')
To drop duplicates based on multiple columns:
df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')
I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first
df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")
without any groupby
Try this:
df.groupby(['A']).max()
I was brought here by a link from a duplicate question.
For just two columns, wouldn't it be simpler to do:
df.groupby('A')['B'].max().reset_index()
And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):
df.loc[df.groupby(...)[column].idxmax()]
For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:
out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]
When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):
Setup:
n = 1_000_000
df = pd.DataFrame({
'A': np.random.randint(0, 20, n),
'B': np.random.randint(0, 20, n),
'C': np.random.uniform(size=n),
'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})
(Adding sort_index() to ensure equal solution):
%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and
clean index like that:
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)
Easiest way to do this:
# First you need to sort this DF as Column A as ascending and column B as descending
# Then you can drop the duplicate values in A column
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step.
d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df
A B
0 1 30
1 1 40
2 2 50
3 3 42
4 1 38
5 2 30
6 3 25
7 1 32
df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)
df
A B
0 1 40
1 2 50
2 3 42
You can try this as well
df.drop_duplicates(subset='A', keep='last')
I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.
df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()
The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)
For the original question, the corresponding approach simplifies to
df.groupby('columnA').columnB.agg('max').reset_index().
When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.
df.groupby('A', as_index=False)['B'].max()
Very similar method to the selected answer, but sorting data frame by multiple columns might be an easier way to code.
Firstly, sort the date frame by both "A" and "B" columns, the ascending=False ensure it is ranked from highest value to lowest:
df.sort_values(["A", "B"], ascending=False, inplace=True)
Then, drop duplication and keep only the first item, which is already the one with the highest value:
df.drop_duplicates(inplace=True)
this also works:
a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A') ['B'].max().values})
I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():
>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

Resources