Performing a Principal Component Analysis to reconstruct time series creates more values than expected - python-3.x

I want to do a Principal Component Analysis following this notebook to reconstruct the DJIA (I'm using alpha_ventage) from its components (found with Quandl). Yet, it seems that I create more values than expected, than the original dataframe, when reconstructing the values multiplying the principal components by their weights
kernel_pca = KernelPCA(n_components=5).fit(df_z_components)
pca_5 = kernel_pca.transform(-daily_df_components)
weights = fn_weighted_average(kernel_pca.lambdas_)
reconstructed_values = np.dot(pca_5, weights)
Indeed, daily_df_components is created from the components of the DJIA by the quandl API which seem to have more data than the library I use to get the DJIA Index, alpha_ventage.
Here is the full code
"""
Obtaining the components data from quandl
"""
import quandl
QUANDL_API_KEY = 'MYKEY'
quandl.ApiConfig.api_key = QUANDL_API_KEY
SYMBOLS = [
'AAPL', 'MMM', 'BA', 'AXP', 'CAT',
'CVX', 'CSCO', 'KO', 'DD', 'XOM',
'GS', 'HD', 'IBM', 'INTC', 'JNJ',
'JPM', 'MCD', 'MRK', 'MSFT', 'NKE',
'PFE', 'PG', 'UNH', 'UTX', 'TRV',
'VZ', 'V', 'WMT', 'WBA', 'DIS'
]
wiki_symbols = ['WIKI/%s'%symbol for symbol in SYMBOLS]
df_components = quandl.get(
wiki_symbols,
start_date='2017-01-01',
end_date='2017-12-31',
column_index=11)
df_components.columns = SYMBOLS
filled_df_components = df_components.fillna(method='ffill')
daily_df_components = filled_df_components.resample('24h').ffill()
daily_df_components = daily_df_components.fillna(method='bfill')
"""
Download the all-time DJIA dataset
"""
from alpha_vantage.timeseries import TimeSeries
# Update your Alpha Vantage API key here...
ALPHA_VANTAGE_API_KEY = 'MYKEY'
ts = TimeSeries(key=ALPHA_VANTAGE_API_KEY, output_format='pandas')
df, meta_data = ts.get_intraday(symbol='DIA',interval='1min', outputsize='full')
# Finding eigenvectors and eigen values
fn_weighted_average = lambda x: x/x.sum()
weighted_values = fn_weighted_average(fitted_pca.lambdas_)[:5]
from sklearn.decomposition import KernelPCA
fn_z_score = lambda x: (x - x.mean())/x.std()
df_z_components = daily_df_components.apply(fn_z_score)
fitted_pca = KernelPCA().fit(df_z_components)
# Reconstructing the Dow Average with PCA
import numpy as np
kernel_pca = KernelPCA(n_components=5).fit(df_z_components)
pca_5 = kernel_pca.transform(-daily_df_components)
weights = fn_weighted_average(kernel_pca.lambdas_)
reconstructed_values = np.dot(pca_5, weights)
# Combine PCA and Index to compare
df_combined = djia_2020_weird.copy()
df_combined['pca_5'] = reconstructed_values
But it returns:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-100-2808dc14f789> in <module>()
9 # Combine PCA and Index to compare
10 df_combined = djia_2020_weird.copy()
---> 11 df_combined['pca_5'] = reconstructed_values
12 df_combined = df_combined.apply(fn_z_score)
13 df_combined.plot(figsize=(12,8));
3 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/construction.py in sanitize_index(data, index)
746 if len(data) != len(index):
747 raise ValueError(
--> 748 "Length of values "
749 f"({len(data)}) "
750 "does not match length of index "
ValueError: Length of values (361) does not match length of index (14)
Indeed, reconstructed_values is 361 long and df_combined is 14 values long...
Here is this last dataframe:
DJI
date
2021-01-21 NaN
2021-01-22 311.37
2021-01-23 310.03
2021-01-24 310.03
2021-01-25 310.03
2021-01-26 309.01
2021-01-27 309.49
2021-01-28 302.17
2021-01-29 305.25
2021-01-30 299.20
2021-01-31 299.20
2021-02-01 299.20
2021-02-02 302.13
2021-02-03 307.86
Maybe the reason is that the notebook author was available to get the data for the whole year he was interested in, when I run the data it seems that I only have two months?

Ahoy there, I'm the author of the notebook. It seems Quandl no longer provides historical prices of DJIA after the time of writing, and copyright wasn't granted to redistribute the data. For research, you may consider other free stock tickers to proxy DJIA.
The example usages have been updated in the repo to demostrate KernelPCA, as explained here.

Related

Huggingface Transformers NER - Offset Mapping Causing ValueError in NumPy boolean array indexing assignment

I was trying out the NER tutorial Token Classification with W-NUT Emerging Entities (https://huggingface.co/transformers/custom_datasets.html#tok-ner) in google colab using the Annotated Corpus for Named Entity Recognition data on Kaggle (https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus?select=ner_dataset.csv).
I will outline my process in detail to facilitate an understanding of what I was doing and to let the community help me figure out the source of the indexing assignment error.
To load the data from google drive where I have saved it, I used the following code
# import pandas library
import pandas as pd
# columns to select
cols_to_select = ["Sentence #", "Word", "Tag"]
# google drive data path
data_path = '/content/drive/MyDrive/Colab Notebooks/ner/ner_dataset.csv'
# load the data from google colab
dataset = pd.read_csv(data_path, encoding="latin-1")[cols_to_select].fillna(method = 'ffill')
I run the following code to parse the sentences and tags
class SentenceGetter(object):
def __init__(self, data):
self.n_sent = 1
self.data = data
self.empty = False
agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),
s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]
def retrieve(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent += 1
return s
except:
return None
# get full data
getter = SentenceGetter(dataset)
# get sentences
sentences = [[s[0] for s in sent] for sent in getter.sentences]
# get tags/labels
tags = [[s[1] for s in sent] for sent in getter.sentences]
# take a look at the data
print(sentences[0][0:5], tags[0][0:5], sep='\n')
I then split the data into train, val, and test sets
# import the sklearn module
from sklearn.model_selection import train_test_split
# split data in to temp and test sets
temp_texts, test_texts, temp_tags, test_tags = train_test_split(sentences,
tags,
test_size=0.20,
random_state=15)
# split data into train and validation sets
train_texts, val_texts, train_tags, val_tags = train_test_split(temp_texts,
temp_tags,
test_size=0.20,
random_state=15)
After splitting the data, I created encodings for tags and the tokens
unique_tags=dataset.Tag.unique()
# create tags to id
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
# create id to tags
id2tag = {id: tag for tag, id in tag2id.items()}
I then installed the transformer library in colab
# install the transformers library
! pip install transformers
Next I imported the small bert model
# import the transformers module
from transformers import BertTokenizerFast
# import the small bert model
model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
I then created the encodings for the tokens
# create train set encodings
train_encodings = tokenizer(train_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
# create validation set encodings
val_encodings = tokenizer(val_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
# create test set encodings
test_encodings = tokenizer(test_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
In the tutorial, it uses offset-mapping to handle the problem that arise with word-piece tokenization, specifically, the mismatch between tokens and labels. It is when running the offset-mapping code in the tutorial that I get the error. Below is the offset mapping function used in the tutorial:
# the offset function
import numpy as np
def encode_tags(tags, encodings):
labels = [[tag2id[tag] for tag in doc] for doc in tags]
encoded_labels = []
for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
# create an empty array of -100
doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
arr_offset = np.array(doc_offset)
# set labels whose first offset position is 0 and the second is not 0
doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
encoded_labels.append(doc_enc_labels.tolist())
return encoded_labels
# return the encoded labels
train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)
test_labels = encode_tags(test_tags, test_encodings)
After running the above code, it gives me the following error, and I can't figure out where the source of the error lies. Any help and pointers would be appreciated.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-afdff0186eb3> in <module>()
17
18 # return the encoded labels
---> 19 train_labels = encode_tags(train_tags, train_encodings)
20 val_labels = encode_tags(val_tags, val_encodings)
21 test_labels = encode_tags(test_tags, test_encodings)
<ipython-input-19-afdff0186eb3> in encode_tags(tags, encodings)
11
12 # set labels whose first offset position is 0 and the second is not 0
---> 13 doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
14 encoded_labels.append(doc_enc_labels.tolist())
15
ValueError: NumPy boolean array indexing assignment cannot assign 38 input values to the 37 output values where the mask is true

KeyError: "None of [Index(['23/01/2020' ......,\n dtype='object', length=9050)] are in the [columns]"

I am learning pandas and matplotlib on my own by using some public dataset via
this api link
I'm using colab and below are my codes:
import datetime
import io
import json
import pandas as pd
import requests
import matplotlib.pyplot as plt
confirm_resp = requests.get('https://api.data.gov.hk/v2/filterq=%7B%22resource%22%3A%22http%3A%2F%2Fwww.chp.gov.hk%2Ffiles%2Fmisc%2Fenhanced_sur_covid_19_eng.csv%22%2 C%22section%22%3A1%2C%22format%22%3A%22json%22%7D').content
confirm_df = pd.read_json(io.StringIO(confirm_resp.decode('utf-8')))
confirm_df.columns = confirm_df.columns.str.replace(" ", "_")
pd.to_datetime(confirm_df['Report_date'])
confirm_df.columns = ['Case_no', 'Report_date', 'Onset_date', 'Gender', 'Age',
'Name_of_hospital_admitted', 'Status', 'Resident', 'Case_classification', 'Confirmed_probable']
confirm_df = confirm_df.drop('Name_of_hospital_admitted', axis = 1)
confirm_df.head()
and this is what the dataframe looks like:
Case_no
Report_date
Onset_date
Gender
Age
Status
Resident
Case_classification
Confirmed_probable
1
23/01/2020
21/01/2020
M
39
Discharged
Non-HK resident
Imported case
Confirmed
2
23/01/2020
18/01/2020
M
56
Discharged
HK resident
Imported case
Confirmed
3
24/01/2020
20/01/2020
F
62
Discharged
Non-HK resident
Imported case
Confirmed
4
24/01/2020
23/01/2020
F
62
Discharged
Non-HK resident
Imported case
Confirmed
5
24/01/2020
23/01/2020
M
63
Discharged
Non-HK resident
Imported case
Confirmed
When I try to make a simple plot with the below code:
x = confirm_df['Report_date']
y = confirm_df['Case_classification']
confirm_df.plot(x, y)
It gives me the below error:
KeyError Traceback (most recent call last)
<ipython-input-17-e4139a9b5ef1> in <module>()
4 y = confirm_df['Case_classification']
5
----> 6 confirm_df.plot(x, y)
3 frames
/usr/local/lib/python3.6/dist-packages/pandas/plotting/_core.py in __call__(self, *args, **kwargs)
912 if is_integer(x) and not data.columns.holds_integer():
913 x = data_cols[x]
--> 914 elif not isinstance(data[x], ABCSeries):
915 raise ValueError("x must be a label or position")
916 data = data.set_index(x)
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __getitem__(self, key)
2910 if is_iterator(key):
2911 key = list(key)
-> 2912 indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
2913
2914 # take() does not accept boolean indexers
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1252 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1253
-> 1254 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
1255 return keyarr, indexer
1256
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1296 if missing == len(indexer):
1297 axis_name = self.obj._get_axis_name(axis)
-> 1298 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1299
1300 # We (temporarily) allow for some missing keys with .loc, except in
KeyError: "*None of [Index(['23/01/2020', '23/01/2020', '24/01/2020', '24/01/2020', '24/01/2020',\n '26/01/2020', '26/01/2020', '26/01/2020', '29/01/2020', '29/01/2020',\n ...\n '05/01/2021', '05/01/2021', '05/01/2021', '05/01/2021', '05/01/2021',\n '05/01/2021', '05/01/2021', '05/01/2021', '05/01/2021', '05/01/2021'],\n dtype='object', length=9050)] are in the [column*s]"
I have tried to make the plot with and without converting Report date to datetime object, I tried substitute x value with all the columns in the data frame, but all give me the same error code.
Appreciate if anyone can help me to understand how to handle these issues here and going forward. I've spent hours to resolve it but cannot find the answers.
I did not encounter this issue before when I downloaded some notebooks and datasets from Kaggle to follow along.
Thank you and happy new year.
First, you need to assign the converted date back to the column:
confirm_df['Report_date'] = pd.to_datetime(confirm_df['Report_date'])
Second, When the plot method is called from a dataframe object, you need to provide only the column names as argument (1).
confirm_df.plot(x='Report_date', y='Case_classification')
But the above code still throws error because 'Case_classification' is not numeric data.
You are trying to plot datetime vs. categorical data, so normal plot won't work but Something like this could work (2):
# I used only first 15 examples here, full dataset is kinda messy
confirm_df.iloc[:15, :].groupby(['Report_date', 'Case_classification']).size().unstack().plot.bar()
(1)pandas.DataFrame.plot
(2)How to plot categorical variable against a date column in Python
Several problems. First, the links were incorrect, I have edited them (probably just a copy/paste error). Second, you have to assign the converted datetime series back to the dataframe. Use print(confirm_df.dtypes) to see the difference. Then, the dataset is not ordered by date, but matplotlib expects an ordered x-axis. Well, actually, the problem was that the parser misinterpreted the datetime objects. I have added dayfirst=True to ensure that the dates are read correctly. Finally, what do you want to plot here? Just the cases by date? The number of cases per group by date? Your original code implies just the former but this is not really informative, is it?
import io
import pandas as pd
import requests
import matplotlib.pyplot as plt
print("starting download")
confirm_resp = requests.get('https://api.data.gov.hk/v2/filter?q=%7B%22resource%22%3A%22http%3A%2F%2Fwww.chp.gov.hk%2Ffiles%2Fmisc%2Fenhanced_sur_covid_19_eng.csv%22%2C%22section%22%3A1%2C%22format%22%3A%22json%22%7D').content
print("finished download")
confirm_df = pd.read_json(io.StringIO(confirm_resp.decode('utf-8')))
confirm_df.columns = confirm_df.columns.str.replace(" ", "_")
confirm_df['Report_date'] = pd.to_datetime(confirm_df['Report_date'], dayfirst=True)
confirm_df.columns = ['Case_no', 'Report_date', 'Onset_date', 'Gender', 'Age',
'Name_of_hospital_admitted', 'Status', 'Resident', 'Case_classification', 'Confirmed_probable']
confirm_df = confirm_df.drop('Name_of_hospital_admitted', axis = 1)
print(confirm_df.dtypes)
fig, ax = plt.subplots(figsize=(20, 5))
ax.plot(confirm_df['Report_date'], confirm_df['Case_classification'])
plt.tight_layout()
plt.show()
Sample output:
Some grouping and data aggregation might be more informative, but you have to decide what you want to display first before writing the code.

Problem with negative numbers in sklearn.feature_selection.SelectKBest feautre scoring module

I was trying auto feature engineering and selecting, so for that, I used the Boston house price dataset available in sklearn.
from sklearn.datasets import load_boston
import pandas as pd
data = load_boston()
x = data.data
y= data.target
y = pd.DataFrame(y)
Then I implemented the feature transformation library on the dataset.
import autofeat as af
clf = af.AutoFeatRegressor()
df = clf.fit_transform(x,y)
df = pd.DataFrame(df)
After this, I implemented another function to find the score of each feature in relation to the label.
from sklearn.feature_selection import SelectKBest, chi2
X_new = SelectKBest(chi2, k=20)
X_new_done = X_new.fit_transform(df,y)
dfscores = pd.DataFrame(X_new.scores_)
dfcolumns = pd.DataFrame(X_new_done.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']
print(featureScores.nlargest(10,'Score'))
This gave error as following.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-b0fa1556bdef> in <module>()
1 from sklearn.feature_selection import SelectKBest, chi2
2 X_new = SelectKBest(chi2, k=20)
----> 3 X_new_done = X_new.fit_transform(df,y)
4 dfscores = pd.DataFrame(X_new.scores_)
5 dfcolumns = pd.DataFrame(X_new_done.columns)
ValueError: Input X must be non-negative.
I had a few negative numbers in my dataset. So how can I overcome this problem?
Note:- df has now transformations of y, its only having transformations of x.
You have a feature with all negative values:
df['exp(x005)*log(x000)']
returns
0 -3630.638503
1 -2212.931477
2 -4751.790753
3 -3754.508972
4 -3395.387438
...
501 -2022.382877
502 -1407.856591
503 -2998.638158
504 -1973.273347
505 -1267.482741
Name: exp(x005)*log(x000), Length: 506, dtype: float64
Quoting another answer (https://stackoverflow.com/a/46608239/5025009):
The error message Input X must be non-negative says it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative.
In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to [0, 1] interval as suggested by EdChum.
If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:
sklearn.feature_selection.f_regression computes ANOVA f-value
sklearn.feature_selection.mutual_info_classif computes the mutual information
Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.

getting error while forming train matrix in book recommendation system

I am new to data science and facing issues while creating a book recommendation system by collaborative filtering. Can someone please advise on the below error.
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
data = pd.read_csv('BX-Book-Ratings.csv',engine = 'python')
df = data.iloc[1:10000,:]
print(df)
print(df.dtypes)
df['isbn']= pd.to_numeric(df['isbn'], errors = 'coerce')
df = df[np.isfinite(df).all(1)]
df['isbn'] = df['isbn'].astype(np.int64)
from sklearn.model_selection import train_test_split
n_users = df.user_id.unique().shape[0]
n_book = df.isbn.unique().shape[0]
train_data, test_data = train_test_split(df, test_size=0.5)
print(n_users , n_book)
train_data_matrix = np.zeros((n_users, n_book))
for line in train_data.itertuples():
#[user_id index, book_id index] = given rating.
train_data_matrix[line[1] - 1, line[2] - 1] = line[3]
train_data_matrix
--------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-125-caa0bcd40167> in <module>
2 for line in train_data.itertuples():
3 #[user_id index, book_id index] = given rating.
----> 4 train_data_matrix[line[1] - 1, line[2] - 1] = line[3]
5 train_data_matrix
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
The Most probable cause of error is the index value are having mismatch.
I can see ISBN is int type but what about user_id??
Fix:
The fix is to create an unique index for the these n_users * n_book.
Method1 : this can be created either using another unique
dataframe for consumer and item and use its index.
Method2 : create a dict and use unique values as key and some index.
Now whatever method is used should be consistent across rest of process or it will result in mismatch of
book-item rating.
This fix uses method 2.
# Method2
user_dict= {}
for item,value in enumerate(df.user_id.unique().tolist()):
consumer_dict[value]= item
book_dict = {}
for item, value in enumerate(df.isbn.unique().tolist()):
item_dict[value] = item
print(len(user_dict.keys()), len(book_dict.keys()))
for line in train.itertuples():
row_index = user_dict[line[1]]
col_index = book_dict[line[2]]
data_matrix[row_index, col_index] = line[3]
Hope This Helps , Snapshot of data will probably help to fix this.

fuzzy lookup between 2 series/columns of nonidentical lengths

I am trying to do a fuzzy lookup between 2 series/columns between df1 and df2 where df1 is the dictionary file(to be used as a base) and df2 is the target file(to be looked up on)
import pandas as pd
df1 = pd.DataFrame(data ={'Brand_var':['Altmeister Bitter','Altos Las Hormigas Argentinian Wine','Amadeus Contri Sparkling Wine','Amadeus Cream Liqueur','Amadeus Sparkling Sparkling Wine']})
df2 = pd.DataFrame(data = {'Product':['1960 Altmeister 330ML CAN METAL','Hormi 12 Yr Bottle','test']})
I looked up for some solutions in SO, unfortunately dont seem to find a solution.
Used:
df3 = df2['ProductLongDesc'].apply(lambda x: difflib.get_close_matches(x, df1['Brand_var'])[0])
also :
df3 = df2['Product'].apply(lambda x: difflib.get_close_matches(x, df1['Brand_var']))
The first one gives me an index error and the second one gives me just the indexes.
My desired output is to print a mapping between df1 item and df2 items using a fuzzy lookup and printing both Brand_var and Product for their respective matches.
Desired Output:
Brand_var Product
Altmeister Bitter 1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine Hormi 12 Yr Bottle
For the non matching items ex: test in df2, can be ignored.
Note: The matching string name also could be non identical, as in it can have 1 or 2 letter missing in it. :(
Thank you in advance for taking your time out for this issue. :)
If you install fuzzywuzzy, you still stay with a problem how to choose proper heuristic to select right prouct and cut those products which are selected incorrectly (explanation below)
install fuzzywuzzy:
pip install fuzzywuzzy
fuzzywuzzy has several methods for a ratio calculation (examples on github). You face the problem: how to choose the best? I tried them on your data, but all of them faliled.
Code:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
# df1 = ...
# df2 = ...
def get_top_by_ratio(x, df2):
product_values = df2.Product.values
# compare two strings by characters
ratio = np.array([fuzz.partial_ratio(x, val) for val in product_values])
argmax = np.argmax(ratio)
rating = ratio[argmax]
linked_product = product_values[argmax]
return rating, linked_product
Aplly this function to your data:
partial_ratio = (df1.Brand_var.apply(lambda x: get_top_by_ratio(x, df2))
.apply(pd.Series) # convert returned Series of tuples into pd.DataFrame
.rename(columns={0: 'ratio', 1: 'Product'})) # just rename columns
print(partial_ratio)
Out:
0 65 1960 Altmeister 330ML CAN METAL # Altmeister Bitter
1 50 test # Altos Las Hormigas Argentinian Wine
2 33 test
3 50 test
4 50 test
That's not good. Other ratio methods as fuzz.ratio, fuzz.token_sort_ratio etc. had failed too.
So I guess extend heuristic to compare words not only characters might help. Define a function that will create vocabulary from your data, encode all the sentences and use more sophisticated heuristic looking for words too:
def create_vocab(df1, df2):
# Leave 0 index free for unknow words
all_words = set((df1.Brand_var.str.cat(sep=' ') + df2.Product.str.cat(sep=' ')).split())
vocab = dict([(i + 1, w) for i, w in enumerate(all_words)])
return vocab
def encode(string, vocab):
"""This function encodes a sting with vocabulary"""
return [vocab[w] if w in vocab else 0 for w in string.split()]
Define new heuristic:
def get_top_with_heuristic(x, df2, vocab):
product_values = df2.Product.values
# compare two strings by characters
ratio_per_char = np.array([fuzz.partial_ratio(x, val) for val in product_values])
# compare two string by words
ratio_per_word = np.array([fuzz.partial_ratio(x, encode(val, vocab)) for val in product_values])
ratio = ratio_per_char + ratio_per_word
argmax = np.argmax(ratio)
rating = ratio[argmax]
linked_product = product_values[argmax]
return rating, linked_product
Create vocabulary, apply sophisticated heuristic to the data:
vocab = create_vocab(df1, df2)
heuristic_rating = (df1.Brand_var.apply(lambda x: get_top_with_heuristic(x, df2, vocab))
.apply(pd.Series)
.rename(columns={0: 'ratio', 1: 'Product'}))
print(heuristic_rating)
Out:
ratio Product
0 73 1960 Altmeister 330ML CAN METAL # Altmeister Bitter
1 61 Hormi 12 Yr Bottle # Altos Las Hormigas Argentinian Wine
2 45 Hormi 12 Yr Bottle
3 50 test
4 50 test
It seems to be correct! Concatenate this dataframe to df1, change index:
result_heuristic = pd.concat((df1, heuristic_rating), axis=1).set_index('Brand_var')
print(result_heuristic)
Out:
ratio Product
Brand_var
Altmeister Bitter 73 1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine 61 Hormi 12 Yr Bottle
Amadeus Contri Sparkling Wine 45 Hormi 12 Yr Bottle
Amadeus Cream Liqueur 50 test
Amadeus Sparkling Sparkling Wine 50 test
Now you should choose some rule of the thumb to cut incorrect data. For this example ratio <= 50 works good, but you probably need some research to define best heuristic and correct threshold. Also you will get some errors anyway. Choose acceptable error rate ,i.e 2%, 5% ... and improve your algorithm until you reach it (This task is similar to validation of machine learning classification algorithms).
Cut incorrect "predictions":
result = result_heuristic[result_heuristic.ratio > 50][['Product']]
print(result)
Out: Product
Brand_var
Altmeister Bitter 1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine Hormi 12 Yr Bottle
Hope it helps!
P.S. of course, this algorithm is very very slow, when you'optimize' it you should do some optimizations, for example, cache the diffs etc.

Resources