Fuzzywuzzy match 2 columns... script keeps running - python-3.x

I'm trying to match 2 columns of ~50.000 instances with Fuzzywuzzy.
Column A (companies) contains company names, with some typos. Column B (correct) contains the correct company names.
I'm trying to match the typo ones with correct ones. When running my script below, the kernel keeps executing for hours & doesn't provide a result.
Any ideas on how to improve?
Many thanks!
Update link to files: https://fromsmash.com/STLz.VEub2-ct
import pandas as pd
from fuzzywuzzy import process, fuzz
import matplotlib.pyplot as plt
correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")
actual_comp = []
similarity = []
for i in companies.Customers:
ratio = process.extract(i, correct.Correct, limit=1)
actual_comp.append(ratio[0][0])
similarity.append(ratio[0][1])
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity)
companies.head(10)

There are a couple of things you can change to improve the performance:
Use Rapidfuzz instead of Fuzzywuzzy, since it implements the same algorithms, but is quite a bit faster (I am the author)
The process functions are preprocessing all strings you pass to them (lowercases them, removes non alpha numeric characters and trims whitespaces). Right now your preprocessing correct.Correct len(companies.Customers) times, which costs a lot of time and could be done once in front of the loop instead
Your only using the best match, so it is better to use process.extractOne instead of process.extract. This is more readable and inside extractOne rapidfuzz is using the results of previous comparision to improve the performance
The following snippet implements these changes for your code. Keep in mind, that your still performing 50k^2 comparisions, so while this should be a lot faster than your current solution it will still take a while.
import pandas as pd
from rapidfuzz import process, fuzz, utils
import matplotlib.pyplot as plt
correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")
actual_comp = []
similarity = []
company_mapping = {company: utils.default_process(company) for company in correct.Correct}
for customer in companies.Customers:
_, score, comp = process.extractOne(
utils.default_process(customer),
company_mapping,
processor=None)
actual_comp.append(comp)
similarity.append(score)
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity)
companies.head(10)
Out of interest I performed a quick benchmark calculating the average runtime when using your datasets. On my machine each lookup requires around 1 second with this solution (so a total of around 4.7 hours), while your previous solution took around 55 seconds per lookup (so a total of around 10.8 days).

Related

What would the equivalent machine learning program in R language of this Python one?

As part of a school assignment on DSL and code generation, I have to translate the following program written in Python/Scikit-learn into R language (the topic of the exercise is an hypothetic Machine Learning DSL).
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
df = pd.read_csv('boston.csv', sep=',')
df.head()
y = df["medv"]
X = df.drop(columns=["medv"])
clf = DecisionTreeRegressor()
scoring = ['neg_mean_absolute_error','neg_mean_squared_error']
results = cross_validate(clf, X, y, cv=6,scoring=scoring)
print('mean_absolute_errors = '+str(results['test_neg_mean_absolute_error']))
print('mean_squared_errors = '+str(results['test_neg_mean_squared_error']))
Since I'm a perfect newbie in Machine Learning, and especially in R, I can't do it.
Could someone help me ?
Sorry for the late answer, probably you have already finished your school assignment. Of course we cannot just do it for you, you probably have to figure it out by yourself. Moreover, I don't get exactly what you need to do. But some tips are:
Read a csv file
data <-read.csv(file="name_of_the_file", header=TRUE, sep=",")
data <-as.data.frame(data)
The header=TRUE indicates that the file has one row which includes the names of the columns, the sep=',' is the same as in python (the seperator in the file is ',')
The as.data.frame makes sure that your data is kept in a dataframe format.
Add/delete a column
data<- data[,-"name_of_the_column_to_be_deleted"] #delete a column
data$name_of_column_to_be_added<- c(1:10) #add column
In order to add a column you will need to add the elements it will include. Also the # symbol indicates the beginning of a comment.
Modelling
For the modelling part I am not sure about what you want to achieve, but R offers a huge selection of algorithms to choose from (i.e. if you want to grow a tree take a look into the page https://www.statmethods.net/advstats/cart.html where it uses the following script to grow a tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis))

Using weighted adjacency matrices to calculate global efficiency of said matrix using networkx

I have been trying to study the impact on a network by looking at deletions of different combinations of nodes.
To study this I have used the networkx graph theory metric, global efficiency. But, I figured that the networkx code ignores weight when calculating global efficiency. So, I went in and changed the source code and added weight as a metric. It seems to be working and is giving me different values than the non-weighted approach but is exceptionally slow (about 20 times).
How can I speed up these computations?
##The code I am running
import networkx
import numpy as np
from networkx import algorithms
from networkx.algorithms import efficiency
from networkx.algorithms.efficiency import global_efficiency
import pandas
data=pandas.read_csv("ones.csv")
lol = data.values.tolist()
data=pandas.read_csv("twos.csv")
lol2 = data.values.tolist()
combo=[["10pp", "10d"]]
GE_list=[]
for row in combo:
values = row
datasafe=pandas.read_csv("b1.csv", index_col=0)
datasafe.loc[values, :] = 0
datasafe[values] = 0
g=networkx.from_pandas_adjacency(datasafe)
ge=global_efficiency(g)
GE_list.append(ge)
extra=[""]
extra2=["full"]
combo.append(extra)
combo.append(extra2)
datasafe=pandas.read_csv("b1.csv", index_col=0)
g=networkx.from_pandas_adjacency(datasafe)
ge=global_efficiency(g)
GE_list.append(ge)
values = ["s6-8","p9-46v","p47r","p10p","IFSp","IFSa",'IFJp','IFJa','i6-8','a9-46v','a47r','a10p','9p','9a','9-46d','8C','8BL','8AV','8AD','47s','47L','10pp','10d','46','45','44']
datasafe=pandas.read_csv("b1.csv", index_col=0)
datasafe.loc[values, :] = 0
datasafe[values] = 0
g=networkx.from_pandas_adjacency(datasafe)
ge=global_efficiency(g)
GE_list.append(ge)
output=pandas.DataFrame(list(zip(combo, GE_list)))
output.to_csv('delete 1.csv',index=None)
##The change I made to the original networkx code
try:
eff = 1 / nx.shortest_path_length(G, u, v)
## changed to
try:
eff = 1 / nx.shortest_path_length(G, u, v, weight='weight')
Previously with my unweighted graphs I was able to process my data in 2 hours, currently its taking the same time to do a twentieth of the data. Please do suggest any improvements to my code or any other pieces of code that I can run.
Ps-I don't have a great understanding of python, so please do bear with me :)
Using weights, you exchange breadth-first search with Dijkstra algorithm, which increases the runtime by log|V|, see second comment of https://stackoverflow.com/a/25449911
If you have problem with the runtime, you should rather exchange networkx, which is implemented in python, with a C implementation like graph-tool or igraph, see e.g. for a (probably biased) comparison of performance: https://graph-tool.skewed.de/performance

how to use pandas to organize sales data into 12 months and find the 10 most profitable products for those 12 months?

I need write a program that organizes the data in the provided spreadsheet to find the top 10 most profitable products by each month. The program needs to take an input from the user to specify the year in which to compile the data.
I've gotten as far as printing all of the products sold in each month by their highest profitability but I don't know how to make it print only the top 10 for each month.
I'm also lost on how to take an input from the user to select only certain year for the program to compile the data.
Please help.
the link to download the files for my project: https://drive.google.com/drive/folders/1VkzTWydV7Qae7hOn6WUjDQutQGmhRaDH?usp=sharing
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
xl = pd.ExcelFile("SalesDataFull.xlsx")
OrdersOnlyData = xl.parse("Orders")
df_year = OrdersOnlyData["Order Date"].dt.year
OrdersOnlyData["Year"] = df_year
df_month = OrdersOnlyData["Order Date"].dt.month
OrdersOnlyData["Month"] = df_month
dataframe = OrdersOnlyData[["Year","Month","Product Name","Profit"]]
month_profit = dataframe.groupby(["Year","Month","Product Name"]).Profit.sum().sort_values(ascending=False)
month_profit = month_profit.reset_index()
month_profit = month_profit.sort_values(["Year","Month","Profit"],ascending=[True,True,False])
print(month_profit)
As #Franco pointed out, it is difficult to recommend a proper solution since you did not provide a data sample together with your question. In any case, the function that you are looking for is most likely nth().
This is probably how it should look like:
month_profit = month_profit.sort_values('Profit', ascending=False).groupby(['Year', 'Month']).nth([range(10)]).sort_values(by=['Year', 'Month', 'Profit'], ascending=[True, True, False])

RuntimeWarning: divide by zero encountered in log when using pvlib

I'm using PVLib to model a PV system. I'm pretty new to coding and Python, and this is my first time using PVLib, so not surprisingly I've hit some difficulties.
Specifically, I've got created the following code using the extensive readthedocs examples at http://pvlib-python.readthedocs.io/en/latest/index.html
import pandas as pd
import numpy as np
from numpy import isnan
import datetime
import pytz
# pvlib imports
import pvlib
from pvlib.forecast import GFS, NAM, NDFD, HRRR, RAP
from pvlib.pvsystem import PVSystem, retrieve_sam
from pvlib.modelchain import ModelChain
# set location (Royal Greenwich Observatory, London, UK)
latitude, longitude, tz = 51.4769, 0.0005, 'Europe/London'
# specify time range.
start = pd.Timestamp(datetime.date.today(), tz=tz)
end = start + pd.Timedelta(days=5)
periods = 8 # number of periods that the GFS model and/or the model chain allows us to forecast power output.
# specify what irradiance variables we want
irrad_vars = ['ghi', 'dni', 'dhi']
# Use Global Forecast System model. The GFS is the US model that provides forecasts for the entire globe.
fx_model = GFS() # note: gives output in 3-hourly intervals
# retrieve data in processed format (convert temps from Kelvin to Celsius, combine elements of wind speed, complete irradiance data)
# Returns pandas.DataFrame object
fx_data = fx_model.get_processed_data(latitude, longitude, start, end)
# load module and inverter specifications
sandia_modules = pvlib.pvsystem.retrieve_sam('SandiaMod')
cec_inverters = pvlib.pvsystem.retrieve_sam('cecinverter')
module = sandia_modules['SolarWorld_Sunmodule_250_Poly__2013_']
inverter = cec_inverters['ABB__PVI_3_0_OUTD_S_US_Z_M_A__240_V__240V__CEC_2014_']
# model a fixed system in the UK. 10 strings of 250W panels, with 40 panels per string. Gives a nominal 100kW array
system = PVSystem(module_parameters=module, inverter_parameters=inverter, modules_per_string=40, strings_per_inverter=10)
# use a ModelChain object to calculate modelling intermediates
mc = ModelChain(system, fx_model.location, orientation_strategy='south_at_latitude_tilt')
# extract relevant data for model chain
mc.run_model(fx_data.index, weather=fx_data)
# OTHER CODE AFTER THIS TO DO SOMETHING WITH THE DATA
Having used a lot of print() statements in the console to debug, I can see that at the final line
mc.run_model(fx_data.index....
I get the following error:
/opt/pyenv/versions/3.6.0/lib/python3.6/site-packages/pvlib/pvsystem.py:1317:
RuntimeWarning: divide by zero encountered in log
module['Voco'] + module['Cells_in_Series']*delta*np.log(Ee) +
/opt/pyenv/versions/3.6.0/lib/python3.6/site-packages/pvlib/pvsystem.py:1323:
RuntimeWarning: divide by zero encountered in log
module['C3']*module['Cells_in_Series']*((delta*np.log(Ee)) ** 2) +
As a result, when I then go on to look at the ac_power outputs, I get what looks like erroneous data (every hour with a forecast that is not NaN = 3000 W).
I'd really appreciate any help you can give as I don't know what's causing it. Maybe I'm specifying the system incorrectly?
Thanks, Matt
I think the warnings you're seeing are ok to ignore. A handful of pvlib algorithms spit out warnings due to things like 0 values at night.
I think your problem with the non-NaN values is unrelated to the warnings. Study the other modeling results (stored as mc attributes -- see documentation and source code) to see if you can track down the source of your problem.

Adding numerical values from dict to a new column in a Pandas DataFrame

I am practicing machine learning and working with a movie/rating dataset. I am trying to create a new column in the dataframe which numerically identifies each genre (around 1300 of them). My logic was to create a dictionary of the unique genres and label with a integer. Then create a for loop to iterate through each row of the dataframe, checking the genre of each, then assigning its appropriate value to a new column named "genre_Id". However this has been causing a infinite loop in which I can not even break with ctrl-c. Same issue when working in Jupyter ( Interrupt Kernel fails to stop it). Below is a summarized version of my approach.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
movies_data = pd.read_csv("C://mypython/moviedata/movies.csv")
ratings_data = pd.read_csv("C://mypython/moviedata/ratings.csv")
joined = pd.merge(movies_data,ratings_data, how = 'inner', on=['movieId'])
print(joined.head())
pd.options.display.float_format = '{:,.2f}'.format
genres = joined['genres'].unique()
genre_dict = {}
Id = 1
for i in genres:
genre_dict[i] = Id
Id += 1
joined['genre_id'] = 0
increment = 0
for i in joined['genres']:
if i in genre_dict:
joined['genre_id'][increment] = genre_dict[i]
increment += 1
I know I should probably be taking a smaller sample to work with as there is about 20,000,000 rows in the dataset but I figured I'd try this as a exercise.
I also recieve the "setting values from copy warning" though this hasn't caused me issues in the past for my other projects. Any thoughts on how to do this would be greatly appreciated.
EDIT Found a solution using the Series map feature.
joined['genre_id'] = joined.genres.map(genre_dict)
I have no permission to just comment. This is a suggestion and right procedure to handle categorical values in a dataset. You can use inbuilt sklearn.preprocessing.OneHotEncoder function which do the work you wanted to do.
For better understanding with examples check this One Hot Encode Sequence Data in Python. Let me know if this works for you.

Resources