Adding numerical values from dict to a new column in a Pandas DataFrame - python-3.x

I am practicing machine learning and working with a movie/rating dataset. I am trying to create a new column in the dataframe which numerically identifies each genre (around 1300 of them). My logic was to create a dictionary of the unique genres and label with a integer. Then create a for loop to iterate through each row of the dataframe, checking the genre of each, then assigning its appropriate value to a new column named "genre_Id". However this has been causing a infinite loop in which I can not even break with ctrl-c. Same issue when working in Jupyter ( Interrupt Kernel fails to stop it). Below is a summarized version of my approach.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
movies_data = pd.read_csv("C://mypython/moviedata/movies.csv")
ratings_data = pd.read_csv("C://mypython/moviedata/ratings.csv")
joined = pd.merge(movies_data,ratings_data, how = 'inner', on=['movieId'])
print(joined.head())
pd.options.display.float_format = '{:,.2f}'.format
genres = joined['genres'].unique()
genre_dict = {}
Id = 1
for i in genres:
genre_dict[i] = Id
Id += 1
joined['genre_id'] = 0
increment = 0
for i in joined['genres']:
if i in genre_dict:
joined['genre_id'][increment] = genre_dict[i]
increment += 1
I know I should probably be taking a smaller sample to work with as there is about 20,000,000 rows in the dataset but I figured I'd try this as a exercise.
I also recieve the "setting values from copy warning" though this hasn't caused me issues in the past for my other projects. Any thoughts on how to do this would be greatly appreciated.
EDIT Found a solution using the Series map feature.
joined['genre_id'] = joined.genres.map(genre_dict)

I have no permission to just comment. This is a suggestion and right procedure to handle categorical values in a dataset. You can use inbuilt sklearn.preprocessing.OneHotEncoder function which do the work you wanted to do.
For better understanding with examples check this One Hot Encode Sequence Data in Python. Let me know if this works for you.

Related

Stuck using pandas to build RPG item generator

I am trying to build a simple random item generator for a game I am working on.
So far I am stuck trying to figure out how to store and access all of the data. I went with pandas using .csv files to store the data sets.
I want to add weighted probabilities to what items are generated so I tried to read the csv files and compile each list into a new set.
I got the program to pick a random set but got stuck when trying to pull a random row from that set.
I am getting an error when I use .sample() to pull the item row which makes me think I don't understand how pandas works. I think I need to be creating new lists so I can later index and access the various statistics of the items once one is selected.
Once I pull the item I was intending on adding effects that would change the damage and armor and such displayed. So I was thinking of having the new item be its own list then use damage = item[2] + 3 or whatever I need
error is: AttributeError: 'list' object has no attribute 'sample'
Can anyone help with this problem? Maybe there is a better way to set up the data?
here is my code so far:
import pandas as pd
import random
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
def get_item():
item_class = [random.choices(df, weights=(45,40,15), k=1)] #this part seemed to work. When I printed item_class it printed one of the entire lists at the correct odds
item = item_class.sample()
print (item) #to see if the program is working
get_item()
I think you are getting slightly confused with lists vs list elements. This should work. I stubbed your dfs with simple ones
import pandas as pd
import random
# Actual data. Comment it out if you do not have the csv files
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
# My stubs -- uncomment and use this instead of the line above if you want to run this specific example
# df = [pd.DataFrame({'weapons' : ['w1','w2']}), pd.DataFrame({'armor' : ['a1','a2', 'a3']}), pd.DataFrame({'aether' : ['e1','e2', 'e3', 'e4']})]
def get_item():
# I removed [] from the line below -- choices() already returns a list of length 1
item_class = random.choices(df, weights=(45,40,15), k=1)
# I added [0] to choose the first element of item_class which is a list of length 1 from the line above
item = item_class[0].sample()
print (item) #to see if the program is working
get_item()
prints random rows from random dataframes that I setup such as
weapons
1 w2

Fuzzywuzzy match 2 columns... script keeps running

I'm trying to match 2 columns of ~50.000 instances with Fuzzywuzzy.
Column A (companies) contains company names, with some typos. Column B (correct) contains the correct company names.
I'm trying to match the typo ones with correct ones. When running my script below, the kernel keeps executing for hours & doesn't provide a result.
Any ideas on how to improve?
Many thanks!
Update link to files: https://fromsmash.com/STLz.VEub2-ct
import pandas as pd
from fuzzywuzzy import process, fuzz
import matplotlib.pyplot as plt
correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")
actual_comp = []
similarity = []
for i in companies.Customers:
ratio = process.extract(i, correct.Correct, limit=1)
actual_comp.append(ratio[0][0])
similarity.append(ratio[0][1])
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity)
companies.head(10)
There are a couple of things you can change to improve the performance:
Use Rapidfuzz instead of Fuzzywuzzy, since it implements the same algorithms, but is quite a bit faster (I am the author)
The process functions are preprocessing all strings you pass to them (lowercases them, removes non alpha numeric characters and trims whitespaces). Right now your preprocessing correct.Correct len(companies.Customers) times, which costs a lot of time and could be done once in front of the loop instead
Your only using the best match, so it is better to use process.extractOne instead of process.extract. This is more readable and inside extractOne rapidfuzz is using the results of previous comparision to improve the performance
The following snippet implements these changes for your code. Keep in mind, that your still performing 50k^2 comparisions, so while this should be a lot faster than your current solution it will still take a while.
import pandas as pd
from rapidfuzz import process, fuzz, utils
import matplotlib.pyplot as plt
correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")
actual_comp = []
similarity = []
company_mapping = {company: utils.default_process(company) for company in correct.Correct}
for customer in companies.Customers:
_, score, comp = process.extractOne(
utils.default_process(customer),
company_mapping,
processor=None)
actual_comp.append(comp)
similarity.append(score)
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity)
companies.head(10)
Out of interest I performed a quick benchmark calculating the average runtime when using your datasets. On my machine each lookup requires around 1 second with this solution (so a total of around 4.7 hours), while your previous solution took around 55 seconds per lookup (so a total of around 10.8 days).

How do you take one dataframe that covers multiple years, and break it into a separate DF for each year

I already looked on SE and couldn't find an answer to my question. I am still new to this.
I am trying to take a purchasing csv file and break it into separate dataframes for each year.
For example, if I have a listing with full dates in MM/DD/YYYY format, I am trying to separate them into dataframes for each year. Like Ord2015, Ord2014, etc...
I tried to covert the full date into just the year, and also attempted to use slicing to only look at the last four of the date to no avail.
Here is my current (incomplete) attempt:
import pandas as pd
import csv
import numpy as np
import datetime as dt
import re
purch1 = pd.read_csv('purchases.csv')
#Remove unneeded fluff
del_colmn = ['pid', 'notes', 'warehouse_id', 'env_notes', 'budget_notes']
purch1 = purch1.drop(del_colmn, 1)
#break down by year only
purch1.sort_values(by=['order_date'])
Ord2015 = ()
Ord2014 = ()
for purch in purch1:
Order2015.add(purch1['order_date'] == 2015)
Per req by #anon01... here are the results of the code you had me run. I only used a sample of four as that was all I was initially playing with... The record has almost 20k lines, so I only pulled aside a few to play with.
'{"pid":{"0":75,"2":95,"3":117,"1":82},"env_id":{"0":12454,"2":12532,"3":12623,"1":12511},"ord_date":{"0":"10\/2\/2014","2":"11\/22\/2014","3":"2\/17\/2015","1":"11\/8\/2014"},"cost_center":{"0":"Ops","2":"Cons","3":"Net","1":"Net"},"dept":{"0":"Ops","2":"Cons","3":"Ops","1":"Ops"},"signing_mgr":{"0":"M. Dodd","2":"L. Price","3":"M. Dodd","1":"M. Dodd"},"check_num":{"0":null,"2":null,"3":null,"1":82301.0},"rec_date":{"0":"10\/11\/2014","2":"12\/2\/2014","3":"3\/1\/2015","1":"11\/20\/2014"},"model":{"0":null,"2":null,"3":null,"1":null},"notes":{"0":"Shipped to east WH","2":"Rec'd by L.Price","3":"Shipped to Client (1190)","1":"Rec'd by K. Wilson"},"env_notes":{"0":"appr by K.Polt","2":"appr by S. Crane","3":"appr by K.Polt","1":"appr by K.Polt"},"budget_notes":{"0":null,"2":"OOB expense","3":"Bill to client","1":null},"cost_year":{"0":2014.0,"2":2015.0,"3":null,"1":2014.0}}'
You can add parse_dates to read_csv for convert column to datetimes and then create dictionary of DataFrames dfs, for selecting is used keys:
purch1 = pd.read_csv('purchases.csv', parse_dates=['ord_date'])
dfs = dict(tuple(purch1.groupby(df['ord_date'].dt.year)))
Ord2015 = dfs[2015]
Ord2016 = dfs[2016]
It is not recommended, but possible create DataFrames by years groups:
for i, g in df.groupby(purch1['ord_date'].dt.year):
globals()['Ord' + str(i)] = g

how to use pandas to organize sales data into 12 months and find the 10 most profitable products for those 12 months?

I need write a program that organizes the data in the provided spreadsheet to find the top 10 most profitable products by each month. The program needs to take an input from the user to specify the year in which to compile the data.
I've gotten as far as printing all of the products sold in each month by their highest profitability but I don't know how to make it print only the top 10 for each month.
I'm also lost on how to take an input from the user to select only certain year for the program to compile the data.
Please help.
the link to download the files for my project: https://drive.google.com/drive/folders/1VkzTWydV7Qae7hOn6WUjDQutQGmhRaDH?usp=sharing
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
xl = pd.ExcelFile("SalesDataFull.xlsx")
OrdersOnlyData = xl.parse("Orders")
df_year = OrdersOnlyData["Order Date"].dt.year
OrdersOnlyData["Year"] = df_year
df_month = OrdersOnlyData["Order Date"].dt.month
OrdersOnlyData["Month"] = df_month
dataframe = OrdersOnlyData[["Year","Month","Product Name","Profit"]]
month_profit = dataframe.groupby(["Year","Month","Product Name"]).Profit.sum().sort_values(ascending=False)
month_profit = month_profit.reset_index()
month_profit = month_profit.sort_values(["Year","Month","Profit"],ascending=[True,True,False])
print(month_profit)
As #Franco pointed out, it is difficult to recommend a proper solution since you did not provide a data sample together with your question. In any case, the function that you are looking for is most likely nth().
This is probably how it should look like:
month_profit = month_profit.sort_values('Profit', ascending=False).groupby(['Year', 'Month']).nth([range(10)]).sort_values(by=['Year', 'Month', 'Profit'], ascending=[True, True, False])

Pandas - iterating to fill values of a dataframe

I'm trying to build a data-frame of time series data. I have to retrieve the data from an API and every (i,j) entry in the data-frame (where "i" is the row and "j" is the column) has to be iterated through and filled individually.
Here's an idea of the type of thing i'm trying to do (note the API i'm using doesn't have historical data for what i'm trying to analyze):
import pandas as pd
import numpy as np
import time
def retrievedata(string):
take string
do some stuff with api
return float
label_list = ['label1','label1','label1', etc...]
discrete_points = 720
df = pd.DataFrame(index=np.arange(0, discrete_points), columns=(i for i in label_list))
So at this point I've pre-allocated a data frame. What comes next is the issue.
Now, I want to iterate over it and assign values to every (i,j) entry in the data frame based on a function i wrote to pull data. Note that the function I wrote has to be specific to a certain column (as it is taking as input the column label). And on top of that, each row will have different values b/c it is time-series data.
EDIT: Yuck, I found a gross way to make it work:
for row in range(discrete_points):
for label in label_list:
df.at[row, label] = retrievedata(label)
This is obviously a non-pythonic, non-numpy, non-pandas way of doing things. So i'd like to find a nicer and more efficient/less computing power intensive way of doing this.
I'm assuming it's gonna have to be some combination of: iter.rows(); iter.tuples(); df.loc(); df.at()
I'm stumped though.
Any ideas?

Resources