Merge Pandas Dataframes (ie: Orders and Inventory), then re-adjust Inventory - python-3.x

I have an order list with a separate inventory system (Google Sheets). Using Pandas, I'm trying to merge the two for an efficient "pick list" and have had some mild success. However, in testing (adding multiple quantities for an order, having multiple orders with the same item/SKU type) it starts breaking down.
orders = "orderNumber,SKU,Quantity\r\n11111,GreenSneakers,2\r\n11111,Brown_Handbag,1\r\n22222,GreenSneakers,1\r\n33333,Blue_Handbag,1"
str_orders = StringIO(orders, newline='\n')
df_orders = pd.read_csv(str_orders, sep=",")
inventory = "SKU,Location\r\nGreenSneakers,DA13A\r\nGreenSneakers,DA13A\r\nRed_Handbag,DA12A\r\nGreenSneakers,DB34C\r\nGreenSneakers,DB33C\r\n"
str_inventory = StringIO(inventory, newline='\n')
df_inventory = pd.read_csv(str_inventory, sep=",")
df_inventory = df_inventory.sort_values(by='Location', ascending=False)
df_pList = df_orders.merge(df_inventory.drop_duplicates(subset=['SKU']), on='SKU', how='left')
print(df_pList)
pseudo desired output:
'
orderNumber, SKU, Quantity, Location
11111, GreenSneakers, 1, DB34C
11111, GreenSneakers, 1, DB33C
11111, Brown_Handbag, 1, NA
22222, GreenSneakers, 1, DA13A
33333, Blue_Handbag, 1, NA
'
Is Merge even a way to solve this type of a problem? Trying to stay away from looping if possible.

Below makes three dataframes.
df_pickList is what you were asking to make
copy_inventory contains what inventory would be if you pick everything (in case you want to just write the DataFrame out to overwrite your inventory file) You could elect to not make the copy and use your df_inventory directly, but especially in beta its handy to make a copy for manipulation.
df_outOfStock is a handy bucket to catch things you don't have in inventory. Cross check against current orders to see what you need to get on order
from io import StringIO
import pandas as pd
import copy
orders = """orderNumber,SKU,Quantity
11111,GreenSneakers,2
11111,Brown_Handbag,1
22222,GreenSneakers,1
33333,Blue_Handbag,1
"""
str_orders = StringIO(orders, newline='\n')
df_orders = pd.read_csv(str_orders, sep=",")
inventory = """SKU,Location
GreenSneakers,DA13A
GreenSneakers,DA13A
Red_Handbag,DA12A
GreenSneakers,DB34C
GreenSneakers,DB33C
"""
str_inventory = StringIO(inventory, newline='\n')
df_inventory = pd.read_csv(str_inventory, sep=",")
df_inventory = df_inventory.sort_values(by='Location', ascending=False)
df_outOfStock = pd.DataFrame() #placeholder to report a lack of stock
df_pickList = pd.DataFrame() #placeholder to make pick list
copy_inventory = copy.deepcopy(df_inventory) #make a copy of inventory to decimate
for orderIndex, orderLineItem in df_orders.iterrows():
for repeat in range(orderLineItem["Quantity"]): #since inventory location is 1 row per item, we need to do that many picks per order line item
availableInventory = copy_inventory.loc[copy_inventory.loc[:,"SKU"] == orderLineItem["SKU"], :]
if len(availableInventory) == 0:
#Failed to find any SKU to pull
df_outOfStock = df_outOfStock.append(orderLineItem, ignore_index=True)
else:
pickRow = {"orderNumber": orderLineItem["orderNumber"],
"SKU": orderLineItem["SKU"],
"Quantity": 1,
"Location": availableInventory.iloc[0]["Location"]}
df_pickList = df_pickList.append(pickRow, ignore_index=True)
copy_inventory.drop(index = availableInventory.index[0], inplace=True)
Thanks, this was a fun little exercise compared to dealing with non-integer quantities (i.e. feet of angle iron)
(Original wrong answer below)
I would recommend concatenating the rows into a single table (not merging and/or overwriting values), then using group by to allow the aggregation of values.
As a primer, I would start with these two links on putting your data together:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby

Related

Iterating over an API response breaks for only one column in a Pandas dataframe

Problem: In my dataframe, when looping through zip codes in a weather API, I am getting the SAME values for column "desc" where every value is "cloudy" (this is incorrect for some zip codes). I think it is taking the value from the very last zip code in the list and applying it to every row in the Desc column.
But if I run only zip code 32303 and comment out all the other zip codes, the value for "Desc" is correct, it is now correctly listed as sunny/clear - - this proves the values when looping are incorrect.
Heck, it's Florida! ;)
Checking other weather sources I know "sunny/clear" is the correct value for 32303, not "cloudy". So for some reason, iterating is breaking on the column Desc only. I've tried so many options and am just stuck. Any ideas how to fix this?
import requests
import pandas as pd
api_key = 'a14ac278e4c4fdfd277a5b37e1dbe87a'
#Create a dictionary of zip codes for the team
zip_codes = {
55446: "You",
16823: "My Boo",
94086: "Your Boo",
32303: "Mr. Manatee",
95073: "Me"
}
# Create a list of zip codes
zip_list = list(zip_codes.keys())
# Create a list of names
name_list = list(zip_codes.values())
#For team data, create a pandas DataFrame from the dictionary
df1 = pd.DataFrame(list(zip_codes.items()),
columns=['Zip Code', 'Name'])
# Create empty lists to hold the API response data
city_name = []
description = []
weather = []
feels_like = []
wind = []
clouds = []
# Loop through each zip code
for zip_code in zip_list:
# Make a request to the OpenWeatherMap API
url = f"http://api.openweathermap.org/data/2.5/weather?zip={zip_code},us&units=imperial&appid={api_key}"
response = requests.get(url).json()
# Store the response data in the appropriate empty list
city_name.append(response['name'])
description = response['weather'][0]['main']
weather.append(response['main']['temp'])
feels_like.append(response['main']['feels_like'])
wind.append(response['wind']['speed'])
clouds.append(response['clouds']['all'])
# rain.append(response['humidity']['value'])
# For weather data, create df from lists
df2 = pd.DataFrame({
'City': city_name,
'Desc': description,
'Temp (F)': weather,
'Feels like': feels_like,
'Wind (mph)': wind,
'Clouds %': clouds,
# 'Rain (1hr)': rain,
})
# Merge df1 & df2, round decimals, and don't display index or zip.
df3=pd.concat([df1,df2],axis=1,join='inner').drop('Zip Code', axis=1)
df3[['Temp (F)', 'Feels like', 'Wind (mph)', 'Clouds %']] = df3[['Temp (F)', 'Feels like', 'Wind (mph)', 'Clouds %']].astype(int)
# Don't truncate df
pd.set_option('display.width', 150)
# Print the combined DataFrames
display(df3.style.hide_index())
Example output, note that "Desc" all have the same value "Clouds" but I know that is not correct as some are different.
Name City Desc Temp (F) Feels like Wind (mph) Clouds %
You Minneapolis Clouds 1 -10 12 100
My Boo Bellefonte Clouds 10 -1 15 100
Your Boo Sunnyvale Clouds 54 53 6 75
Mr. Manatee Tallahassee Clouds 49 49 3 0
Me Soquel Clouds 53 52 5 100
For example, if I comment out all the zip codes except for 32303: "Mr. Manatee", then I get a different value:
Name City Desc Temp (F) Feels like Wind (mph) Clouds %
Mr. Manatee Tallahassee Clear 49 49 3 0
To solve this, I tried another approach, below, which DOES give correct values for each zip code. The problem is that several of the columns are json values, and if I can't fix the code above, then I need to parse them and show only the relevant values. But my preference would be to fix the code above!
import requests
import pandas as pd
import json
zip_codes = {
95073: "Me",
55446: "You",
16823: "My Boo",
94086: "Your Boo",
32303: "Mr. Manatee"
}
import pandas as pd
import requests
# Create a list of zip codes
zip_list = list(zip_codes.keys())
# Create a list of names
name_list = list(zip_codes.values())
# Create a list of weather data
weather_list = []
# Set the API key
api_key = 'a14ac278e4c4fdfd277a5b37e1dbe87a'
# Get the weather data from the openweather API
for zip_code in zip_list:
api_url = f'http://api.openweathermap.org/data/2.5/weather?zip={zip_code},us&units=imperial&appid={api_key}'
response = requests.get(api_url).json()
weather_list.append(response)
# Create the dataframe
df = pd.DataFrame(weather_list)
# Add the name column
df['Name'] = name_list
# Parse the 'weather' column
#THIS DOESN'T WORK! df.weather.apply(lambda x: x[x]['main'])
# Drop unwanted columns
df.drop(['coord', 'base', 'visibility','dt', 'sys', 'timezone','cod'], axis=1)
I tried a different approach but got unusable json values. I tried various ways to fix looping in my first approach but I still get the same values for "Desc" instead of unique values corresponding to each zip code.
Like jqurious said, you had a bug in your code:
description = response['weather'][0]['main']
This means description stores the description of the final zip code in the dictionary and will repeat that across the whole dataframe. No wonder they are all the same.
Since you are collecting data to build a dataframe, it's better to use a list of dictionaries rather than a series of lists:
data = []
for zip_code in zip_list:
url = f"http://api.openweathermap.org/data/2.5/weather?zip={zip_code},us&units=imperial&appid={api_key}"
response = requests.get(url).json()
data.append({
"City": response["name"],
"Desc": response["weather"][0]["main"],
"Temp (F)": response["main"]["temp"],
"Feels like": response["main"]["feels_like"],
"Wind (mph)": response["wind"]["speed"],
"Clouds %": response["clouds"]["all"]
})
# You don't need to redefine the column names here
df2 = pd.DataFrame(data)

Any optimize way to iterate excel and provide data into pd.read_sql() as a string one by one

#here I have to apply the loop which can provide me the queries from excel for respective reports:
df1 = pd.read_sql(SQLqueryB2, con=con1)
df2 = pd.read_sql(ORCqueryC2, con=con2)
if (df1.equals(df2)):
print(Report2 +" : is Pass")
Can we achieve above by something doing like this (by iterating ndarray)
df = pd.read_excel(path) for col, item in df.iteritems():
OR do the only option left to read the excel from "openpyxl" library and iterate row, columns and then provide the values. Hope I am clear with the question, if any doubt please comment me.
You are trying to loop through an excel file, run the 2 queries, see if they match and output the result, correct?
import pandas as pd
from sqlalchemy import create_engine
# add user, pass, database name
con = create_engine(f"mysql+pymysql://{USER}:{PWD}#{HOST}/{DB}")
file = pd.read_excel('excel_file.xlsx')
file['Result'] = '' # placeholder
for i, row in file.iterrows():
df1 = pd.read_sql(row['SQLQuery'], con)
df2 = pd.read_sql(row['Oracle Queries'], con)
file.loc[i, 'Result'] = 'Pass' if df1.equals(df2) else 'Fail'
file.to_excel('results.xlsx', index=False)
This will save a file named results.xlsx that mirrors the original data but adds a column named Result that will be Pass or Fail.
Example results.xlsx:

How to get specific attributes of a df that has been grouped

I'm printing out the frequency of murders in each state in each particular decade. However, I just want to print the state, decade, and it's victim count. What I have right now is that it's printing out all the columns with the same frequencies. How do I change it so that I just have 3 columns, State, Decade, and Victim Count?
I'm currently using the groupby function to group by the state and decade and setting that equal to a variable called count.
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
print(counts)
The outcome is printing out all the columns in the file with the same frequencies whereas I just want 3 columns: State Decade Victim Count
Sample Text File
You should reset_index of the groupby object, and then select the columns from the new dataframe.
Something like
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
counts = counts.reset_index()[['State', 'Decade','Vistim Count']]
print(counts)
Select the columns that you want:
counts = df.loc[:,['State', 'Decade','Vistim Count']].groupby(['State', 'Decade']).count()
or
print(count.loc[:,['State', 'Decade','Vistim Count']])

Compare column names of multiple pandas dataframes

In the below code, I created list of dataframes. Now I want to check if all the dataframes in dataframes list has same column names (I just want to compare headers, not the values) and if the condition is not met, it should error out.
dataframes = []
list_of_files = os.listdir(os.path.join(folder_location, quarter, "inputs"))
for files in list_of_files:
df = pd.read_excel(os.path.join(folder_location, quarter, "inputs", files), header=[0,1], sheetname= "Ratings Inputs", parse_cols ="B:AC", index_col=None).reset_index()
df.columns = pd.MultiIndex.from_tuples([tuple(df.columns.names)]
+ list(df.columns)[1:])
dataframes.append(df)
Not the most elegant solution, but it will get you there:
np.all([sorted(dataframes[0].columns) == sorted(i.columns) for i in dataframes])
sorted serves the purpose of both transforming into lists and making sure they dont fail because they are in different order

Improving the speed of cross-referencing rows in the same DataFrame in pandas

I'm trying to apply a complex function to a pandas DataFrame, and I'm wondering if there's a faster way to do it. A simplified version of my data looks like this:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
What I want to do is for each combination of UID and UID2 check if there is both a row with EventType = A and EventType = B, and then calculate the time difference, and then add it back as a new column. So the new dataset would be:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
This is the current implementation, where I group the records by UID and UID2, then have only a small subset of rows to search to identify whether both EventTypes exist. I can't figure out a faster one, and profiling in PyCharm hasn't helped uncover where the bottleneck is.
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min
I need to make sure Time column is a timedelta
df.Time = pd.to_datetime(df.Time)
df.Time = df.Time - pd.to_datetime(df.Time.dt.date)
After that I create a helper dataframe
df1 = df.set_index(['UID', 'UID2', 'EventType']).unstack().Time
df1
Finally, I take the diff and merge to df
df.merge((df1.B - df1.A).rename('TimeDiff').reset_index())

Resources