I have a data frame with 3 columns: journal, abstract, and word count. I'm trying to calculate the average word count of the abstract for each journal and then order them in descending order to show the journals with the longest average abstract.
I've managed to get to a point where I have just the journal and average word count. That was done by doing:
newDF = marchDF.select("journal", "abstract").withColumn("wordcount", lit("0").cast("integer")).withColumn("wordcount", sql.size(sql.split(sql.col("abstract"), " ")))
nonullDF = newDF.filter(col("journal").isNotNull()).filter(col("abstract").isNotNull())
groupedDF = nonullDF.select("journal", "wordcount").groupBy("journal").avg("wordcount")
however, when I try to order it by wordcount, it throws the error:
cannot resolve '`wordcount`' given input columns: [avg(wordcount), journal];;
I've tried:
orderedDF = groupedDF.orderBy(col("wordcount")).desc().show(5)
and:
orderedDF = groupedDF.sort(col("wordcount")).desc.show(5)
but both throw that same error and I don't understand why.
That's because as the error says, there is no column named wordcount. The column you want to order by is called avg(wordcount), so you can do
orderedDF = groupedDF.orderBy("avg(wordcount)", ascending=False).show(5)
Alternatively, you can rename the avg column to wordcount during the aggregation:
import pyspark.sql.functions as F
groupedDF = nonullDF.select("journal", "wordcount").groupBy("journal").agg(F.avg("wordcount").alias("wordcount"))
orderedDF = groupedDF.orderBy("wordcount", ascending=False).show(5)
Note also the correct syntax for ordering in descending order.
Iam new to python please help me in below problem
I have a dictionary as below
city = {"AP":"VIZAG","TELANGANA":"HYDERABAD"}
and also I have a list which I need to loop for all state tables as below
states=['AP','HYDERABAD']
for st in states:
df = spark.sql(f"""select * from {st} where city = {city}["{st}"]""")
In above df I am trying to filter city based on dictionary value as per state. But I am not able to do it
New answer
By combining two filter conditions you can do the expected filtering.
selected_city = 'AP'
df = df.filter(
(F.col('city') == selected_city)
& (F.col('state') == cities[selected_city])
)
Old answer
It is a simple change: You can use isin to filter a column based on a list [Docs].
cities = list(city.keys())
df = df.filter(F.col('city').isin(cities))
If you want to construct more complex conditions based on a dictionary see this question.
[Edit] Updated answer based on OPs comment. Will leave the old one in there for completeness.
Imagine I have a huge dataset which I partitionBy('id'). Assume that id is unique to a person, so there could be n number of rows per id and the goal is to reduce it to one.
Basically, aggregating to make id distinct.
w = Window().partitionBy(id).rowsBetween(-sys.maxsize, sys.maxsize)
test1 = {
key: F.first(key, True).over(w).alias(key)
for key in some_dict.keys()
if (some_dict[key] == 'test1')
}
test2 = {
key: F.last(key, True).over(w).alias(k)
for k in some_dict.keys()
if (some_dict[k] == 'test2')
}
Assume that I have some_dict with values either as test1 or test2 and based on the value, I either take the first or last as shown above.
How do I actually call aggregate and reduce this?
cols = {**test1, **test2}
cols = list(cols.value())
df.select(*cols).groupBy('id').agg(*cols) # Doesnt work
The above clearly doesn't work. Any ideas?
Goal here is : I have 5 unique IDs and 25 rows with each ID having 5 rows. I want to reduce it to 5 rows from 25.
Let assume you dataframe name df which contains duplicate use below method
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
window = Window.partitionBy(df['id']).orderBy(df['id'])
final = df.withColumn("row_id", row_number.over(window)).filter("row_id = 1")
final.show(10,False)
change the order by condition in case there is specific criteria so that particular record will be on top of partition
I have an order list with a separate inventory system (Google Sheets). Using Pandas, I'm trying to merge the two for an efficient "pick list" and have had some mild success. However, in testing (adding multiple quantities for an order, having multiple orders with the same item/SKU type) it starts breaking down.
orders = "orderNumber,SKU,Quantity\r\n11111,GreenSneakers,2\r\n11111,Brown_Handbag,1\r\n22222,GreenSneakers,1\r\n33333,Blue_Handbag,1"
str_orders = StringIO(orders, newline='\n')
df_orders = pd.read_csv(str_orders, sep=",")
inventory = "SKU,Location\r\nGreenSneakers,DA13A\r\nGreenSneakers,DA13A\r\nRed_Handbag,DA12A\r\nGreenSneakers,DB34C\r\nGreenSneakers,DB33C\r\n"
str_inventory = StringIO(inventory, newline='\n')
df_inventory = pd.read_csv(str_inventory, sep=",")
df_inventory = df_inventory.sort_values(by='Location', ascending=False)
df_pList = df_orders.merge(df_inventory.drop_duplicates(subset=['SKU']), on='SKU', how='left')
print(df_pList)
pseudo desired output:
'
orderNumber, SKU, Quantity, Location
11111, GreenSneakers, 1, DB34C
11111, GreenSneakers, 1, DB33C
11111, Brown_Handbag, 1, NA
22222, GreenSneakers, 1, DA13A
33333, Blue_Handbag, 1, NA
'
Is Merge even a way to solve this type of a problem? Trying to stay away from looping if possible.
Below makes three dataframes.
df_pickList is what you were asking to make
copy_inventory contains what inventory would be if you pick everything (in case you want to just write the DataFrame out to overwrite your inventory file) You could elect to not make the copy and use your df_inventory directly, but especially in beta its handy to make a copy for manipulation.
df_outOfStock is a handy bucket to catch things you don't have in inventory. Cross check against current orders to see what you need to get on order
from io import StringIO
import pandas as pd
import copy
orders = """orderNumber,SKU,Quantity
11111,GreenSneakers,2
11111,Brown_Handbag,1
22222,GreenSneakers,1
33333,Blue_Handbag,1
"""
str_orders = StringIO(orders, newline='\n')
df_orders = pd.read_csv(str_orders, sep=",")
inventory = """SKU,Location
GreenSneakers,DA13A
GreenSneakers,DA13A
Red_Handbag,DA12A
GreenSneakers,DB34C
GreenSneakers,DB33C
"""
str_inventory = StringIO(inventory, newline='\n')
df_inventory = pd.read_csv(str_inventory, sep=",")
df_inventory = df_inventory.sort_values(by='Location', ascending=False)
df_outOfStock = pd.DataFrame() #placeholder to report a lack of stock
df_pickList = pd.DataFrame() #placeholder to make pick list
copy_inventory = copy.deepcopy(df_inventory) #make a copy of inventory to decimate
for orderIndex, orderLineItem in df_orders.iterrows():
for repeat in range(orderLineItem["Quantity"]): #since inventory location is 1 row per item, we need to do that many picks per order line item
availableInventory = copy_inventory.loc[copy_inventory.loc[:,"SKU"] == orderLineItem["SKU"], :]
if len(availableInventory) == 0:
#Failed to find any SKU to pull
df_outOfStock = df_outOfStock.append(orderLineItem, ignore_index=True)
else:
pickRow = {"orderNumber": orderLineItem["orderNumber"],
"SKU": orderLineItem["SKU"],
"Quantity": 1,
"Location": availableInventory.iloc[0]["Location"]}
df_pickList = df_pickList.append(pickRow, ignore_index=True)
copy_inventory.drop(index = availableInventory.index[0], inplace=True)
Thanks, this was a fun little exercise compared to dealing with non-integer quantities (i.e. feet of angle iron)
(Original wrong answer below)
I would recommend concatenating the rows into a single table (not merging and/or overwriting values), then using group by to allow the aggregation of values.
As a primer, I would start with these two links on putting your data together:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby
I have a fake dataset presenting a list of areas. These areas contain members and each member has a value.
I would like to count for each area, the number of unique members whose value satisfies a condition. I managed to deal with the issue but I would like to know if there is a cleaner way to do so in Pandas.
Here is my attempt so far:
# Building the fake dataset
dummy_dict = {
"area": ["A","A", "A","A","B","B"],
"member" : ["O1","O2","O2","O3","O1","O1"],
"value" : [90, 200, 200, 150, 120, 120]
}
df = pd.DataFrame(dummy_dict)
# Counting the number of unique members that satisfy the condition by zone
value_cutoff = 100
df["nb_unique_members"] = df.groupby("area")["member"].transform("nunique")
df.loc[df["value"]>=value_cutoff,"tmp"] = df.loc[df["value"]>=value_cutoff].groupby("area")["member"].transform("nunique")
df["nb_unique_members_above_cutoff"] = df.groupby("area")["tmp"].transform("mean")
df.head()
Is there a better way to do so in Pandas ? Thanks in advance!