dynamically create columns in pyspark sql select statement - python-3.x

I have a pyspark dataframe called unique_attributes. the dataframe has columns productname, productbrand, producttype, weight, id. I am partitioning by some columns and trying to get the first value of the id column using a window function. I would like to be able to dynamically pass a list of columns to partition by. so for example if I wanted to add the weight column to the partition without having to code another 'col('weight') in the select, just instead pass a list. does anyone have a suggestion how to accomplish this? I have an example below.
current code:
w2 = Window().partitionBy(['productname',
'productbrand',
'producttype']).orderBy(unique_attributes.id.asc())
first_item_id_df=unique_attributes\
.select(col('productname'),
col('productbrand'),
col('producttype')),first("id",True).over(w2).alias('matchid')).distinct()
desired dynamic code:
column_list=['productname',
'productbrand',
'producttype',
'weight']
w2 = Window().partitionBy(column_list).orderBy(unique_attributes.id.asc())
# somehow creates
first_item_id_df=unique_attributes\
.select(col('productname'),
col('productbrand'),
col('producttype'), col('weight'),first("id",True).over(w2).alias('matchid')).distinct()

Related

How can I reset the index in a .groupby output?

I have the following dataframe. I want to group the name and brand columns by their respective unique values.
dataframe
I wrote the following python code to group them:
high_products = products.reset_index().groupby(['name', 'brand'])[['name', 'brand', 'count_name']]
and printed the output with the following code:
high_products.head().sort_values(by='count_name', ascending=False)
grouped
As you can see from the image above, it appears that it is grouping them also based on the index. In the end, I just want to get the unique name and brand values and their respective count names.
How can I do this with .groupby?
Thank you.
Change the first line of your code to (count aggregation on name, brand):
high_products = products.groupby(['name', 'brand']).agg(['count']).reset_index()

How to compute unique values for a group of rows and create a column for all records using that value?

I have a dask dataframe with only the 'Name' and 'Value' column similar to the table below.
How do I compute the 'Average' column? I tried groupby in dash but that just gives me a dataframe of 2 records containing the average of A and B.
You can just left join your original table with the new one on Name. From https://docs.dask.org/en/latest/dataframe-joins.html:
small = small.repartition(npartitions=1)
result = big.merge(small)

How to insert a new column to a specific index to a Dask DataFrame?

With pandas, I can insert a new column to a specific location like below:
df_all.insert(loc=10, column="label", value=label_column, allow_duplicates=True)
How can I add a new column to a specific location with dask? (to a dask dataframe)
Naively, I would add a column
df["label"] = label_column
And then I would rearrange columns with another getitem call
df = df[[..., "label", ...]]
There might be a cleaner way to do this, but this should work fine.

Spark: Filter & withColumn using row values?

I need to create a column called sim_count for every row in my spark dataframe, whose value is the count of all other rows from the dataframe that match some conditions based on the current row's values. Is it possible to access row values while using when?
Is something like this possible? I have already implemented this logic using a UDF, but serialization of the dataframe's rdd map is very costly and I am trying to see if there is a faster alternative to find this count value.
Edit
<Row's col_1 val> refer's to the outer scope row I am calculating the count for, NOT the inner scope row inside the df.where. For example, I know this is incorrect syntax, but I'm looking for something like:
df.withColumn('sim_count',
f.when(
f.col("col_1").isNotNull(),
(
df.where(
f.col("price_list").between(f.col("col1"), f.col("col2"))
).count()
)
).otherwise(f.lit(None).cast(LongType()))
)

Iterating over rows of dataframe but keep each row as a dataframe

I want to iterate over the rows of a dataframe, but keep each row as a dataframe that has the exact same format of the parent dataframe, except with only one row. I know about calling DataFrame() and passing in the index and columns, but for some reason this doesn't always give me the same format of the parent dataframe. Calling to_frame() on the series (i.e. the row) does cast it back to a dataframe, but often transposed or in some way different from the parent dataframe format. Isn't there some easy way to do this and guarantee it will always be the same format for each row?
Here is what I came up with as my best solution so far:
def transact(self, orders):
# Buy or Sell
if len(orders) > 1:
empty_order = orders.iloc[0:0]
for index, order in orders.iterrows():
empty_order.loc[index] = order
#empty_order.append(order)
self.sub_transact(empty_order)
else:
self.sub_transact(orders)
In essence, I empty the dataframe and then insert the series, from the For loop, back into it. This works correctly, but gives the following warning:
C:\Users\BNielson\Google Drive\My Files\machine-learning\Python-Machine-Learning\ML4T_Ex2_1.py:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
empty_order.loc[index] = order
C:\Users\BNielson\Anaconda3\envs\PythonMachineLearning\lib\site-packages\pandas\core\indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
So it's this line giving the warning:
empty_order.loc[index] = order
This is particularly strange because I am using .loc already, when normally you get this error when you don't use .loc.
There is a much much easier way to do what I want.
order.to_frame().T
So...
if len(orders) > 1:
for index, order in orders.iterrows():
self.sub_transact(order.to_frame().T)
else:
self.sub_transact(orders)
What this actually does is translates the series (which still contains the necessary column and index information) back to a dataframe. But for some Moronic (but I'm sure Pythonic) reason it transposes it so that the previous row is now the column and the previous columns are now multiple rows! So you just transpose it back.
Use groupby with a unique list. groupby does exactly what you are asking for as in, it iterates over each group and each group is a dataframe. So, if you manipulate it such that you groupby a value that is unique for each and every row, you'll get a single row dataframe when you iterate over the group
for n, group in df.groupby(np.arange(len(df))):
pass
# do stuff
If I can suggest an alternative way than it would be like this:
for index, order in orders.iterrows():
orders.loc[index:index]
orders.loc[index:index] is exactly one row dataframe slice with the same structure, including index and column names.

Resources