Apply custom function to groupby in vaex - python-3.x

I want to apply some custom logic to each individual group obtained by groupby. It is easy to do so in pandas. How to apply some custom function to groups created by groupby in vaex?
For example, suppose I want to find the min index and max index of each group and based on that, do some operation on the rows present in that group.
Is this possible in vaex?

I think vaex intentionally doesn't support this now, see for example this github issue https://github.com/vaexio/vaex/issues/752.

Related

PySpark Design Pattern for Combining Values Based on Criteria

Hi I am new to PySpark and want to create a function that takes a table of duplicate rows and a dict of {field_names : ["the source" : "the approach for getting the record"]} as an input and creates a new record. The new record will be equal to the first non-null value in the priority list where each "approach" is a function.
For example, the input table looks like this for a specific component:
And given this priority dict:
The output record should look like this:
The new record looks like this because for each field there is a function selected that dictates how the value is selected. (e.g. phone is equal to 0.75 as Amazon's most complete record is null so you coalesce to the next approach in the list which is the value of phone for the most complete record for Google = 0.75)
Essentially, I want to write a pyspark function that groups by components and then applies the appropriate function for each column to get the correct value. While I have a function that "works" the time complexity is terrible as I am naively looping through each component then each column, then each approach in the list to build the record.
Any help is much appreciated!
I think you can solve this using pyspark.sql.functions.when . See this blog post for some complicated usage patterns. You're going to want to group by id, and then use when statements to implement your logic. For example, 'title': {'source': 'Google', 'approach': 'first record'} can be implemented as
(df.groupBy('id').agg(
when(col("source") == lit("Google"), first("title") ).otherwise("null").alias("title" )
)
'Most recent' and 'most complete' are more complicated and may require some self-joins, but you should still be able to use when clauses to get the aggregates you need.

How to keep the Cognos List structure?

I have this list with some Grouped columns and with some non grouped columns as well as some List Header and Footer.
I also have some automatic created TOTALS for some of the metrics as well as some manually created totals for some metrics.
QUESTION : I now simply need to REPLACE a metric on the report by another metric. Problem is that the list displays automatic totals for some of the metrics including the one i need to replace so i want to know if it will break (destroy) the structure of my list and will i need to recreate its structure or is there a way to replace the metric without affecting my list sttucture, therefore no need to recreate my totals
Regards !
I believe you can go to the query for the list
In the query, find the data item
Change the expression definition to the new thing you want
Change the property for the name & label too
This way you do not have to adjust the layout

Assigning indexes across rows within python DataFrame

I'm currently trying to assign a unique indexer across rows, rather than alongside columns. The main problem is these values can never repeat, and must be preserved with every monthly report that I run.
I've thought about merging the columns and assigning an indexer to that, but my concern is that I won't be able to easily modify the dataframe and still preserve the same index values for each cell with this method.
I'm expecting my df to look something like this below:
Sample DataFrame
I haven't yet found a viable solution so haven't got any code to show yet. Any solutions would be much appreciated. Thank you.

Pandas dataframe to array for further use

I've got a dataframe which contains a csv of selling KPIs (quantity, article number and the corresponding date)
I need to split the dataframe into multiple with each containing the data to one article number (e.g. frame1= 123, frame2=345 and so on.
)
How can I dynamically split like this for a further use in sklearns kmean? (match different article numbers and their selling KPI)
thanks a lot
You can group by the article number using groupBy.
grouped = df.groupby(['article_number'])
You can then access the individual groups using
grouped.groups
or directly apply aggregation functions like grouped.sum(['quantity']) to get a new frame with the respective values for each group.
Also refer to the User Guide.

How to automatically index DataFrame created from groupby in Pandas

Using the Kiva Loan_Data from Kaggle I aggregated the Loan Amounts by country. Pandas allows them to be easily turned into a DataFrame, but indexes on the country data. The reset_index can be used to create a numerical/sequential index, but I'm guessing I am adding an unnecessary step. Is there a way to create an automatic default index when creating a DataFrame like this?
Use as_index=False
groupby
split-apply-combine
df.groupby('country', as_index=False)['loan_amount'].sum()

Resources