Pandas dataframe to array for further use - python-3.x

I've got a dataframe which contains a csv of selling KPIs (quantity, article number and the corresponding date)
I need to split the dataframe into multiple with each containing the data to one article number (e.g. frame1= 123, frame2=345 and so on.
)
How can I dynamically split like this for a further use in sklearns kmean? (match different article numbers and their selling KPI)
thanks a lot

You can group by the article number using groupBy.
grouped = df.groupby(['article_number'])
You can then access the individual groups using
grouped.groups
or directly apply aggregation functions like grouped.sum(['quantity']) to get a new frame with the respective values for each group.
Also refer to the User Guide.

Related

Grouping dataset by dates and converting columns into categories

I have a powerquery-table with data structured as picture 1, with 1 row per date, and capacity input per vendor separated into columns. This is creating some issues in creating pivot-tables, as i cannot simply filter which vendor i want to look at by a single column.
I think the solution would be to structure the data as shown in picture 2, but is there a way to change from one format to the other? In other situations i need the data structured as is. So ideally i need a way to present the same dataset in either format as needed.
Current table:
Preferable table:

Can you apply a formula to the count filter of a pivot table? Trying to find duplicates in a large dataset

I have a huge dataset of brands, stores and devices. A store can have multiple brands. There are no duplicate device numbers. I need to figure out for each brand (Pop, Rock, etc.), how many stores have duplicate brands (e.g, Store 110 in the table below has two pop brand devices).
I have figured it out on this simple example dataset using pivot tables. However, I need to apply this to a huge dataset with over 100 brands and thousands of stores. Is there a way to automate the process so I can figure out for each brand, how many duplicate stores there are?
[2
Is there a way for me to apply a field that returns a value if the count of stores is over 1? Then can I summarize this by brand?
How about you just add in a column to tell you if the brand is a duplicate for that store? You can then put the result in a pivot table or just filter it to find your answer
=countifs(C:C,C2,A:A,A2)>1

split up tables according to keywords

I can download excel exports of my monthly transactions from my bank. It gives me a table with the following columns:
[date][account number][amount][debit/credit][account name][code][description]
I want to create a file splits this up into different tables according to certain keywords such as:
Put all transactions where the description contains an item from a list of keywords into category x (e.g: Put all transactions where the description contains "spotify" or "netflix" into the table for subscriptions.)
Put all transactions that don't match any criteria into miscellaneous table.
Add up all items per table per month.
I can't find out which functions to use to achieve this. The filter function seems about right but too limited. Any ideas?

Python3 Pandas dataframes: beside columns names are there also columns labels?

Many database management systems, such as Oracle, SQL Server or even statistical software like SAS, allow having, beside field names, also field labels.
E.g., in DBMS one may have a table called "Table1" with, among other fields, two fields called "income_A" and "income_B".
Now, in the DBMS logic, "income_A" and "income_B" are the field names.
Beside a name, those two fields can also have plain English labels associated to them, which clarify the actual meaning of those two fields; such as "A - Income of households with dependable children where both parents work and they have a post-degree level of education" and "B - Income of empty-nesters households where only one works".
Is there anything like that in Python3 Pandas dataframes?
I mean, I know I can give a dataframe column a "label" (which is, seen from the above DBMS perspective, more like a "name", in the sense that you can use it to refer to the column itself).
But can I also associate a longer description to the column, something that I can choose to display instead of the column "label" in print-outs and reports or that I can save into dataframe exports, e.g., in MS Excel format? Or do I have to do it all using data dictionaries, instead?
It does not seem that there is a way to store such meta info other than in the columns name. But the column name can be quite verbose. I tested up to 100 characters. Make sure to pass it as a collection.
Such a long name could be annoying to use for indexing in the code. You could use loc/iloc or assign the name to a string for use in indexing.
In[10]: pd.DataFrame([1, 2, 3, 4],columns=['how long can this be i want to know please tell me'])
Out[10]:
how long can this be i want to know please tell me
0 1
1 2
2 3
3 4
This page shows that the columns don't really have any attributes to play with other than the lablels.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html
There is some more info you can get about a dataframe:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Resources