How to count unique values in aggregation in koalas - apache-spark

New to koalas and trying to do something really basic. I am simply trying to count the unique values in a column in an aggregation. In pandas I would do:
df.groupby('columnname').agg({'column_i_want_count_of_unique_values' : pd.Series.nunique})
But for example ks.Series.nunique doesn't work and 'count' also does not seem to give the right answer.
Pretty frustrating for something so simple and common, and annoying that I can't seem to find it in the documentation for something that bills itself as porting pandas to spark.

You can use the function nunique
df.groupby('columnname')['column_i_want_count_of_unique_values'].nunique()

I suppose the correct syntax is:
df.groupby('columnname').agg({'column_i_want_count_of_unique_values' : 'nunique'})
Source: https://github.com/databricks/koalas/pull/512

Related

COUNTIFS when counting distinct values

I've got a table that looks something like this
what I want to do is have a formula that counts how many distinct machines are used on each hospital in between 04/06/2021 and 07/06/2021
So I would get something like this
I feel like this could be done with a COUNTIFS but somehow make it count only distinct values. I'm just not sure of how I could do this. Can someone help me with this? Thanks
To expand on the comments, if you have the newest version of Excel you can use functions like FILTER, COUNTA, and UNIQUE.
For example, after getting the unique hospitals with UNIQUE, try something like:
=COUNTA(UNIQUE(
FILTER($B$2:$B$10,
(($C$2:$C$10<DATEVALUE("1/5/2021"))*
($C$2:$C$10>=DATEVALUE("1/1/2021"))*
($A$2:$A$10=$F2)))))

Why am I getting a SettingWithCopyWarning when trying to filter rows in pandas? How can I properly filter my data rows?

I'm pretty new to data analytics in general, let alone pandas and python, but I've done a lot of searching to try and resolve this issue and can't pinpoint what exactly the issue is. I'm hoping someone can catch it. When I run my function, it works but I am met with the following error: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would ideally like to resolve this and learn what the proper best practice is. I have pinpointed the issue to this single row of code:
region_filtered_data = data.loc[(data['region'] == region[:-1]) & (data['type'] == 'conventional')]
The most common causes of the issue, to my best understanding, are when you're reassigning a value to some part of the dataframe, which is not what I'm doing here. I am just trying to filter my dataframe to only those records whose value in the "region" column matches the one I specify and whose "type" is conventional.
Try this one:
region_filtered_data = data.loc[(data['region'] == region) & (data['type'] == 'conventional')]

How do you know when a dataframe column can / should be added to a method's parameters?

In Use .corr to get the correlation between two columns
in answer with
Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'])
and consulting pandas doc on .corr neither parameters nor example indicate you should put column to be correlated with as a parameter to .corr()
How do you know when and if you should can or should put data frame column reference inside a method like here for .corr?
This is a good point... it is quite frustrating that in the pandas.DataFrame.corr documentation, there's no explanation that the input dataframe must be a 2 column dataframe, nor any discussion on what types those columns ought to be, given the correlation coefficients of choice. That's the kind of thing you could add as a contribution to the pandas project, and I think it would be valuable.
On the other hand, the question you're asking does have an answer on the pandas.Series.corr documentation.

getting a groupby error through a for loop

fashion = [1,1,2,3,3,3,21,1,1,1,5,5,5,5,3,3,2,6]
for key,group in groupby(fashion):
print(key,':',list(group))
I have written the above code to group by certain numbers and get a list. For example, I want an outcome such as :
1 : [1,1,1,1,1]
2 : [2,2]
Can someone please tell me what's wrong with my code?
fashion is neither a sorted list for itertools.groupby() nor a Pandas DataFrame. Please post the fully functioning code with clarification on if you are attempting to use Pandas for grouping a series or DataFrame.
How to make good reproducible pandas examples
Contains good advices for getting help with Pandas features.

pandas method .isin() automatically converts datatypes in element comparisons?

Wondering if someone can help me here. When I take a regular python list containing strings, and check to see if a pandas series (pla.id) has a value that matches a value in that list. It works.
Which is great and all but I wonder how it's able to compare strings to ints... is there documentation somewhere that states that it will convert under the hood before comparing those values??
I wasn't able to find anything on the .isin() page of pandas documentation..
Also super interesting is that when I try pandas indexing it fails due to a type comparison.
So my two questions:
Does pandas.series.isin(some_list_of_strings) method automatically convert the values in the series (which are int values) to strings before doing a value comparison?
If so, why doesn't pandas indexing i.e. (df[df.series == 'some value']) not do the same thing? What is the thinking behind this? If I wanted to accomplish the same thing I would have to do df[df.series.astype(str) == ''] or df[df.series.astype(str).isin(some_list_of_strings)] to access those values in the df that match
After some digging I think this might be due to the pandas object datatype? but I have no understanding of why this works. Also this doesn't explain why the below works... since it is a int dtype
Thanks in advance!

Resources