np.std change ddof within groupby - python-3.x

I was running a manual (I wrote a function) std dev versus numpy's built in.
There was a slight difference in the returned values.
I looked it up and numpy uses ddof=0 by default.
I am trying to figure out how to pass that within a groupby and I am failing.
My groupby is simply this: grouped = houses.groupby('Yr Sold').agg({'SalePrice': np.std})
If I use: np.std(ddof=1) it errors out saying I am missing the required positional argument 'a'.
I looked that up and I see what it is, but it seems to me that 'a' is my 'SalePrice' column.
I have tried a few different ways but every single thing I try results in a syntax error.
Using the groupby syntax above, how do I pass the ddof=1 parameter to adjust numpy's default behavior?

I figured out how to solve my problem, just not by directly using the syntax above.
std_dev_dict = {}
for id, group in houses.groupby('Yr Sold'):
std_dev_dict[id] = np.std(group['SalePrice'], ddof=1)
print(std_dev_dict)

Related

Why do I get a naming convention error in PySpark when the name is correct?

I'm trying to groupBy a variable (column) called saleId, and then get the Sum for it, using an attribute (column) called totalAmount with the code below:
df = df.groupBy('saleId').agg({"totalAmount": "sum"})
But I get the following error:
Attribute sum(totalAmount) contains an invalid character among
,;{}()\n\t=. Please use an alias to rename it
I'm assuming there's something wrong with the way I'm using groupBy, because I get other errors even when I try the following code instead of the above one:
df = df.groupBy('saleId').sum('totalAmount')
What's the problem with my code?
OK, I figured out what went wrong.
The code I used in my question, returns the whole sum(totalAmount) as the name of the variable (column), which as you can see includes parenthesis.
This can be avoided by using:
df= df.groupBy('saleId').agg({"totalAmount": "sum"}).withColumnRenamed('sum(totalAmount)', 'totalAmount')
or
df.groupBy('saleId').agg(F.sum("totalAmount").alias(totalAmount))

Is there a pandas filter that allows any value? [duplicate]

I have discovered the pandas DataFrame.query method and it almost does exactly what I needed it to (and implemented my own parser for, since I hadn't realized it existed but really I should be using the standard method).
I would like my users to be able to specify the query in a configuration file. The syntax seems intuitive enough that I can expect my non-programmer (but engineer) users to figure it out.
There's just one thing missing: a way to select everything in the dataframe. Sometimes what my users want to use is every row, so they would put 'All' or something into that configuration option. In fact, that will be the default option.
I tried df.query('True') but that raised a KeyError. I tried df.query('1') but that returned the row with index 1. The empty string raised a ValueError.
The only things I can think of are 1) put an if clause every time I need to do this type of query (probably 3 or 4 times in the code) or 2) subclass DataFrame and either reimplement query, or add a query_with_all method:
import pandas as pd
class MyDataFrame(pd.DataFrame):
def query_with_all(self, query_string):
if query_string.lower() == 'all':
return self
else:
return self.query(query_string)
And then use my own class every time instead of the pandas one. Is this the only way to do this?
Keep things simple, and use a function:
def query_with_all(data_frame, query_string):
if query_string == "all":
return data_frame
return data_frame.query(query_string)
Whenever you need to use this type of query, just call the function with the data frame and the query string. There's no need to use any extra if statements or subclass pd.Dataframe.
If you're restricted to using df.query, you can use a global variable
ALL = slice(None)
df.query('#ALL', engine='python')
If you're not allowed to use global variables, and if your DataFrame isn't MultiIndexed, you can use
df.query('tuple()')
All of these will property handle NaN values.
df.query('ilevel_0 in ilevel_0') will always return the full dataframe, also when the index contains NaN values or even when the dataframe is completely empty.
In you particular case you could then define a global variable all_true = 'ilevel_0 in ilevel_0' (as suggested in the comments by Zero) so that your engineers could use the name of the global variable in their config file instead.
This statement is just a dirty way to properly query True like you already tried. ilevel_0 is a more formal way of making sure you are referring the index. See the docs here for more details on using in and ilevel_0: https://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method

Where can I find an overview of how the ec2.instancesCollection is built

In boto3 there's a function:
ec2.instances.filter()
The documentation:
http://boto3.readthedocs.org/en/latest/reference/services/ec2.html#instance
Say it returns a list(ec2.Instance) I wish...
when I try printing the return I get this:
ec2.instancesCollection(ec2.ServiceResource(), ec2.Instance)
I've tried searching for any mention of an ec2.instanceCollection, but the only thing I found was something similar for ruby.
I'd like to iterate through this instanceCollection so I can see how big it is, what machines are present and things like that.
Problem is I have no idea how it works, and when it's empty iteration doesn't work at all(It throws an error)
The filter method does not return a list, it returns an iterable. This is basically a Python generator that will produce the desired results on demand in an efficient way.
You can use this iterator in a loop like this:
for instance in ec2.instances.filter():
# do something with instance
or if you really want a list you can turn the iterator into a list with:
instances = list(ec2.instances.filter())
I'm adding this answer because 5 years later I had the same question and went round in circles trying to find the answer.
First off, the return type in the documentation is wrong (still). As you say, it states that the return type is: list(ec2.Instance)
where it should be:ec2.instancesCollection.
At the time of writing there's an open issue in github covering this - https://github.com/boto/boto3/issues/2000.
When you call the filter method a ResourceCollection is created for the particular type of resource against which you called the method. In this case the resource type is instance which gives an instancesCollection. You can see the code for the ResourceCollection superclass of instancesCollection here:
https://github.com/boto/boto3/blob/develop/boto3/resources/collection.py
The documentation here gives an overview of the collections: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/collections.html
To get to how to use it and actually answer your question, what I did was to turn the iterator into a list and iterate over the list if the size is > 0.
testList = list(ec2.instances.filter(Filters=filters))
if len(testList) > 0;
for item in testList;
.
.
.
This may well not be the best way of doing it but it worked for me.

When to use list with sum in python

Here are two examples:
sum(list(map(lambda x:x,range(10))))
and
sum(range(10))
The second example does not require a list(), but the first one does. Why?
How do I know when is list() a necessity? Similarly using list() for min() and max().
I am running python 3.3.5 with ipython 2.2.0. Here is what I see:
print(sum) results in <built-in function sum> from python console and <function sum at 0x7f965257eb00> from ipythonNotebook. Looks like an issue with hidden imports in notebook.
Neither of the examples require the use of list. The sum builtin function works with any iterable, so converting the result of map to a list isn't necessary.
Just in case, make sure you are indeed using the builtin sum function. Doing something like from numpy import * would override that. (you can simply print sum and see what you get).
I guess the 1st one just enforces and expects the output of the map function to be a list because if there are multiple arguments, map() returns a list consisting of tuples containing the corresponding items from all iterables.
But either way base on your example, it would still work.

How do you specify range to end of list?

Consider the following statement:
process.text.readLines[3..<-1]
It seems like it should work. Essentially, strip off the first two elements of the array. However, the range operator is confused by the ending -1, since its less than -1. You can easily solve this problem by storing the array as a variable and replacing -1 with size() but that requires an extra line and the definition of a variable. Any other ideas how to express this easily?
I believe you could do:
process.text.readLines()[ 2..-1 ]
or:
process.text.readLines().drop( 2 )
This will also do the trick:
process.text.readLines().with { it[2..size()-1] }
It's longer than simply calling drop as suggested above, but it might read a little better depending on the larger context. with lets you get around defining a new variable.

Resources