Turning a Dataframe into a Series with .squeeze("columns") - python-3.x

I'm studying how to work with data right now and so I'm following along with a tutorial for working with Time Series data. Among the first things he does is read_csv on the file path and then use squeeze=True to read it as a Series. Unfortunately (and as you probably know), squeeze has been depricated from read_csv.
I've been reading documentation to figure out how to read a csv as a series, and everything I try fails. The documentation itself says to use pd.read_csv('filename').squeeze('columns') , but, when I check the type afterward, it is always still a Dataframe.
I've looked up various other methods online, but none of them seem to work. I'm doing this on a Jupyter Notebook using Python3 (which the tutorial uses as well).
If anyone has any insights into why I cannot change the type in this way, I would appreciate it. I'm not sure if I've misunderstood the tutorial altogether or if I'm not understanding the documentation.
I do literally type .squeeze("columns") when I write this out because when I write a column name or index, it fails completely. Am I doing that correctly? Is this the correct method or am I missing a better method?
Thanks for the help!
shampoo = pd.read_csv('shampoo_with_exog.csv',index_col= [0], parse_dates=True).squeeze("columns")

I would start with this...
#Change the the stuff between the '' to the entire file path of where your csv is located.
df = pd.read_csv(r'c:\user\documents\shampoo_with_exog.csv')
To start this will name your dataframe as df which is kind of the unspoken industry standard the same as pd for pandas.
Additionally, this will allow you to use a "raw" (the r) string which makes it easier to insert directories into your python code.
Once you are are able to successfully run this you can simply put df in a separate cell in jupyter. This will show you what your data looks like from your CSV. Once you have done all that you can start manipulating your data. While you can use the fancy stuff in pd.read_csv() I mostly just try to get the data and manipulate it from the code itself. Obviously there are reasons not to only do a pd.read_csv but as you progress you can start adding things here and there. I almost never use squeeze although I'm sure there will be those here to comment stating how "essential" it is for whatever the specific case might be.

Related

Python 3.10 datetime strptime not picking up time zone?

I have a timestamp embedded in some JSON data as a string, for ease of inspection and modification. An example looks like this:
"debug_time": 1670238819.9747384,
"last_saved": "2022-12-05 11:13:39.974725 UTC",
When loaded back in, I need to convert it back to a float for comparison against time.time() and similar things, however, I can't seem to find the magic incantations to have it restore the correct value.
In restoring the JSON data, I attempt to convert the string to a float via strptime() like this:
loaded_time = datetime.datetime.strptime(obj.last_saved, '%Y-%m-%d %H:%M:%S.%f %Z')
This does restore the timestamp to a valid datetime object, however calling .tzname() results in None, and my attempts to use loaded_time.replace(tzinfo=zoneinfo.ZoneInfo('UTC')) have not yielded any useful results.
In short, emitting loaded_time.timestamp() yields 1670267619.974725, which is 8 hours ahead of what it should be. I've tried using .astimezone(), in various permutations, but can't find a way to correctly have it convert to the client's local time.
I even tried to hard-code in my own time zone US/Pacific but it stubbornly refuses to give me that original debug_time value back.
This doesn't seem like it should be a difficult problem, but clearly I'm misunderstanding something about how python 3's time handling works. Any ideas are welcome!
Thank you for your time!
you have to use built in function replace like
.strftime("%Y-%m-%dT%H:%M:%S.%f%Z").replace("UTC", "")

Python 3 Tips to Shorten Code for Assignment and Getting Around TextIO

I've been going through a course and trying to find ways to shorten my code. I had this assignment to open a text file, split it, then add all of the unique values to a list, then finally sort it. I passed the assignment, but I have been trying to shorten it to learn some ways to apply any shortening concepts to future codes. The main issues I keep running into is trying to make the opened file into strings to turn them into lists to append and such without read(). If I don't used read() I get back TextIO errors. I tried looking into it but what I found involved importing os and doing some other funky stuff, which seems like it would take more time.
So if anyone would mind giving me tips to more effectively code this that are beginner friendly I would be appreciative.
romeo = open('romeo').read()
mylist = list()
for line in romeo.split() :
if line not in mylist:
mylist.append(line)
mylist.sort()
print(mylist)
I saw that set() is pretty good for unique values, but then I don't think I can sort it. Then trying flip flop between a list and set would seem wacky. I tried those swanky one line for loop boys, but couldn't get it to work. like for line not in mylist : mylist.append(line) I know that's not how to do it or even close, but I don't know how to convey what I mean.
So to iterate:
1. How to get the same result without read() / getting around textIO
2. How to write this code in a more stream lined way.
I'm new to the site and coding, so hopefully I didn't trigger anyone.

Python groupby returning single value between carets

Long time listener first time caller. I am new to Python, about 3 days into this and I cannot figure out for the life of me, what is happening in this particular instance.
I brought in an XLSX file as a dataframe called dfInvoice. I want to use groupby on two columns (indexes?) but something funky is happening I think. I can't see my new grouped dataframe with the code below.
uniqueLocation = dfInvoice.groupby(['Location ID','Location'])
When I call uniqueLocation, all that is returned is this:
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001B9A1C61198>
I have two questions from here.
1) what the heck is going on? I followed these steps almost identically to this (https://www.geeksforgeeks.org/python-pandas-dataframe-groupby).
2) this string of text between the carets, what should I refer to this as? I didn't know how to search for this happening because I don't exactly understand what this return is.

Is there a pandas filter that allows any value? [duplicate]

I have discovered the pandas DataFrame.query method and it almost does exactly what I needed it to (and implemented my own parser for, since I hadn't realized it existed but really I should be using the standard method).
I would like my users to be able to specify the query in a configuration file. The syntax seems intuitive enough that I can expect my non-programmer (but engineer) users to figure it out.
There's just one thing missing: a way to select everything in the dataframe. Sometimes what my users want to use is every row, so they would put 'All' or something into that configuration option. In fact, that will be the default option.
I tried df.query('True') but that raised a KeyError. I tried df.query('1') but that returned the row with index 1. The empty string raised a ValueError.
The only things I can think of are 1) put an if clause every time I need to do this type of query (probably 3 or 4 times in the code) or 2) subclass DataFrame and either reimplement query, or add a query_with_all method:
import pandas as pd
class MyDataFrame(pd.DataFrame):
def query_with_all(self, query_string):
if query_string.lower() == 'all':
return self
else:
return self.query(query_string)
And then use my own class every time instead of the pandas one. Is this the only way to do this?
Keep things simple, and use a function:
def query_with_all(data_frame, query_string):
if query_string == "all":
return data_frame
return data_frame.query(query_string)
Whenever you need to use this type of query, just call the function with the data frame and the query string. There's no need to use any extra if statements or subclass pd.Dataframe.
If you're restricted to using df.query, you can use a global variable
ALL = slice(None)
df.query('#ALL', engine='python')
If you're not allowed to use global variables, and if your DataFrame isn't MultiIndexed, you can use
df.query('tuple()')
All of these will property handle NaN values.
df.query('ilevel_0 in ilevel_0') will always return the full dataframe, also when the index contains NaN values or even when the dataframe is completely empty.
In you particular case you could then define a global variable all_true = 'ilevel_0 in ilevel_0' (as suggested in the comments by Zero) so that your engineers could use the name of the global variable in their config file instead.
This statement is just a dirty way to properly query True like you already tried. ilevel_0 is a more formal way of making sure you are referring the index. See the docs here for more details on using in and ilevel_0: https://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method

Getting the result of an excel formula in python

I need to open a .xlsx-file (without writing to it) in python, to change some fields and get the output after the formulas in some fields were calculated; I only know the input fields, the output field and the name of the sheet.
To write some code: Here is how it would look like if I would have created the library
file = excel.open("some_file.xlsx")
sheet = file[sheet_name]
for k, v in input_fields.items():
sheet[k] = v
file.do_calculations()
print(sheet[output_field])
Is there an easy way to do this? Wich library should I use to get the result of the formulas after providing new values for some fields?
Is there a better way than using something like pyoo, maybe something that doesn't require another application (a python library is clearly better) to be installed?
I'll just thank you in advance.
I now came up with a (very ugly) solution.
I am now reading the xml within the xlsx-file, and I am now using eval and some regular expressions to find out wich fields are needed; and I have defined some functions to run the calculations.
It works, but it would be great if there were a better solution.
If the resulting library is ready, and I don't forget to do this; I'll add a link to the library (that'll be hosted on Github) to this answer to my own question.

Resources