I tried calculating minutes by subtracting two columns. But an error shows up which is " ValueError: unit abbreviation w/o a number".
However, I tried similar operation on two different columns but it worked.
I tried calculating arr_delay, but there was no error. But error came up when I tried calculating dep_delay.
data['arr_delay'] = (pd.to_timedelta(data.ATA) - pd.to_timedelta(data.STA)).dt.total_seconds()/60
data['dep_delay'] = (pd.to_timedelta(data.ATD) - pd.to_timedelta(data.STD)).dt.total_seconds()/60
I was able to calculate arr_delay. But an error came up while calculating dep_delay which is :-" ValueError: unit abbreviation w/o a number "
In my opinion there are some bad values in some column, so use parameter errors='coerce' for convert these values to NaT:
data['dep_delay'] = (pd.to_timedelta(data.ATD, errors='coerce') -
pd.to_timedelta(data.STD, errors='coerce')).dt.total_seconds()/60
Related
I have a data frame which I groupBy Date and name and I aggregate the average of price.
df.filter("some filter")
.withColumn("price_int", df.print.cast('integer'))
.groupBy("Date", 'name')
.agg(avg(col('price_int')))
I want to get the max, and stddev of price_int column.
df.filter("some filter")
.withColumn("price_int", df.print.cast('integer'))
.groupBy("Date", 'name')
.agg(avg(col('price_int'),
max(col('price_int'),
stddev(col('price_int'),
)))
But when I add max(col('price_int), I get error saying Column is not iterable
And when I add stddev(col('price_int), I get error saying 'stdev is not defined'
Can you please tell me how can I get 'max' and 'standard dev' of the column 'price_int'?
I'm skeptical about where the issue is, as the code you provided seems to have syntax messed up, but in any case - the below would work:
from pyspark.sql import functions as F
df.filter("some filter")
.withColumn("price_int", df.print.cast('integer'))
.groupBy("Date", 'name')
.agg(F.avg(F.col('price_int')).alias("Average"), F.stddev(F.col("price_int")).alias("Std_Deviation"), F.max(F.col("price_int")).alias("Maximum"))
Worked for me:
Also, note Spark has 2 different standard deviation functions:
stddev() or stddev_samp() - returns the unbiased sample standard
deviation of the expression in a group
stddev_pop() - returns population standard deviation of the expression
in a group.
I'm new at this! Doing my first Python project. :)
My tasks are:
convert df['Start Time'] from string to datetime
create a month column from df['Start Time']
get the mode of that month.
I used a few different ways to do all 3 of the steps, but trying to get the mode always returns TypeError: tuple indices must be integers or slices, not str. This happens even if I try converting the "tuple" into a list or NumPy array.
Ways I tried to extract month from Start Time:
df['extracted_month'] = pd.DatetimeIndex(df['Start Time']).month
df['extracted_month'] = np.asarray(df['extracted_month'])
df['extracted_month'] = df['Start Time'].dt.month
Ways I've tried to get the mode:
print(df['extracted_month'].mode())
print(df['extracted_month'].mode()[0])
print(stat.mode(df['extracted_month']))
Trying to get the index with df.columns.get_loc("extracted_month") then replacing it in the mode code gives me the SAME error (TypeError: tuple indices must be integers or slices, not str).
I think I should convert df['extracted_month'] into a different... something. What is it?
Note: My extracted_month column is a STRING, but you should still be able to get the mode from a string variable! I'm not changing it, that would be giving up.
Edit: using the following code still results in the same error
extracted_month = pd.Index(df['extracted_month'])
print(extracted_month.value_counts())
The error is likely caused by the way you are creating your dataframe.
If the dataframe is created in another function, and that function returns other things along with the dataframe, but you assign it to the variable df, then df will be a tuple that contains the actual dataframe, and not the dataframe itself.
I have a code which I have it's performance timestamped, and I want to measure the average of time it took to run it on multiple computers, but I just cant figure out how to use the datetime module in python.
Here is how my procedure looks:
1) I have the code which simply writes into a text file the log, where the timestamp looks like
t1=datetime.datetime.now()
...
t2=datetime.datetime.now()
stamp= t2-t1
And that stamp variable is just written in say log.txt so in the log file it looks like 0:07:23.160896 so it seems like it's %H:%M:%S.%f format.
2) Then I run a second python script which reads in the log.txt file and it reads the 0:07:23.160896 value as a string.
The problem is I don't know how to work with this value because if I import it as a datetime it will also append and imaginary year and month and day to it, which I don't want, I simply just want to work with hours and minutes and seconds and microseconds to add them up or do an average.
For example I can just open it in Libreoffice and add the 0:07:23.160896 to 0:00:48.065130 which will give 0:08:11.226026 and then just divide by 2 which will give 0:04:05.613013, and I just can't possibly do that in python or I dont know how to do it.
I have tried everything, but neither datetime.datetime, nor datetime.timedelta allows simply multiplication and division like that. If I just do a y=datetime.datetime.strptime('0:07:23.160896','%H:%M:%S.%f') it will just give out 1900-01-01 00:07:23.160896 and I can't just take a y*2 like that, it doesnt allow arithmetic operations, plus if if I convert it into a timedelta it will also multiply the year,which is ridiculous. I simply just want to add and subtract and multiply time.
Please help me find a way to do this, and not just for 2 variables but possibly even a way to calculate the average of an entire list of timestamps like average(['0:07:23.160896' , '0:00:48.065130', '0:00:14.517086',...]) way.
I simply just want a way to calculate the average of many timestamps and give out it's average in the same format, just as you can just select a column in Libreoffice and take the AVERAGE() function which will give out the average timestamp in that column.
As you have done, you first read the string into a datetime-object using strptime: t = datetime.datetime.strptime(single_time,'%H:%M:%S.%f')
After that, convert the time part of your datestring into a timedelta, so you can easily calculate with times: tdelta = datetime.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second, microseconds=t.microsecond)
Now you can easily calculate with the timedelta object, and convert at the end of the calculations back into a string by str(tdsum)
import datetime
times = ['0:07:23.160896', '0:00:48.065130', '0:12:22.324251']
# convert times in iso-format into timedelta list
tsum = datetime.timedelta()
count = 0
for single_time in times:
t = datetime.datetime.strptime(single_time,'%H:%M:%S.%f')
tdelta = datetime.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second, microseconds=t.microsecond)
tsum = tsum + tdelta
count = count + 1
taverage = tsum / count
average_time = str(taverage)
print(average_time)
My first python project that didn't print 'Hello World' - so be gentle. Tried answers from similar questions but they don't seem to work.
I'm working with an Excel file, parsing as pandas dataframe.
I have a calculated column that calculates the number of days to later be added to a date. The number of days to add column is done as below, with 'choices' being a list of integers. This seems to work fine.
choices = [0,0,925,778,567,608, 638,730]
df['Days_to_add'] = np.select(conditions, choices, default=0)
I now want to add this to an existing date column, to return a new column with the new date. So far i've tried this but Jupyter says its depreciated and will return a TypeError in a future version:
df["Estimated Start"] = pd.to_timedelta(df["Date1"]) + df['Days_to_add']
Also tried this:
df['Estimated_Start'] = df.Max_Dec_Date + pd.DateOffset(df['Days_to_add'])
And something else that told me to use timedelta index, and something else that pointed to timedelta range. I think the problem is something to do with trying to add an integer to a series?
No success with any of it. Help?
Date is not TimeDelta, but DateTime,
so the addition should go like this:
df["Estimated Start"] = pd.to_datetime(df["Date1"]) + pd.to_timedelta(df['Days_to_add'], unit='D')
In a small portion of code for a retail auditing calculator, I'm attempting to allow the input of a retail value and multiply it by up to 2 entered quantities The expected (intended) result is $X*Y=$Z.
I've attempted to modify the code a couple of says and seem to be stuck on how this math is (isn't) working correctly.
I've attempted a number of different configurations in the code and the most I've achieved is the following:
#Retail value of item, whole number (i.e. $49.99 entered as 4999)
rtlVAL = input("Retail Value: ")
#Quantity of Items - can be multiplied for full stack items, default if no entry is '1'
qt1 = float(input("Quantity 1: ")) #ex. 4
qt2 = float(input("Quantity 2: ") or "1") #ex " "
#Convert the Retail Value to finacial format (i.e 4999 to $49.99)
rtl = float("{:.2}".format (rtlVAL))
# Screen Output
qtyVAL = int(qt1)*int(qt2)
print("$" + str(qtyVAL*rtl))
The entered values are:
Retail Value: 4999
Quantity 1: 4
Quantity 2: (blank)
The expected performance is 4999 * 4 * (because no entry defaults to value of 1) and the expected result is $199.96
The result of this code is $196.0, so not only is it the wrong conclusion but it's missing the two decimal places.
I'm not entirely certain why the math comes up wrong in context to expectation.
What am I missing here?
On line 9, I've tried the following:
rtl = float("{:.2f}".format (rtlVAL))
rtl = int("{:.2f}".format (rtlVAL))
The return was
ValueError: Unknown format code 'f' for object of type 'str'
if I change line 13 to:
print("$" + float(qtyVAL*rtl))
I get
TypeError: must be str, not float
using either of the prior alterations in conjunction with the latter will return the ValueError:
Python 3.4 and 3.6
I did search a few other SO questions regarding Python, Math, Floating point, and formatting but the questions were looking for and presenting something far more advances and entangled than this so i wasn't able to glean an answer to make a contextual application or it applied mainly to Python 2.7 wherein some of the code such as raw input() is simply input() and altered by int(input())in Python 3.x to step out of str value (as far as I understand for this purpose.
I did not see this as a duplicate, but if I missed something in that I do apologize - it isn't intentional.
No need to mess around with number formats:
rtl = float(rtlVAL)/100
Just divide the retail value by 100 to get the dollar value
EDIT:
Incidentally, the reason it was coming up with 196 was because your number format was taking the first two digits of rtlVAL - 49 in your case - and then multiplying by that.