Wrote this function to reference and existing date column to create a new column called wbm (short for week beginning monday).
def wbmFunc(df, col):
if df[col].weekday() == 0:
return df[col]
else:
return df[col] + timedelta(days=(0 - df[col].weekday()))
df['wbm'] = wbmFunc(df, 'date')
Why does it return the below error?
AttributeError: 'Series' object has no attribute 'weekday'
Since you want to access a datetime like property you have to use:
series.dt.weekday
Also note that since it is a property, you don't call a function on the series.
You can refer to the pandas Documentation on this topic.
It looks like you want to construct a new column that takes the week begin Monday for a given date. I think to achieve this, even you fix the property bug, there is still some problem. Why not use the pd.offsets ? You can try the following code for the same purpose
def wbmFunc(df, col):
w_mon = pd.offsets.Week(weekday=0)
return df[col].apply(w_mon.rollback)
Related
'''
Could someone perhaps assist me in finding a solution to this problem? I'm currently
learning how to code. I'm attempting to create a new column that displays the current
price as it fluctuates in real-time. I tried "stock_info.get_live_price('NIO')"; it
works when only one ticker is inserted, but not when the variable 'stock_name' is
inserted.
import pandas
from yahoo_fin import stock_info
def My_portfolio1():
df = pd.DataFrame({
'stock_names':['NIO','JMIA','SVRA'],
'price': [1,3,4],
'quantity':[200,100,400],
'entry_price':[3,4,5],
'current_price':[2,3,1]
}
)
df['new_value'] = df['current_price'] - df['entry_price']
df['pnl'] = df['new_value'] * df['quantity']
df['live_update']= stock_info.get_live_price('stock_name')
return df
My_portfolio1()
'''
Thank you so much, everyone! As a result, I decided to make a variable for each of the tickers and use the loc function to place them in the appropriate rows and columns. Thank you so much, everyone! As a result, I decided to make a variable for each of the tickets and use the loc function to place them in the appropriate rows and columns.
I need to check if an employee has checked out during the break.
To do so, I need to see if there is the time in which Door Name is RDC_OUT-1 is in the interval [12:15:00 ; 14:15:00]
import pandas as pd
df_by_date= pd.DataFrame({'Time':['01/02/2019 07:02:07', '01/02/2019 10:16:55', '01/02/2019 12:27:20', '01/02/2019 14:08:58','01/02/2019 15:32:28','01/02/2019 17:38:54'],
'Door Name':['RDC_OUT-1', 'RDC_IN-1','RDC_OUT-1','RDC_IN-1','RDC_OUT-1','RDC_IN-1']})
df_by_date['Time'] = pd.to_datetime(df_by_date['Time'])
df_by_date['hours']=pd.to_datetime(df_by_date['Time'], format='%H:%M:%S').apply(lambda x: x.time())
print('hours \n',df_by_date['hours'])
out = '12:15:00'
inn = '14:15:00'
pause=0
for i in range (len(df_by_date)):
if (out < str((df_by_date['hours'].iloc[i]).where(df_by_date['Door Name'].iloc[i]=='RDC_IN-1')) < inn) :
pause+=1
print('Break outside ')
else:
print('Break inside')
When running the code above, I got this error:
if (out < ((df_by_date['hours'].iloc[i]).where(df_by_date['Door Name'].iloc[i]=='RDC_OUT-1')) < inn) :
AttributeError: 'datetime.time' object has no attribute 'where'
When you are iterating the DataFrame/Series you are selecting one cell at a time.
The cell which you are Selecting is of type datetime.time
However, where only works with the complete DataFrame/Series rather than having this in a loop.
Like,
sub_df = df_by_date['hours'].where(condition)
and then to count you can use len(sub_df)
My first python project that didn't print 'Hello World' - so be gentle. Tried answers from similar questions but they don't seem to work.
I'm working with an Excel file, parsing as pandas dataframe.
I have a calculated column that calculates the number of days to later be added to a date. The number of days to add column is done as below, with 'choices' being a list of integers. This seems to work fine.
choices = [0,0,925,778,567,608, 638,730]
df['Days_to_add'] = np.select(conditions, choices, default=0)
I now want to add this to an existing date column, to return a new column with the new date. So far i've tried this but Jupyter says its depreciated and will return a TypeError in a future version:
df["Estimated Start"] = pd.to_timedelta(df["Date1"]) + df['Days_to_add']
Also tried this:
df['Estimated_Start'] = df.Max_Dec_Date + pd.DateOffset(df['Days_to_add'])
And something else that told me to use timedelta index, and something else that pointed to timedelta range. I think the problem is something to do with trying to add an integer to a series?
No success with any of it. Help?
Date is not TimeDelta, but DateTime,
so the addition should go like this:
df["Estimated Start"] = pd.to_datetime(df["Date1"]) + pd.to_timedelta(df['Days_to_add'], unit='D')
I am writing a function where the argument is a pandas Series and I want to be able to print the name of the pandas series. Here is the function I have so far:
def chi2_ind_reps(x):
if chi2_ind(df['n_killed'], x) is True:
print('n_killed is dependent on ')
if chi2_ind(df['n_injured'], x) is True:
print('n_injured is dependent on ')
For example, I want chi2_ind_reps(df['date']) to return
n_killed is dependent on df['date']
n_injured is dependent on df['date']
I have tried to use the str() function but that would just return the entire series as string objects.
Any suggestions?
You can use the name attribute of the series.
Look also at this question for a similar case.
I'm using a map function to generate a new column where its value depends on the result of a column that already exists in the dataframe.
def computeTechFields(row):
if row.col1!=VALUE_TO_COMPARE:
tech1=0
else:
tech1=1
return (row.col1, row.col2, row.col3, tech1)
delta2rdd = delta.map(computeTechFields)
The problem is that my main dataframe has more than 150 columns that I have to return with the map function so in the end I have something like this :
return (row.col1, row.col2, row.col3, row.col4, row.col5, row.col6, row.col7, row.col8, row.col9, row.col10, row.col11, row.col12, row.col13, row.col14, row.col15, row.col16, row.col17, row.col18 ..... row.col149, row.col150, row.col151, tech1)
As you can see, it is really long to write and difficult to read. So I tried to do something like this :
return (row.*, tech1)
But of course it did not work.
I know that the "withColumn" function exists but I don't know much about its performance and could not make it work anyway.
Edit (What happened with the withColumn function) :
def computeTech1(row):
if row.col1!=VALUE_TO_COMPARE:
tech1=0
else:
tech1=1
return tech1
delta2 = delta.withColumn("tech1", computeTech1)
And it gave me this error :
AssertionError: col should be Column
I tried to do something like this :
return col(tech1)
The error was the same
I also tried :
delta2 = delta.withColumn("tech1", col(computeTech1))
This time, the error was :
AttributeError: 'function' object has no attribute '_get_object_id'
End of the edit
So my question is, how can I return all the columns + a few more within my UDF used by the map function ?
Thanks !
Not super firm with Python, so people might correct me on the syntax here, but the general idea is to make your function a UDF with a column as input, then call that inside withColumn. I used a lambda here, but with some fiddeling it should also work with a function.
from pyspark.sql.functions import udf
computeTech1UDF = udf(
lambda col: 0 if col != VALUE_TO_COMPARE else 1, IntegerType())
delta2 = delta.withColumn("tech1", computeTech1UDF(col1))
What you tried did not work since you did not provide withColumn with a column expression (see http://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn). Using the UDF wrapper achieves exactly that.