How to use groupby function by leaving out leap day [duplicate] - python-3.x

I have a data frame in pandas like this:
ID Date Element Data_Value
0 USW00094889 2014-11-12 TMAX 22
1 USC00208972 2009-04-29 TMIN 56
2 USC00200032 2008-05-26 TMAX 278
3 USC00205563 2005-11-11 TMAX 139
4 USC00200230 2014-02-27 TMAX -106
I want to remove all leap days and my code is
df = df[~((df.Date.month == 2) & (df.Date.day == 29))]
but the AttributeError happened :
'Series' object has no attribute 'month'
Whats wrong with my code?

Use dt accessor:
df = df[~((df.Date.dt.month == 2) & (df.Date.dt.day == 29))]

Add dt accessor because working with Series, not with DatetimeIndex:
df = df[~((df.Date.dt.month == 2) & (df.Date.dt.day == 29))]
Or invert condition with chaining | for bitwise OR and != for not equal:
df = df[(df.Date.dt.month != 2) | (df.Date.dt.day != 29)]
Or use strftime for convert to MM-DD format:
df = df[df.Date.dt.strftime('%m-%m') != '02-29']

Another way you can try below in incase your Date column is not proper datetime rather a str.
df[~df.Date.str.endswith('02-29')]
OR , if it's in datetime format even you can try converting to str.
df[~df.Date.astype(str).str.endswith('02-29')]
OR, Even use contains:
df[~df.Date.str.contains('02-29')]

Related

'Column' object is not callable" - ge le gt lt methods

I have a dataframe and I'm trying to filter based on end_date if it's >= or < a certain date.
However, I'm getting a "not callable" error.
line 148, in <module>
df_s1 = df_x.filter(df_x[\"end_date\"].ge(lit(\"2022-08-17\")))
TypeError: 'Column' object is not callable"
Here is my code:
df_x = df_x.join(df_di_meet, trim(df_x.application_id) == trim(df_di_meet.application_id), "left")\
.select (df_x["*"], df_di_meet["end_date"])
# ... Cast end_date to timestamp ...end_date format looks like 2013-12-20 23:59:00.0000000
df_x = df_x.withColumn("end_date",(col("end_date").cast("timestamp")))
# ... Here df_s1 >= 2022-08-17
df_s1 = df_x.filter(df_x["end_date"].ge(lit("2022-08-17")))
#... Here df_s2 < 2022-08-17
df_s2 = df_x.filter(df_x["end_date"].lt(lit("2022-08-17")))
What I'm trying to do is check additional logic as well like the code below, but since it's not working with a when clause I decided to break down the dataframes and check each one separately. Is there an easier way, or how could I get the below code to work?
df_x = df_x.withColumn("REV_STAT_TYP_DES", when((df_x.review_statmnt_type_desc == lit("")) & (df_x("end_date").ge(lit("2022-08-17"))), "Not Released")
when((df_x.review_statmnt_type_desc == lit("")) & ((df_x("end_date").lt(lit("2022-08-17"))) | (df_x.end_date == lit(""))), "Not Available")
.otherwise(None))
There are attempts to make difficult code look cleaner. According to those recommendations, conditional statements may be better understood and maintained if they were separated into different variables. Look at how I've added isnull to some of the variables - it would have been a lot more difficult if they were not refactored into separate variables.
from pyspark.sql import functions as F
no_review = (F.col("review_statmnt_type_desc") == "") | F.isnull("review_statmnt_type_desc")
no_end_date = (F.col("end_date") == "") | F.isnull("end_date")
not_released = no_review & (F.col("end_date") >= F.lit("2022-08-17"))
not_available = no_review & ((F.col("end_date") < F.lit("2022-08-17")) | no_end_date)
Also, you don't need the otherwise clause if it returns null (its the default behaviour).
df_x = df_x.withColumn(
"REV_STAT_TYP_DES",
F.when(not_released, "Not Released")
.when(not_available, "Not Available")
)
df_x("end_date") --> This is wrong way of accessing a spark dataframe column. That's why python is assuming it as a callable and you are getting that error.
df_x["end_date"] --> This is how you should access the column (or df_x.end_date)
UPDATE:
Now only noticed , .ge() or .le() kind of methods won't work with spark dataframe column objects. You can use any of the below ways of filtering:
from pyspark.sql.functions import col
df_s1 = df_x.filter(df_x["end_date"] >='2022-08-17')
# OR
df_s1 = df_x.filter(df_x.end_date>='2022-08-17')
# OR
df_s1 = df_x.filter(col('end_date')>='2022-08-17')
# OR
df_s1 = df_x.filter("end_date>='2022-08-17'")
# OR
# you can use df_x.where() instead of df_x.filter
You probably got confused between pandas and pyspark. Anyway this is how you do it
DataFrame
df=spark.createDataFrame([("2022-08-16",1),("2019-06-24",2),("2022-08-19",3)]).toDF("date","increment")#
Pyspark
df_x = df.withColumn('date', to_date('date'))
df_x.filter(col('date')>(to_date(lit("2022-08-17")))).show()
Pandas
df_x = df.toPandas()
df_s1 = df_x.assign(date= pd.to_datetime(df_x['date'])).query("date.gt('2022-08-17')", engine='python')
or
df_x[df_x['date']>'2022-08-17']
Use SQL style free form case/when syntax in the expr() function. That way it is portable also.
df_x = (df_x.withColumn("REV_STAT_TYP_DES",
expr(""" case
when review_statmnt_type_desc='' and end_date >='2022-08-17' then 'Not Released'
when review_statmnt_type_desc='' and ( end_date <'2022-08-17' or end_date is null ) then 'Not Available'
else null
end
""")

How can I create a column that flags when another datetime column has changed date?

How can I create a column 'Marker that flags (0 or 1) when another datetime column 'DT' has changed date?
df = pd.DataFrame()
df['Obs']=float_array
df['DT'] = pd.to_datetime(datetime_array)
df['Marker'] = 0
print(type(df['DT'].dt))
<class 'pandas.core.indexes.accessors.DatetimeProperties'>
df['Marker'] = df.where(datetime.date(df.DT.dt) == datetime.date(df.DT.shift(1).dt),1)
TypeError: an integer is required (got type DatetimeProperties)
Use Series.dt.date for convert to dates and for convert True/False to 1/0 is used Series.view:
df['Marker'] = (df.DT.dt.date == df.DT.dt.date.shift()).view('i1')
Or numpy.where:
df['Marker'] = np.where(df.DT.dt.date == df.DT.dt.date.shift(), 0, 1)

How to apply a function with multiple arguments to a specific column in Pandas?

I'm trying to apply a function to a specific column in this dataframe
datetime PM2.5 PM10 SO2 NO2
0 2013-03-01 7.125000 10.750000 11.708333 22.583333
1 2013-03-02 30.750000 42.083333 36.625000 66.666667
2 2013-03-03 76.916667 120.541667 61.291667 81.000000
3 2013-03-04 22.708333 44.583333 22.854167 46.187500
4 2013-03-06 223.250000 265.166667 116.236700 142.059383
5 2013-03-07 263.375000 316.083333 97.541667 147.750000
6 2013-03-08 221.458333 297.958333 69.060400 120.092788
I'm trying to apply this function(below) to a specific column(PM10) of the above dataframe:
range1 = [list(range(0,50)),list(range(51,100)),list(range(101,200)),list(range(201,300)),list(range(301,400)),list(range(401,2000))]
def c1_c2(x,y):
for a in y:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
Where "x" can be any column and "y" = Range1
Available Options
df.PM10.apply(c1_c2,args(df.PM10,range1),axis=1)
df.PM10.apply(c1_c2)
I've tried these couple of available options and none of them seems to be working. Any suggestions?
Not sure what the expected output is from the function. But to get the function getting called you can try the following
from functools import partial
df.PM10.apply(partial(c1_c2, y=range1))
Update:
Ok, I think I understand a little better. This should work, but 'range1' is a list of lists of integers. Your data doesn't have integers and the new column comes up empty. I created another list based on your initial data that works. See below:
df = pd.read_csv('pm_data.txt', header=0)
range1= [[7.125000,10.750000,11.708333,22.583333],list(range(0,50)),list(range(51,100)),list(range(101,200)),
list(range(201,300)),list(range(301,400)),list(range(401,2000))]
def c1_c2(x,y):
for a in y:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
df['function']=df.PM10.apply(lambda x: c1_c2(x,range1))
print(df.head(10))
datetime PM2.5 PM10 SO2 NO2 new_column function
0 2013-03-01 7.125000 10.750000 11.708333 22.583333 25.750000 16.458333
1 2013-03-02 30.750000 42.083333 36.625000 66.666667 2.104167 NaN
2 2013-03-03 76.916667 120.541667 61.291667 81.000000 6.027083 NaN
3 2013-03-04 22.708333 44.583333 22.854167 46.187500 2.229167 NaN
4 2013-03-06 223.250000 265.166667 116.236700 142.059383 13.258333 NaN
5 2013-03-07 263.375000 316.083333 97.541667 147.750000 15.804167 NaN
6 2013-03-08 221.458333 297.958333 69.060400 120.092788 14.897917 NaN
Only the first item in 'function' had a match because it came from your initial data because of 'if x in a'.
Old Code:
I'm also not sure what you are doing. But you can use a lambda to modify columns or create new ones.
Like this,
import pandas as pd
I created a data file to import from the data you posted above:
datetime,PM2.5,PM10,SO2,NO2
2013-03-01,7.125000,10.750000,11.708333,22.583333
2013-03-02,30.750000,42.083333,36.625000,66.666667
2013-03-03,76.916667,120.541667,61.291667,81.000000
2013-03-04,22.708333,44.583333,22.854167,46.187500
2013-03-06,223.250000,265.166667,116.236700,142.059383
2013-03-07,263.375000,316.083333,97.541667,147.750000
2013-03-08,221.458333,297.958333,69.060400,120.092788
Here is how I import it,
df = pd.read_csv('pm_data.txt', header=0)
and create a new column and apply a function to the data in 'PM10'
df['new_column'] = df['PM10'].apply(lambda x: x+15 if x < 30 else x/20)
which yields,
datetime PM2.5 PM10 SO2 NO2 new_column
0 2013-03-01 7.125000 10.750000 11.708333 22.583333 25.750000
1 2013-03-02 30.750000 42.083333 36.625000 66.666667 2.104167
2 2013-03-03 76.916667 120.541667 61.291667 81.000000 6.027083
3 2013-03-04 22.708333 44.583333 22.854167 46.187500 2.229167
4 2013-03-06 223.250000 265.166667 116.236700 142.059383 13.258333
5 2013-03-07 263.375000 316.083333 97.541667 147.750000 15.804167
6 2013-03-08 221.458333 297.958333 69.060400 120.092788 14.897917
Let me know if this helps.
"I've tried these couple of available options and none of them seems to be working..."
What do you mean by this? What's your output, are you getting errors or what?
I see a couple of problems:
range1 lists contain int while your column values are float, so c1_c2() will return None.
if the data types were the same within range1 and columns, c1_c2() will return None when value is not in range1.
Below is how I would do it, assuming the data-types match:
def c1_c2(x):
range1 = [list of lists]
for a in range1:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
return x # returns the original value if not in range1
df.PM10.apply(c1_c2)

Pandas Dataframe - sort list element by date, when date is substring of element

I want to sort data in each cell in column named SESSIONS, based on date (YYYY_MM_DD) and this date is inside elements (strings) forming list. SESSIONS column can have various numbers of sessions and can be also empty. In one cell of SESSIONS colum there is list of sessions (like in "li" I put as an example for testing).
Below is how it worked OK when doing it outside df (2019_04_20 appears as latest):
li = ['WE233JP_2015_03_03__13_31_21','WE238JP_2019_04_20__16_40_59','WE932LT_2017_10_12__08_35_49']
li.sort(key = lambda x: datetime.strptime(re.sub(r'^([^_]+)_(.+)__(.+)', r'\2', x), '%Y_%m_%d'))
print(li)
When I try to apply it on df with below codes (2 attempts):
df['sessions'] = df.sessions.fillna('NULL').sort_values().apply(lambda x: sorted(datetime.strptime(re.sub(r'^([^_]+)_(.+)__(.+)', r'\2', x), '%Y_%m_%d')))
df['sessions'] = df.sessions.fillna('NULL').sort_values().apply(lambda x: sorted(re.sub(r'^([^_]+)_(.+)__(.+)', r'\2', x)))
In both cases I got an erorr: TypeError: expected string or bytes-like object
Simple non-date sorting like below works OK:
df['sessions'] = df.sessions.fillna('NULL').sort_values().apply(lambda x: sorted(x))
Any suggestions how to sort df on extracted part of string formatted as date?
Let's try series map with custom sort key function
Sample `df`:
sessions
0 [WE233JP_2015_03_03__13_31_21, WE238JP_2019_04_20__16_40_59, WE932LT_2017_10_12__08_35_49]
1 NaN
import re
sort_func = lambda x: pd.to_datetime(re.findall(r'^[^_]+_(.+)__.+', x)[0],
format='%Y_%m_%d', errors='coerce')
df['sorted_sessions'] = df.sessions.map(lambda y: sorted(y, key=sort_func)
if y is not np.nan else y)
Out[1455]:
sessions \
0 [WE233JP_2015_03_03__13_31_21, WE238JP_2019_04_20__16_40_59, WE932LT_2017_10_12__08_35_49]
1 NaN
sorted_sessions
0 [WE233JP_2015_03_03__13_31_21, WE932LT_2017_10_12__08_35_49, WE238JP_2019_04_20__16_40_59]
1 NaN

Convert pandas column from object type [] in python 3

I have read this Pandas: convert type of column and this How to convert datatype:object to float64 in python?
I have current output of df:
Day object
Time object
Open float64
Close float64
High float64
Low float64
Day Time Open Close High Low
0 ['2019-03-25'] ['02:00:00'] 882.2 882.6 884.0 882.1
1 ['2019-03-25'] ['02:01:00'] 882.9 882.9 883.4 882.9
2 ['2019-03-25'] ['02:02:00'] 882.8 882.8 883.0 882.7
So I can not use this:
day_=df.loc[df['Day'] == '2019-06-25']
My final purpose is to extract df by filtering the value of column "Day" by specific condition.
I think the reason of df.loc above failed to excecute is that dtype of Day is object so I can not execute df.loc
so I try to convert the above df to something like this:
Day Time Open Close High Low
0 2019-03-25 ['02:00:00'] 882.2 882.6 884.0 882.1
1 2019-03-25 ['02:01:00'] 882.9 882.9 883.4 882.9
2 2019-03-25 ['02:02:00'] 882.8 882.8 883.0 882.7
I have tried:
df=pd.read_csv('output.csv')
df = df.convert_objects(convert_numeric=True)
#df['Day'] = df['CTR'].str.replace('[','').astype(np.float64)
df['Day'] = pd.to_numeric(df['Day'].str.replace(r'[,.%]',''))
But it does not work with error like this:
ValueError: Unable to parse string "['2019-03-25']" at position 0
I am novice at pandas and this may be duplicated!
Pls, help me to find solution. Thanks alot.
Try this I hope it would work
first remove list brackets by from day then do filter using .loc
df = pd.DataFrame(data={'Day':[['2016-05-12']],
'day2':[['2016-01-01']]})
df['Day'] = df['Day'].apply(''.join)
df['Day'] = pd.to_datetime(df['Day']).dt.date.astype(str)
days_df=df.loc[df['Day'] == '2016-05-12']
Second Solution
If the list is stored as string
from ast import literal_eval
df2 = pd.DataFrame(data={'Day':["['2016-05-12']"],
'day2':["['2016-01-01']"]})
df2['Day'] = df2['Day'].apply(literal_eval)
df2['Day'] = df2['Day'].apply(''.join)
df2['Day'] = pd.to_datetime(df2['Day']).dt.date.astype(str)
days_df=df2.loc[df2['Day'] == '2016-05-12']

Resources