Syntax error for pyspark.sql.functions.from_utc_timestamp(timestamp, tz) - apache-spark

I tried importing pyspark.sql.functions.from_utc_timestamp(timestamp, tz) but it always shows an invalid syntax error. How do we use this module to convert set of values in epoch time to UTC in spark?

Some example code (taken from the docs, and modified) to do what you want:
>>> from pyspark.sql.functions import from_utc_timestamp
>>> df = sqlContext.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(from_utc_timestamp(df.t, "PST").alias('t')).collect()
[Row(t=datetime.datetime(1997, 2, 28, 2, 30))]

Related

Extract Date-Month_year from utc format string python

I have a using below code to extract date-month-year from the string of a date however it is giving me
time data '2020-05-11T04:47:54.530000' does not match format '%Y-%m-%d %H:%M:%S.%f'
error, can anyone help?
from datetime import datetime
cr_date="2020-05-11T04:47:54.530000"
datetime.strptime(cr_date, '%Y-%m-%d %H:%M:%S.%f').strftime('%m/%d/%Y')
Regarding your own code just add T see the following:
from datetime import datetime
cr_date="2020-05-11T04:47:54.530000"
date_object = datetime.strptime(cr_date, '%Y-%m-%dT%H:%M:%S.%f').strftime('%m/%d/%Y')
Another way to solve this is using regex ,
import re
from datetime import datetime
cr_date="2020-05-11T04:47:54.530000"
match = re.search(r'\d{4}-\d{2}-\d{2}', cr_date)
date = datetime.strptime(match.group(), '%Y-%m-%d').date()
if you use Python 3.7 or higher, use fromisoformat:
from datetime import datetime
cr_date="2020-05-11T04:47:54.530000"
datetime.fromisoformat(cr_date)
# datetime.datetime(2020, 5, 11, 4, 47, 54, 530000)
you can strftime that however you want now, e.g.
datetime.fromisoformat(cr_date).strftime('%m/%d/%Y')
# '05/11/2020'

Is there a quick way to find the global quantile of a pandas dataframe?

I want to find the 20th quantile of a pandas dataframe through and through, not per column. I know that the .quantile function can find the quantiles along a specific axis, but is there a fast shortcut to find the quantile of the whole dataframe, provided that all of its columns are integers?
Example of the desired result using a pandas series as a mediator:
>>> import pandas as pd
>>> df= pd.DataFrame(data={1: [55, 11, 13, 9, 11],
2: [56, 75, 31, 1, 25]})
>>> df.quantile(.2) # this finds two quantiles, one per column
1 10.6
2 20.2
Name: 0.2, dtype: float64
# The workaround
>>> s = df[1].append(df[2])
>>> s.quantile(.2)
10.6
You can use numpy's np.quantile [numpy-doc] for that:
>>> import numpy as np
>>> np.quantile(df, 0.2)
10.6
Or we can use the numpy library import in the pandas module directly:
>>> pd.np.quantile(df, 0.2)
10.6
So here is melt
df.melt().value.quantile(0.2)
Out[309]: 10.6

Keyerror in time/Date Components of datetime - what to do?

I am using a pandas DataFrame with datetime indexing. I know from the
Xarray documentation, that datetime indexing can be done as ds['date.year'] with ds being the DataArray of xarray, date the date index and years of the dates. Xarray points to datetime components which again leads to DateTimeIndex, the latter being panda documentation. So I thought of doing the same with pandas, as I really like this feature.
However, it is not working for me. Here is what I did so far:
# Import required modules
import pandas as pd
import numpy as np
# Create DataFrame (name: df)
df=pd.DataFrame({'Date': ['2017-04-01','2017-04-01',
'2017-04-02','2017-04-02'],
'Time': ['06:00:00','18:00:00',
'06:00:00','18:00:00'],
'Active': [True,False,False,True],
'Value': np.random.rand(4)})
# Combine str() information of Date and Time and format to datetime
df['Date']=pd.to_datetime(df['Date'] + ' ' + df['Time'],format = '%Y-%m-%d %H:%M:%S')
# Make the combined data the index
df = df.set_index(df['Date'])
# Erase the rest, as it is not required anymore
df = df.drop(['Time','Date'], axis=1)
# Show me the first day
df['2017-04-01']
Ok, so this shows me only the first entries. So far, so good.
However
df['Date.year']
results in KeyError: 'Date.year'
I would expect an output like
array([2017,2017,2017,2017])
What am I doing wrong?
EDIT:
I have a workaround, which I am able to go on with, but I am still not satisfied, as this doesn't explain my question. I did not use a pandas DataFrame, but an xarray Dataset and now this works:
# Load modules
import pandas as pd
import numpy as np
import xarray as xr
# Prepare time array
Date = ['2017-04-01','2017-04-01', '2017-04-02','2017-04-02']
Time = ['06:00:00','18:00:00', '06:00:00','18:00:00']
time = [Date[i] + ' ' + Time[i] for i in range(len(Date))]
time = pd.to_datetime(time,format = '%Y-%m-%d %H:%M:%S')
# Create Dataset (name: ds)
ds=xr.Dataset({'time': time,
'Active': [True,False,False,True],
'Value': np.random.rand(4)})
ds['time.year']
which gives:
<xarray.DataArray 'year' (time: 4)>
array([2017, 2017, 2017, 2017])
Coordinates:
* time (time) datetime64[ns] 2017-04-01T06:00:00 ... 2017-04-02T18:00:00
Just in terms of what you're doing wrong, your are
a) trying to call an index as a series
b) chaning commands within a string df['Date'] is a single column df['Date.year'] is a column called 'Date.year'
if you're datetime is the index, then use the .year or dt.year if it's a series.
df.index.year
#or assuming your dtype is a proper datetime (your code indicates it is)
df.Date.dt.year
hope that helps bud.

datetime match format when month is in string

Don't know which matching format to use if the month is in string.I have the following date to convert (11-OCT-2017) for which (%d-%m-%Y) isn't working.I'm using Matplotlib.dates module.
matplotlib.dates vendors dateutil's parser, so you could use that:
>>> import matplotlib.dates
>>> matplotlib.dates.dateutil.parser.parse('11-OCT-2017')
datetime.datetime(2017, 10, 11, 0, 0)
Or, if you are trying to parse into a matplotlib datenum, then the month format is %b:
>>> parse = matplotlib.dates.strpdate2num('%d-%b-%Y')
>>> parse('11-OCT-2017')
736613.0

Why python UDF returns unexpected datetime objects where as the same function applied over RDD gives proper datetime object

I am not sure if I am doing anything wrong so pardon me if this looks naive,
My problem is reproducible by the following data
from pyspark.sql import Row
df = sc.parallelize([Row(C3=u'Dec 1 2013 12:00AM'),
Row(C3=u'Dec 1 2013 12:00AM'),
Row(C3=u'Dec 5 2013 12:00AM')]).toDF()
I have created a function to parse this date strings as datetime objects to process further
from datetime import datetime
def date_convert(date_str):
date_format = '%b %d %Y %I:%M%p'
try:
dt=datetime.strptime(date_str,date_format)
except ValueError,v:
if len(v.args) > 0 and v.args[0].startswith('unconverted data remains: '):
dt = dt[:-(len(v.args[0])-26)]
dt=datetime.strptime(dt,date_format)
else:
raise v
return dt
Now if I make a UDF out of this and apply to my dataframe I get unexpected data
from pyspark.sql.functions import udf
date_convert_udf = udf(date_convert)
df.select(date_convert_udf(df.C3).alias("datetime")).take(2)
The result is like below
Out[40]:
[Row(datetime=u'java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2013,MONTH=11,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=1,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=0,ZONE_OFFSET=?,DST_OFFSET=?]'),
Row(datetime=u'java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2013,MONTH=11,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=1,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=0,ZONE_OFFSET=?,DST_OFFSET=?]')]
but if I use it after making the dataframe a RDD then it returns a pythond datetime object
df.rdd.map(lambda row:date_convert(row.C3)).collect()
(1) Spark Jobs
Out[42]:
[datetime.datetime(2013, 12, 1, 0, 0),
datetime.datetime(2013, 12, 1, 0, 0),
datetime.datetime(2013, 12, 5, 0, 0)]
I want to achieve the similar thing with dataframe . How can I do that and what is wrong with this approach (UDF over dataframe)
It's because you have to set the return type data of your UDF. Apparently you are trying to obtain timestamps, if this is the case you have to write something like this.
from pyspark.sql.types import TimestampType
date_convert_udf = udf(date_convert, TimestampType())

Resources