how to convert string date time to int using python - python-3.x

I am doing data preprocessing, so I am trying to convert the date string format into an int, but I got an error, please help me how to convert it.
I have data like this :
0 Apr-12
1 Apr-12
2 Mar-12
3 Apr-12
4 Apr-12
I tried this :
d=df['d_date'].apply(lambda x: datetime.strptime(x, '%m%Y'))
I got an error.
ValueError Traceback (most recent call last)
<ipython-input-134-173081812744> in <module>()
----> 1 d=test['first_payment_date'].apply(lambda x: datetime.strptime(x, '%m%Y'))
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4036 else:
4037 values = self.astype(object).values
-> 4038 mapped = lib.map_infer(values, f, convert=convert_dtype)
4039
4040 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-134-173081812744> in <lambda>(x)
----> 1 d=test['first_payment_date'].apply(lambda x: datetime.strptime(x, '%m%Y'))
~\Anaconda3\lib\_strptime.py in _strptime_datetime(cls, data_string, format)
563 """Return a class cls instance based on the input string and the
564 format string."""
--> 565 tt, fraction = _strptime(data_string, format)
566 tzname, gmtoff = tt[-2:]
567 args = tt[:6] + (fraction,)
~\Anaconda3\lib\_strptime.py in _strptime(data_string, format)
360 if not found:
361 raise ValueError("time data %r does not match format %r" %
--> 362 (data_string, format))
363 if len(data_string) != found.end():
364 raise ValueError("unconverted data remains: %s" %
ValueError: time data 'Apr12' does not match format '%m%Y'

IIUC, You need to se %b-%y as Apr is %b and 12 is %y. Refer to Python's strftime directives for more information. Once you convert to datetime objects, you can then convert them to UNIX.
df:
col
0 Apr-12
1 Apr-12
For int datetime,
pd.Series(pd.to_datetime(df['col'], format='%b-%y').values.astype(float)).div(10**9)
Output:
0 1.333238e+09
1 1.333238e+09
dtype: float64
Explanation:
pd.to_datetime(df['col'], format='%b-%y')
Outputs:
0 2012-04-01
1 2012-04-01
Name: col, dtype: datetime64[ns]

Related

How to subset a xarray.Dataset according to lat/lon values taken from a SRTM DEM extents

I have a year wise (1980-2020) precipitation data set in netCDF format. I am importing them in xarray to have 40 years of merged precipitation values:
import netCDF4
import numpy
import xarray as xr
import pandas as pd
prcp=xr.open_mfdataset('/home/hrsa/Sayantan/HAR_V2/prcp/HARv2_d10km_d_2d_prcp_*.nc',combine = 'nested', concat_dim="time")
prcp
which renders:
xarray.Dataset
Dimensions:
time: 14976west_east: 381south_north: 252
Coordinates:
time
(time)
datetime64[ns]
1980-01-01 ... 2020-12-31
west_east
(west_east)
float32
-1.675e+06 -1.665e+06 ... 2.125e+06
south_north
(south_north)
float32
-7.45e+05 -7.35e+05 ... 1.765e+06
lon
(south_north, west_east)
float32
dask.array<chunksize=(252, 381), meta=np.ndarray>
lat
(south_north, west_east)
float32
dask.array<chunksize=(252, 381), meta=np.ndarray>
Data variables:
prcp
(time, south_north, west_east)
float32
dask.array<chunksize=(366, 252, 381), meta=np.ndarray>
Attributes: (33)
This a large dataset, hence I am required to subset it according to an SRTM image whose extents (in EPSG:4326) is defined as
# Extents of the SRTM DEM covering Panchi_B and the SASE AWS/Base Camp
min_lon = 77.0
min_lat = 32.0
max_lon = 78.0
max_lat = 33.0
In order to subset according to above coordinates I have tried the following:
prcp = prcp.sel(lat = slice(min_lat,max_lat), lon = slice(min_lon,max_lon))
the Error output:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/.pyenv/versions/3.9.7/envs/v3.9.7/lib/python3.9/site-packages/xarray/core/indexing.py:73, in group_indexers_by_index(data_obj, indexers, method, tolerance)
72 try:
---> 73 index = xindexes[key]
74 coord = data_obj.coords[key]
KeyError: 'lat'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
Input In [25], in <cell line: 1>()
----> 1 prcp = prcp.sel(lat = slice(min_lat,max_lat), lon = slice(min_lon,max_lon))
File ~/.pyenv/versions/3.9.7/envs/v3.9.7/lib/python3.9/site-packages/xarray/core/dataset.py:2501, in Dataset.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
2440 """Returns a new dataset with each array indexed by tick labels
2441 along the specified dimension(s).
2442
(...)
2498 DataArray.sel
2499 """
2500 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2501 pos_indexers, new_indexes = remap_label_indexers(
2502 self, indexers=indexers, method=method, tolerance=tolerance
2503 )
2504 # TODO: benbovy - flexible indexes: also use variables returned by Index.query
2505 # (temporary dirty fix).
2506 new_indexes = {k: v[0] for k, v in new_indexes.items()}
File ~/.pyenv/versions/3.9.7/envs/v3.9.7/lib/python3.9/site-packages/xarray/core/coordinates.py:421, in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs)
414 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "remap_label_indexers")
416 v_indexers = {
417 k: v.variable.data if isinstance(v, DataArray) else v
418 for k, v in indexers.items()
419 }
--> 421 pos_indexers, new_indexes = indexing.remap_label_indexers(
422 obj, v_indexers, method=method, tolerance=tolerance
423 )
424 # attach indexer's coordinate to pos_indexers
425 for k, v in indexers.items():
File ~/.pyenv/versions/3.9.7/envs/v3.9.7/lib/python3.9/site-packages/xarray/core/indexing.py:110, in remap_label_indexers(data_obj, indexers, method, tolerance)
107 pos_indexers = {}
108 new_indexes = {}
--> 110 indexes, grouped_indexers = group_indexers_by_index(
111 data_obj, indexers, method, tolerance
112 )
114 forward_pos_indexers = grouped_indexers.pop(None, None)
115 if forward_pos_indexers is not None:
File ~/.pyenv/versions/3.9.7/envs/v3.9.7/lib/python3.9/site-packages/xarray/core/indexing.py:84, in group_indexers_by_index(data_obj, indexers, method, tolerance)
82 except KeyError:
83 if key in data_obj.coords:
---> 84 raise KeyError(f"no index found for coordinate {key}")
85 elif key not in data_obj.dims:
86 raise KeyError(f"{key} is not a valid dimension or coordinate")
KeyError: 'no index found for coordinate lat'
How can I resolve this issue? Any help will be appreciated, Thank you.
############# Edit (for #Robert Wilson) ##################
In order to find out the ranges, I did the following:
lon = prcp.lon.to_dataframe()
lon
lat = prcp.lat.to_dataframe()
lat

TypeError: 'float' object cannot be interpreted as an integer on linspace

TypeError Traceback (most recent call last)
d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 15' in <cell line: 1>()
-->1 plot_freq(signal, sample_rate)
d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 10' in plot_freq(signal, sample_rate, fft_size)
2 def plot_freq(signal, sample_rate, fft_size=512):
3 xf = np.fft.rfft(signal, fft_size) / fft_size
----> 4 freq = np.linspace(0, sample_rate/2, fft_size/2 + 1)
5 xfp = 20 * np.log10(np.clip(np.abs(xf), 1e-20, 1e100))
6 plt.figure(figsize=(20, 5))
File <__array_function__ internals>:5, in linspace(*args, **kwargs)
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\function_base.py:120, in linspace(start, stop, num, endpoint, retstep, dtype, axis)
23 #array_function_dispatch(_linspace_dispatcher)
24 def linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None,
25 axis=0):
26 """
27 Return evenly spaced numbers over a specified interval.
28
(...)
118
119 """
--> 120 num = operator.index(num)
121 if num < 0:
122 raise ValueError("Number of samples, %s, must be non-negative." % num)
TypeError: 'float' object cannot be interpreted as an integer
What solution about this problem?

slice in xarray gives error 'float' object cannot be interpreted as an integer

I am trying to slice data by longtitude using xarray.
The data is in a netcdf file I created from measurements I made.
The xarray.Dataset has the following attributes:
Dimensions:
(lat: 1321, lon: 1321)
Data variables:
(lon) float64 '8.413 8.411 8.409 ... 4.904 4.905'
(lat) float64 '47.4 47.4 47.41 ... 52.37 52.37'
(data) float64 ... #dimension: 1321
my code is:
import xarray as xr
obs = xr.open_dataset('data.nc')
obs=obs['data'].sel(lon=slice(4.905, 8.413))
The error I get is TypeError: 'float' object cannot be interpreted as an integer
I could not find out whether it is an error in my code, or an error in xarray. I would expect such an error using isel instead of sel. Could not find any solution on here or over at the xarray documentation.
Full error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-434-5b37e4c5d0c6> in <module>
----> 1 obs=obs['data'].sel(lon=slice(4.905, 8.413))
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/dataarray.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
1059 method=method,
1060 tolerance=tolerance,
-> 1061 **indexers_kwargs,
1062 )
1063 return self._from_temp_dataset(ds)
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
2066 self, indexers=indexers, method=method, tolerance=tolerance
2067 )
-> 2068 result = self.isel(indexers=pos_indexers, drop=drop)
2069 return result._overwrite_indexes(new_indexes)
2070
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/dataset.py in isel(self, indexers, drop, **indexers_kwargs)
1933 var_indexers = {k: v for k, v in indexers.items() if k in var_value.dims}
1934 if var_indexers:
-> 1935 var_value = var_value.isel(var_indexers)
1936 if drop and var_value.ndim == 0 and var_name in coord_names:
1937 coord_names.remove(var_name)
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py in isel(self, indexers, **indexers_kwargs)
1058
1059 key = tuple(indexers.get(dim, slice(None)) for dim in self.dims)
-> 1060 return self[key]
1061
1062 def squeeze(self, dim=None):
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py in __getitem__(self, key)
701 array `x.values` directly.
702 """
--> 703 dims, indexer, new_order = self._broadcast_indexes(key)
704 data = as_indexable(self._data)[indexer]
705 if new_order:
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py in _broadcast_indexes(self, key)
540
541 if all(isinstance(k, BASIC_INDEXING_TYPES) for k in key):
--> 542 return self._broadcast_indexes_basic(key)
543
544 self._validate_indexers(key)
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py in _broadcast_indexes_basic(self, key)
568 dim for k, dim in zip(key, self.dims) if not isinstance(k, integer_types)
569 )
--> 570 return dims, BasicIndexer(key), None
571
572 def _validate_indexers(self, key):
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py in __init__(self, key)
369 k = int(k)
370 elif isinstance(k, slice):
--> 371 k = as_integer_slice(k)
372 else:
373 raise TypeError(
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py in as_integer_slice(value)
344
345 def as_integer_slice(value):
--> 346 start = as_integer_or_none(value.start)
347 stop = as_integer_or_none(value.stop)
348 step = as_integer_or_none(value.step)
~/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py in as_integer_or_none(value)
340
341 def as_integer_or_none(value):
--> 342 return None if value is None else operator.index(value)
343
344
I want to select the entire data, because eventually I want to subtract the entire array from a bigger data base with a wider grid. This bigger data base is a NETCDF file as well. And for that one, I managed to slice the longitude with the exact same code I am trying on this smaller data set where I get the error. The only difference is, that the bigger NETCDF uses a float32 format. I don't suspect this could cause the error.
Any help is appreciated. Thank you.
I think I found the problem.
When I created the netcdf file for the observation, I made a mistake in the createDimension part, when I named the lon and lat data. Because of this, lat and lon showed up under 'Data variables' in the netcdf file, where they should show up under 'Coordinates'
wrong was something like:
#Specifying dimensions#
f.createDimension('longitude', len(lon_list))
f.createDimension('latitude', len(lat_list))
#Building variables
longitude = f.createVariable('lon', float, ('lon',), zlib=True)
latitude = f.createVariable('lat', float, ('lat',), zlib=True)
data = f.createVariable('data', float, ('lat','lon'), zlib=True)
correct was:
#Specifying dimensions#
f.createDimension('lon', len(lon_list))
f.createDimension('lat', len(lat_list))
#Building variables
longitude = f.createVariable('lon', float, ('lon',), zlib=True)
latitude = f.createVariable('lat', float, ('lat',), zlib=True)
data = f.createVariable('data', float, ('lat','lon'), zlib=True)
This is a little late but I just ran into a similar issue where I got a similar undecipherable error when trying to slice by a variable.
I think the problem is that if you are trying to slice by a variable that isn't a coordinate you get an error that isn't very informative.
data = data.assign_coords({"lat":data.lat,"lon":data.lon})
would have fixed this without rewriting the netcdf file.

error using resample , says 'only valid with datetimeindex' even though i used to_datetime and set_index

I have a dataframe with two columns: 'case' and 'datetime' .
index------ case----------outDateTime
71809----10180227.0-----2013-01-01 01:41:01
71810----10180229.0-----2013-01-01 04:20:05
71811----10180230.0-----2013-01-01 06:20:22
575------10180232.0-----2013-01-01 02:01:13
23757----10180233.0-----2013-01-01 01:48:49
My goal is to count the amount of cases that there are in a certain hour of every day. For this data it would be:
2013-01-01 01AM = 2
2013-01-01 02AM = 1
2013-01-01 03AM = 0
2013-01-01 04AM = 1
and so on
I wanted to use df.resample('H').agg(), but I get the following error :
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
I find it weird because I used to_datetime and set_index so my dataframes index is the datetime and the type is datetime .
My code:
pd.to_datetime(Dout['outDateTime'],dayfirst=True)
Dout.set_index('outDateTime',inplace=True)
Dout.isnull().values.any()
false
Dout = Dout.resample('H').agg()
-------------------------------------------------------------------
--------
TypeError Traceback (most recent
call last)
<ipython-input-196-75f9fbadd6cc> in <module>
----> 1 Dout = Dout.resample('H').agg()
~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in
resample(self, rule, how, axis, fill_method, closed, label,
convention, kind, loffset, limit, base, on, level)
7108 axis=axis, kind=kind, loffset=loffset,
7109 convention=convention,
-> 7110 base=base, key=on, level=level)
7111 return _maybe_process_deprecations(r,
7112 how=how,
~/anaconda3/lib/python3.7/site-packages/pandas/core/resample.py in
resample(obj, kind, **kwds)
1146 """ create a TimeGrouper and return our resampler """
1147 tg = TimeGrouper(**kwds)
-> 1148 return tg._get_resampler(obj, kind=kind)
1149
1150
~/anaconda3/lib/python3.7/site-packages/pandas/core/resample.py in
_get_resampler(self, obj, kind)
1274 raise TypeError("Only valid with DatetimeIndex, "
1275 "TimedeltaIndex or PeriodIndex, "
-> 1276 "but got an instance of %r" %
type(ax).__name__)
1277
1278 def _get_grouper(self, obj, validate=True):
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or
PeriodIndex, but got an instance of 'Index'
I think you are looking for:
df.set_index('outDateTime').resample('H').size()

fit_transform error using CountVectorizer

So I have a dataframe X which looks something like this:
X.head()
0 My wife took me here on my birthday for breakf...
1 I have no idea why some people give bad review...
3 Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4 General Manager Scott Petello is a good egg!!!...
6 Drop what you're doing and drive here. After I...
Name: text, dtype: object
And then,
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(X)
But I get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-61-8ff79b91e317> in <module>()
----> 1 X = cv.fit_transform(X)
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
867
868 vocabulary, X = self._count_vocab(raw_documents,
--> 869 self.fixed_vocabulary_)
870
871 if self.binary:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
790 for doc in raw_documents:
791 feature_counter = {}
--> 792 for feature in analyze(doc):
793 try:
794 feature_idx = vocabulary[feature]
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
264
265 return lambda doc: self._word_ngrams(
--> 266 tokenize(preprocess(self.decode(doc))), stop_words)
267
268 else:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(x)
230
231 if self.lowercase:
--> 232 return lambda x: strip_accents(x.lower())
233 else:
234 return strip_accents
~/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py in __getattr__(self, attr)
574 return self.getnnz()
575 else:
--> 576 raise AttributeError(attr + " not found")
577
578 def transpose(self, axes=None, copy=False):
AttributeError: lower not found
No idea why.
You need to specify the column name of the text data even if the dataframe has single column.
X_countMatrix = cv.fit_transform(X['text'])
Because a CountVectorizer expects an iterable as input and when you supply a dataframe as an argument, only thing thats iterated is the column names. So even if you did not have any errors, that would be incorrect. Lucky that you got an error and got a chance to correct it.

Resources