Python xarray: Processing data for a loop with method='nearest' at different locations - subset

Is it possible to have an xarray with multiple columns all having the same coordinates? In The following example I create an xarray and then I want to extract time series data at different locations. However, to do this I have to create a numpy array to store this data and its coordinates.
#Sample from the data in the netCDF file
ds['temp'] = xr.DataArray(data=np.random.rand(2,3,4), dims=['time','lat','lon'],
coords=dict(time=pd.date_range('1900-1-1',periods=2,freq='D'),
lat=[25.,26.,27.],lon=[-85.,-84.,-83.,-82.]))
display(ds)
#lat and lon locations to extract temp values
locations=np.array([[25.6, -84.7], [26, -83], [26.5, -84.1]])
#Extract time series at different locations
temp=np.empty([ds.shape[0], len(locations)])
lat_lon=np.empty([len(locations),2])
for n in range(locations.shape[0]):
lat_lon[n,0]=ds.sel(lat=locations[n,0],
lon=locations[n,1], method='nearest').coords['lat'].values
lat_lon[n,1]=ds.sel(lat=locations[n,0],
lon=locations[n,1], method='nearest').coords['lon'].values
temp[:,n]=ds.sel(lat=locations[n,0],
lon=locations[n,1], method='nearest')
print(temp)
print(lat_lon)
#Find maximum temp for all locations:
temp=temp.max(1)
The output of this code is:
array([[[0.67465371, 0.0710136 , 0.03263631, 0.41050204],
[0.26447469, 0.46503577, 0.5739435 , 0.33725726],
[0.20353832, 0.01441925, 0.26728572, 0.70531547]],
[[0.75418953, 0.20321738, 0.41129902, 0.96464691],
[0.53046103, 0.88559914, 0.20876142, 0.98030988],
[0.48009467, 0.7906767 , 0.09548439, 0.61088112]]])
Coordinates:
time (time) datetime64[ns] 1900-01-01 1900-01-02
lat (lat) float64 25.0 26.0 27.0
lon (lon) float64 -85.0 -84.0 -83.0 -82.0
temp (time, lat, lon) float64 0.09061 0.6634 ... 0.5696 0.4438
Attributes: (0)
[[0.26447469 0.5739435 0.01441925]
[0.53046103 0.20876142 0.7906767 ]]
[[ 26. -85.]
[ 26. -83.]
[ 27. -84.]]
More simply, is there a way to find the maximum temp across all locations for every timestamp without creating the intermediate temp array?

When you create the sample data, you specify 3 values of latitude and 4 values of longitude. That means 12 values in total, on a 2D grid (3D if we add time).
When you want to query values for 3 specific points, you have to query each point individually. As far as I know, there are two ways to do that:
Write a loop and store the result on an intermediate array (your solution)
Stack dimensions and query longitude and latitude simultaneously.
First, you have to express your locations as a list/array of tuples:
locations=np.array([[25.6, -84.7], [26, -83], [26.5, -84.1]])
coords=[(coord[0], coord[1]) for coord in locations]
print(coords)
[(25.6, -84.7), (26.0, -83.0), (26.5, -84.1)]
Then you interpolate your data for the specified locations, stack latitude and longitude to a new dimension coord, select your points.
(ds
.interp(lon=locations[:,1], lat=locations[:,0], method='linear') # interpolate on the grid
.stack(coord=['lat','lon']) # from 3x3 grid to list of 9 points
.sel(coord=coords)) # select your three points
.temp.max(dim='coord') # get largest temp value from the coord dimension
)
array([0.81316195, 0.56967184]) # your largest values at both timestamps
The downside is that xarray doesn't support interpolation for unlabeled multi-index, which is why first you need to interpolate (NOT simply find the nearest neighbor) the grid on your set of latitudes and longitudes.

Related

How to graph Binance API Orderbook with Pandas-matplotlib?

the data comes in 3 columns after (orderbook = pd.DataFrame(orderbook_data):
timestamp bids asks
UNIX timestamp [bidprice, bidvolume] [askprice, askvolume]
list has 100 values of each. timestamp is the same
the problem is that I don't know how to access/index the values inside each row list [price, volume] of each column
I know that by running ---> bids = orderbook["bids"]
I get the list of 100 lists ---> [bidprice, bidvolume]
I'm looking to avoid doing a loop.... there has to be a way to just plot the data
I hope someone can undertand my problem. I just want to plot price on x and volume on y. The goal is to make it live
As you didn't present your input file, I prepared it on my own:
timestamp;bids
1579082401;[123.12, 300]
1579082461;[135.40, 220]
1579082736;[130.76, 20]
1579082801;[123.12, 180]
To read it I used:
orderbook = pd.read_csv('Input.csv', sep=';')
orderbook.timestamp = pd.to_datetime(orderbook.timestamp, unit='s')
Its content is:
timestamp bids
0 2020-01-15 10:00:01 [123.12, 300]
1 2020-01-15 10:01:13 [135.40, 220]
2 2020-01-15 10:05:36 [130.76, 20]
3 2020-01-15 10:06:41 [123.12, 180]
Now:
timestamp has been converted to native pandasonic type of datetime,
but bids is of object type (actually, a string).
and, as I suppose, this is the same when read from your input file.
And now the main task: The first step is to extract both numbers from bids,
convert them to float and int and save in respective columns:
orderbook = orderbook.join(orderbook.bids.str.extract(
r'\[(?P<bidprice>\d+\.\d+), (?P<bidvolume>\d+)]'))
orderbook.bidprice = orderbook.bidprice.astype(float)
orderbook.bidvolume = orderbook.bidvolume.astype(int)
Now orderbook contains:
timestamp bids bidprice bidvolume
0 2020-01-15 10:00:01 [123.12, 300] 123.12 300
1 2020-01-15 10:01:01 [135.40, 220] 135.40 220
2 2020-01-15 10:05:36 [130.76, 20] 130.76 20
3 2020-01-15 10:06:41 [123.12, 180] 123.12 180
and you can generate e.g. a scatter plot, calling:
orderbook.plot.scatter('bidprice', 'bidvolume');
or other plotting function.
Another possibility
Or maybe your orderbook_data is a dictionary? Something like:
orderbook_data = {
'timestamp': [1579082401, 1579082461, 1579082736, 1579082801],
'bids': [[123.12, 300], [135.40, 220], [130.76, 20], [123.12, 180]] }
In this case, when you create a DataFrame from it, the column types
are initially:
timestamp - int64,
bids - also object, but this time each cell contains a plain
pythonic list.
Then you can also convert timestamp column to datetime just like
above.
But to split bids (a column of lists) into 2 separate columns,
you should run:
orderbook[['bidprice', 'bidvolume']] = pd.DataFrame(orderbook.bids.tolist())
Then you have 2 new columns with respective components of the
source column and you can create your graphics jus like above.

Using sparse matrix instead of numpy distance matrix

I have a list of car ids and a list of x,y coordinates.
I want to calculate the distance between each of the coordinates.
The problem is ,after trying for weeks now, the distance matrix has limits and I'm dealing with Gigas of files with a resulting matrix of millions of rows and columns.
Can this be done using sparse to make it more effecient?
list_coordinates = []
for line in coordinates.readlines():
list_coordinates.append(line.strip().split(','))
list_coordinates_int = [list(map(float, x)) for x in list_coordinates]
list_car_id = []
for line in car_ids.readlines():
list_car_id.append(line.strip().split(' '))
df = pd.DataFrame(list_coordinates_int, columns=['xcord', 'ycord'], index=list_car_id)
df2=pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)
list_coordinates : [['875.88', '588.26'], ['751.49', '656.55']]
list_coordinates_int : [[875.88, 588.26], [751.49, 656.55]]
list_car_id : [['car.0', 'car2.0', 'car.0', 'car2.0', 'car.0']]
the resulting df2 is like this:
car.0 car2.0 car.4
car.0 0.000000 141.902770 0.702140
car2.0 141.902770 0.000000 141.205831
car.4 141.902770 0.702140 0.000000
is there a way i could get the same df2 using sparse or any other mehtod than distance matrix?

Does resampling two matrices of same size with same random state result in rows of same indices?

I have a data points in a csr numpy matrix and labels in a pandas series.
I want to do down sampling of the dataset.
I tried re-sampling the data points(matrix) and labels(pandas series) separately using same random state.
X4_train_undersampled = resample(X4_train,replace=False, n_samples=41615, random_state=123)
y_train_undersampled = resample(y_train, replace=False , n_samples=41615, random_state=123)
I want to whether this is the right method to do it.
if yes, how can i test if the same rows are sampled in data points and labels.
if No, please provide another way to do down-sampling.

How to use list comprehension in pandas to create a new series for a plot?

This is by far the most difficult problem I have faced. I am trying to create plots indexed on ratetype. For example, a matrix of unique ratetype x avg customer number for that ratetype is what I want to create efficiently. The lambda expression for getting the rows where the value is equal to each individual ratetype then getting the average customer number for that type then creating a series based on these two lists that are equal in size and length and accurate, is way over my head for pandas.
The number of different ratetypes can be in the hundreds. Reading it into a list via lambda would logically be a better choice than hard coding each possibility, as the list is going to only increase in size and new variability.
""" a section of the data for example use. Working with column "Ratetype"
column "NumberofCustomers" to work towards getting something like
list1 = unique occurs of ratetypes
list2 = avg number of customers for each ratetype
rt =['fixed','variable',..]
avg_cust_numbers = [45.3,23.1,...]
**basically for each ratetype: get mean of all row data for custno column**
ratetype,numberofcustomers
fixed,1232
variable, 1100
vec, 199
ind, 1211
alg, 123
bfd, 788
csv, 129
ggg, 1100
aaa, 566
acc, 439
"""
df['ratetype','number_of_customers']
fixed = df.loc['ratetype']=='fixed']
avg_fixed_custno = fixed.mean()
rt_counts = df.ratetype.value_counts()
rt_uniques = df.ratetype.unique()
# rt_uniques would be same size vector as avg_cust_nos, has to be anyway
avg_cust_nos = [avg_fixed_custno, avg_variable_custno]
My goal is to create and plot these subplots using matplot.pyplot.
data = {'ratetypes': pd.Series(rt_counts, index=rt_uniques),
'Avg_cust_numbers': pd.Series(avg_cust_nos, index=rt_uniques),
}
df = pd.DataFrame(data)
df = df.sort_values(by=['ratetypes'], ascending=False)
fig, axes = plt.subplots(nrows=2, ncols=1)
for i, c in enumerate(df.columns):
df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c)
plt.savefig('custno_byrate.png', bbox_inches='tight')

How to classify a frequency domain data in Python 3? (could not convert string to float)

I've converted a DataFrame from time domain to frequency domain using :
df = np.fft.fft(df)
Now I need to classify the the data using several machine learning algorithms such as Random Forest and Gaussian Naive Bayes. The problem is I keep getting this error:
could not convert string to float: '(2.9510193818016135-0.47803712350473193j)'
I tried to convert the strings to floats in DataFrame but it is still giving me the same error.
How can I solve this problem in order to get my classification results?
Assuming your results are like the following form, you first need to cast to a real complex type:
In[84]:
# data setup
df = pd.DataFrame({'fft':['(2.9510193818016135-0.47803712350473193j)']})
df
Out[84]:
fft
0 (2.9510193818016135-0.47803712350473193j)
Now cast to complex type:
In[85]:
df['complex'] = df['fft'].apply(complex)
df
Out[85]:
fft complex
0 (2.9510193818016135-0.47803712350473193j) (2.9510193818-0.478037123505j)
Now you can extract as polar coords using apply with cmath.polar:
In[86]:
import cmath
df['polar_x'],df['polar_y'] = df['complex'].apply(lambda x: cmath.polar(x)[0]), df['complex'].apply(lambda x: cmath.polar(x)[1])
df
Out[86]:
fft complex \
0 (2.9510193818016135-0.47803712350473193j) (2.9510193818-0.478037123505j)
polar_x polar_y
0 2.989487 -0.160595
Now the dtypes are compatible so you can pass the float columns:
In[87]:
df.dtypes
Out[87]:
fft object
complex complex128
polar_x float64
polar_y float64
dtype: object
You can also use cmath.rect if desired

Resources