Using sparse matrix instead of numpy distance matrix - python-3.x

I have a list of car ids and a list of x,y coordinates.
I want to calculate the distance between each of the coordinates.
The problem is ,after trying for weeks now, the distance matrix has limits and I'm dealing with Gigas of files with a resulting matrix of millions of rows and columns.
Can this be done using sparse to make it more effecient?
list_coordinates = []
for line in coordinates.readlines():
list_coordinates.append(line.strip().split(','))
list_coordinates_int = [list(map(float, x)) for x in list_coordinates]
list_car_id = []
for line in car_ids.readlines():
list_car_id.append(line.strip().split(' '))
df = pd.DataFrame(list_coordinates_int, columns=['xcord', 'ycord'], index=list_car_id)
df2=pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)
list_coordinates : [['875.88', '588.26'], ['751.49', '656.55']]
list_coordinates_int : [[875.88, 588.26], [751.49, 656.55]]
list_car_id : [['car.0', 'car2.0', 'car.0', 'car2.0', 'car.0']]
the resulting df2 is like this:
car.0 car2.0 car.4
car.0 0.000000 141.902770 0.702140
car2.0 141.902770 0.000000 141.205831
car.4 141.902770 0.702140 0.000000
is there a way i could get the same df2 using sparse or any other mehtod than distance matrix?

Related

create MultiIndex columns based on "lookup"

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

Python xarray: Processing data for a loop with method='nearest' at different locations

Is it possible to have an xarray with multiple columns all having the same coordinates? In The following example I create an xarray and then I want to extract time series data at different locations. However, to do this I have to create a numpy array to store this data and its coordinates.
#Sample from the data in the netCDF file
ds['temp'] = xr.DataArray(data=np.random.rand(2,3,4), dims=['time','lat','lon'],
coords=dict(time=pd.date_range('1900-1-1',periods=2,freq='D'),
lat=[25.,26.,27.],lon=[-85.,-84.,-83.,-82.]))
display(ds)
#lat and lon locations to extract temp values
locations=np.array([[25.6, -84.7], [26, -83], [26.5, -84.1]])
#Extract time series at different locations
temp=np.empty([ds.shape[0], len(locations)])
lat_lon=np.empty([len(locations),2])
for n in range(locations.shape[0]):
lat_lon[n,0]=ds.sel(lat=locations[n,0],
lon=locations[n,1], method='nearest').coords['lat'].values
lat_lon[n,1]=ds.sel(lat=locations[n,0],
lon=locations[n,1], method='nearest').coords['lon'].values
temp[:,n]=ds.sel(lat=locations[n,0],
lon=locations[n,1], method='nearest')
print(temp)
print(lat_lon)
#Find maximum temp for all locations:
temp=temp.max(1)
The output of this code is:
array([[[0.67465371, 0.0710136 , 0.03263631, 0.41050204],
[0.26447469, 0.46503577, 0.5739435 , 0.33725726],
[0.20353832, 0.01441925, 0.26728572, 0.70531547]],
[[0.75418953, 0.20321738, 0.41129902, 0.96464691],
[0.53046103, 0.88559914, 0.20876142, 0.98030988],
[0.48009467, 0.7906767 , 0.09548439, 0.61088112]]])
Coordinates:
time (time) datetime64[ns] 1900-01-01 1900-01-02
lat (lat) float64 25.0 26.0 27.0
lon (lon) float64 -85.0 -84.0 -83.0 -82.0
temp (time, lat, lon) float64 0.09061 0.6634 ... 0.5696 0.4438
Attributes: (0)
[[0.26447469 0.5739435 0.01441925]
[0.53046103 0.20876142 0.7906767 ]]
[[ 26. -85.]
[ 26. -83.]
[ 27. -84.]]
More simply, is there a way to find the maximum temp across all locations for every timestamp without creating the intermediate temp array?
When you create the sample data, you specify 3 values of latitude and 4 values of longitude. That means 12 values in total, on a 2D grid (3D if we add time).
When you want to query values for 3 specific points, you have to query each point individually. As far as I know, there are two ways to do that:
Write a loop and store the result on an intermediate array (your solution)
Stack dimensions and query longitude and latitude simultaneously.
First, you have to express your locations as a list/array of tuples:
locations=np.array([[25.6, -84.7], [26, -83], [26.5, -84.1]])
coords=[(coord[0], coord[1]) for coord in locations]
print(coords)
[(25.6, -84.7), (26.0, -83.0), (26.5, -84.1)]
Then you interpolate your data for the specified locations, stack latitude and longitude to a new dimension coord, select your points.
(ds
.interp(lon=locations[:,1], lat=locations[:,0], method='linear') # interpolate on the grid
.stack(coord=['lat','lon']) # from 3x3 grid to list of 9 points
.sel(coord=coords)) # select your three points
.temp.max(dim='coord') # get largest temp value from the coord dimension
)
array([0.81316195, 0.56967184]) # your largest values at both timestamps
The downside is that xarray doesn't support interpolation for unlabeled multi-index, which is why first you need to interpolate (NOT simply find the nearest neighbor) the grid on your set of latitudes and longitudes.

How can I select specific values from list and plot a seaborn boxplot?

I have a list (length 300) of lists (each length 1000). I want to sort the list of 300 by the median of each list of 1000, and then plot a seaborn boxplot of the top 10 (i.e. the 10 lists with the greatest median).
I am able to plot the entire list of 300 but don't know where to go from there.
I can plot a range of the points but how to I plot, for example: data[3],data[45], data[129] all in the same plot?
ax = sns.boxplot(data = data[0:50])
I can also work out which items in the list are in the top 10 by doing this (but I realise this is not the most elegant way!)
array_median = np.median(data, axis=1)
np_sortedarray = np.sort(np.array(array_median))
sort_panda = pd.DataFrame(array_median)
TwoL = sort_panda.reset_index()
TwoL.sort_values(0)
Ultimately I want a boxplot with 10 boxes, showing the list items that have the greatest median values.
Example of data: list of 300 x 1000
[[1.236762285232544,
1.2303414344787598,
1.196462631225586,
...1.1787045001983643,
1.1760116815567017,
1.1614983081817627,
1.1546586751937866],
[1.1349891424179077,
1.1338907480239868,
1.1239897012710571,
1.1173863410949707,
...1.1015456914901733,
1.1005324125289917,
1.1005228757858276],
[1.0945734977722168,
...1.091795563697815]]
I modified your example data a bit just to make it easier.
import seaborn as sns
import pandas as pd
import numpy as np
data = [[1.236762285232544, 1.2303414344787598, 1.196462631225586, 1.1787045001983643, 1.1760116815567017, 1.1614983081817627, 1.1546586751937866],
[1.1349891424179077, 1.1338907480239868, 1.1239897012710571, 1.1173863410949707, 1.1015456914901733, 1.1005324125289917, 1.1005228757858276]]
To sort your data, since it is in list format and not numpy arrays, you can use the sorted function with a key to tell it to perform an operation on each list in your list, which is how the function will sort. Setting reverse = True tells it to sort highest to lowest.
sorted_data = sorted(data, key = lambda x: np.median(x), reverse = True)
To select the top n lists, add [:n] to the end of the previous statement.
To plot in Seaborn, it's easiest to convert your data to a pandas.DataFrame.
df = pd.DataFrame(data).T
That makes a DataFrame with 10 columns (or 2 in this example). We can rename the columns to make each dataset clearer.
df = df.rename(columns={k: f'Data{k+1}' for k in range(len(sorted_data))}).reset_index()
And to plot 2 (or 10) boxplots in one plot, you can reshape the dataframe to have 2 columns, one for the data and one for the dataset number (ID) (credit here).
df = pd.wide_to_long(df, stubnames = ['Data'], i = 'index', j = 'ID').reset_index()[['ID', 'Data']]
And then you can plot it.
sns.boxplot(x='ID', y = 'Data', data = df)
See this answer for fetching top 10 elements
idx = (-median).argsort()[:10]
data[idx]
Also, you can get particular elements of data like this
data[[3, 45, 129]]

How to use list comprehension in pandas to create a new series for a plot?

This is by far the most difficult problem I have faced. I am trying to create plots indexed on ratetype. For example, a matrix of unique ratetype x avg customer number for that ratetype is what I want to create efficiently. The lambda expression for getting the rows where the value is equal to each individual ratetype then getting the average customer number for that type then creating a series based on these two lists that are equal in size and length and accurate, is way over my head for pandas.
The number of different ratetypes can be in the hundreds. Reading it into a list via lambda would logically be a better choice than hard coding each possibility, as the list is going to only increase in size and new variability.
""" a section of the data for example use. Working with column "Ratetype"
column "NumberofCustomers" to work towards getting something like
list1 = unique occurs of ratetypes
list2 = avg number of customers for each ratetype
rt =['fixed','variable',..]
avg_cust_numbers = [45.3,23.1,...]
**basically for each ratetype: get mean of all row data for custno column**
ratetype,numberofcustomers
fixed,1232
variable, 1100
vec, 199
ind, 1211
alg, 123
bfd, 788
csv, 129
ggg, 1100
aaa, 566
acc, 439
"""
df['ratetype','number_of_customers']
fixed = df.loc['ratetype']=='fixed']
avg_fixed_custno = fixed.mean()
rt_counts = df.ratetype.value_counts()
rt_uniques = df.ratetype.unique()
# rt_uniques would be same size vector as avg_cust_nos, has to be anyway
avg_cust_nos = [avg_fixed_custno, avg_variable_custno]
My goal is to create and plot these subplots using matplot.pyplot.
data = {'ratetypes': pd.Series(rt_counts, index=rt_uniques),
'Avg_cust_numbers': pd.Series(avg_cust_nos, index=rt_uniques),
}
df = pd.DataFrame(data)
df = df.sort_values(by=['ratetypes'], ascending=False)
fig, axes = plt.subplots(nrows=2, ncols=1)
for i, c in enumerate(df.columns):
df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c)
plt.savefig('custno_byrate.png', bbox_inches='tight')

How to classify a frequency domain data in Python 3? (could not convert string to float)

I've converted a DataFrame from time domain to frequency domain using :
df = np.fft.fft(df)
Now I need to classify the the data using several machine learning algorithms such as Random Forest and Gaussian Naive Bayes. The problem is I keep getting this error:
could not convert string to float: '(2.9510193818016135-0.47803712350473193j)'
I tried to convert the strings to floats in DataFrame but it is still giving me the same error.
How can I solve this problem in order to get my classification results?
Assuming your results are like the following form, you first need to cast to a real complex type:
In[84]:
# data setup
df = pd.DataFrame({'fft':['(2.9510193818016135-0.47803712350473193j)']})
df
Out[84]:
fft
0 (2.9510193818016135-0.47803712350473193j)
Now cast to complex type:
In[85]:
df['complex'] = df['fft'].apply(complex)
df
Out[85]:
fft complex
0 (2.9510193818016135-0.47803712350473193j) (2.9510193818-0.478037123505j)
Now you can extract as polar coords using apply with cmath.polar:
In[86]:
import cmath
df['polar_x'],df['polar_y'] = df['complex'].apply(lambda x: cmath.polar(x)[0]), df['complex'].apply(lambda x: cmath.polar(x)[1])
df
Out[86]:
fft complex \
0 (2.9510193818016135-0.47803712350473193j) (2.9510193818-0.478037123505j)
polar_x polar_y
0 2.989487 -0.160595
Now the dtypes are compatible so you can pass the float columns:
In[87]:
df.dtypes
Out[87]:
fft object
complex complex128
polar_x float64
polar_y float64
dtype: object
You can also use cmath.rect if desired

Resources