Python 3.6 adjacency Matrix: How to obtain it in a better way - python-3.x

The problem starts with a classical csv file. An example can be:
date;origing;destiny;minutes;distance
19-02-2020;A;B;36;4
20-02-2020;A;B;33;4
24-02-2020;B;A;37;4
25-02-2020;A;C;20;7
27-02-2020;C;B;20;3
28-02-2020;A;B;37.2;4
28-02-2020;A;Z;44;10
My first idea consist in solving it in a classical programmaing way:
Loop + counter variables and represent de counter variables in a matrix like:
A B C Z
A 0 3 1 1
B 1 0 0 0
C 0 1 0 0
Z 0 0 0 0
My first question is if there is a better automatic way of implementing this in python instead os use classical programming algorithm based on loops and counters.
and what about obtaining more complex adjacence matrixes like the one that give you for example and average of times in the values?

There are packages like networkx, but you could use the groupby of pandas.
I don't think pandas with groupby is the fastest. I think networkx would be faster, but at least groupby is better than a loop (is my guess).
import pandas as pd
import numpy as np
M = pd.read_csv('../sample_data.csv', sep=';')
M['constant'] = 1
print(M)
date origing destiny minutes distance constant
0 19-02-2020 A B 36.0 4 1
1 20-02-2020 A B 33.0 4 1
2 24-02-2020 B A 37.0 4 1
3 25-02-2020 A C 20.0 7 1
4 27-02-2020 C B 20.0 3 1
5 28-02-2020 A B 37.2 4 1
6 28-02-2020 A Z 44.0 10 1
With groupby we can get counts;
counts = M.groupby(['origing','destiny']).count()[['constant']]
counts
constant
origing destiny
A B 3
C 1
Z 1
B A 1
C B 1
And store those values in a zero matrix
def key_map(key):
a,b = key
return (ord(a) - ord('A'),ord(b)-ord('A'))
will get the indicis, like
counts['constant'].keys().map(key_map).values
and we set those indicis to any values, i do the counts here, but you can use the same groupby to aggregate sum,average, or anything from other columns;
indici = np.array( [tuple(x) for x in counts['constant'].keys().map(key_map).values] )
indici = tuple(zip(*indici))
and store with
Z = np.zeros((26,26))
Z[ indici ] = counts['constant']
I only print first few with
print(Z[:3,:3])
[[0. 3. 1.]
[1. 0. 0.]
[0. 1. 0.]]

Related

What's the best way of converting a numeric array in a text file to a numpy array?

So I'm trying to create an array from a text file, the text file is laid out as follows. The numbers in the first two columns both go to 165:
0 0 1.0 0.0
1 0 0.0 0.0
1 1 0.0 0.0
2 0 -9.0933087157900000E-5 0.0000000000000000E+00
2 1 -2.7220323615900000E-09 -7.5751829208300000E-10
2 2 3.4709851601400000E-5 1.6729490538300000E-08
3 0 -3.2035914003000000E-06 0.0000000000000000E+00
3 1 2.6327440121800000E-05 5.4643630898200000E-06
3 2 1.4188179329400000E-05 4.8920365004800000E-06
3 3 1.2286058944700000E-05 -1.7854480816400000E-06
4 0 3.1973095717200000E-06 0.0000000000000000E+00
4 1 -5.9966018301500000E-06 1.6619345194700000E-06
4 2 -7.0818069269700000E-06 -6.7836271726900000E-06
4 3 -1.3622983381300000E-06 -1.3443472287100000E-05
4 4 -6.0257787358300000E-06 3.9396371953800000E-06
I'm trying to write a function where an array is made using the numbers in the 3rd columns, taking their positions in the array from the first two columns, and the empty cells are 0s. For example:
1 0 0 0
0 0 0 0
-9.09330871579000e-05 -2.72203236159000e-09 3.47098516014000e-05 0
-3.20359140030000e-06 2.63274401218000e-05 1.41881793294000e-05 1.22860589447000e-05
At the same time, I'm also trying to make a second array but using the numbers from the 4th column not the 3rd. The code that I've written so far is as follows and this is the array produced, I'm not even sure where the 4.41278e-08 comes from:
import numpy as np
def createarray(filepath,maxdegree):
Cnm = np.zeros((maxdegree+1,maxdegree+1))
Snm = np.zeros((maxdegree+1,maxdegree+1))
fid = np.genfromtxt(filepath)
for row in fid:
for n in range(0,maxdegree):
for m in range(0,maxdegree):
Cnm[n+1,m+1]=row[2]
Snm[n+1,m+1]=row[3]
return [Cnm, Snm]
0 0 0 0
0 4.41278e-08 4.41278e-08 4.41278e-08
0 4.41278e-08 4.41278e-08 4.41278e-08
0 4.41278e-08 4.41278e-08 4.41278e-08
I'm not getting any errors but I'm also not getting the right array. Can anyone shed some light on what I'm doing wrong?
Your data appear to be in a COO sparse matrix format already. This means, that you could use your own function, but you could also capitalize on the work done in the scipy.sparse package.
For example this code creates a function that would generate one of your matrices at a time. You could modify it to return both matrices.
import numpy as np
from scipy import sparse
def createarray(filepath, maxdegree, value_column):
"""Create single array from file"""
# load sparse data into numpy array
data = np.loadtxt(filepath)
# use coo_matrix to create the sparse matrix where the
# values are found in the value_column column of data
M = sparse.coo_matrix((data[:,value_column], (data[:,0], data[:,1])), shape=(maxdegree+1, maxdegree+1))
# if you need a numpy array call toarray() otherwise you
# can return M which is sparse and more memory efficient
return M.toarray()
Then for the first matrix you wanted to create you would set value_column to 2, and for the second you would set value_column to 3.
# first matrix
Cnm = createarray(filepath, maxdegree, 2)
# second matrix
Snm = createarray(filepath, maxdegree, 3)

Converting dataframe fraction to float

I would like to convert the string values in column b to float. Wondering how should I make it.
A B
1 16-1/4
2 3-1/4
3 21-1/4
4 8-1/4
Update:
Give map a try to avoid limit 100 rows on pd.eval
df['C'] = df.B.str.replace('-', '+').map(pd.eval)
Original:
As your comment, it seems you adding the fraction to the whole number, so the solution would be
df['C'] = pd.eval(df.B.str.replace('-', '+'))
Out[5]:
A B C
0 1 16-1/4 16.25
1 2 3-1/4 3.25
2 3 21-1/4 21.25
3 4 8-1/4 8.25
Use built-in Python function eval():
df.B = df.B.apply(eval)
Test:
In[1]: df
A B
0 1 16-1/4
1 2 3-1/4
2 3 21-1/4
3 4 8-1/4
In[2]: df.B = df.B.apply(eval)
In[3]: df
A B
0 1 15.75
1 2 2.75
2 3 20.75
3 4 7.75

Pandas dataframe how to add column of distance from previous row?

I have a dataframe of locations:
df = X Y
1 1
2 1
2 1
2 2
3 3
5 5
5.5 5.5
I want to add a columns, with the distance to the previous point:
So it will be:
df = X Y Distance
1 1 0
2 1 1
2 1 0
2 2 1
3 3 2
5 5 2
5.5 5.5 1
What is the best way to do so?
You can use the pd.Series.diff method.
For instance, to compute the eulerian distance, using also np.sqrt, you would do like this:
import numpy as np
df["Distance"] = np.sqrt(df.X.diff()**2 + df.Y.diff()**2)

Merge distance matrix results and original indices with Python Pandas

I have a panda df with list of bus stops and their geolocations:
stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125
stop_id isn't necessarily incremental.
Using sklearn.metrics.pairwise.manhattan_distances I calculate distances and get a symmetric distance matrix:
array([[0. , 1.412176, 2.33437 , 3.422297, 5.24705 ],
[1.412176, 0. , 1.151232, 2.047153, 4.165126],
[2.33437 , 1.151232, 0. , 1.104079, 3.143274],
[3.422297, 2.047153, 1.104079, 0. , 2.175247],
[5.24705 , 4.165126, 3.143274, 2.175247, 0. ]])
But I can't manage to easily connect between the two now. I want to have a df that contains a tuple for each pair of stops and their distance, something like:
stop_id_1 stop_id_2 distance
1 2 3.33
I tried working with the lower triangle, convert to vector and all sorts but I feel I just over-complicate things with no success.
Hope this helps!
d= ''' stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125 '''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+')
from sklearn.metrics.pairwise import manhattan_distances
distance_df = pd.DataFrame(manhattan_distances(df))
distance_df.index = df.stop_id.values
distance_df.columns = df.stop_id.values
print(distance_df)
output:
1 2 3 4 6
1 0.000000 1.412176 2.334370 3.422297 5.247050
2 1.412176 0.000000 1.151232 2.047153 4.165126
3 2.334370 1.151232 0.000000 1.104079 3.143274
4 3.422297 2.047153 1.104079 0.000000 2.175247
6 5.247050 4.165126 3.143274 2.175247 0.000000
Now, to create the long format of the same df, use the following.
long_frmt_dist=distance_df.unstack().reset_index()
long_frmt_dist.columns = ['stop_id_1', 'stop_id_2', 'distance']
print(long_frmt_dist.head())
output:
stop_id_1 stop_id_2 distance
0 1 1 0.000000
1 1 2 1.412176
2 1 3 2.334370
3 1 4 3.422297
4 1 6 5.247050
df_dist = pd.DataFrame.from_dict(dist_matrix)
pd.merge(df, df_dist, how='left', left_index=True, right_index=True)
example

Pandas Flag Rows with Complementary Zeros

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':[0,4,4,4],
'B':[0,4,4,0],
'C':[0,4,4,4],
'D':[4,0,0,4],
'E':[4,0,0,0],
'Name':['a','a','b','c']})
df
A B C D E Name
0 0 0 0 4 4 a
1 4 4 4 0 0 a
2 4 4 4 0 0 b
3 4 0 4 4 0 c
I'd like to add a new field called "Match_Flag" which labels unique combinations of rows if they have complementary zero patterns (as with rows 0, 1, and 2) AND have the same name (just for rows 0 and 1). It uses the name of the rows that match.
The desired result is as follows:
A B C D E Name Match_Flag
0 0 0 0 4 4 a a
1 4 4 4 0 0 a a
2 4 4 4 0 0 b NaN
3 4 0 4 4 0 c NaN
Caveat:
The patterns may vary, but should still be complementary.
Thanks in advance!
UPDATE
Sorry for the confusion.
Here is some clarification:
The reason why rows 0 and 1 are "complementary" is that they have opposite patterns of zeros in their columns; 0,0,0,4,4 vs, 4,4,4,0,0.
The number 4 is arbitrary; it could just as easily be 0,0,0,4,2 and 65,770,23,0,0. So if 2 such rows are indeed complementary and they have the same name, I'd like for them to be flagged with that same name under the "Match_Flag" column.
You can identify a compliment if it's dot product is zero and it's element wise sum is nowhere zero.
def complements(df):
v = df.drop('Name', axis=1).values
n = v.shape[0]
row, col = np.triu_indices(n, 1)
# ensure two rows are complete
# their sum contains no zeros
c = ((v[row] + v[col]) != 0).all(1)
complete = set(row[c]).union(col[c])
# ensure two rows do not overlap
# their product is zero everywhere
o = (v[row] * v[col] == 0).all(1)
non_overlap = set(row[o]).union(col[o])
# we are a compliment iff we do
# not overlap and we are complete
complement = list(non_overlap.intersection(complete))
# return slice
return df.Name.iloc[complement]
Then groupby('Name') and apply our function
df['Match_Flag'] = df.groupby('Name', group_keys=False).apply(complements)

Resources