calculation of distance matrix in a faster approach - python-3.x

I have a dataframe
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
a = {'b':['cat','bat','cat','cat','bat','No Data','bat','No Data']}
df11 = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
and i have a distance function
def distancemetric(x):
list1 = x['b'].tolist()
result11 =[]
sortlist11 = [process.extract(ele, list1, limit=11000000, scorer=fuzz.token_set_ratio) for ele in list1]
d11 = [dict(element) for element in sortlist11]
finale11 = [(k, element123[k]) for k in list1 for element123 in d11]
result11.extend([x[1] for x in finale11])
final_result11=np.reshape(result11, (len(x.index),len(x.index)))
return final_result11
I call the funtion by
values1 = distancemetric(df11)
Here the token_set_ratio methods compares only two strings. When i pass an array of strings it gives me avg which i dont need.
This code is working but it is slower. Is there any way which could make it run faster

Related

Altair/Vega-Lite heatmap: Filter top k

I am attempting to create a heatmap and retain only the top 5 samples based on the mean of the relative abundance. I am able to sort the heatmap properly, but I can't figure out how to retain only the top 5, in this case samples c, e, b, y, a. I am pasting a subset of the df with the image. I've tried myriad permutations of the "Top K Items Tutorial" link at the altair-viz website. I'd prefer to use altair to do the filtering if possible, as opposed to filtering the df itself in the python code.
Dateframe:
,Lowest_Taxon,p1,p2,p3,p4,p5,p6,p7
0,a,0.03241281,0.0,0.467738067,3.14456785,0.589519651,13.5744323,0.0
1,b,0.680669,9.315121951,2.848404893,13.99058458,2.139737991,16.60779366,7.574639383
2,c,40.65862829,1.244878049,71.01223315,4.82197541,83.18777293,0.0,0.0
3,d,0.0,0.0,0.0,0.548471137,0.272925764,0.925147183,0.0
4,e,0.090755867,13.81853659,5.205085152,27.75721011,1.703056769,19.6691898,12.27775914
5,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,g,0.187994295,0.027317073,0.0,0.0,0.0,0.02242781,0.0
7,h,0.16854661,0.534634146,1.217318302,7.271813154,1.73580786,0.57751612,0.57027843
8,i,0.142616362,2.528780488,1.163348525,0.34279446,0.0,0.0,0.0
9,j,1.711396344,0.694634146,0.251858959,4.273504274,0.087336245,1.721334455,0.899027172
10,k,0.0,1.475121951,0.0,0.0,0.0,5.573310906,0.0
11,l,0.194476857,0.253658537,1.517150396,2.413273002,0.949781659,5.147182506,1.650452868
12,m,0.0,1.736585366,0.0,0.063988299,0.0,8.42724979,0.623951694
13,n,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,o,4.68689226,0.12097561,0.0,0.0,0.0,0.0,0.0
15,p,0.0,0.885853659,0.0,0.0,0.0,0.913933277,0.046964106
16,q,0.252819914,0.050731707,0.023986568,0.0,0.087336245,0.0,0.0
17,r,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18,s,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,t,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,u,0.0,0.058536585,0.089949628,0.356506239,0.0,0.285954584,1.17410265
21,v,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,w,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,x,1.471541553,2.396097561,0.593667546,0.278806161,0.065502183,0.280347631,0.952700436
24,y,0.0,0.32,0.0,0.461629873,0.0,7.804878049,18.38980208
25,z,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Code block:
import pandas as pd
import numpy as np
import altair as alt
from vega_datasets import data
from altair_saver import save
# Read in the file and fill empty cells with zero
df = pd.read_excel("path\to\df")
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
# Tell altair to plot as many rows as is necessary
alt.data_transformers.disable_max_rows()
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
)
If you know what you want to show is the entries with c, e, b, y and a (and it will not change later) you could simply apply a transform_filter on the field Lowest_Taxon.
If you want to calculate on the spot which ones make it into the top five, it needs a bit more effort, i.e. a combination of joinaggregate, window and filter transforms.
For both I paste an example below. By the way, I converted the original data that you pasted into a csv file which is imported by the code snippets. You can make it easier for others to to use your pandas toy data by providing it as a dict, which can then be simply read directly in the code.
Simple approach:
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_filter(
alt.FieldOneOfPredicate(field='Lowest_Taxon', oneOf=['c', 'e', 'b', 'y', 'a'])
)
Flexible approach:
set n to how many of the top entries you want to see
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
n = 5 # number of entries to display
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_joinaggregate(
mean_rel_ab = 'mean(Relative_abundance)',
count_of_samples = 'valid(Relative_abundance)',
groupby = ['Lowest_Taxon']
).transform_window(
rank='rank(mean_rel_ab)',
sort=[alt.SortField('mean_rel_ab', order='descending')],
frame = [None, None]
).transform_filter(
(alt.datum.rank <= (n-1) * alt.datum.count_of_samples + 1)

Getting the elements of list in specific range in Python using negative indexing

Input
list1 = ['Apple','Google','MS','Facebook']
print(list1)
list1[-4:1]
Output
['Apple', 'Google', 'MS', 'Facebook']
['Apple']
Can anyone please explain the result?
When you use negative indexing, you start at index -1. It would seem silly to say list1[-0] and have it be different than list1[0]. Because of this your code becomes "grab the elements starting from the 4th to last and going to 1". Another way to think of it is list1[-4] is the same as list1[len(list1) - 4]. So for this you're going in the range [0, 1) and only returning the first element.
This may help you.
from math import sqrt
from sklearn.cluster import MiniBatchKMeans
import pandas_datareader as dr
from matplotlib import pyplot as plt
import pandas as pd
import matplotlib.cm as cm
import seaborn as sn
start = '2020-1-1'
end = '2021-1-1'
tickers = ['AXP','AAPL','BA','CAT','CSCO','CVX','XOM','GS']
prices_list = []
for ticker in tickers:
try:
prices = dr.DataReader(ticker,'yahoo',start)['Adj Close']
prices = pd.DataFrame(prices)
prices.columns = [ticker]
prices_list.append(prices)
except:
pass
prices_df = pd.concat(prices_list,axis=1)
prices_df.sort_index(inplace=True)
prices_df.head()
index -4 means 4th from last element, so list1[-4:1] is the same as list1[0:1], which is the same as [list1[0]] which is [`Apple`]
( To give another example, list1[1:-1] == list1[1:3] == [list1[1],list1[2]] == [`Google`,`MS`] )

Multiplying two RDD in pyspark

I am new to pyspark. I have been trying to multiply two sparse RDD. The code whichI have tried generates two sparse matrices and I have written a function to multiply the two RDD but I think this is not the solution as the computations does not occur in parallel. Can someone help me with it? How can I multiply the RDD in parallel? I tried out a lot of resources on the sites but could not come up with a solution.
import findspark
findspark.init()
import numpy as np
import pyspark
import random
from scipy.sparse import rand
sc = pyspark.SparkContext(appName="matrix")
np.random.seed(42)
n=4
x = rand(n, n, density=0.25)
y = rand(n, n, density=0.25)
A = x.A
B = y.A
rdd_x = sc.parallelize(A)
rdd_y = sc.parallelize(B)
def multiply(r1, r2):
A = r1.collect()
B = r2.collect()
result = []
for i in range(len(B[0])):
total = 0
for j in range(len(A)):
total += A[j] * B[j][i]
result.append(total)
return result
C = multiply(rdd_x,rdd_x)
print(C)
sc.stop()
If you're using collect() anyway, you might as well use np.multiply():
C = np.multiply(np.array(rdd_x.collect()), np.array(rdd_y.collect()))
Or if you want a dot product, you can use np.dot():
C = np.dot(np.array(rdd_x.collect()), np.array(rdd_y.collect()))

Can I using multiple processes to read different subsets of numpy array (or pandas dataframe) safely?

I want to use multiple processes to get each 2 columns combination in numpy array (or pandas dataframe), such as array[:, 1:3], array[:, 2:4].
I wonder is it safe to get array[:, 1:3] in one process and get array[:, 2:4] in another process?
The example code is shown:
import time
import numpy as np
import pandas as pd
from itertools import combinations
from multiprocessing import Pool, Value, Lock, Array
g = np.load('input.npy')
c = Value('i', 0, lock=True)
def count_valid_pairs(i):
pair = g[:, i]
global c
if pair.max() > 100:
with c.get_lock():
c.value += 1
return
if __name__ == '__main__':
t_start = time.time()
cpus = 20
p = Pool(processes=cpus)
r=p.imap_unordered(count_valid_pairs, combinations(range(g.shape[1]), 2))
p.close()
p.join()
print("Total {} pairs has max value > 100".format(c.value)

How can I find out which operands are supported before looping through pandas dataframe?

I am attempting to iterate over rows in a Series within a Pandas DataFrame. I would like to take the value in each row of the column csv_df['Strike'] and plug it into variable K, which gets called in function a.
Then, I want the output a1 and a2 to be put into their own columns within the DataFrame.
I am receiving the error: TypeError: unsupported operand type(s) for *: 'int' and 'zip', and I figure that if I can find out which operands are supported, I could convert a1 and a2to that.
Am I thinking about this correctly?
Note: S is just a static number as the df is just one row, while K has many rows.
Code is below:
from scipy.stats import norm
from math import sqrt, exp, log, pi
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
import fix_yahoo_finance as yf
yf.pdr_override()
import numpy as np
import datetime
from pandas_datareader import data, wb
import matplotlib.pyplot as plt
#To get data:
start = datetime.datetime.today()
end = datetime.datetime.today()
df = data.get_data_yahoo('AAPL', start, end) #puts data into a pandas dataframe
csv_df = pd.read_csv('./AAPL_TEST.csv')
for row in csv_df.itertuples():
def a(S, K):
a1 = 100 * K
a2 = S
return a1
S = df['Adj Close'].items()
K = csv_df['strike'].items()
a1, a2 = a(S, K)
df['new'] = a1
df['new2'] = a2
It seems an alternate way of doing what you want would be to apply your method to each data frame separately, as in:
df = data.get_data_yahoo('AAPL', start, end)
csv_df = pd.read_csv('./AAPL_TEST.csv')
df['new'] = csv_df['strike'].apply(lambda x: 100 * x)
df['new2'] = df['Adj Close']
Perhaps, applying the calculation directly to the Pandas Series (a column of your data frame) is a way to avoid defining a method that is used only once.
Plus, I wouldn't define a method within a loop as you have.
Cheers
ps. I believe you have forgotten to return both values in your method.
def a(S, K):
a1 = 100 * K
a2 = S
return (a1, a2)

Resources