calculation of distance matrix in a faster approach - python-3.x
I have a dataframe
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
a = {'b':['cat','bat','cat','cat','bat','No Data','bat','No Data']}
df11 = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
and i have a distance function
def distancemetric(x):
list1 = x['b'].tolist()
result11 =[]
sortlist11 = [process.extract(ele, list1, limit=11000000, scorer=fuzz.token_set_ratio) for ele in list1]
d11 = [dict(element) for element in sortlist11]
finale11 = [(k, element123[k]) for k in list1 for element123 in d11]
result11.extend([x[1] for x in finale11])
final_result11=np.reshape(result11, (len(x.index),len(x.index)))
return final_result11
I call the funtion by
values1 = distancemetric(df11)
Here the token_set_ratio methods compares only two strings. When i pass an array of strings it gives me avg which i dont need.
This code is working but it is slower. Is there any way which could make it run faster
Related
Altair/Vega-Lite heatmap: Filter top k
I am attempting to create a heatmap and retain only the top 5 samples based on the mean of the relative abundance. I am able to sort the heatmap properly, but I can't figure out how to retain only the top 5, in this case samples c, e, b, y, a. I am pasting a subset of the df with the image. I've tried myriad permutations of the "Top K Items Tutorial" link at the altair-viz website. I'd prefer to use altair to do the filtering if possible, as opposed to filtering the df itself in the python code. Dateframe: ,Lowest_Taxon,p1,p2,p3,p4,p5,p6,p7 0,a,0.03241281,0.0,0.467738067,3.14456785,0.589519651,13.5744323,0.0 1,b,0.680669,9.315121951,2.848404893,13.99058458,2.139737991,16.60779366,7.574639383 2,c,40.65862829,1.244878049,71.01223315,4.82197541,83.18777293,0.0,0.0 3,d,0.0,0.0,0.0,0.548471137,0.272925764,0.925147183,0.0 4,e,0.090755867,13.81853659,5.205085152,27.75721011,1.703056769,19.6691898,12.27775914 5,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0 6,g,0.187994295,0.027317073,0.0,0.0,0.0,0.02242781,0.0 7,h,0.16854661,0.534634146,1.217318302,7.271813154,1.73580786,0.57751612,0.57027843 8,i,0.142616362,2.528780488,1.163348525,0.34279446,0.0,0.0,0.0 9,j,1.711396344,0.694634146,0.251858959,4.273504274,0.087336245,1.721334455,0.899027172 10,k,0.0,1.475121951,0.0,0.0,0.0,5.573310906,0.0 11,l,0.194476857,0.253658537,1.517150396,2.413273002,0.949781659,5.147182506,1.650452868 12,m,0.0,1.736585366,0.0,0.063988299,0.0,8.42724979,0.623951694 13,n,0.0,0.0,0.0,0.0,0.0,0.0,0.0 14,o,4.68689226,0.12097561,0.0,0.0,0.0,0.0,0.0 15,p,0.0,0.885853659,0.0,0.0,0.0,0.913933277,0.046964106 16,q,0.252819914,0.050731707,0.023986568,0.0,0.087336245,0.0,0.0 17,r,0.0,0.0,0.0,0.0,0.0,0.0,0.0 18,s,0.0,0.0,0.0,0.0,0.0,0.0,0.0 19,t,0.0,0.0,0.0,0.0,0.0,0.0,0.0 20,u,0.0,0.058536585,0.089949628,0.356506239,0.0,0.285954584,1.17410265 21,v,0.0,0.0,0.0,0.0,0.0,0.0,0.0 22,w,0.0,0.0,0.0,0.0,0.0,0.0,0.0 23,x,1.471541553,2.396097561,0.593667546,0.278806161,0.065502183,0.280347631,0.952700436 24,y,0.0,0.32,0.0,0.461629873,0.0,7.804878049,18.38980208 25,z,0.0,0.0,0.0,0.0,0.0,0.0,0.0 Code block: import pandas as pd import numpy as np import altair as alt from vega_datasets import data from altair_saver import save # Read in the file and fill empty cells with zero df = pd.read_excel("path\to\df") doNotMelt = df.drop(df.iloc[:,1:], axis=1) df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance') # Tell altair to plot as many rows as is necessary alt.data_transformers.disable_max_rows() alt.Chart(df_melted).mark_rect().encode( alt.X('SampleID:N'), alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')), alt.Color('Relative_abundance:Q') )
If you know what you want to show is the entries with c, e, b, y and a (and it will not change later) you could simply apply a transform_filter on the field Lowest_Taxon. If you want to calculate on the spot which ones make it into the top five, it needs a bit more effort, i.e. a combination of joinaggregate, window and filter transforms. For both I paste an example below. By the way, I converted the original data that you pasted into a csv file which is imported by the code snippets. You can make it easier for others to to use your pandas toy data by providing it as a dict, which can then be simply read directly in the code. Simple approach: import pandas as pd import altair as alt import numpy as np alt.data_transformers.disable_max_rows() df = pd.read_csv('df.csv', index_col=0) doNotMelt = df.drop(df.iloc[:,1:], axis=1) df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance') alt.Chart(df_melted).mark_rect().encode( alt.X('SampleID:N'), alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')), alt.Color('Relative_abundance:Q') ).transform_filter( alt.FieldOneOfPredicate(field='Lowest_Taxon', oneOf=['c', 'e', 'b', 'y', 'a']) ) Flexible approach: set n to how many of the top entries you want to see import pandas as pd import altair as alt import numpy as np alt.data_transformers.disable_max_rows() df = pd.read_csv('df.csv', index_col=0) doNotMelt = df.drop(df.iloc[:,1:], axis=1) df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance') n = 5 # number of entries to display alt.Chart(df_melted).mark_rect().encode( alt.X('SampleID:N'), alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')), alt.Color('Relative_abundance:Q') ).transform_joinaggregate( mean_rel_ab = 'mean(Relative_abundance)', count_of_samples = 'valid(Relative_abundance)', groupby = ['Lowest_Taxon'] ).transform_window( rank='rank(mean_rel_ab)', sort=[alt.SortField('mean_rel_ab', order='descending')], frame = [None, None] ).transform_filter( (alt.datum.rank <= (n-1) * alt.datum.count_of_samples + 1)
Getting the elements of list in specific range in Python using negative indexing
Input list1 = ['Apple','Google','MS','Facebook'] print(list1) list1[-4:1] Output ['Apple', 'Google', 'MS', 'Facebook'] ['Apple'] Can anyone please explain the result?
When you use negative indexing, you start at index -1. It would seem silly to say list1[-0] and have it be different than list1[0]. Because of this your code becomes "grab the elements starting from the 4th to last and going to 1". Another way to think of it is list1[-4] is the same as list1[len(list1) - 4]. So for this you're going in the range [0, 1) and only returning the first element.
This may help you. from math import sqrt from sklearn.cluster import MiniBatchKMeans import pandas_datareader as dr from matplotlib import pyplot as plt import pandas as pd import matplotlib.cm as cm import seaborn as sn start = '2020-1-1' end = '2021-1-1' tickers = ['AXP','AAPL','BA','CAT','CSCO','CVX','XOM','GS'] prices_list = [] for ticker in tickers: try: prices = dr.DataReader(ticker,'yahoo',start)['Adj Close'] prices = pd.DataFrame(prices) prices.columns = [ticker] prices_list.append(prices) except: pass prices_df = pd.concat(prices_list,axis=1) prices_df.sort_index(inplace=True) prices_df.head()
index -4 means 4th from last element, so list1[-4:1] is the same as list1[0:1], which is the same as [list1[0]] which is [`Apple`] ( To give another example, list1[1:-1] == list1[1:3] == [list1[1],list1[2]] == [`Google`,`MS`] )
Multiplying two RDD in pyspark
I am new to pyspark. I have been trying to multiply two sparse RDD. The code whichI have tried generates two sparse matrices and I have written a function to multiply the two RDD but I think this is not the solution as the computations does not occur in parallel. Can someone help me with it? How can I multiply the RDD in parallel? I tried out a lot of resources on the sites but could not come up with a solution. import findspark findspark.init() import numpy as np import pyspark import random from scipy.sparse import rand sc = pyspark.SparkContext(appName="matrix") np.random.seed(42) n=4 x = rand(n, n, density=0.25) y = rand(n, n, density=0.25) A = x.A B = y.A rdd_x = sc.parallelize(A) rdd_y = sc.parallelize(B) def multiply(r1, r2): A = r1.collect() B = r2.collect() result = [] for i in range(len(B[0])): total = 0 for j in range(len(A)): total += A[j] * B[j][i] result.append(total) return result C = multiply(rdd_x,rdd_x) print(C) sc.stop()
If you're using collect() anyway, you might as well use np.multiply(): C = np.multiply(np.array(rdd_x.collect()), np.array(rdd_y.collect())) Or if you want a dot product, you can use np.dot(): C = np.dot(np.array(rdd_x.collect()), np.array(rdd_y.collect()))
Can I using multiple processes to read different subsets of numpy array (or pandas dataframe) safely?
I want to use multiple processes to get each 2 columns combination in numpy array (or pandas dataframe), such as array[:, 1:3], array[:, 2:4]. I wonder is it safe to get array[:, 1:3] in one process and get array[:, 2:4] in another process? The example code is shown: import time import numpy as np import pandas as pd from itertools import combinations from multiprocessing import Pool, Value, Lock, Array g = np.load('input.npy') c = Value('i', 0, lock=True) def count_valid_pairs(i): pair = g[:, i] global c if pair.max() > 100: with c.get_lock(): c.value += 1 return if __name__ == '__main__': t_start = time.time() cpus = 20 p = Pool(processes=cpus) r=p.imap_unordered(count_valid_pairs, combinations(range(g.shape[1]), 2)) p.close() p.join() print("Total {} pairs has max value > 100".format(c.value)
How can I find out which operands are supported before looping through pandas dataframe?
I am attempting to iterate over rows in a Series within a Pandas DataFrame. I would like to take the value in each row of the column csv_df['Strike'] and plug it into variable K, which gets called in function a. Then, I want the output a1 and a2 to be put into their own columns within the DataFrame. I am receiving the error: TypeError: unsupported operand type(s) for *: 'int' and 'zip', and I figure that if I can find out which operands are supported, I could convert a1 and a2to that. Am I thinking about this correctly? Note: S is just a static number as the df is just one row, while K has many rows. Code is below: from scipy.stats import norm from math import sqrt, exp, log, pi import pandas as pd pd.core.common.is_list_like = pd.api.types.is_list_like import fix_yahoo_finance as yf yf.pdr_override() import numpy as np import datetime from pandas_datareader import data, wb import matplotlib.pyplot as plt #To get data: start = datetime.datetime.today() end = datetime.datetime.today() df = data.get_data_yahoo('AAPL', start, end) #puts data into a pandas dataframe csv_df = pd.read_csv('./AAPL_TEST.csv') for row in csv_df.itertuples(): def a(S, K): a1 = 100 * K a2 = S return a1 S = df['Adj Close'].items() K = csv_df['strike'].items() a1, a2 = a(S, K) df['new'] = a1 df['new2'] = a2
It seems an alternate way of doing what you want would be to apply your method to each data frame separately, as in: df = data.get_data_yahoo('AAPL', start, end) csv_df = pd.read_csv('./AAPL_TEST.csv') df['new'] = csv_df['strike'].apply(lambda x: 100 * x) df['new2'] = df['Adj Close'] Perhaps, applying the calculation directly to the Pandas Series (a column of your data frame) is a way to avoid defining a method that is used only once. Plus, I wouldn't define a method within a loop as you have. Cheers ps. I believe you have forgotten to return both values in your method. def a(S, K): a1 = 100 * K a2 = S return (a1, a2)