I have a CSV file which have content as belows and I want to calculate the cosine similarity from one the remaining ID in the CSV file.
I have load it into a dataframe of pandas as follows:
old_df['Vector']=old_df.apply(lambda row:
np.array(np.matrix(row.Vector)).ravel(), axis = 1)
l=[]
for a in old_df['Vector']:
l.append(a)
A=np.array(l)
similarities = cosine_similarity(A)
The output looks fine. However, i do not know how to find which the GUID (or ID)similar to other GUID (or ID), and I only want to get the top k have highest similar score.
Could you pls help me to solve this issue.
Thank you.
|Index | GUID | Vector |
|-------|-------|---------------------------------------|
|36099 | b770 |[-0.04870541 -0.02133574 0.03180726] |
|36098 | 808f |[ 0.0732905 -0.05331331 0.06378368] |
|36097 | b111 |[ 0.01994788 0.00417582 -0.09615131] |
|36096 | b6b5 |[0.025697 -0.08277534 -0.0124591] |
|36083 | 9b07 |[ 0.025697 -0.08277534 -0.0124591] |
|36082 | b9ed |[-0.00952298 0.06188576 -0.02636449] |
|36081 | a5b6 |[0.00432161 0.02264584 -0.0341924] |
|36080 | 9891 |[ 0.08732156 0.00649456 -0.02014138] |
|36079 | ba40 |[0.05407356 -0.09085554 -0.07671648] |
|36078 | 9dff |[-0.09859556 0.04498474 -0.01839088] |
|36077 | a423 |[-0.06124249 0.06774347 -0.05234318] |
|36076 | 81c4 |[0.07278682 -0.10460124 -0.06572364] |
|36075 | 9f88 |[0.09830415 0.05489364 -0.03916228] |
|36074 | adb8 |[0.03149953 -0.00486591 0.01380711] |
|36073 | 9765 |[0.00673934 0.0513557 -0.09584251] |
|36072 | aff4 |[-0.00097896 0.0022945 0.01643319] |
Example code to get top k cosine similarities and they corresponding GUID and row ID:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
data = {"GUID": ["b770", "808f", "b111"], "Vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
df = pd.DataFrame(data)
print("Data: \n{}\n".format(df))
vectors = []
for v in df['Vector']:
vectors.append(v)
vectors_num = len(vectors)
A=np.array(vectors)
# Get similarities matrix
similarities = cosine_similarity(A)
similarities[np.tril_indices(vectors_num)] = -2
print("Similarities: \n{}\n".format(similarities))
k = 2
if k > vectors_num:
K = vectors_num
# Get top k similarities and pair GUID in ascending order
top_k_indexes = np.unravel_index(np.argsort(similarities.ravel())[-k:], similarities.shape)
top_k_similarities = similarities[top_k_indexes]
top_k_pair_GUID = []
for indexes in top_k_indexes:
pair_GUID = (df.iloc[indexes[0]]["GUID"], df.iloc[indexes[1]]["GUID"])
top_k_pair_GUID.append(pair_GUID)
print("top_k_indexes: \n{}\ntop_k_pair_GUID: \n{}\ntop_k_similarities: \n{}".format(top_k_indexes, top_k_pair_GUID, top_k_similarities))
Outputs:
Data:
GUID Vector
0 b770 [-0.1, -0.2, 0.3]
1 808f [0.1, -0.2, -0.3]
2 b111 [-0.1, 0.2, -0.3]
Similarities:
[[-2. -0.42857143 -0.85714286]
[-2. -2. 0.28571429]
[-2. -2. -2. ]]
top_k_indexes:
(array([0, 1], dtype=int64), array([1, 2], dtype=int64))
top_k_pair_GUID:
[('b770', '808f'), ('808f', 'b111')]
top_k_similarities:
[-0.42857143 0.28571429]
Related
I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like this
entity | Probability | value
A | 0.8 | 10
A | 0.6 | 15
A | 0.3 | 20
B | 0.8 | 10
Then, for entity A, I'm expecting the avg value to be 0.8*10 + (1-0.8)*0.6*15 + (1-0.8)*(1-0.6)*0.3*20 + (1-0.8)*(1-0.6)*(1-0.3)*MAX_VALUE_DEFINED.
How could I achieve this in Spark using DataFrame agg func? I found it's challenging given the complexity to groupBy entity and compute the sequence of results.
You can use UDF to perform such custom calculations. The idea is using collect_list to group all probab and values of A into one place so you can loop through it. However, collect_list does not respect the order of your records, therefore might lead to the wrong calculation. One way to fix it is generating ID for each row using monotonically_increasing_id
import pyspark.sql.functions as F
#F.pandas_udf('double')
def markov_udf(values):
def markov(lst):
# you can implement your markov logic here
s = 0
for i, prob, val in lst:
s += prob
return s
return values.apply(markov)
(df
.withColumn('id', F.monotonically_increasing_id())
.groupBy('entity')
.agg(F.array_sort(F.collect_list(F.array('id', 'probability', 'value'))).alias('values'))
.withColumn('markov', markov_udf('values'))
.show(10, False)
)
+------+------------------------------------------------------+------+
|entity|values |markov|
+------+------------------------------------------------------+------+
|B |[[3.0, 0.8, 10.0]] |0.8 |
|A |[[0.0, 0.8, 10.0], [1.0, 0.6, 15.0], [2.0, 0.3, 20.0]]|1.7 |
+------+------------------------------------------------------+------+
There may be a better solution, but I think this does what you needed.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('A', 0.8, 10),
('A', 0.6, 15),
('A', 0.3, 20),
('B', 0.8, 10)],
['entity', 'Probability', 'value']
)
w_desc = W.partitionBy('entity').orderBy(F.desc('value'))
w_asc = W.partitionBy('entity').orderBy('value')
df = df.withColumn('_ent_max_val', F.max('value').over(w_desc))
df = df.withColumn('_prob2', 1 - F.col('Probability'))
df = df.withColumn('_cum_prob2', F.product('_prob2').over(w_asc) / F.col('_prob2'))
df = (df.groupBy('entity')
.agg(F.round((F.max('_ent_max_val') * F.product('_prob2')
+ F.sum(F.col('_cum_prob2') * F.col('Probability') * F.col('value'))
),2).alias('mean_value'))
)
df.show()
# +------+----------+
# |entity|mean_value|
# +------+----------+
# | A| 11.4|
# | B| 10.0|
# +------+----------+
I want to drop rows in a PySpark DataFrame where a certain column contains an empty map. How do I do this? I can't seem to declare a typed empty MapType against which to compare my column. I have seen that in Scala, you can use typedLit, but there seems to be no such equivalent in PySpark. I have also tried using lit(...) and casting to a struct<string,int> but I have found no acceptable argument for lit() (tried using None which returns null and {} which is an error).
I'm sure this is trivial but I haven't seen any docs on this!
Here is a solution using pyspark size build-in function:
from pyspark.sql.functions import col, size
df = spark.createDataFrame(
[(1, {1:'A'} ),
(2, {2:'B'} ),
(3, {3:'C'} ),
(4, {}),
(5, None)]
).toDF("id", "map")
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- map: map (nullable = true)
# | |-- key: long
# | |-- value: string (valueContainsNull = true)
df.withColumn("is_empty", size(col("map")) <= 0).show()
# +---+--------+--------+
# | id| map|is_empty|
# +---+--------+--------+
# | 1|[1 -> A]| false|
# | 2|[2 -> B]| false|
# | 3|[3 -> C]| false|
# | 4| []| true|
# | 5| null| true|
# +---+--------+--------+
Note that the condition is size <= 0 since in the case of null the function returns -1 (if the spark.sql.legacy.sizeOfNull setting is true otherwise it will return null). Here you can find more details.
Generic solution: comparing Map column and literal Map
For a more generic solution we can use the build-in function size in combination with a UDF which append the string key + value of each item into a sorted list (thank you #jxc for pointing out the problem with the previous version). The hypothesis here will be that two maps are equal when:
they have the same size
the string representation of key + value is identical between the items of the maps
The literal map is created from an arbitrary python dictionary combining keys and values via map_from_arrays:
from pyspark.sql.functions import udf, lit, size, when, map_from_arrays, array
df = spark.createDataFrame([
[1, {}],
[2, {1:'A', 2:'B', 3:'C'}],
[3, {1:'A', 2:'B'}]
]).toDF("key", "map")
dict = { 1:'A' , 2:'B' }
map_keys_ = array([lit(k) for k in dict.keys()])
map_values_ = array([lit(v) for v in dict.values()])
tmp_map = map_from_arrays(map_keys_, map_values_)
to_strlist_udf = udf(lambda d: sorted([str(k) + str(d[k]) for k in d.keys()]))
def map_equals(m1, m2):
return when(
(size(m1) == size(m2)) &
(to_strlist_udf(m1) == to_strlist_udf(m2)), True
).otherwise(False)
df = df.withColumn("equals", map_equals(df["map"], tmp_map))
df.show(10, False)
# +---+------------------------+------+
# |key|map |equals|
# +---+------------------------+------+
# |1 |[] |false |
# |2 |[1 -> A, 2 -> B, 3 -> C]|false |
# |3 |[1 -> A, 2 -> B] |true |
# +---+------------------------+------+
Note: As you can see the pyspark == operator works pretty well for array comparison as well.
I have a 'text' column in which arrays of tokens are stored. How to filter all these arrays so that the tokens are at least three letters long?
from pyspark.sql.functions import regexp_replace, col
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
columns = ['id', 'text']
vals = [
(1, ['I', 'am', 'good']),
(2, ['You', 'are', 'ok']),
]
df = spark.createDataFrame(vals, columns)
df.show()
# Had tried this but have TypeError: Column is not iterable
# df_clean = df.select('id', regexp_replace('text', [len(word) >= 3 for word
# in col('text')], ''))
# df_clean.show()
I expect to see:
id | text
1 | [good]
2 | [You, are]
This does it, you can decide to exclude row or not, I added an extra column and filtered out, but options are yours:
from pyspark.sql import functions as f
columns = ['id', 'text']
vals = [
(1, ['I', 'am', 'good']),
(2, ['You', 'are', 'ok']),
(3, ['ok'])
]
df = spark.createDataFrame(vals, columns)
#df.show()
df2 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))"))
df2.show()
# This is the actual piece of logic you are looking for.
df3 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))")).where(f.size(f.col("text_left_over")) > 0).drop("text")
df3.show()
returns:
+---+--------------+--------------+
| id| text|text_left_over|
+---+--------------+--------------+
| 1| [I, am, good]| [good]|
| 2|[You, are, ok]| [You, are]|
| 3| [ok]| []|
+---+--------------+--------------+
+---+--------------+
| id|text_left_over|
+---+--------------+
| 1| [good]|
| 2| [You, are]|
+---+--------------+
This is the solution
filter_length_udf = udf(lambda row: [x for x in row if len(x) >= 3], ArrayType(StringType()))
df_final_words = df_stemmed.withColumn('words_filtered', filter_length_udf(col('words')))
I have a DataFrame with N Attributes (Atr1, Atr2, Atr3, ..., AtrN) and an individual instance with the same [1..N-1] attributes, except the Nth one.
I want to check if there is any instance in the DataFrame with the same values for the Attributes [1..N-1] of the instance, and if it exists an occurrence of that instance, my goal is to get the instance in the DataFrame with the Attributes [1..N].
For example, if I have:
Instance:
[Row(Atr1=u'A', Atr2=u'B', Atr3=24)]
Dataframe:
+------+------+------+------+
| Atr1 | Atr2 | Atr3 | Atr4 |
+------+------+------+------+
| 'C' | 'B' | 21 | 'H' |
+------+------+------+------+
| 'D' | 'B' | 21 | 'J' |
+------+------+------+------+
| 'E' | 'B' | 21 | 'K' |
+------+------+------+------+
| 'A' | 'B' | 24 | 'I' |
+------+------+------+------+
I want to get the 4th row of the DataFrame also with the value of Atr4.
I tried it with "filter()" method like this:
df.filter("Atr1 = 'C' and Atr2 = 'B', and Atr3 = 24").take(1)
And I get the result I wanted, but it took much time.
So, my question is: is there any way to do the same but in less time?
Thanks!
You can use locality sensitive hashing(minhashLSH) to find the closest neighbor and check whether it's same or not.
Since, your data has strings , you need to process it before applying LSH.
We will be using pyspark ml's feature module
Start with stringIndexing and onehotencoding
df= spark.createDataFrame([('C','B',21,'H'),('D','B',21,'J'),('E','c',21,'K'),('A','B',24,'J')], ["attr1","attr2","attr3","attr4"])
for col_ in ["attr1","attr2","attr4"]:
stringIndexer = StringIndexer(inputCol=col_, outputCol=col_+"_")
model = stringIndexer.fit(df)
df = model.transform(df)
encoder = OneHotEncoder(inputCol=col_+"_", outputCol="features_"+col_, dropLast = False)
df = encoder.transform(df)
df = df.drop("attr1","attr2","attr4","attr1_","attr2_","attr4_")
df.show()
+-----+--------------+--------------+--------------+
|attr3|features_attr1|features_attr2|features_attr4|
+-----+--------------+--------------+--------------+
| 21| (4,[2],[1.0])| (2,[0],[1.0])| (3,[1],[1.0])|
| 21| (4,[0],[1.0])| (2,[0],[1.0])| (3,[0],[1.0])|
| 21| (4,[3],[1.0])| (2,[1],[1.0])| (3,[2],[1.0])|
| 24| (4,[1],[1.0])| (2,[0],[1.0])| (3,[0],[1.0])|
+-----+--------------+--------------+--------------+
Add id and assemble all features vectors
from pyspark.sql.functions import monotonically_increasing_id
df = df.withColumn("id", monotonically_increasing_id())
df.show()
assembler = VectorAssembler(inputCols = ["features_attr1", "features_attr2", "features_attr4", "attr3"]
, outputCol = "features")
df_ = assembler.transform(df)
df_ = df_.select("id", "features")
df_.show()
+----------+--------------------+
| id| features|
+----------+--------------------+
| 0|(10,[2,4,7,9],[1....|
| 1|(10,[0,4,6,9],[1....|
|8589934592|(10,[3,5,8,9],[1....|
|8589934593|(10,[1,4,6,9],[1....|
+----------+--------------------+
Create your minHashLSH model and search for nearest neighbors
mh = MinHashLSH(inputCol="features", outputCol="hashes", seed=12345)
model = mh.fit(df_)
model.transform(df_)
key = df_.select("features").collect()[0]["features"]
model.approxNearestNeighbors(df_, key, 1).collect()
output
[Row(id=0, features=SparseVector(10, {2: 1.0, 4: 1.0, 7: 1.0, 9: 21.0}), hashes=[DenseVector([-1272095496.0])], distCol=0.0)]
I am trying to find a way to name columns of a dataframe using strings coming from excel or scraping the web.
So how to transform "colname" to colname below?
df = DataFrame(colname = [1, 2])
I tried
df = DataFrame(symbol("colname") = [1, 2])
or
df = DataFrame([1, 2], [symbol("colname")])
and many other combinations, but no success.
I see questions related to deleting columns based on string column names but no question/answer for naming columns in the first place.
May be you can try something like this in two steps using the names! function.
using DataFrames
newname = ["colname1", "colname2"]
df = DataFrame(v1 = [1, 2], v2 = [3, 4])
names!(df.colindex, map(parse, newname))
df
# 2x2 DataFrames.DataFrame
# | Row | colname1 | colname2 |
# |-----|----------|----------|
# | 1 | 1 | 3 |
# | 2 | 2 | 4 |
Here are the version of Julia and DataFrames.jl I used
versioninfo()
# Julia Version 0.4.0-dev+6991
# Commit 811a977 (2015-08-26 04:02 UTC)
# Platform Info:
# System: Linux (x86_64-unknown-linux-gnu)
# CPU: Intel(R) Core(TM) i7-3520M CPU # 2.90GHz
# WORD_SIZE: 64
# BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
# LAPACK: libopenblas
# LIBM: libopenlibm
# LLVM: libLLVM-svn
Pkg.installed("DataFrames")
# v"0.6.9"