PySpark cosin-similarity Transformer - apache-spark

I have a DataFrame with two columns, each contain vectors, e.g.
+-------------+------------+
| v1 | v2 |
+-------------+------------+
| [1,1.2,0.4] | [2,0.4,5] |
| [1,.2,0.6] | [2,.2,5] |
| . | . |
| . | . |
| . | . |
| [0,1.2,.6] | [2,.2,0.4] |
+-------------+------------+
I would like to add another column to this DataFrame that contains the cosin-similarity between the two vectors in each row.
Is there a Transformer for this?
Is Transformer the right approach for this task?
If it is the right approach and there is no such Transformer, could you give me a pointer to how to write such myself?

Not aware of any transformation that can directly compute consine-similarity here.
You can write your own udf for such functionality:
from pyspark.ml.linalg import Vectors, DenseVector
from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
v = [(DenseVector([1,1.2,0.4]), DenseVector([2,0.4,5])),
(DenseVector([1,2,0.6]), DenseVector([2,0.2,5])),
(DenseVector([0,1.2,0.6]), DenseVector([2,0.2,0.4]))]
dfv1 = spark.createDataFrame(v, ['v1', 'v2'])
dfv1 = dfv1.withColumn('v1v2', F.struct([F.col('v1'), F.col('v2')]))
dfv1.show(truncate=False)
Here's the DataFrame with combined vectors:
+-------------+-------------+------------------------------+
|v1 |v2 |v1v2 |
+-------------+-------------+------------------------------+
|[1.0,1.2,0.4]|[2.0,0.4,5.0]|[[1.0,1.2,0.4], [2.0,0.4,5.0]]|
|[1.0,2.0,0.6]|[2.0,0.2,5.0]|[[1.0,2.0,0.6], [2.0,0.2,5.0]]|
|[0.0,1.2,0.6]|[2.0,0.2,0.4]|[[0.0,1.2,0.6], [2.0,0.2,0.4]]|
+-------------+-------------+------------------------------+
Now we can define our udf for cosine similarity:
dot_prod_udf = F.udf(lambda v: float(v[0].dot(v[1])/v[0].norm(None)/v[1].norm(None)), FloatType())
dfv1 = dfv1.withColumn('cosine_similarity', dot_prod_udf(dfv1['v1v2']))
dfv1.show(truncate=False)
The last column shows the cosine similarity:
+-------------+-------------+------------------------------+-----------------+
|v1 |v2 |v1v2 |cosine_similarity|
+-------------+-------------+------------------------------+-----------------+
|[1.0,1.2,0.4]|[2.0,0.4,5.0]|[[1.0,1.2,0.4], [2.0,0.4,5.0]]|0.51451445 |
|[1.0,2.0,0.6]|[2.0,0.2,5.0]|[[1.0,2.0,0.6], [2.0,0.2,5.0]]|0.4328257 |
|[0.0,1.2,0.6]|[2.0,0.2,0.4]|[[0.0,1.2,0.6], [2.0,0.2,0.4]]|0.17457432 |
+-------------+-------------+------------------------------+-----------------+

Related

Dataframe column is list of strings: how to apply transformation to each element?

Assuming a dataframe where a the content of a column is one list of 0 to n strings
df = pd.DataFrame({'col_w_list':[['c/100/a/111','c/100/a/584','c/100/a/324'],
['c/100/a/327'],
['c/100/a/324','c/100/a/327'],
['c/100/a/111','c/100/a/584','c/100/a/999'],
['c/100/a/584','c/100/a/327','c/100/a/999']
]})
How would I go about transforming the column (either the same or a new one) if all I wanted was the last set of digits, meaning
| | target_still_list |
|--|-----------------------|
|0 | ['111', '584', '324'] |
|1 | ['327'] |
|2 | ['324', '327'] |
|3 | ['111', '584', '999'] |
|4 | ['584', '327', '999'] |
I know how to handle this one list at a time
from os import path
ls = ['c/100/a/111','c/100/a/584','c/100/a/324']
new_ls = [path.split(x)[1] for x in ls]
# or, alternatively
new_ls = [x.split('/')[3] for x in ls]
But I have failed at doing the same over a dataframe. For instance
df['target_still_list'] = df['col_w_list'].apply([lambda x: x.split('/')[3] for x in df['col_w_list']])
Throws an AttributeError at me.
How to apply transformation to each element?
For a data frame, you can use pandas.DataFrame.applymap.
For a series, you can use pandas.Series.map or pandas.Series.apply, which is your posted solution.
Your error is caused by the lambda expression. It takes an element x, so the type of x is list, you can directly iterate over its items.
The correct code should be,
df['target_still_list'] = df['col_w_list'].apply(lambda x: [item.split('/')[-1] for item in x])
# or
# df['target_still_list'] = df['col_w_list'].map(lambda x: [item.split('/')[-1] for item in x])
# or (NOTE: This assignment works only if df has only one column.)
# df['target_still_list'] = df.applymap(lambda x: [item.split('/')[-1] for item in x])

How can we make a function do different things based on the nature of its input?

How do we implement Predicate Dispatching in python?
Suppose that we have a function named funky_the_function.
funky_the_function should test its input against criterion and then call some other function based on the result of the test.
Below are some examples of test predicates:
class Predicates:
#classmethod
def is_numeric_string(cls, chs:str) -> bool:
"""
+-----------------+--------+
| INPUT | OUTPUT |
+-----------------+--------+
| "9821" | True |
| "3038739984" | True |
| "0" | True |
| "3.14" | False |
| "orange" | False |
| "kiwi 5 pear 0" | False |
+-----------------+--------+
"""
return all([ch in string.digits for ch in chs])
#classmethod
def is_just_one_thing(cls, thing):
"""
This function returns a boolean (True/False)
`thing` is defined to just one thing only,
not many things if str(thing)
is the same as the concatenation
of the to-stringed versions
of all of its elements
(The whole is the sum of its parts)
+--------------------------+--------+
| INPUT | OUTPUT |
|--------------------------|--------|
| int(4) | True |
| str(4) | True |
| float(9.17) | True |
| str("ABCDE") | True |
| [int(1), str(2), int(3)] | False |
| (8, 3) | False |
| [8] | False |
| ["A", "B", "C"] | False |
+--------------------------+--------+
"""
if hasattr(thing, "__iter__"):
return str(thing) == "".join(str(elem) for elem in thing)
else: # thing is not iterable
return True
We have a handful of different versions of a function.
Which version of the function should be called is based on what its inputs are.
How do we implement Predicate Dispatching in python?
Suppose that there were eight different forms of the funky_the_function function.
We could:
Write eight different implementations of funky_the_function
Write eight different test-predicates
Write eight different classes.
After that we could write a funky_the_function function which:
tests its input.
Based on the result of the test, passes the input into one of several different class constructors
Dispatch using #singledispatchmethod from python's functools library
from functools import singledispatch
class ArgsOne:
pass
class ArgsTwo:
pass
def funky_the_function(*args):
if test_one(args):
obj = ArgsOne(args)
return _funky_the_function(obj)
elif test_two(args):
obj = ArgsTwo(args)
return _funky_the_function(obj)
#singledispatch
def _funky_the_function():
pass
#_funky_the_function.register
def _(arg:ArgsOne):
print("implementation one")
#_funky_the_function.register
def _(arg:ArgsTwo):
print("implementation one")

How to find out items in each partition after repartition in Java Spark

I have a Java ArrayList with few Integer values.
I have created a DataSet with the ArrayList.
I used System.out.println(DF.javaRDD().getNumPartitions()); and it resulted in 1 partition.
I wanted to divide the data into 3 partitions. so I used repartition().
I want to find out the number of items in each partition after repartition.
In scala it is straight forward.
DF.repartition(3).mapPartitions((it) => Iterator(it.length));
But the same syntax is not working in Java since the length function is not available in Iterator Interface in Java.
How should we interpret mappartition function?
mapPartitions(FlatMapFunction<java.util.Iterator<T>,U> f)
What are the parameters that inner function will take and what is its return type?
SparkSession sessn = SparkSession.builder().appName("RDD to DF").master("local").getOrCreate();
List<Integer> lst = Arrays.asList(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
Dataset<Integer> DF = sessn.createDataset(lst, Encoders.INT());
System.out.println(DF.javaRDD().getNumPartitions());
Try this-
List<Integer> lst = Arrays.asList(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
Dataset<Integer> DF = spark.createDataset(lst, Encoders.INT());
System.out.println(DF.javaRDD().getNumPartitions());
MapPartitionsFunction<Integer, Integer> f =
it -> ImmutableList.of(JavaConverters.asScalaIteratorConverter(it).asScala().length()).iterator();
DF.repartition(3).mapPartitions(f,
Encoders.INT()).show(false);
/**
* 2
* +-----+
* |value|
* +-----+
* |6 |
* |8 |
* |6 |
* +-----+
*/

Spark-java : Exception in thread "main" org.apache.spark.sql.AnalysisException [duplicate]

I am new to spark SQL,
In MS SQL, we have LEFT keyword, LEFT(Columnname,1) in('D','A') then 1 else 0.
How to implement the same in SPARK SQL.
You can use substring function with positive pos to take from the left:
import org.apache.spark.sql.functions.substring
substring(column, 0, 1)
and negative pos to take from the right:
substring(column, -1, 1)
So in Scala you can define
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.substring
def left(col: Column, n: Int) = {
assert(n >= 0)
substring(col, 0, n)
}
def right(col: Column, n: Int) = {
assert(n >= 0)
substring(col, -n, n)
}
val df = Seq("foobar").toDF("str")
df.select(
Seq(left _, right _).flatMap(f => (1 to 3).map(i => f($"str", i))): _*
).show
+--------------------+--------------------+--------------------+---------------------+---------------------+---------------------+
|substring(str, 0, 1)|substring(str, 0, 2)|substring(str, 0, 3)|substring(str, -1, 1)|substring(str, -2, 2)|substring(str, -3, 3)|
+--------------------+--------------------+--------------------+---------------------+---------------------+---------------------+
| f| fo| foo| r| ar| bar|
+--------------------+--------------------+--------------------+---------------------+---------------------+---------------------+
Similarly in Python:
from pyspark.sql.functions import substring
from pyspark.sql.column import Column
def left(col, n):
assert isinstance(col, (Column, str))
assert isinstance(n, int) and n >= 0
return substring(col, 0, n)
def right(col, n):
assert isinstance(col, (Column, str))
assert isinstance(n, int) and n >= 0
return substring(col, -n, n)
import org.apache.spark.sql.functions._
Use substring(column, 0, 1) instead of LEFT function.
where
0 : starting position in the string
1 : Number of characters to be selected
Example : Consider a LEFT function :
LEFT(upper(SKU),2)
Corresponding SparkSQL statement would be :
substring(upper(SKU),1,2)
To build upon user6910411's answer, you can also use isin and then to build a new column with the result of your character comparison.
Final full code would look something like this
import org.apache.spark.sql.functions._
df.select(substring($"Columnname", 0, 1) as "ch")
.withColumn("result", when($"ch".isin("D", "A"), 1).otherwise(0))
There are Spark SQL right and left functions as of Spark 2.3
Suppose you have the following DataFrame.
+----------------------+
|some_string |
+----------------------+
|this 23 has 44 numbers|
|no numbers |
|null |
+----------------------+
Here's how to get the leftmost two elements using the SQL left function:
df.select(expr("left(some_string, 2)").as("left_two")).show(false)
+--------+
|left_two|
+--------+
|th |
|no |
|null |
+--------+
Passing in SQL strings to expr() isn't ideal. Scala API users don't want to deal with SQL string formatting.
I created a library called bebe that provides easy access to the left function:
df.select(bebe_left(col("some_string"), lit(2)).as("left_two")).show()
+--------+
|left_two|
+--------+
|th |
|no |
|null |
+--------+
The Spark SQL right and bebe_right functions work in a similar manner.
You can use the Spark SQL functions with the expr hack, but it's better to use the bebe functions that are more flexible and type safe.

Spark join dataframes & datasets

I have a DataFrame called Link with a dynamic amount of fields/columns in a Row.
Some fields however had the structure [ClassName]Id that contain an id
[ClassName]Id's are always of type String
I have a couple of Datasets each of a different type [ClassName]
Each Dataset has at least fields id (String) and typeName (String), which is always filled with the String value of the [ClassName]
e.g. If I would have 3 DataSets of type A, B and C
Link:
+----+-----+-----+-----+
| id | AId | BId | CId |
+----+-----+-----+-----+
| XX | A01 | B02 | C04 |
| XY | null| B05 | C07 |
A:
+-----+----------+-----+-----+
| id | typeName | ... | ... |
+-----+----------+-----+-----+
| A01 | A | ... | ... |
B:
+-----+----------+-----+-----+
| id | typeName | ... | ... |
+-----+----------+-----+-----+
| B02 | B | ... | ... |
The preferred end result would be the Link Dataframe where each Id is either replace or appended by a field called [ClassName] With the original object encapsulated.
Result:
+----+----------------+----------------+----------------+
| id | A | B | C |
+----+----------------+----------------+----------------+
| XX | A(A01, A, ...) | B(B02, B, ...) | C(C04, C, ...) |
| XY | null | B(B05, B, ...) | C(C07, C, ...) |
Things I've tried
Recursive Call on joinWith.
The first call succeeds returning a tuple/Row where the first element is the original Row and the second the matched [ClassName]
However the second iteration starts nesting these results.
Trying to 'unnest' these results using map either results in Encoder hell (since the resulting Row is not a fixed type) or the Encoding is so complex that it results in a catalyst error
join as RDD Can't work this one out yet.
Any ideas are welcome.
So I figured out how I could do what I want.
I made some changes for it to work for me, but it's a
For reference purpose I will show my steps, maybe it can be useful for someone in the future?
First I declare a datatype that shares all properties of A, B, C, etc. that I'm Interested in and make the classes extend from this super type
case class Base(id: String, typeName: String)
case class A(override val id: String, override val typeName: String) extends Base(id, typeName)
Next I load the link Dataframe
val linkDataFrame = spark.read.parquet("[path]")
I want to convert this DataFrame in something joinable, this means creating a placeholder for the joined sources and a way to convert all the single Id fields (AId, BId, etc) into a Map of source -> id's. Spark has a sql map method that is useful. Also we need to convert the Base class to a StructType for use in the encoder. Tried multiple ways, but couldn't circumvent specific declaration (otherwise casting errors)
val linkDataFrame = spark.read.parquet("[path]")
case class LinkReformatted(ids: Map[String, Long], sources: Map[String, Base])
// Maps each column ending with Id into a Map of (columnname1 (-Id), value1, columnname2 (-Id), value2)
val mapper = linkDataFrame.columns.toList
.filter(
_.matches("(?i).*Id$")
)
.flatMap(
c => List(lit(c.replaceAll("(?i)Id$", "")), col(c))
)
val baseStructType = ScalaReflection.schemaFor[Base].dataType.asInstanceOf[StructType]
All these parts made it possible to create a new DataFrame with the Id's all in one field called ids and a placeholder for the sources in an empty Map[String, Base]
val linkDatasetReformatted = linkDataFrame.select(
map(mapper: _*).alias("ids")
)
.withColumn("sources", lit(null).cast(MapType(StringType, baseStructType)))
.as[LinkReformatted]
The next step was to join all source Datasets (A,B, etc) to this reformatted Link dataset. A lot of stuff happens in this tailrecursive method
#tailrec
def recursiveJoinBases(sourceDataset: Dataset[LinkReformatted], datasets: List[Dataset[Base]]): Dataset[LinkReformatted] = datasets match {
case Nil => sourceDataset // Nothing left to join, return it
case baseDataset :: remainingDatasets => {
val typeName = baseDataset.head.typeName // extract the type from base (each field hase same value)
val masterName = "source" // something to name the source
val joinedDataset = sourceDataset.as(masterName) // joining source
.joinWith(
baseDataset.as(typeName), // with a base A,B, etc
col(s"$typeName.id") === col(s"$masterName.ids.$typeName"), // join on source.ids.[typeName]
"left_outer"
)
.map {
case (source, base) => {
val newSources = if (source.sources == null) Map(typeName -> base) else source.sources + (typeName -> base) // append or create map of sources
source.copy(sources = newSources)
}
}
.as[LinkReformatted]
recursiveJoinBases(joinedDataset, remainingDatasets)
}
}
You now end up with a Dataset of LinkReformatted records where for each corresponding typeName -> id in the ids field is a corresponding typeName -> Base in the sources field.
For me that was enough. I could extract everything I needed using some map function over this final Dataset
I hope this somewhat helps. I understand it's not the exact solution I was asking about, nor is it all very straightforward.

Resources