Calculate distance between vectors from different Spark dataframes - apache-spark

I have two Spark dataframes:
> df1
+--------+-----------------------------+
| word | init_vec|
+--------+-----------------------------+
| venus |[-0.235, -0.060, -0.609, ...]|
+--------+-----------------------------+
> df2
+-----------------------------+-----+
| targ_vec | id|
+-----------------------------+-----+
|[-0.272, -0.070, -0.686, ...]| 45ha|
+-----------------------------+-----+
|[-0.234, -0.060, -0.686, ...]| 98pb|
+-----------------------------+-----+
|[-0.562, -0.334, -0.981, ...]| c09j|
+-----------------------------+-----+
I need to find euclidean distance between init_vec from df1 and each vector from targ_vec of df2 dataframe and return top 3 closest vector to init_vec.
> desired_output
+--------+-----------------------------+-----+----------+
| word | targ_vec| id| distance|
+--------+-----------------------------+-----+----------+
| venus |[-0.234, -0.060, -0.686, ...]| 98pb|some_value|
+--------+-----------------------------+-----+----------+
| venus |[-0.221, -0.070, -0.613, ...]| tg67|some_value|
+--------+-----------------------------+-----+----------+
| venus |[-0.240, -0.091, -0.676, ...]| jhg6|some_value|
+--------+-----------------------------+-----+----------+
I need to implement this using PySpark.

After a cross join between df1 and df2 to add the df1.init_vec to all the rows of df2:
df1 = (df1
.withColumn('distance', f.sqrt(f.expr('aggregate(transform(targ_vec, (element, idx) -> power(abs(element - element_at(init_vec, cast(idx + 1 as int))), 2)), cast(0 as double), (acc, value) -> acc + value)')))
)
Then you can sort the dataframe and keep the 3 rows with the least distance values.

Related

Merging series of pandas dataframe into single dataframe

I have a series of pandas data frame stored in variable df similar to below:
df
| 0 | 1 |
+-------+--------+
|ABCD | WXYZ |
| 0 | 1 |
+-------+--------+
|DEFJ | HJKL |
| 0 | 1 |
+-------+--------+
|ZXCT | WYOM |
| 0 | 1 |
+-------+--------+
|TYZX | NMEX |
I want to merge them to a single pandas data frame as below :
| 0 | 1 |
+-------+--------+
|ABCD | WXYZ |
|DEFJ | HJKL |
|ZXCT | WYOM |
|TYZX | NMEX |
So how can I merge series of pandas dataframe into one single pandas dataframe ?
As your code is now, you're only outputing one dataframe with one row only (overwriting the others).
Try this:
# Copy the names to pandas dataframes and save them in a list
import pandas as pd
dfs = []
for j in range(0,5):
for i in divs[j].find_elements_by_tag_name('a'):
i = i.get_attribute('text')
i = parse_name(i)
df = pd.DataFrame(i)
df = df.transpose()
dfs.append(df)
# Aggregate all dataframes in one
new_df = dfs[0]
for df in dfs[1:]:
new_df = new_df.append(df)
# Update index
new_df = new_df.reset_index(drop=True)
# Print first five rows
new_df.head()
0 1
0 Lynn Batten Emeritus Professor
1 Andrzej Goscinski Emeritus Professor
2 Jemal Abawajy Professor
3 Maia Angelova Professor
4 Gleb Beliakov Professor
There are four ways to concat or merge dataframes, you may refer to this post
these are the most common implementations
import pandas as pd
df1 = pd.DataFrame({0:['ABCD'], 1:['WXYX']})
df2 = pd.DataFrame({0:['DEFJ'], 1:['HJKL']})
df3 = pd.DataFrame({0:['ZXCT'], 1:['WYOM']})
...
df = pd.concat([df1, df2, df3], axis=0)
print(df.head())
or if you have a list of dataframes with the same headers you can try
dfs = [df1, df2, df3 ..]
df = pd.concat(dfs, axis=0)
and the most simple way is to just use df.append
df = df.append(anotherdf)

break one DF row to multiple row in another DF

I am looking to convert one DF into another.
the difference is 1 row in DF1 may be 3 rows in DF2
example DF1
cust_id | email_id_1 | email_id_2 | email_id_3 |
1 |one_1#m.com | one_2#m.com| one_3#m.com|
then DF2 will be like
cust_id | email_id |
1 |one_1#m.com |
1 |one_2#m.com |
1 |one_3#m.com |
I have written below code , which is giving me error AttributeError: 'str' object has no attribute 'cast'
# Create a schema for the dataframe
dfSchema = StructType([
StructField('CUST_ID', LongType()),
StructField('EMAIL_ADDRESS', StringType())
])
dfData = []
for row in initialCustEmailDetailsDF.rdd.collect():
if row["email_address_1"]!="":
temp1 = [row["cust_id"].cast(LongType()),row["email_address_1"]]
# error : AttributeError: 'str' object has no attribute 'cast'
dfData.append(temp1)
if row["email_address_2"]!="":
temp2 = [row["cust_id"].cast(LongType()),row["email_address_2"]]
dfData.append(temp2)
if row["email_address_3"]!="":
temp3 = [row["cust_id"].cast(LongType())row["email_address_3"]]
dfData.append(temp3)
# Convert list to RDD
rdd = spark.sparkContext.parallelize(dfData)
# Create data frame
df = spark.createDataFrame(rdd, dfSchema)
df.show()
You may be looking for explode_outer:
df.show()
+-------+-----------+-----------+-----------+
|cust_id| email_id_1| email_id_2| email_id_3|
+-------+-----------+-----------+-----------+
| 1|one_1#m.com|one_2#m.com| null|
| 2|one_1#m.com| null|one_3#m.com|
| 3|one_1#m.com|one_2#m.com|one_3#m.com|
+-------+-----------+-----------+-----------+
import pyspark.sql.functions as F
df2 = df.select(
'cust_id',
F.explode_outer(
F.array('email_id_1', 'email_id_2', 'email_id_3')
).alias('email_id')
)
df2.show()
+-------+-----------+
|cust_id| email_id|
+-------+-----------+
| 1|one_1#m.com|
| 1|one_2#m.com|
| 1| null|
| 2|one_1#m.com|
| 2| null|
| 2|one_3#m.com|
| 3|one_1#m.com|
| 3|one_2#m.com|
| 3|one_3#m.com|
+-------+-----------+

Spark: Join two dataframes on an array type column

I have a simple use case
I have two dataframes df1 and df2, and I am looking for an efficient way to join them?
df1: Contains my main dataframe (billions of records)
+--------+-----------+--------------+
|doc_id |doc_name |doc_type_id |
+--------+-----------+--------------+
| 1 |doc_name_1 |[1,4] |
| 2 |doc_name_2 |[3,2,6] |
+--------+-----------+--------------+
df2: Contains labels of doc types(40000 records), as it's a small one I am broadcasting it.
+------------+----------------+
|doc_type_id |doc_type_name |
+------------+----------------+
| 1 |doc_type_1 |
| 2 |doc_type_2 |
| 3 |doc_type_3 |
| 4 |doc_type_4 |
| 5 |doc_type_5 |
| 6 |doc_type_5 |
+------------+----------------+
I would like to join these two dataframes to result in somthing like this:
+--------+------------+--------------+----------------------------------------+
|doc_id |doc_name |doc_type_id |doc_type_name |
+--------+------------+--------------+----------------------------------------+
| 1 |doc_name_1 |[1,4] |["doc_type_1","doc_type_4"] |
| 2 |doc_name_2 |[3,2,6] |["doc_type_3","doc_type_2","doc_type_6"]|
+--------+------------+--------------+----------------------------------------+
Thanks
We can use array_contains + groupBy + collect_list functions for this case.
Example:
val df1=Seq(("1","doc_name_1",Seq(1,4)),("2","doc_name_2",Seq(3,2,6))).toDF("doc_id","doc_name","doc_type_id")
val df2=Seq(("1","doc_type_1"),("2","doc_type_2"),("3","doc_type_3"),("4","doc_type_4"),("5","doc_type_5"),("6","doc_type_6")).toDF("doc_type_id","doc_type_name")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df1.createOrReplaceTempView("tbl")
df2.createOrReplaceTempView("tbl2")
spark.sql("select a.doc_id,a.doc_name,a.doc_type_id,collect_list(b.doc_type_name) doc_type_name from tbl a join tbl2 b on array_contains(a.doc_type_id,int(b.doc_type_id)) = TRUE group by a.doc_id,a.doc_name,a.doc_type_id").show(false)
//+------+----------+-----------+------------------------------------+
//|doc_id|doc_name |doc_type_id|doc_type_name |
//+------+----------+-----------+------------------------------------+
//|2 |doc_name_2|[3, 2, 6] |[doc_type_2, doc_type_3, doc_type_6]|
//|1 |doc_name_1|[1, 4] |[doc_type_1, doc_type_4] |
//+------+----------+-----------+------------------------------------+
Other way to achieve is by using explode + join + collect_list:
val df3=df1.withColumn("arr",explode(col("doc_type_id")))
df3.join(df2,df2.col("doc_type_id") === df3.col("arr"),"inner").
groupBy(df3.col("doc_id"),df3.col("doc_type_id"),df3.col("doc_name")).
agg(collect_list(df2.col("doc_type_name")).alias("doc_type_name")).
show(false)
//+------+-----------+----------+------------------------------------+
//|doc_id|doc_type_id|doc_name |doc_type_name |
//+------+-----------+----------+------------------------------------+
//|1 |[1, 4] |doc_name_1|[doc_type_1, doc_type_4] |
//|2 |[3, 2, 6] |doc_name_2|[doc_type_2, doc_type_3, doc_type_6]|
//+------+-----------+----------+------------------------------------+

How to find avg length each column in pyspark? [duplicate]

This question already has an answer here:
Apply a transformation to multiple columns pyspark dataframe
(1 answer)
Closed 4 years ago.
I have created data frame like below:
from pyspark.sql import Row
l = [('Ankit','25','Ankit','Ankit'),('Jalfaizy','2.2','Jalfaizy',"aa"),('saurabh','230','saurabh',"bb"),('Bala','26',"aa","bb")]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], ages=x[1],lname=x[2],mname=x[3]))
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.show()
+----+--------+-----+--------+
|ages| lname|mname| name|
+----+--------+-----+--------+
| 25| Ankit|Ankit| Ankit|
| 2.2|Jalfaizy| aa|Jalfaizy|
| 230| saurabh| bb| saurabh|
| 26| aa| bb| Bala|
+----+--------+-----+--------+
I want find each column avg length for all comuns i.e below my expected output.i.e total number of character in particular column/ number of rows
+----+--------+-----+--------+
|ages| lname|mname| name|
+----+--------+-----+--------+
|2.5 | 5.5 | 2.75 | 6 |
+----+--------+-----+--------+
This is actually pretty straight forward. We will be using a projection for column length and an aggregation for avg :
from pyspark.sql.functions import length, col, avg
selection = ['lname','mname','name']
schemaPeople \
.select(*(length(col(c)).alias(c) for c in selection)) \
.agg(*(avg(col(c)).alias(c) for c in selection)).show()
# +-----+-----+----+
# |lname|mname|name|
# +-----+-----+----+
# | 5.5| 2.75| 6.0|
# +-----+-----+----+
This way, you'll be able to pass the names of the columns dynamically.
What we are doing here is actually unpacking the argument list (selection)
Reference : Control Flow Tools - Unpacking Argument Lists.
I think you can just create new rows for the individual lengths and then just group the dataframe. Then you would end up with something like:
df_new = spark.createDataFrame([
( "25","Ankit","Ankit","Ankit"),( "2.2","Jalfaizy","aa","Jalfaizy"),
("230","saurabh","bb","saurabh") ,( "26","aa","bb","Bala")
], ("age", "lname","mname","name"))
df_new.withColumn("len_age",length(col("age"))).withColumn("len_lname",length(col("lname")))\
.withColumn("len_mname",length(col("mname"))).withColumn("len_name",length(col("name")))\
.groupBy().agg(avg("len_age"),avg("len_lname"),avg("len_mname"),avg("len_name")).show()
Result:
+------------+--------------+--------------+-------------+
|avg(len_age)|avg(len_lname)|avg(len_mname)|avg(len_name)|
+------------+--------------+--------------+-------------+
| 2.5| 5.5| 2.75| 6.0|
+------------+--------------+--------------+-------------+
In Scala can be done in this way, guess, can be converted to Python by author:
val averageColumnList = List("age", "lname", "mname", "name")
val columns = averageColumnList.map(name => avg(length(col(name))))
val result = df.select(columns: _*)

PySpark: Compare array values in one dataFrame with array values in another dataFrame to get the intersection

I have the following two DataFrames:
l1 = [(['hello','world'],), (['stack','overflow'],), (['hello', 'alice'],), (['sample', 'text'],)]
df1 = spark.createDataFrame(l1)
l2 = [(['big','world'],), (['sample','overflow', 'alice', 'text', 'bob'],), (['hello', 'sample'],)]
df2 = spark.createDataFrame(l2)
df1:
["hello","world"]
["stack","overflow"]
["hello","alice"]
["sample","text"]
df2:
["big","world"]
["sample","overflow","alice","text","bob"]
["hello", "sample"]
For every row in df1, I want to calculate the number of times all the words in the array occur in df2.
For example, the first row in df1 is ["hello","world"]. Now, I want to check df2 for the intersection of ["hello","world"] with every row in df2.
| ARRAY | INTERSECTION | LEN(INTERSECTION)|
|["big","world"] |["world"] | 1 |
|["sample","overflow","alice","text","bob"] |[] | 0 |
|["hello","sample"] |["hello"] | 1 |
Now, I want to return the sum(len(interesection)). Ultimately I want the resulting df1 to look like this:
df1 result:
ARRAY INTERSECTION_TOTAL
| ["hello","world"] | 2 |
| ["stack","overflow"] | 1 |
| ["hello","alice"] | 2 |
| ["sample","text"] | 3 |
How do I solve this?
I'd focus on avoiding Cartesian product first. I'd try to explode and join
from pyspark.sql.functions import explode, monotonically_increasing_id
df1_ = (df1.toDF("words")
.withColumn("id_1", monotonically_increasing_id())
.select("*", explode("words").alias("word")))
df2_ = (df2.toDF("words")
.withColumn("id_2", monotonically_increasing_id())
.select("id_2", explode("words").alias("word")))
(df1_.join(df2_, "word").groupBy("id_1", "id_2", "words").count()
.groupBy("id_1", "words").sum("count").drop("id_1").show())
+-----------------+----------+
| words|sum(count)|
+-----------------+----------+
| [hello, alice]| 2|
| [sample, text]| 3|
|[stack, overflow]| 1|
| [hello, world]| 2|
+-----------------+----------+
If intermediate values are not needed it could be simplified to:
df1_.join(df2_, "word").groupBy("words").count().show()
+-----------------+-----+
| words|count|
+-----------------+-----+
| [hello, alice]| 2|
| [sample, text]| 3|
|[stack, overflow]| 1|
| [hello, world]| 2|
+-----------------+-----+
and you could omit adding ids.

Resources