Udf not working - python-3.x

can you help me to optimize this code and make it work?
this is original data:
+--------------------+-------------+
| original_name|medicine_name|
+--------------------+-------------+
| Venlafaxine| Venlafaxine|
| Lacrifilm 5mg/ml| Lacrifilm|
| Lacrifilm 5mg/ml| null|
| Venlafaxine| null|
|Vitamin D10,000IU...| null|
| paracetamol| null|
| mucolite| null|
I'm expect to get data like this
+--------------------+-------------+
| original_name|medicine_name|
+--------------------+-------------+
| Venlafaxine| Venlafaxine|
| Lacrifilm 5mg/ml| Lacrifilm|
| Lacrifilm 5mg/ml| Lacrifilm|
| Venlafaxine| Venlafaxine|
|Vitamin D10,000IU...| null|
| paracetamol| null|
| mucolite| null|
This is the code:
distinct_df = spark.sql("select distinct medicine_name as medicine_name from medicine where medicine_name is not null")
distinct_df.createOrReplaceTempView("distinctDF")
def getMax(num1, num2):
pmax = (num1>=num2)*num1+(num2>num1)*num2
return pmax
def editDistance(s1, s2):
ed = (getMax(length(s1), length(s2)) - levenshtein(s1,s2))/
getMax(length(s1), length(s2))
return ed
editDistanceUdf = udf(lambda x,y: editDistance(x,y), FloatType())
def getSimilarity(str):
res = spark.sql("select medicine_name, editDistanceUdf('str', medicine_name) from distinctDf where editDistanceUdf('str', medicine_name)>=0.85 order by 2")
res['medicine_name'].take(1)
return res
getSimilarityUdf = udf(lambda x: getSimilarity(x), StringType())
res_df = df.withColumn('m_name', when((df.medicine_name.isNull)|(df.medicine_name.=="null")),getSimilarityUdf(df.original_name)
.otherwise(df.medicine_name)).show()
now i'm getting error:
command_part = REFERENCE_TYPE + parameter._get_object_id()
AttributeError: 'function' object has no attribute '_get_object_id'

There is a bunch of problems with your code:
You cannot use SparkSession or distributed objects in the udf. So getSimilarity just cannot work. If you want to compare objects like this you have to join.
If length and levenshtein come from pyspark.sql.functions there cannot be used inside UserDefinedFunctions. There are designed to generate SQL expressions, mapping from *Column to Column.
Column isNull is a method not property so should be called:
df.medicine_name.isNull()
Following
df.medicine_name.=="null"
is not a syntactically valid Python (looks like Scala calque) and would throw compiler exceptions.
If SparkSession access was allowed in an UserDefinedFunction this wouldn't be a valid substitution
spark.sql("select medicine_name, editDistanceUdf('str', medicine_name) from distinctDf where editDistanceUdf('str', medicine_name)>=0.85 order by 2")
You should use string formatting methods
spark.sql("select medicine_name, editDistanceUdf({str}, medicine_name) from distinctDf where editDistanceUdf({str}, medicine_name)>=0.85 order by 2".format(str=str))
Maybe some other problems, but since you didn't provide a MCVE, anything else would be pure guessing.
When you fix smaller mistakes you have two choices:
Use crossJoin:
combined = df.alias("left").crossJoin(spark.table("distinctDf").alias("right"))
Then apply udf, filter, and one of the methods listed in Find maximum row per group in Spark DataFrame to closest match in group.
Use built-in approximate matching tools as explained in Efficient string matching in Apache Spark

Related

Is there a way to use a map/dict in Pyspark to avoid CASE WHEN condition equals pairs?

I have a problem in Pyspark creating a column based on values in another column for a new dataframe.
It's boring and seems to me not a good practice to use a lot of
CASE
WHEN column_a = 'value_1' THEN 'value_x'
WHEN column_a = 'value_2' THEN 'value_y'
...
WHEN column_a = 'value_289' THEN 'value_xwerwz'
END
In cases like this, in python, I get used to using a dict or, even better, a configparser file and avoid the if else condition. I just pass the key and python returns the desired value. Also, we have a 'fallback' option for ELSE clause.
The problem seems to me that we are not treating a single row but all of them in one command, so using dict/map/configparser is an unavailable option. I thought about using a loop with dict, but it seems too slow and a waste of computation as we repeat all the conditions.
I'm still looking for this practice, if I find it, I'll post it here. But, you know, probably a lot of people already use it and I don't know yet. But if there is no other way, ok. Use many WHEN THEN conditions won't be a choice.
Thank you
I tried to use a dict and searched for solutions like this
You could create a function which converts a dict into a Spark F.when, e.g.:
import pyspark.sql.functions as F
def create_spark_when(column, conditions, default):
when = None
for key, value in conditions.items():
current_when = F.when(F.col(column) == key, value)
if when is None:
when = current_when.otherwise(default)
else:
when = current_when.otherwise(when)
return when
df = spark.createDataFrame([(0,), (1,), (2,)])
df.show()
my_conditions = {1: "a", 2: "b"}
my_default = "c"
df.withColumn(
"my_column",
create_spark_when("_1", my_conditions, my_default),
).show()
Output:
+---+
| _1|
+---+
| 0|
| 1|
| 2|
+---+
+---+---------+
| _1|my_column|
+---+---------+
| 0| c|
| 1| a|
| 2| b|
+---+---------+
One choice is to use create a dataframe out of dictionary and perform join
This would work:
Creating a Dataframe:
dict={"value_1": "value_x", "value_2": "value_y"}
dict_df=spark.createDataFrame([(k,v) for k,v in dict.items()], ["key","value"])
Performing the join:
df.alias("df1")\
.join(F.broadcast(dict_df.alias("df2")), F.col("column_a")==F.col("key"))\
.selectExpr("df1.*","df2.value as newColumn")\
.show()
We can broadcast the dict_df as it is small.
Input:
Dict_df:
Output:
Alternatively, you can use a UDF - but that is not recommended.

Finding index position of a character in a string and use it in substring functions in dataframe

data frame :
I need to truncate the String column value based on the # position. The result should be :
I am trying this code but it is throwing a TypeError :
Though I can achieve the desired result using SparkSql or by creating a function in Python, is there any way that it can be done in pyspark itself?
Another way is to use locate within the substr function, but this can only be used with expr.
spark.sparkContext.parallelize([('WALGREENS #6411',), ('CVS/PHARMACY #08864',), ('CVS',)]).toDF(['acct']). \
withColumn('acct_name',
func.when(func.col('acct').like('%#%') == False, func.col('acct')).
otherwise(func.expr('substr(acct, 1, locate("#", acct)-2)'))
). \
show()
# +-------------------+------------+
# | acct| acct_name|
# +-------------------+------------+
# | WALGREENS #6411| WALGREENS|
# |CVS/PHARMACY #08864|CVS/PHARMACY|
# | CVS| CVS|
# +-------------------+------------+
You can use split() function to achieve this. I used split function with delimiter as # to get the required value and removed leading spaces with rtrim().
My input:
+---+-------------------+
| id| string|
+---+-------------------+
| 1| WALGREENS #6411|
| 2|CVS/PHARMACY #08864|
| 3| CVS|
| 4| WALGREENS|
| 5| Test #1234|
+---+-------------------+
Try using the following code:
from pyspark.sql.functions import split,col,rtrim
df = df.withColumn("New_string", split(col("string"), "#").getItem(0))
#you can also use substring_index()
#df.withColumn("result", substring_index(df['string'], '#',1))
df = df.withColumn('New_string', rtrim(df['New_string']))
df.show()
Output:

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column (MeaningfulWords column) of df1, according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list (df1) is not in the dictionary (df2), zero is scored.
The Dataframes looks like this:
df1.select("ID","MeaningfulWords").show(truncate=True, n=5)
+------------------+------------------------------+
| ID| MeaningfulWords|
+------------------+------------------------------+
|abcde00000qMQ00001|[casa, alejado, buen, gusto...|
|abcde00000qMq00002|[clientes, contentos, servi...|
|abcde00000qMQ00003| [resto, bien]|
|abcde00000qMQ00004|[mal, servicio, no, antiend...|
|abcde00000qMq00005|[gestion, adecuada, proble ...|
+------------------+------------------------------+
df2.show(5)
+-----+----------+
|score| word|
+-----+----------+
| 1.68|abandonado|
| 3.18| abejas|
| 2.8| aborto|
| 2.46| abrasador|
| 8.13| abrazo|
+-----+----------+
The new columns to add in df1, should look like this:
+------------------+---------------------+
| MeanScore| ScoreList|
+------------------+---------------------+
| 2.95|[3.10, 2.50, 1.28,...|
| 2.15|[1.15, 3.50, 2.75,...|
| 2.75|[4.20, 1.00, 1.75,...|
| 3.25|[3.25, 2.50, 3.20,...|
| 3.15|[2.20, 3.10, 1.28,...|
+------------------+---------------------+
I have reviewed some options using .join, but using columns with different data types gives error.
I have also tried converting the Dataframes to RDD and calling a function:
def map_words_to_values(review_words, dict_df):
return [dict_df[word] for word in review_words if word in dict_df]
RDD1=swRemoved.rdd.map(list)
RDD2=Dict_df.rdd.map(list)
reviewsRDD_dict_values = RDD1.map(lambda tuple: (tuple[0], map_words_to_values(tuple[1], RDD2)))
reviewsRDD_dict_values.take(3)
But with this option I get the error:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
I have found some examples to score text using afinn library. But it doesn't works with spanish text.
I wanna try to utilize native functions of pyspark instead of using udfs to avoid affect the performance, if it's possible. But I'm a begginer in spark and I would like to find the spark way to do that.
You could do this by first joining using array_contains word, then groupBy with aggregations of first, collect_list, and mean.(spark2.4+)
welcome to SO
df1.show()
#+------------------+----------------------------+
#|ID |MeaningfulWords |
#+------------------+----------------------------+
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|
#|abcde00000qMq00002|[clientes, contentos, servi]|
#|abcde00000qMQ00003|[resto, bien] |
#+------------------+----------------------------+
df2.show()
#+-----+---------+
#|score| word|
#+-----+---------+
#| 1.68| casa|
#| 2.8| alejado|
#| 1.03| buen|
#| 3.68| gusto|
#| 0.68| clientes|
#| 2.1|contentos|
#| 2.68| servi|
#| 1.18| resto|
#| 1.98| bien|
#+-----+---------+
from pyspark.sql import functions as F
df1.join(df2, F.expr("""array_contains(MeaningfulWords,word)"""),'left')\
.groupBy("ID").agg(F.first("MeaningfulWords").alias("MeaningfullWords")\
,F.collect_list("score").alias("ScoreList")\
,F.mean("score").alias("MeanScore"))\
.show(truncate=False)
#+------------------+----------------------------+-----------------------+------------------+
#|ID |MeaningfullWords |ScoreList |MeanScore |
#+------------------+----------------------------+-----------------------+------------------+
#|abcde00000qMQ00003|[resto, bien] |[1.18, 1.98] |1.58 |
#|abcde00000qMq00002|[clientes, contentos, servi]|[0.68, 2.1, 2.68] |1.8200000000000003|
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|[1.68, 2.8, 1.03, 3.68]|2.2975 |
#+------------------+----------------------------+-----------------------+------------------+

Convert string type to array type in spark sql

I have table in Spark SQL in Databricks and I have a column as string. I converted as new columns as Array datatype but they still as one string. Datatype is array type in table schema
Column as String
Data1
[2461][2639][2639][7700][7700][3953]
Converted to Array
Data_New
["[2461][2639][2639][7700][7700][3953]"]
String to array conversion
df_new = df.withColumn("Data_New", array(df["Data1"]))
Then write as parquet and use as spark sql table in databricks
When I search for string using array_contains function I get results as false
select *
from table_name
where array_contains(Data_New,"[2461]")
When I search for all string then query turns the results as true
Please suggest if I can separate these string as array and can find any array using array_contains function.
Just remove leading and trailing brackets from the string then split by ][ to get an array of strings:
df = df.withColumn("Data_New", split(expr("rtrim(']', ltrim('[', Data1))"), "\\]\\["))
df.show(truncate=False)
+------------------------------------+------------------------------------+
|Data1 |Data_New |
+------------------------------------+------------------------------------+
|[2461][2639][2639][7700][7700][3953]|[2461, 2639, 2639, 7700, 7700, 3953]|
+------------------------------------+------------------------------------+
Now use array_contains like this:
df.createOrReplaceTempView("table_name")
sql_query = "select * from table_name where array_contains(Data_New,'2461')"
spark.sql(sql_query).show(truncate=False)
Actually this is not an array, this is a full string so you need a regex or similar
expr = "[2461]"
df_new.filter(df_new["Data_New"].rlike(expr))
import
from pyspark.sql import functions as sf, types as st
create table
a = [["[2461][2639][2639][7700][7700][3953]"], [None]]
sdf = sc.parallelize(a).toDF(["col1"])
sdf.show()
+--------------------+
| col1|
+--------------------+
|[2461][2639][2639...|
| null|
+--------------------+
convert type
def spliter(x):
if x is not None:
return x[1:-1].split("][")
else:
return None
udf = sf.udf(spliter, st.ArrayType(st.StringType()))
sdf.withColumn("array_col1", udf("col1")).withColumn("check", sf.array_contains("array_col1", "2461")).show()
+--------------------+--------------------+-----+
| col1| array_col1|check|
+--------------------+--------------------+-----+
|[2461][2639][2639...|[2461, 2639, 2639...| true|
| null| null| null|
+--------------------+--------------------+-----+

Does GraphFrames api support creation of Bipartite graphs?

Does GraphFrames api support creation of Bipartite graphs in the current version?
Current version: 0.1.0
Spark version : 1.6.1
As pointed out in the comments to this question, neither GraphFrames nor GraphX have built-in support for bipartite graphs. However, they both have more than enough flexibility to let you create bipartite graphs. For a GraphX solution, see this previous answer. That solution uses a shared trait between the different vertex / object type. And while that works with RDDs that's not going to work for DataFrames. A row in a DataFrame has a fixed schema -- it can't sometimes contain a price column and sometimes not. It can have a price column that's sometimes null, but the column has to exist in every row.
Instead, the solution for GraphFrames seems to be that you need to define a DataFrame that's essentially a linear sub-type of both types of objects in your bipartite graph -- it has to contain all of the fields of both types of objects. This is actually pretty easy -- a join with full_outer is going to give you that. Something like this:
val players = Seq(
(1,"dave", 34),
(2,"griffin", 44)
).toDF("id", "name", "age")
val teams = Seq(
(101,"lions","7-1"),
(102,"tigers","5-3"),
(103,"bears","0-9")
).toDF("id","team","record")
You could then create a super-set DataFrame like this:
val teamPlayer = players.withColumnRenamed("id", "l_id").join(
teams.withColumnRenamed("id", "r_id"),
$"r_id" === $"l_id", "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
teamPlayer.show
+---+-------+----+------+------+
| id| name| age| team|record|
+---+-------+----+------+------+
|101| null|null| lions| 7-1|
|102| null|null|tigers| 5-3|
|103| null|null| bears| 0-9|
| 1| dave| 34| null| null|
| 2|griffin| 44| null| null|
+---+-------+----+------+------+
You could possibly do it a little cleaner with structs:
val tpStructs = players.select($"id" as "l_id", struct($"name", $"age") as "player").join(
teams.select($"id" as "r_id", struct($"team",$"record") as "team"),
$"l_id" === $"r_id",
"full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
tpStructs.show
+---+------------+------------+
| id| player| team|
+---+------------+------------+
|101| null| [lions,7-1]|
|102| null|[tigers,5-3]|
|103| null| [bears,0-9]|
| 1| [dave,34]| null|
| 2|[griffin,44]| null|
+---+------------+------------+
I'll also point out that more or less the same solution would work in GraphX with RDDs. You could always create a vertex via joining two case classes that don't share any traits:
case class Player(name: String, age: Int)
val playerRdd = sc.parallelize(Seq(
(1L, Player("date", 34)),
(2L, Player("griffin", 44))
))
case class Team(team: String, record: String)
val teamRdd = sc.parallelize(Seq(
(101L, Team("lions", "7-1")),
(102L, Team("tigers", "5-3")),
(103L, Team("bears", "0-9"))
))
playerRdd.fullOuterJoin(teamRdd).collect foreach println
(101,(None,Some(Team(lions,7-1))))
(1,(Some(Player(date,34)),None))
(102,(None,Some(Team(tigers,5-3))))
(2,(Some(Player(griffin,44)),None))
(103,(None,Some(Team(bears,0-9))))
With all respect to the previous answer, this seems like a more flexible way to handle it -- without having to share a trait between the combined objects.

Resources