Find out substring from url/value of a key from url - apache-spark

I have a table which has url column
I need to find out all the values correspond to tag
TableA
#+---------------------------------------------------------------------+
#| url |
#+---------------------------------------------------------------------+
#| https://www.amazon.in/primeday?tag=final&value=true |
#| https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2 |
#| https://www.google.com/active/search?tag=inreview&type=addtional |
#| https://www.google.com/filter/search?&type=nonactive |
output
#+------------------+
#| Tag |
#+------------------+
#| final |
#| presubmitted |
#| inreview |
I am able to do it in spark sql via below
spark.sql("""select parse_url(url,'QUERY','tag') as Tag from TableA""")
Any option via dataframe or regular expression.

PySpark:
df \
.withColumn("partialURL", split("url", "tag=")[1]) \
.withColumn("tag", split("partialURL", "&")[0]) \
.drop("partialURL")

You can try the below implementation -
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val extract: String => String = StringUtils.substringBetween(_,"tag=","&")
val parse = udf(extract)
val urlDS = Seq("https://www.amazon.in/primeday?tag=final&value=true",
"https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2",
"https://www.google.com/active/search?tag=inreview&type=addtional",
"https://www.google.com/filter/search?&type=nonactive").toDS
urlDS.withColumn("tag",parse($"value")).show()
+----------------------------------------------------------------+------------+
|value |tag |
+----------------------------------------------------------------+------------+
|https://www.amazon.in/primeday?tag=final&value=true |final |
|https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2 |presubmitted|
|https://www.google.com/active/search?tag=inreview&type=addtional|inreview |
|https://www.google.com/filter/search?&type=nonactive |null |
+----------------------------------------------------------------+------------+

The fastest solution is likely substring based, similar to Pardeep's answer. An alternative approach is to use a regex that does some light input checking, similar to:
^(?:(?:(?:https?|ftp):)?\/\/).+?tag=(.*?)(?:&.*?$|$)
This checks that the string starts with a http/https/ftp protocol, the colon and slashes, at least one character (lazily), and either tag=<string of interest> appears somewhere in the middle or at the very end of the string.
Visually (courtesy of regex101), the matches look like:
The tag values you want are in capture group 1, so if you use regexp_extract (PySpark docs), you'll want to use idx of 1 to extract them.
The main difference between this answer and Pardeep's is that this one won't extract values from strings that don't conform to the regex, e.g. the last string in the image above doesn't match. In these edge cases, regexp_extract will return a NULL, which you can process as you wish afterwards.
Since we're invoking a regex engine, this approach is likely a little slower, but the performance difference might be imperceptible in your application.

Related

Is there a way to use a map/dict in Pyspark to avoid CASE WHEN condition equals pairs?

I have a problem in Pyspark creating a column based on values in another column for a new dataframe.
It's boring and seems to me not a good practice to use a lot of
CASE
WHEN column_a = 'value_1' THEN 'value_x'
WHEN column_a = 'value_2' THEN 'value_y'
...
WHEN column_a = 'value_289' THEN 'value_xwerwz'
END
In cases like this, in python, I get used to using a dict or, even better, a configparser file and avoid the if else condition. I just pass the key and python returns the desired value. Also, we have a 'fallback' option for ELSE clause.
The problem seems to me that we are not treating a single row but all of them in one command, so using dict/map/configparser is an unavailable option. I thought about using a loop with dict, but it seems too slow and a waste of computation as we repeat all the conditions.
I'm still looking for this practice, if I find it, I'll post it here. But, you know, probably a lot of people already use it and I don't know yet. But if there is no other way, ok. Use many WHEN THEN conditions won't be a choice.
Thank you
I tried to use a dict and searched for solutions like this
You could create a function which converts a dict into a Spark F.when, e.g.:
import pyspark.sql.functions as F
def create_spark_when(column, conditions, default):
when = None
for key, value in conditions.items():
current_when = F.when(F.col(column) == key, value)
if when is None:
when = current_when.otherwise(default)
else:
when = current_when.otherwise(when)
return when
df = spark.createDataFrame([(0,), (1,), (2,)])
df.show()
my_conditions = {1: "a", 2: "b"}
my_default = "c"
df.withColumn(
"my_column",
create_spark_when("_1", my_conditions, my_default),
).show()
Output:
+---+
| _1|
+---+
| 0|
| 1|
| 2|
+---+
+---+---------+
| _1|my_column|
+---+---------+
| 0| c|
| 1| a|
| 2| b|
+---+---------+
One choice is to use create a dataframe out of dictionary and perform join
This would work:
Creating a Dataframe:
dict={"value_1": "value_x", "value_2": "value_y"}
dict_df=spark.createDataFrame([(k,v) for k,v in dict.items()], ["key","value"])
Performing the join:
df.alias("df1")\
.join(F.broadcast(dict_df.alias("df2")), F.col("column_a")==F.col("key"))\
.selectExpr("df1.*","df2.value as newColumn")\
.show()
We can broadcast the dict_df as it is small.
Input:
Dict_df:
Output:
Alternatively, you can use a UDF - but that is not recommended.

Finding index position of a character in a string and use it in substring functions in dataframe

data frame :
I need to truncate the String column value based on the # position. The result should be :
I am trying this code but it is throwing a TypeError :
Though I can achieve the desired result using SparkSql or by creating a function in Python, is there any way that it can be done in pyspark itself?
Another way is to use locate within the substr function, but this can only be used with expr.
spark.sparkContext.parallelize([('WALGREENS #6411',), ('CVS/PHARMACY #08864',), ('CVS',)]).toDF(['acct']). \
withColumn('acct_name',
func.when(func.col('acct').like('%#%') == False, func.col('acct')).
otherwise(func.expr('substr(acct, 1, locate("#", acct)-2)'))
). \
show()
# +-------------------+------------+
# | acct| acct_name|
# +-------------------+------------+
# | WALGREENS #6411| WALGREENS|
# |CVS/PHARMACY #08864|CVS/PHARMACY|
# | CVS| CVS|
# +-------------------+------------+
You can use split() function to achieve this. I used split function with delimiter as # to get the required value and removed leading spaces with rtrim().
My input:
+---+-------------------+
| id| string|
+---+-------------------+
| 1| WALGREENS #6411|
| 2|CVS/PHARMACY #08864|
| 3| CVS|
| 4| WALGREENS|
| 5| Test #1234|
+---+-------------------+
Try using the following code:
from pyspark.sql.functions import split,col,rtrim
df = df.withColumn("New_string", split(col("string"), "#").getItem(0))
#you can also use substring_index()
#df.withColumn("result", substring_index(df['string'], '#',1))
df = df.withColumn('New_string', rtrim(df['New_string']))
df.show()
Output:

How to read this custom file in spark-scala using dataframes

I have a file which is of format:
ID|Value
1|name:abc;org:tec;salary:5000
2|org:Ja;Designation:Lead
How do I read this with Dataframes?
The required output is:
1,name,abc
1,org,tec
2,org,Ja
2,designation,Lead
Please help
You will need a bit of ad-hoc string parsing because I don't think there is a built in parser doing exactly what you want. I hope you are confident in your format and in the fact that special characters (|, :, and ;) do not appear in your fields because it would screw everything up.
That being given, you get your result with a couple of simple splits and an explode to put each property in the dictionary on different lines.
val raw_df = sc.parallelize(List("1|name:abc;org:tec;salary:5000", "2|org:Ja;Designation:Lead"))
.map(_.split("\\|") )
.map(a => (a(0),a(1))).toDF("ID", "value")
raw_df
.select($"ID", explode(split($"value", ";")).as("key_value"))
.select($"ID", split($"key_value", ":").as("key_value"))
.select($"ID", $"key_value"(0).as("property"), $"key_value"(1).as("value"))
.show
result:
+---+-----------+-----+
| ID| property|value|
+---+-----------+-----+
| 1| name| abc|
| 1| org| tec|
| 1| salary| 5000|
| 2| org| Ja|
| 2|Designation| Lead|
+---+-----------+-----+
Edit: alternatively, you can use the from_json function (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$) on the value field to parse it. You would however still need to explode the result into separate lines and dispatch each element of the resulting object in the desired column. With the simple example you gave, this would not be simpler and hence boils down to a question of taste.

How to ignore double quotes when reading CSV file in Spark?

I have a CSV file like:
col1,col2,col3,col4
"A,B","C", D"
I want to read it as a data frame in spark, where the values of every field are exactly as written in the CSV (I would like to treat the " character as a regular character, and copy it like any other character).
Expected output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| "A| B"| "C"| D"|
+----+----+----+----+
The output I get:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A,B| C| D"|null|
+----+----+----+----+
In pyspark I am reading like this:
dfr = spark.read.format("csv").option("header", "true").option("inferSchema", "true")
I know that if I add an option like this:
dfr.option("quote", "\u0000")
I get the expected result in the above example, as the function of char '"' is now done by '\u0000', but if my CSV file contains a '\u0000' char, I would also get the wrong result.
Therefore, my question is:
How do I disable the quote option, so that no character acts like a quote?
My CSV file can contain any character, and I want all characters (except comas) to simply be copied into their respective data frame cell. I wonder if there is a way to accomplish this using the escape option.
From the documentation for pyspark.sql.DataFrameReader.csv (emphasis mine):
quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
dfr = spark.read.csv(
path="path/to/some/file.csv",
header="true",
inferSchema="true",
quote=""
)
dfr.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| "A| B"| "C"| D"|
#+----+----+----+----+
This is just a work around, if the option suggested by #pault doesn't work -
from pyspark.sql.functions import split
df = spark.createDataFrame([('"A,B","C", D"',),('""A,"B","""C", D"D"',)], schema = ['Column'])
df.show()
+-------------------+
| Column|
+-------------------+
| "A,B","C", D"|
|""A,"B","""C", D"D"|
+-------------------+
for i in list(range(4)):
df = df.withColumn('Col'+str(i),split(df.Column, ',')[i])
df = df.drop('Column')
df.show()
+----+----+-----+-----+
|Col0|Col1| Col2| Col3|
+----+----+-----+-----+
| "A| B"| "C"| D"|
| ""A| "B"|"""C"| D"D"|
+----+----+-----+-----+

Udf not working

can you help me to optimize this code and make it work?
this is original data:
+--------------------+-------------+
| original_name|medicine_name|
+--------------------+-------------+
| Venlafaxine| Venlafaxine|
| Lacrifilm 5mg/ml| Lacrifilm|
| Lacrifilm 5mg/ml| null|
| Venlafaxine| null|
|Vitamin D10,000IU...| null|
| paracetamol| null|
| mucolite| null|
I'm expect to get data like this
+--------------------+-------------+
| original_name|medicine_name|
+--------------------+-------------+
| Venlafaxine| Venlafaxine|
| Lacrifilm 5mg/ml| Lacrifilm|
| Lacrifilm 5mg/ml| Lacrifilm|
| Venlafaxine| Venlafaxine|
|Vitamin D10,000IU...| null|
| paracetamol| null|
| mucolite| null|
This is the code:
distinct_df = spark.sql("select distinct medicine_name as medicine_name from medicine where medicine_name is not null")
distinct_df.createOrReplaceTempView("distinctDF")
def getMax(num1, num2):
pmax = (num1>=num2)*num1+(num2>num1)*num2
return pmax
def editDistance(s1, s2):
ed = (getMax(length(s1), length(s2)) - levenshtein(s1,s2))/
getMax(length(s1), length(s2))
return ed
editDistanceUdf = udf(lambda x,y: editDistance(x,y), FloatType())
def getSimilarity(str):
res = spark.sql("select medicine_name, editDistanceUdf('str', medicine_name) from distinctDf where editDistanceUdf('str', medicine_name)>=0.85 order by 2")
res['medicine_name'].take(1)
return res
getSimilarityUdf = udf(lambda x: getSimilarity(x), StringType())
res_df = df.withColumn('m_name', when((df.medicine_name.isNull)|(df.medicine_name.=="null")),getSimilarityUdf(df.original_name)
.otherwise(df.medicine_name)).show()
now i'm getting error:
command_part = REFERENCE_TYPE + parameter._get_object_id()
AttributeError: 'function' object has no attribute '_get_object_id'
There is a bunch of problems with your code:
You cannot use SparkSession or distributed objects in the udf. So getSimilarity just cannot work. If you want to compare objects like this you have to join.
If length and levenshtein come from pyspark.sql.functions there cannot be used inside UserDefinedFunctions. There are designed to generate SQL expressions, mapping from *Column to Column.
Column isNull is a method not property so should be called:
df.medicine_name.isNull()
Following
df.medicine_name.=="null"
is not a syntactically valid Python (looks like Scala calque) and would throw compiler exceptions.
If SparkSession access was allowed in an UserDefinedFunction this wouldn't be a valid substitution
spark.sql("select medicine_name, editDistanceUdf('str', medicine_name) from distinctDf where editDistanceUdf('str', medicine_name)>=0.85 order by 2")
You should use string formatting methods
spark.sql("select medicine_name, editDistanceUdf({str}, medicine_name) from distinctDf where editDistanceUdf({str}, medicine_name)>=0.85 order by 2".format(str=str))
Maybe some other problems, but since you didn't provide a MCVE, anything else would be pure guessing.
When you fix smaller mistakes you have two choices:
Use crossJoin:
combined = df.alias("left").crossJoin(spark.table("distinctDf").alias("right"))
Then apply udf, filter, and one of the methods listed in Find maximum row per group in Spark DataFrame to closest match in group.
Use built-in approximate matching tools as explained in Efficient string matching in Apache Spark

Resources