pyspark replace repeated backslash character with empty string - apache-spark

In pyspark , how to replace the text ( "\"\"") with empty string .tried with regexp_replace(F.col('new'),'\\' ,''). but not working.
in .csv File contains
|"\\\"\\\""|
df.show is showing like this
\"\"
But i am expecting to print empty('') string

You should escape quotes and \ in regex.
Regex for text "\"\"" is \"\\\"\\\"\"
Below spark-scala code is working fine and same should work in pyspark also.
val inDF = List(""""\"\""""").toDF()
inDF.show()
/*
+------+
| value|
+------+
|"\"\""|
+------+
*/
inDF.withColumn("value", regexp_replace('value, """\"\\\"\\\"\"""", "")).show()
/*
+-----+
|value|
+-----+
| |
+-----+
*/

The text and the pattern you're using don't match with each other.
The text you gave as an example would equal to an output of "" while the pattern would be equal to an output of \
Try running the following in the playground to see what I mean.
print("\"\"")
print('\\')
Not sure about the rest as I haven't used pyspark and your code snippet may not include enough information to determine if there are any other issues.

Related

Finding index position of a character in a string and use it in substring functions in dataframe

data frame :
I need to truncate the String column value based on the # position. The result should be :
I am trying this code but it is throwing a TypeError :
Though I can achieve the desired result using SparkSql or by creating a function in Python, is there any way that it can be done in pyspark itself?
Another way is to use locate within the substr function, but this can only be used with expr.
spark.sparkContext.parallelize([('WALGREENS #6411',), ('CVS/PHARMACY #08864',), ('CVS',)]).toDF(['acct']). \
withColumn('acct_name',
func.when(func.col('acct').like('%#%') == False, func.col('acct')).
otherwise(func.expr('substr(acct, 1, locate("#", acct)-2)'))
). \
show()
# +-------------------+------------+
# | acct| acct_name|
# +-------------------+------------+
# | WALGREENS #6411| WALGREENS|
# |CVS/PHARMACY #08864|CVS/PHARMACY|
# | CVS| CVS|
# +-------------------+------------+
You can use split() function to achieve this. I used split function with delimiter as # to get the required value and removed leading spaces with rtrim().
My input:
+---+-------------------+
| id| string|
+---+-------------------+
| 1| WALGREENS #6411|
| 2|CVS/PHARMACY #08864|
| 3| CVS|
| 4| WALGREENS|
| 5| Test #1234|
+---+-------------------+
Try using the following code:
from pyspark.sql.functions import split,col,rtrim
df = df.withColumn("New_string", split(col("string"), "#").getItem(0))
#you can also use substring_index()
#df.withColumn("result", substring_index(df['string'], '#',1))
df = df.withColumn('New_string', rtrim(df['New_string']))
df.show()
Output:

Find out substring from url/value of a key from url

I have a table which has url column
I need to find out all the values correspond to tag
TableA
#+---------------------------------------------------------------------+
#| url |
#+---------------------------------------------------------------------+
#| https://www.amazon.in/primeday?tag=final&value=true |
#| https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2 |
#| https://www.google.com/active/search?tag=inreview&type=addtional |
#| https://www.google.com/filter/search?&type=nonactive |
output
#+------------------+
#| Tag |
#+------------------+
#| final |
#| presubmitted |
#| inreview |
I am able to do it in spark sql via below
spark.sql("""select parse_url(url,'QUERY','tag') as Tag from TableA""")
Any option via dataframe or regular expression.
PySpark:
df \
.withColumn("partialURL", split("url", "tag=")[1]) \
.withColumn("tag", split("partialURL", "&")[0]) \
.drop("partialURL")
You can try the below implementation -
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val extract: String => String = StringUtils.substringBetween(_,"tag=","&")
val parse = udf(extract)
val urlDS = Seq("https://www.amazon.in/primeday?tag=final&value=true",
"https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2",
"https://www.google.com/active/search?tag=inreview&type=addtional",
"https://www.google.com/filter/search?&type=nonactive").toDS
urlDS.withColumn("tag",parse($"value")).show()
+----------------------------------------------------------------+------------+
|value |tag |
+----------------------------------------------------------------+------------+
|https://www.amazon.in/primeday?tag=final&value=true |final |
|https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2 |presubmitted|
|https://www.google.com/active/search?tag=inreview&type=addtional|inreview |
|https://www.google.com/filter/search?&type=nonactive |null |
+----------------------------------------------------------------+------------+
The fastest solution is likely substring based, similar to Pardeep's answer. An alternative approach is to use a regex that does some light input checking, similar to:
^(?:(?:(?:https?|ftp):)?\/\/).+?tag=(.*?)(?:&.*?$|$)
This checks that the string starts with a http/https/ftp protocol, the colon and slashes, at least one character (lazily), and either tag=<string of interest> appears somewhere in the middle or at the very end of the string.
Visually (courtesy of regex101), the matches look like:
The tag values you want are in capture group 1, so if you use regexp_extract (PySpark docs), you'll want to use idx of 1 to extract them.
The main difference between this answer and Pardeep's is that this one won't extract values from strings that don't conform to the regex, e.g. the last string in the image above doesn't match. In these edge cases, regexp_extract will return a NULL, which you can process as you wish afterwards.
Since we're invoking a regex engine, this approach is likely a little slower, but the performance difference might be imperceptible in your application.

Spark Read csv with missing quotes

spark.read
val data = spark.read
.option("delimiter", "\t")
.quote("quote", "\"")
.csv("file:///opt/spark/test1.tsv")
incorrectly interprets lines with missing quotes, even though tab delimeter exists
for example line:
"aaa" \t "b'bb \t 222
is interpreted as "aaa", "b`bb 222"
instead of
"aaa", "b`bb", "222"
according to the documentation deli-meters inside quotes are ignored.
I can get around the problem by re defining default quote for example:
.option("quote","+")
but it's not a good solution
if quotes are not closing properly, the only option is to keep it when creating dataframe and later on drop it using custom logic.
scala> spark.read.option("delimiter", "\t").option("quote", "").csv("test.csv").show()
+-----+-----+---+
| _c0| _c1|_c2|
+-----+-----+---+
|"aaa"|"b'bb| 22|
+-----+-----+---+
Now if you know which column, might have an issue just apply the following logic.
scala> df.withColumn("col_without_quotes", regexp_replace($"_c0","\"","")).show()
+-----+-----+---+------------------+
| _c0| _c1|_c2|col_without_quotes|
+-----+-----+---+------------------+
|"aaa"|"b'bb| 22| aaa|
+-----+-----+---+------------------+

Is there a way spark won't escape the backslash coming at beginning of each column?

I have a column that has windows address as follows:
\aod140med01MediaExtractorCatalog20190820Hub26727007444841620183_6727007462021489387.nmf
After reading it to a dataset when I am trying to read the column it's escaping the first backslash and printing the value as follows. Is there a way to skip this?
aod140med01MediaExtractorCatalog20190820Hub26727007444841620183_6727007462021489387.nmf
By default Apache spark not removing back slach
val df1 = sc.parallelize(
| Seq(
| (1,"khan /, vaquar","30","/aod140med01MediaExtractorCatalog20190820Hub26727007444841620183_6727007462021489387.nmf"),
| (2,"Zidan /, khan","5","vkhan1MediaExtractorCatalog20190820Hub26727007444841620183_6727007462021489387.nmf"),
| (3,"Zerina khan","1","test")
| ) ).toDF("id","name","age","string").show
Please share your full code to further debug issue.

How to ignore double quotes when reading CSV file in Spark?

I have a CSV file like:
col1,col2,col3,col4
"A,B","C", D"
I want to read it as a data frame in spark, where the values of every field are exactly as written in the CSV (I would like to treat the " character as a regular character, and copy it like any other character).
Expected output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| "A| B"| "C"| D"|
+----+----+----+----+
The output I get:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A,B| C| D"|null|
+----+----+----+----+
In pyspark I am reading like this:
dfr = spark.read.format("csv").option("header", "true").option("inferSchema", "true")
I know that if I add an option like this:
dfr.option("quote", "\u0000")
I get the expected result in the above example, as the function of char '"' is now done by '\u0000', but if my CSV file contains a '\u0000' char, I would also get the wrong result.
Therefore, my question is:
How do I disable the quote option, so that no character acts like a quote?
My CSV file can contain any character, and I want all characters (except comas) to simply be copied into their respective data frame cell. I wonder if there is a way to accomplish this using the escape option.
From the documentation for pyspark.sql.DataFrameReader.csv (emphasis mine):
quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
dfr = spark.read.csv(
path="path/to/some/file.csv",
header="true",
inferSchema="true",
quote=""
)
dfr.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| "A| B"| "C"| D"|
#+----+----+----+----+
This is just a work around, if the option suggested by #pault doesn't work -
from pyspark.sql.functions import split
df = spark.createDataFrame([('"A,B","C", D"',),('""A,"B","""C", D"D"',)], schema = ['Column'])
df.show()
+-------------------+
| Column|
+-------------------+
| "A,B","C", D"|
|""A,"B","""C", D"D"|
+-------------------+
for i in list(range(4)):
df = df.withColumn('Col'+str(i),split(df.Column, ',')[i])
df = df.drop('Column')
df.show()
+----+----+-----+-----+
|Col0|Col1| Col2| Col3|
+----+----+-----+-----+
| "A| B"| "C"| D"|
| ""A| "B"|"""C"| D"D"|
+----+----+-----+-----+

Resources