How to ignore double quotes when reading CSV file in Spark? - apache-spark

I have a CSV file like:
col1,col2,col3,col4
"A,B","C", D"
I want to read it as a data frame in spark, where the values of every field are exactly as written in the CSV (I would like to treat the " character as a regular character, and copy it like any other character).
Expected output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| "A| B"| "C"| D"|
+----+----+----+----+
The output I get:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A,B| C| D"|null|
+----+----+----+----+
In pyspark I am reading like this:
dfr = spark.read.format("csv").option("header", "true").option("inferSchema", "true")
I know that if I add an option like this:
dfr.option("quote", "\u0000")
I get the expected result in the above example, as the function of char '"' is now done by '\u0000', but if my CSV file contains a '\u0000' char, I would also get the wrong result.
Therefore, my question is:
How do I disable the quote option, so that no character acts like a quote?
My CSV file can contain any character, and I want all characters (except comas) to simply be copied into their respective data frame cell. I wonder if there is a way to accomplish this using the escape option.

From the documentation for pyspark.sql.DataFrameReader.csv (emphasis mine):
quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
dfr = spark.read.csv(
path="path/to/some/file.csv",
header="true",
inferSchema="true",
quote=""
)
dfr.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| "A| B"| "C"| D"|
#+----+----+----+----+

This is just a work around, if the option suggested by #pault doesn't work -
from pyspark.sql.functions import split
df = spark.createDataFrame([('"A,B","C", D"',),('""A,"B","""C", D"D"',)], schema = ['Column'])
df.show()
+-------------------+
| Column|
+-------------------+
| "A,B","C", D"|
|""A,"B","""C", D"D"|
+-------------------+
for i in list(range(4)):
df = df.withColumn('Col'+str(i),split(df.Column, ',')[i])
df = df.drop('Column')
df.show()
+----+----+-----+-----+
|Col0|Col1| Col2| Col3|
+----+----+-----+-----+
| "A| B"| "C"| D"|
| ""A| "B"|"""C"| D"D"|
+----+----+-----+-----+

Related

Finding index position of a character in a string and use it in substring functions in dataframe

data frame :
I need to truncate the String column value based on the # position. The result should be :
I am trying this code but it is throwing a TypeError :
Though I can achieve the desired result using SparkSql or by creating a function in Python, is there any way that it can be done in pyspark itself?
Another way is to use locate within the substr function, but this can only be used with expr.
spark.sparkContext.parallelize([('WALGREENS #6411',), ('CVS/PHARMACY #08864',), ('CVS',)]).toDF(['acct']). \
withColumn('acct_name',
func.when(func.col('acct').like('%#%') == False, func.col('acct')).
otherwise(func.expr('substr(acct, 1, locate("#", acct)-2)'))
). \
show()
# +-------------------+------------+
# | acct| acct_name|
# +-------------------+------------+
# | WALGREENS #6411| WALGREENS|
# |CVS/PHARMACY #08864|CVS/PHARMACY|
# | CVS| CVS|
# +-------------------+------------+
You can use split() function to achieve this. I used split function with delimiter as # to get the required value and removed leading spaces with rtrim().
My input:
+---+-------------------+
| id| string|
+---+-------------------+
| 1| WALGREENS #6411|
| 2|CVS/PHARMACY #08864|
| 3| CVS|
| 4| WALGREENS|
| 5| Test #1234|
+---+-------------------+
Try using the following code:
from pyspark.sql.functions import split,col,rtrim
df = df.withColumn("New_string", split(col("string"), "#").getItem(0))
#you can also use substring_index()
#df.withColumn("result", substring_index(df['string'], '#',1))
df = df.withColumn('New_string', rtrim(df['New_string']))
df.show()
Output:

pyspark replace repeated backslash character with empty string

In pyspark , how to replace the text ( "\"\"") with empty string .tried with regexp_replace(F.col('new'),'\\' ,''). but not working.
in .csv File contains
|"\\\"\\\""|
df.show is showing like this
\"\"
But i am expecting to print empty('') string
You should escape quotes and \ in regex.
Regex for text "\"\"" is \"\\\"\\\"\"
Below spark-scala code is working fine and same should work in pyspark also.
val inDF = List(""""\"\""""").toDF()
inDF.show()
/*
+------+
| value|
+------+
|"\"\""|
+------+
*/
inDF.withColumn("value", regexp_replace('value, """\"\\\"\\\"\"""", "")).show()
/*
+-----+
|value|
+-----+
| |
+-----+
*/
The text and the pattern you're using don't match with each other.
The text you gave as an example would equal to an output of "" while the pattern would be equal to an output of \
Try running the following in the playground to see what I mean.
print("\"\"")
print('\\')
Not sure about the rest as I haven't used pyspark and your code snippet may not include enough information to determine if there are any other issues.

Find out substring from url/value of a key from url

I have a table which has url column
I need to find out all the values correspond to tag
TableA
#+---------------------------------------------------------------------+
#| url |
#+---------------------------------------------------------------------+
#| https://www.amazon.in/primeday?tag=final&value=true |
#| https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2 |
#| https://www.google.com/active/search?tag=inreview&type=addtional |
#| https://www.google.com/filter/search?&type=nonactive |
output
#+------------------+
#| Tag |
#+------------------+
#| final |
#| presubmitted |
#| inreview |
I am able to do it in spark sql via below
spark.sql("""select parse_url(url,'QUERY','tag') as Tag from TableA""")
Any option via dataframe or regular expression.
PySpark:
df \
.withColumn("partialURL", split("url", "tag=")[1]) \
.withColumn("tag", split("partialURL", "&")[0]) \
.drop("partialURL")
You can try the below implementation -
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val extract: String => String = StringUtils.substringBetween(_,"tag=","&")
val parse = udf(extract)
val urlDS = Seq("https://www.amazon.in/primeday?tag=final&value=true",
"https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2",
"https://www.google.com/active/search?tag=inreview&type=addtional",
"https://www.google.com/filter/search?&type=nonactive").toDS
urlDS.withColumn("tag",parse($"value")).show()
+----------------------------------------------------------------+------------+
|value |tag |
+----------------------------------------------------------------+------------+
|https://www.amazon.in/primeday?tag=final&value=true |final |
|https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2 |presubmitted|
|https://www.google.com/active/search?tag=inreview&type=addtional|inreview |
|https://www.google.com/filter/search?&type=nonactive |null |
+----------------------------------------------------------------+------------+
The fastest solution is likely substring based, similar to Pardeep's answer. An alternative approach is to use a regex that does some light input checking, similar to:
^(?:(?:(?:https?|ftp):)?\/\/).+?tag=(.*?)(?:&.*?$|$)
This checks that the string starts with a http/https/ftp protocol, the colon and slashes, at least one character (lazily), and either tag=<string of interest> appears somewhere in the middle or at the very end of the string.
Visually (courtesy of regex101), the matches look like:
The tag values you want are in capture group 1, so if you use regexp_extract (PySpark docs), you'll want to use idx of 1 to extract them.
The main difference between this answer and Pardeep's is that this one won't extract values from strings that don't conform to the regex, e.g. the last string in the image above doesn't match. In these edge cases, regexp_extract will return a NULL, which you can process as you wish afterwards.
Since we're invoking a regex engine, this approach is likely a little slower, but the performance difference might be imperceptible in your application.

Spark Read csv with missing quotes

spark.read
val data = spark.read
.option("delimiter", "\t")
.quote("quote", "\"")
.csv("file:///opt/spark/test1.tsv")
incorrectly interprets lines with missing quotes, even though tab delimeter exists
for example line:
"aaa" \t "b'bb \t 222
is interpreted as "aaa", "b`bb 222"
instead of
"aaa", "b`bb", "222"
according to the documentation deli-meters inside quotes are ignored.
I can get around the problem by re defining default quote for example:
.option("quote","+")
but it's not a good solution
if quotes are not closing properly, the only option is to keep it when creating dataframe and later on drop it using custom logic.
scala> spark.read.option("delimiter", "\t").option("quote", "").csv("test.csv").show()
+-----+-----+---+
| _c0| _c1|_c2|
+-----+-----+---+
|"aaa"|"b'bb| 22|
+-----+-----+---+
Now if you know which column, might have an issue just apply the following logic.
scala> df.withColumn("col_without_quotes", regexp_replace($"_c0","\"","")).show()
+-----+-----+---+------------------+
| _c0| _c1|_c2|col_without_quotes|
+-----+-----+---+------------------+
|"aaa"|"b'bb| 22| aaa|
+-----+-----+---+------------------+

How to read this custom file in spark-scala using dataframes

I have a file which is of format:
ID|Value
1|name:abc;org:tec;salary:5000
2|org:Ja;Designation:Lead
How do I read this with Dataframes?
The required output is:
1,name,abc
1,org,tec
2,org,Ja
2,designation,Lead
Please help
You will need a bit of ad-hoc string parsing because I don't think there is a built in parser doing exactly what you want. I hope you are confident in your format and in the fact that special characters (|, :, and ;) do not appear in your fields because it would screw everything up.
That being given, you get your result with a couple of simple splits and an explode to put each property in the dictionary on different lines.
val raw_df = sc.parallelize(List("1|name:abc;org:tec;salary:5000", "2|org:Ja;Designation:Lead"))
.map(_.split("\\|") )
.map(a => (a(0),a(1))).toDF("ID", "value")
raw_df
.select($"ID", explode(split($"value", ";")).as("key_value"))
.select($"ID", split($"key_value", ":").as("key_value"))
.select($"ID", $"key_value"(0).as("property"), $"key_value"(1).as("value"))
.show
result:
+---+-----------+-----+
| ID| property|value|
+---+-----------+-----+
| 1| name| abc|
| 1| org| tec|
| 1| salary| 5000|
| 2| org| Ja|
| 2|Designation| Lead|
+---+-----------+-----+
Edit: alternatively, you can use the from_json function (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$) on the value field to parse it. You would however still need to explode the result into separate lines and dispatch each element of the resulting object in the desired column. With the simple example you gave, this would not be simpler and hence boils down to a question of taste.

Resources