Spark Read csv with missing quotes - apache-spark

spark.read
val data = spark.read
.option("delimiter", "\t")
.quote("quote", "\"")
.csv("file:///opt/spark/test1.tsv")
incorrectly interprets lines with missing quotes, even though tab delimeter exists
for example line:
"aaa" \t "b'bb \t 222
is interpreted as "aaa", "b`bb 222"
instead of
"aaa", "b`bb", "222"
according to the documentation deli-meters inside quotes are ignored.
I can get around the problem by re defining default quote for example:
.option("quote","+")
but it's not a good solution

if quotes are not closing properly, the only option is to keep it when creating dataframe and later on drop it using custom logic.
scala> spark.read.option("delimiter", "\t").option("quote", "").csv("test.csv").show()
+-----+-----+---+
| _c0| _c1|_c2|
+-----+-----+---+
|"aaa"|"b'bb| 22|
+-----+-----+---+
Now if you know which column, might have an issue just apply the following logic.
scala> df.withColumn("col_without_quotes", regexp_replace($"_c0","\"","")).show()
+-----+-----+---+------------------+
| _c0| _c1|_c2|col_without_quotes|
+-----+-----+---+------------------+
|"aaa"|"b'bb| 22| aaa|
+-----+-----+---+------------------+

Related

pyspark replace repeated backslash character with empty string

In pyspark , how to replace the text ( "\"\"") with empty string .tried with regexp_replace(F.col('new'),'\\' ,''). but not working.
in .csv File contains
|"\\\"\\\""|
df.show is showing like this
\"\"
But i am expecting to print empty('') string
You should escape quotes and \ in regex.
Regex for text "\"\"" is \"\\\"\\\"\"
Below spark-scala code is working fine and same should work in pyspark also.
val inDF = List(""""\"\""""").toDF()
inDF.show()
/*
+------+
| value|
+------+
|"\"\""|
+------+
*/
inDF.withColumn("value", regexp_replace('value, """\"\\\"\\\"\"""", "")).show()
/*
+-----+
|value|
+-----+
| |
+-----+
*/
The text and the pattern you're using don't match with each other.
The text you gave as an example would equal to an output of "" while the pattern would be equal to an output of \
Try running the following in the playground to see what I mean.
print("\"\"")
print('\\')
Not sure about the rest as I haven't used pyspark and your code snippet may not include enough information to determine if there are any other issues.

Find out substring from url/value of a key from url

I have a table which has url column
I need to find out all the values correspond to tag
TableA
#+---------------------------------------------------------------------+
#| url |
#+---------------------------------------------------------------------+
#| https://www.amazon.in/primeday?tag=final&value=true |
#| https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2 |
#| https://www.google.com/active/search?tag=inreview&type=addtional |
#| https://www.google.com/filter/search?&type=nonactive |
output
#+------------------+
#| Tag |
#+------------------+
#| final |
#| presubmitted |
#| inreview |
I am able to do it in spark sql via below
spark.sql("""select parse_url(url,'QUERY','tag') as Tag from TableA""")
Any option via dataframe or regular expression.
PySpark:
df \
.withColumn("partialURL", split("url", "tag=")[1]) \
.withColumn("tag", split("partialURL", "&")[0]) \
.drop("partialURL")
You can try the below implementation -
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val extract: String => String = StringUtils.substringBetween(_,"tag=","&")
val parse = udf(extract)
val urlDS = Seq("https://www.amazon.in/primeday?tag=final&value=true",
"https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2",
"https://www.google.com/active/search?tag=inreview&type=addtional",
"https://www.google.com/filter/search?&type=nonactive").toDS
urlDS.withColumn("tag",parse($"value")).show()
+----------------------------------------------------------------+------------+
|value |tag |
+----------------------------------------------------------------+------------+
|https://www.amazon.in/primeday?tag=final&value=true |final |
|https://www.filipkart.in/status?tag=presubmitted&Id=124&key=2 |presubmitted|
|https://www.google.com/active/search?tag=inreview&type=addtional|inreview |
|https://www.google.com/filter/search?&type=nonactive |null |
+----------------------------------------------------------------+------------+
The fastest solution is likely substring based, similar to Pardeep's answer. An alternative approach is to use a regex that does some light input checking, similar to:
^(?:(?:(?:https?|ftp):)?\/\/).+?tag=(.*?)(?:&.*?$|$)
This checks that the string starts with a http/https/ftp protocol, the colon and slashes, at least one character (lazily), and either tag=<string of interest> appears somewhere in the middle or at the very end of the string.
Visually (courtesy of regex101), the matches look like:
The tag values you want are in capture group 1, so if you use regexp_extract (PySpark docs), you'll want to use idx of 1 to extract them.
The main difference between this answer and Pardeep's is that this one won't extract values from strings that don't conform to the regex, e.g. the last string in the image above doesn't match. In these edge cases, regexp_extract will return a NULL, which you can process as you wish afterwards.
Since we're invoking a regex engine, this approach is likely a little slower, but the performance difference might be imperceptible in your application.

How to use Split function in spark sql with delemter |#|?

My column is having data as,
col
---
abc|#|pqr|#|xyz
aaa|#|sss|#|sdf
It is delemeted by |#| (pipe ,# , pipe).
How to split this with spark sql.
I am trying spark.sql("select split(col,'|#|')").show() but it is not giving me proper result.
I tried escaping \ but still no luck.
Can anyone knows what is going on here..
Note: I need solution for spark sql only.
I am not sure if I have understood your problem statement properly or not but to split a string by its delimiter is fairly simple and can be done in a variety of ways.
One of the methods is to use SUBSTRING_INDEX -
val data = Seq(("abc|#|pqr|#|xyz"),("aaa|#|sss|#|sdf")).toDF("col1")
data.createOrReplaceTempView("testSplit")
followed by -
%sql
select *,substring_index(col1,'|#|',1) as value1, substring_index(col1,'|#|',2) as value2, substring_index(col1,'|#|',3) as value3 from testSplit
Result -
OR - Split Function Documentation
%sql
select *,SPLIT(col1,'\\|#\\|') as SplitString from testSplit
Result -
Do let me know if this fulfills your requirement or not .
Check below code.
scala> adf.withColumn("split_data",split($"data","\\|#\\|")).show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[aaa, sss, sdf]|
+---------------+---------------+
scala> spark.sql("select * from split_data").show(false)
+---------------+
|data |
+---------------+
|abc|#|pqr|#|xyz|
|aaa|#|sss|#|sdf|
+---------------+
scala> spark.sql("""select data,split('abc|#|pqr|#|xyz', '\\|\\#\\|') as split_data from split_data""").show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[abc, pqr, xyz]|
+---------------+---------------+
Note: inside spark.sql function pass your select query between """ """ & escape special symbols with \\.

How to read this custom file in spark-scala using dataframes

I have a file which is of format:
ID|Value
1|name:abc;org:tec;salary:5000
2|org:Ja;Designation:Lead
How do I read this with Dataframes?
The required output is:
1,name,abc
1,org,tec
2,org,Ja
2,designation,Lead
Please help
You will need a bit of ad-hoc string parsing because I don't think there is a built in parser doing exactly what you want. I hope you are confident in your format and in the fact that special characters (|, :, and ;) do not appear in your fields because it would screw everything up.
That being given, you get your result with a couple of simple splits and an explode to put each property in the dictionary on different lines.
val raw_df = sc.parallelize(List("1|name:abc;org:tec;salary:5000", "2|org:Ja;Designation:Lead"))
.map(_.split("\\|") )
.map(a => (a(0),a(1))).toDF("ID", "value")
raw_df
.select($"ID", explode(split($"value", ";")).as("key_value"))
.select($"ID", split($"key_value", ":").as("key_value"))
.select($"ID", $"key_value"(0).as("property"), $"key_value"(1).as("value"))
.show
result:
+---+-----------+-----+
| ID| property|value|
+---+-----------+-----+
| 1| name| abc|
| 1| org| tec|
| 1| salary| 5000|
| 2| org| Ja|
| 2|Designation| Lead|
+---+-----------+-----+
Edit: alternatively, you can use the from_json function (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$) on the value field to parse it. You would however still need to explode the result into separate lines and dispatch each element of the resulting object in the desired column. With the simple example you gave, this would not be simpler and hence boils down to a question of taste.

How to ignore double quotes when reading CSV file in Spark?

I have a CSV file like:
col1,col2,col3,col4
"A,B","C", D"
I want to read it as a data frame in spark, where the values of every field are exactly as written in the CSV (I would like to treat the " character as a regular character, and copy it like any other character).
Expected output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| "A| B"| "C"| D"|
+----+----+----+----+
The output I get:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A,B| C| D"|null|
+----+----+----+----+
In pyspark I am reading like this:
dfr = spark.read.format("csv").option("header", "true").option("inferSchema", "true")
I know that if I add an option like this:
dfr.option("quote", "\u0000")
I get the expected result in the above example, as the function of char '"' is now done by '\u0000', but if my CSV file contains a '\u0000' char, I would also get the wrong result.
Therefore, my question is:
How do I disable the quote option, so that no character acts like a quote?
My CSV file can contain any character, and I want all characters (except comas) to simply be copied into their respective data frame cell. I wonder if there is a way to accomplish this using the escape option.
From the documentation for pyspark.sql.DataFrameReader.csv (emphasis mine):
quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
dfr = spark.read.csv(
path="path/to/some/file.csv",
header="true",
inferSchema="true",
quote=""
)
dfr.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| "A| B"| "C"| D"|
#+----+----+----+----+
This is just a work around, if the option suggested by #pault doesn't work -
from pyspark.sql.functions import split
df = spark.createDataFrame([('"A,B","C", D"',),('""A,"B","""C", D"D"',)], schema = ['Column'])
df.show()
+-------------------+
| Column|
+-------------------+
| "A,B","C", D"|
|""A,"B","""C", D"D"|
+-------------------+
for i in list(range(4)):
df = df.withColumn('Col'+str(i),split(df.Column, ',')[i])
df = df.drop('Column')
df.show()
+----+----+-----+-----+
|Col0|Col1| Col2| Col3|
+----+----+-----+-----+
| "A| B"| "C"| D"|
| ""A| "B"|"""C"| D"D"|
+----+----+-----+-----+

Resources