Escape newline character in DataFrame - apache-spark

I have a Parquet table in Hive which I read via Spark and write to a delimited file. The code I use is this
var x = spark.table("myschema.my_table")
x.write.mode("overwrite").format("csv").save("/tmp/abc")
So far, so good. But the Hive table can contain data that has \n in it. Now when I write the data, that character breaks the line into a new one, creating an extra broken record. The character can be there in any column. How can I set it to replace it with a space while writing? I tried the following but it didn't work
x.write.mode("overwrite").format("csv").option("multiline", "true").save("/tmp/abc")

There are no such options provided by spark to replace \n with space while writing dataframe to csv. Check Aavailable options here.
You can use regex_replace to replace \n with space and then write the dataframe to CSV.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
val inDF = List(("a\nb", "c\nd", "d\ne", "ef")).toDF("col1", "col2", "col3", "col4")
inDF.show()
/*
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| a
b| c
d| d
e| ef|
+----+----+----+----+ */
val outDF = inDF.columns.foldLeft(inDF)((df, c) => df.withColumn(c, regexp_replace(col(c), "\n", "")))
outDF.show()
/*
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| ab| cd| de| ef|
+----+----+----+----+ */
outDF.write.option("header", true).csv("outputPath")

Related

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column (MeaningfulWords column) of df1, according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list (df1) is not in the dictionary (df2), zero is scored.
The Dataframes looks like this:
df1.select("ID","MeaningfulWords").show(truncate=True, n=5)
+------------------+------------------------------+
| ID| MeaningfulWords|
+------------------+------------------------------+
|abcde00000qMQ00001|[casa, alejado, buen, gusto...|
|abcde00000qMq00002|[clientes, contentos, servi...|
|abcde00000qMQ00003| [resto, bien]|
|abcde00000qMQ00004|[mal, servicio, no, antiend...|
|abcde00000qMq00005|[gestion, adecuada, proble ...|
+------------------+------------------------------+
df2.show(5)
+-----+----------+
|score| word|
+-----+----------+
| 1.68|abandonado|
| 3.18| abejas|
| 2.8| aborto|
| 2.46| abrasador|
| 8.13| abrazo|
+-----+----------+
The new columns to add in df1, should look like this:
+------------------+---------------------+
| MeanScore| ScoreList|
+------------------+---------------------+
| 2.95|[3.10, 2.50, 1.28,...|
| 2.15|[1.15, 3.50, 2.75,...|
| 2.75|[4.20, 1.00, 1.75,...|
| 3.25|[3.25, 2.50, 3.20,...|
| 3.15|[2.20, 3.10, 1.28,...|
+------------------+---------------------+
I have reviewed some options using .join, but using columns with different data types gives error.
I have also tried converting the Dataframes to RDD and calling a function:
def map_words_to_values(review_words, dict_df):
return [dict_df[word] for word in review_words if word in dict_df]
RDD1=swRemoved.rdd.map(list)
RDD2=Dict_df.rdd.map(list)
reviewsRDD_dict_values = RDD1.map(lambda tuple: (tuple[0], map_words_to_values(tuple[1], RDD2)))
reviewsRDD_dict_values.take(3)
But with this option I get the error:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
I have found some examples to score text using afinn library. But it doesn't works with spanish text.
I wanna try to utilize native functions of pyspark instead of using udfs to avoid affect the performance, if it's possible. But I'm a begginer in spark and I would like to find the spark way to do that.
You could do this by first joining using array_contains word, then groupBy with aggregations of first, collect_list, and mean.(spark2.4+)
welcome to SO
df1.show()
#+------------------+----------------------------+
#|ID |MeaningfulWords |
#+------------------+----------------------------+
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|
#|abcde00000qMq00002|[clientes, contentos, servi]|
#|abcde00000qMQ00003|[resto, bien] |
#+------------------+----------------------------+
df2.show()
#+-----+---------+
#|score| word|
#+-----+---------+
#| 1.68| casa|
#| 2.8| alejado|
#| 1.03| buen|
#| 3.68| gusto|
#| 0.68| clientes|
#| 2.1|contentos|
#| 2.68| servi|
#| 1.18| resto|
#| 1.98| bien|
#+-----+---------+
from pyspark.sql import functions as F
df1.join(df2, F.expr("""array_contains(MeaningfulWords,word)"""),'left')\
.groupBy("ID").agg(F.first("MeaningfulWords").alias("MeaningfullWords")\
,F.collect_list("score").alias("ScoreList")\
,F.mean("score").alias("MeanScore"))\
.show(truncate=False)
#+------------------+----------------------------+-----------------------+------------------+
#|ID |MeaningfullWords |ScoreList |MeanScore |
#+------------------+----------------------------+-----------------------+------------------+
#|abcde00000qMQ00003|[resto, bien] |[1.18, 1.98] |1.58 |
#|abcde00000qMq00002|[clientes, contentos, servi]|[0.68, 2.1, 2.68] |1.8200000000000003|
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|[1.68, 2.8, 1.03, 3.68]|2.2975 |
#+------------------+----------------------------+-----------------------+------------------+

rlike in scala dataframe is giving the error

I am trying to convert the below Hive SQL statement into Spark dataframe but getting an error.
case when (lower(message_txt) rlike '.*sampletext(\\s?is\\s?)newtext.*' ) then 'P' else 'Y'
Sample data: message_txt = "This is new sampletext, followed by newtext"
Please help me to provide equivalent spark dataframe statement.
Use when(lower($"value").rlike(""".sampletext(\sis\s?)newtext."""),lit('P')).otherwise("Y")
scala> df.withColumn("condition",when(lower($"value").rlike(""".sampletext(\s?is\s?)newtext."""),lit('P')).otherwise("Y")).show(false)
+-------------------------------------------+---------+
|value |condition|
+-------------------------------------------+---------+
|This is new sampletext, followed by newtext|Y |
+-------------------------------------------+---------+
Add end at the end of case statement in sql.
Example:
In spark Sql:
val df=Seq(("This is new sampletext, followed by newtext")).toDF("message_txt")
df.createOrReplaceTempView("tmp")
spark.sql("select case when (lower(message_txt) rlike '.sampletext(\\s?is\\s?)newtext.' ) then 'P' else 'Y' end from tmp").show()
//Result
//+--------------------------------------------------------------------------------+
//|CASE WHEN lower(message_txt) RLIKE .sampletext(s?iss?)newtext. THEN P ELSE Y END|
//+--------------------------------------------------------------------------------+
//| Y|
//+--------------------------------------------------------------------------------+
In dataframe API:
df.withColumn("status", when(lower(col("message_txt")).rlike(".sampletext(\\s?is\\s?)newtext."),"P").otherwise("Y")).show()
//Result
//+--------------------+------+
//| message_txt|status|
//+--------------------+------+
//|This is new sampl...| Y|
//+--------------------+------+
UPDATE:
Checking for strings sampletext and newtext in message_txt column.
//using rlike
df.withColumn("status", when(lower(col("message_txt")).rlike("sampletext.*newtext"),"P").otherwise("Y")).show()
//using like
df.withColumn("status", when(lower(col("message_txt")).like("%sampletext%newtext%"),"P").otherwise("Y")).show()
//+--------------------+------+
//| message_txt|status|
//+--------------------+------+
//|This is new sampl...| P|
//+--------------------+------+

How to deal with white space in column names to use spark coalesce function in expr method

I am working on spark coalesce functionality in my project.Code works fine on columns with no spaces but fails on spaced columns.
e1.csv
id,code,type,no root
1,,A,1
2,,,0
3,123,I,1
e2.csv
id,code,type,no root
1,456,A,1
2,789,A1,0
3,,C,0
logic code
Dataset<Row> df1 = spark.read().format("csv").option("header", "true").load("/home/user/Videos/<folder>/e1.csv");
Dataset<Row> df2 = spark.read().format("csv").option("header", "true").load("/home/user/Videos/<folder>/e2.csv");
Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id== b.id").selectExpr("coalesce(`a.no root`,`b.no root`) AS `a.no root`");
newDS.show();
What I have tried
Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id== b.id").selectExpr("""coalesce(`a.no root`,`b.no root`) AS `a.no root`""");
The espexted result would be like
no root
1
0
1
Using the following criteria
val newDS = df1.as("a").join(df2.as("b")).where("a.id==b.id").selectExpr("coalesce(a.`no root`,b.`no root`) AS `a.no root`")
will generate the expected output
+---------+
|a.no root|
+---------+
| 1|
| 0|
| 1|
+---------+

How to ignore double quotes when reading CSV file in Spark?

I have a CSV file like:
col1,col2,col3,col4
"A,B","C", D"
I want to read it as a data frame in spark, where the values of every field are exactly as written in the CSV (I would like to treat the " character as a regular character, and copy it like any other character).
Expected output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| "A| B"| "C"| D"|
+----+----+----+----+
The output I get:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A,B| C| D"|null|
+----+----+----+----+
In pyspark I am reading like this:
dfr = spark.read.format("csv").option("header", "true").option("inferSchema", "true")
I know that if I add an option like this:
dfr.option("quote", "\u0000")
I get the expected result in the above example, as the function of char '"' is now done by '\u0000', but if my CSV file contains a '\u0000' char, I would also get the wrong result.
Therefore, my question is:
How do I disable the quote option, so that no character acts like a quote?
My CSV file can contain any character, and I want all characters (except comas) to simply be copied into their respective data frame cell. I wonder if there is a way to accomplish this using the escape option.
From the documentation for pyspark.sql.DataFrameReader.csv (emphasis mine):
quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
dfr = spark.read.csv(
path="path/to/some/file.csv",
header="true",
inferSchema="true",
quote=""
)
dfr.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| "A| B"| "C"| D"|
#+----+----+----+----+
This is just a work around, if the option suggested by #pault doesn't work -
from pyspark.sql.functions import split
df = spark.createDataFrame([('"A,B","C", D"',),('""A,"B","""C", D"D"',)], schema = ['Column'])
df.show()
+-------------------+
| Column|
+-------------------+
| "A,B","C", D"|
|""A,"B","""C", D"D"|
+-------------------+
for i in list(range(4)):
df = df.withColumn('Col'+str(i),split(df.Column, ',')[i])
df = df.drop('Column')
df.show()
+----+----+-----+-----+
|Col0|Col1| Col2| Col3|
+----+----+-----+-----+
| "A| B"| "C"| D"|
| ""A| "B"|"""C"| D"D"|
+----+----+-----+-----+

Spark csv to dataframe skip first row

I am loading csv to dataframe using -
sqlContext.read.format("com.databricks.spark.csv").option("header", "true").
option("delimiter", ",").load("file.csv")
but my input file contains date in the first row and header from second row.
example
20160612
id,name,age
1,abc,12
2,bcd,33
How can i skip this first row while converting csv to dataframe?
Here are several options that I can think of since the data bricks module doesn't seem to provide a skip line option:
Option one: Add a "#" character in front of the first line, and the line will be automatically considered as comment and ignored by the data.bricks csv module;
Option two: Create your customized schema and specify the mode option as DROPMALFORMED which will drop the first line since it contains less token than expected in the customSchema:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val customSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("mode", "DROPMALFORMED").
schema(customSchema).load("test.txt")
df.show
16/06/12 21:24:05 WARN CsvRelation$: Number format exception. Dropping
malformed line: id,name,age
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+
Note the warning message here which says dropped malformed line:
Option three: Write your own parser to drop the line that doesn't have length of three:
val file = sc.textFile("pathToYourCsvFile")
val df = file.map(line => line.split(",")).
filter(lines => lines.length == 3 && lines(0)!= "id").
map(row => (row(0), row(1), row(2))).
toDF("id", "name", "age")
df.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+

Resources