spark replacing empty string with 0 - apache-spark

Given this mocked up sample:
<?xml version="1.0" encoding="utf-8"?>
<Report xsi:schemaLocation="foo"
Name="foo"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="foo">
<stuff>
<list>
<Details key="abdce"
Code=""/>
<Details key="12346"
Code="10"/>
</list>
</stuff>
</Report>
When I read this file into a data frame in spark, spark is picking Long for the data type for the Code column (which is fine). However, when the Code value is an empty string ("") in the XML, it is being replaced with a 0.
val df= spark.read.format("xml").option("RootTag","Report").option("rowTag","Report").option("nullValue", "").load(source_file)
df.withColumn("row", explode($"stuff.list.Details")).select($"row._Code",$"row._key").show
That returns:
+-----+----------+
|_Code| _key|
+-----+----------+
| 0| abdce|
| 10| 12346|
+-----+----------+
How do I prevent replacing the empty string with a 0? It should be null on the first row.

Related

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column (MeaningfulWords column) of df1, according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list (df1) is not in the dictionary (df2), zero is scored.
The Dataframes looks like this:
df1.select("ID","MeaningfulWords").show(truncate=True, n=5)
+------------------+------------------------------+
| ID| MeaningfulWords|
+------------------+------------------------------+
|abcde00000qMQ00001|[casa, alejado, buen, gusto...|
|abcde00000qMq00002|[clientes, contentos, servi...|
|abcde00000qMQ00003| [resto, bien]|
|abcde00000qMQ00004|[mal, servicio, no, antiend...|
|abcde00000qMq00005|[gestion, adecuada, proble ...|
+------------------+------------------------------+
df2.show(5)
+-----+----------+
|score| word|
+-----+----------+
| 1.68|abandonado|
| 3.18| abejas|
| 2.8| aborto|
| 2.46| abrasador|
| 8.13| abrazo|
+-----+----------+
The new columns to add in df1, should look like this:
+------------------+---------------------+
| MeanScore| ScoreList|
+------------------+---------------------+
| 2.95|[3.10, 2.50, 1.28,...|
| 2.15|[1.15, 3.50, 2.75,...|
| 2.75|[4.20, 1.00, 1.75,...|
| 3.25|[3.25, 2.50, 3.20,...|
| 3.15|[2.20, 3.10, 1.28,...|
+------------------+---------------------+
I have reviewed some options using .join, but using columns with different data types gives error.
I have also tried converting the Dataframes to RDD and calling a function:
def map_words_to_values(review_words, dict_df):
return [dict_df[word] for word in review_words if word in dict_df]
RDD1=swRemoved.rdd.map(list)
RDD2=Dict_df.rdd.map(list)
reviewsRDD_dict_values = RDD1.map(lambda tuple: (tuple[0], map_words_to_values(tuple[1], RDD2)))
reviewsRDD_dict_values.take(3)
But with this option I get the error:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
I have found some examples to score text using afinn library. But it doesn't works with spanish text.
I wanna try to utilize native functions of pyspark instead of using udfs to avoid affect the performance, if it's possible. But I'm a begginer in spark and I would like to find the spark way to do that.
You could do this by first joining using array_contains word, then groupBy with aggregations of first, collect_list, and mean.(spark2.4+)
welcome to SO
df1.show()
#+------------------+----------------------------+
#|ID |MeaningfulWords |
#+------------------+----------------------------+
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|
#|abcde00000qMq00002|[clientes, contentos, servi]|
#|abcde00000qMQ00003|[resto, bien] |
#+------------------+----------------------------+
df2.show()
#+-----+---------+
#|score| word|
#+-----+---------+
#| 1.68| casa|
#| 2.8| alejado|
#| 1.03| buen|
#| 3.68| gusto|
#| 0.68| clientes|
#| 2.1|contentos|
#| 2.68| servi|
#| 1.18| resto|
#| 1.98| bien|
#+-----+---------+
from pyspark.sql import functions as F
df1.join(df2, F.expr("""array_contains(MeaningfulWords,word)"""),'left')\
.groupBy("ID").agg(F.first("MeaningfulWords").alias("MeaningfullWords")\
,F.collect_list("score").alias("ScoreList")\
,F.mean("score").alias("MeanScore"))\
.show(truncate=False)
#+------------------+----------------------------+-----------------------+------------------+
#|ID |MeaningfullWords |ScoreList |MeanScore |
#+------------------+----------------------------+-----------------------+------------------+
#|abcde00000qMQ00003|[resto, bien] |[1.18, 1.98] |1.58 |
#|abcde00000qMq00002|[clientes, contentos, servi]|[0.68, 2.1, 2.68] |1.8200000000000003|
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|[1.68, 2.8, 1.03, 3.68]|2.2975 |
#+------------------+----------------------------+-----------------------+------------------+

Replace Null values with no value in spark sql

I am writing a csv file onto datalake from a dataframe which has null values. Spark sql explicitly puts the value as Null for null values. I want to replace these null values with no values or no other strings.
When i write the csv file from databricks, it looks like this
ColA,ColB,ColC
null,ABC,123
ffgg,DEF,345
null,XYZ,789
I tried replacing nulls with '' using fill.na, but when I do that, the file gets written like this
ColA,ColB,ColC
'',ABC,123
ffgg,DEF,345
'',XYZ,789
And I want my csv file to look like this. How do I achieve this from spark sql. I am using databricks. Any help in this regard is highly appreciated.
ColA,ColB,ColC
,ABC,123
ffg,DEF,345
,XYZ,789
Thanks!
I think we need to use .saveAsTextFile for this case instead of csv.
Example:
df.show()
//+----+----+----+
//|col1|col2|col3|
//+----+----+----+
//|null| ABC| 123|
//| dd| ABC| 123|
//+----+----+----+
//extract header from dataframe
val header=spark.sparkContext.parallelize(Seq(df.columns.mkString(",")))
//union header with data and replace [|]|null then save
header.union(df.rdd.map(x => x.toString)).map(x => x.replaceAll("[\\[|\\]|null]","")).coalesce(1).saveAsTextFile("<path>")
//content of file
//co1,co2,co3
//,ABC,123
//dd,ABC,123
If First field in your data is not null then you can use csv option:
df.write.option("nullValue", null).mode("overwrite").csv("<path>")

How to read this custom file in spark-scala using dataframes

I have a file which is of format:
ID|Value
1|name:abc;org:tec;salary:5000
2|org:Ja;Designation:Lead
How do I read this with Dataframes?
The required output is:
1,name,abc
1,org,tec
2,org,Ja
2,designation,Lead
Please help
You will need a bit of ad-hoc string parsing because I don't think there is a built in parser doing exactly what you want. I hope you are confident in your format and in the fact that special characters (|, :, and ;) do not appear in your fields because it would screw everything up.
That being given, you get your result with a couple of simple splits and an explode to put each property in the dictionary on different lines.
val raw_df = sc.parallelize(List("1|name:abc;org:tec;salary:5000", "2|org:Ja;Designation:Lead"))
.map(_.split("\\|") )
.map(a => (a(0),a(1))).toDF("ID", "value")
raw_df
.select($"ID", explode(split($"value", ";")).as("key_value"))
.select($"ID", split($"key_value", ":").as("key_value"))
.select($"ID", $"key_value"(0).as("property"), $"key_value"(1).as("value"))
.show
result:
+---+-----------+-----+
| ID| property|value|
+---+-----------+-----+
| 1| name| abc|
| 1| org| tec|
| 1| salary| 5000|
| 2| org| Ja|
| 2|Designation| Lead|
+---+-----------+-----+
Edit: alternatively, you can use the from_json function (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$) on the value field to parse it. You would however still need to explode the result into separate lines and dispatch each element of the resulting object in the desired column. With the simple example you gave, this would not be simpler and hence boils down to a question of taste.

How to ignore double quotes when reading CSV file in Spark?

I have a CSV file like:
col1,col2,col3,col4
"A,B","C", D"
I want to read it as a data frame in spark, where the values of every field are exactly as written in the CSV (I would like to treat the " character as a regular character, and copy it like any other character).
Expected output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| "A| B"| "C"| D"|
+----+----+----+----+
The output I get:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A,B| C| D"|null|
+----+----+----+----+
In pyspark I am reading like this:
dfr = spark.read.format("csv").option("header", "true").option("inferSchema", "true")
I know that if I add an option like this:
dfr.option("quote", "\u0000")
I get the expected result in the above example, as the function of char '"' is now done by '\u0000', but if my CSV file contains a '\u0000' char, I would also get the wrong result.
Therefore, my question is:
How do I disable the quote option, so that no character acts like a quote?
My CSV file can contain any character, and I want all characters (except comas) to simply be copied into their respective data frame cell. I wonder if there is a way to accomplish this using the escape option.
From the documentation for pyspark.sql.DataFrameReader.csv (emphasis mine):
quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
dfr = spark.read.csv(
path="path/to/some/file.csv",
header="true",
inferSchema="true",
quote=""
)
dfr.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| "A| B"| "C"| D"|
#+----+----+----+----+
This is just a work around, if the option suggested by #pault doesn't work -
from pyspark.sql.functions import split
df = spark.createDataFrame([('"A,B","C", D"',),('""A,"B","""C", D"D"',)], schema = ['Column'])
df.show()
+-------------------+
| Column|
+-------------------+
| "A,B","C", D"|
|""A,"B","""C", D"D"|
+-------------------+
for i in list(range(4)):
df = df.withColumn('Col'+str(i),split(df.Column, ',')[i])
df = df.drop('Column')
df.show()
+----+----+-----+-----+
|Col0|Col1| Col2| Col3|
+----+----+-----+-----+
| "A| B"| "C"| D"|
| ""A| "B"|"""C"| D"D"|
+----+----+-----+-----+

Running a high volume of Hive queries from PySpark

I want execute a very large amount of hive queries and store the result in a dataframe.
I have a very large dataset structured like this:
+-------------------+-------------------+---------+--------+--------+
| visid_high| visid_low|visit_num|genderid|count(1)|
+-------------------+-------------------+---------+--------+--------+
|3666627339384069624| 693073552020244687| 24| 2| 14|
|1104606287317036885|3578924774645377283| 2| 2| 8|
|3102893676414472155|4502736478394082631| 1| 2| 11|
| 811298620687176957|4311066360872821354| 17| 2| 6|
|5221837665223655432| 474971729978862555| 38| 2| 4|
+-------------------+-------------------+---------+--------+--------+
I want to create a derived dataframe which uses each row as input for a secondary query:
result_set = []
for session in sessions.collect()[:100]:
query = "SELECT prop8,count(1) FROM hit_data WHERE dt = {0} AND visid_high = {1} AND visid_low = {2} AND visit_num = {3} group by prop8".format(date,session['visid_high'],session['visid_low'],session['visit_num'])
result = hc.sql(query).collect()
result_set.append(result)
This works as expected for a hundred rows, but causes livy to time out with higher loads.
I tried using map or foreach:
def f(session):
query = "SELECT prop8,count(1) FROM hit_data WHERE dt = {0} AND visid_high = {1} AND visid_low = {2} AND visit_num = {3} group by prop8".format(date,session.visid_high,session.visid_low,session.visit_num)
return hc.sql(query)
test = sampleRdd.map(f)
causing PicklingError: Could not serialize object: TypeError: 'JavaPackage' object is not callable. I understand from this answer and this answer that the spark context object is not serializable.
I didn't try generating all queries first, then running the batch, because I understand from this question batch querying is not supported.
How do I proceed?
What I was looking for is:
Querying all required data in one go by writing the appropriate joins
Adding custom columns, based on the values of the large dataframe using pyspark.sql.functions.when() and df.withColumn(), then
Flattening the resulting dataframe with df.groupBy() and pyspark.sql.functions.sum()
I think I didn't fully realize that Spark handles dataframes lazily. The supported way of working is to define large dataframes and then the appropriate transforms. Spark will try to execute the data retrieval and the transforms in one go, at the last second and distributed. I was trying to limit the scope up front, which led to unsupported functionality.

Resources