Why did Spark interchange values of two columns? - apache-spark

Please can someone explain why spark interchanges the values of two columns when querying a DataFrame?
The values of ProposedAction are returned for SimpleMatchRate vise versa.
Here is the code sample:
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType as ST, StructField as SF, StringType as STR
spark = (SparkSession.builder
.master("local")
.appName("Fuzzy")
.config("spark.jars", "../jars/mysql-connector-java-8.0.29.jar")
.config("spark.driver.extraClassPath", "../jars/mysql-connector-java-8.0.29.jar")
.getOrCreate())
customschema = ST([
SF("Matched", STR()),
SF("MatchRate", STR()),
SF("ProposedAction", STR()), # e.g. is_new
SF("SimpleMatchRate", STR()), # e.g. 76.99800
SF("Status", STR())])
files = [file for file in glob.glob('../source_files/*fuzzy*')]
df = spark.read.csv(files, sep="\t", header="true", encoding="UTF-8", schema=customschema)
df.printSchema()
root
|-- Matched: string (nullable = true)
|-- MatchRate: string (nullable = true)
|-- ProposedAction: string (nullable = true)
|-- SimpleMatchRate: string (nullable = true)
|-- Status: string (nullable = true)
Now if I try to query the df as a table:
df.createOrReplaceTempView("tmp_table")
spark.sql("""SELECT MatchRate, ProposedAction, SimpleMatchRate
FROM tmp_table LIMIT 5""").show()
I get:
+-----------+----------------+-----------------+
| MatchRate | ProposedAction | SimpleMatchRate |
+-----------+----------------+-----------------+
| 0.043169 | 0.000000 | is_new |
| 88.67153 | 98.96907 | is_linked |
| 89.50349 | 98.94736 | is_linked |
| 99.44025 | 100.00000 | is_dupe |
| 90.78082 | 98.92473 | is_linked |
+-----------+----------------+-----------------+

I found what I was doing wrong. My schema definition did not correctly follow the order of columns in the input file. ProposedAction comes after SimpleMatchRate like below:
. . .Matched MatchRate SimpleMatchRate ProposedAction status
I modified the definition to the below and the issue is fixed:
customschema = ST([
SF("Matched", STR()),
SF("MatchRate", STR()),
SF("SimpleMatchRate", STR()),
SF("ProposedAction", STR()), # Now in the correct position as in the input file
SF("Status", STR())])

Related

iterate array in pyspark /nested elements

I have input_data as
[[2022-04-06,test],[2022-04-05,test2]]
schema of the input_data is
|-- source: array(nullable = true)
| |-- element: struct (containsNull= true)
| | |-- #date: string(nullable = true)
| | |-- user: string (nullable = true)
I am looking output as
+-----------+--------+
| date | user |
+-----------|--------+
|2022-04-06 |test |
|2022-04-05 |test2 |
+--------------------+
I have created a df from input_data and applied explode on it further I am thinking to explode the result of it
df.select(explode(df.source))
is there any better way to achieve the output in spark sql or spark df
note I am getting #date and not date in input_data so applying spark sql is also some challenge
Use select inline
df.selectExpr("inline(source)").show()

pyspark can't stop reading empty string as null (spark 3.0)

I have a some csv data file like this (^ as delmiter):
ID
name
age
0
1
Mike
20
When I do
df = spark.read.option("delimiter", "^").option("quote","").option("header", "true").option(
"inferSchema", "true").csv(xxxxxxx)
spark will default the 2 column after 0 row to null
df.show():
ID
name
age
0
null
null
1
Mike
20
How can I stop pyspark to read the data as null but just empty string?
I have tried add some option in the end
1,option("nullValue", "xxxx").option("treatEmptyValuesAsNulls", False)
2,option("nullValue", None).option("treatEmptyValuesAsNulls", False)
3,option("nullValue", None).option("emptyValue", None)
4,option("nullValue", "xxx").option("emptyValue", "xxx")
But no matter what I do pyspark is still reading the data as null.. Is there a way to make pyspark read the empty string as it is?
Thanks
It looks that the empty values since Spark Version 2.0.1 are treated as null. A manner to achieve your result is using df.na.fill(...):
df = spark.read.csv('your_data_path', sep='^', header=True)
# root
# |-- ID: string (nullable = true)
# |-- name: string (nullable = true)
# |-- age: string (nullable = true)
# Fill all columns
# df = df.na.fill('')
# Fill specific columns
df = df.na.fill('', subset=['name', 'age'])
df.show(truncate=False)
Output
+---+----+---+
|ID |name|age|
+---+----+---+
|0 | | |
|1 |Mike|20 |
+---+----+---+

Reading XML File Through Dataframe

I have XML file like below format.
<nt:vars>
<nt:var id="1.3.0" type="TimeStamp"> 89:19:00.01</nt:var>
<nt:var id="1.3.1" type="OBJECT ">1.9.5.67.2</nt:var>
<nt:var id="1.3.9" type="STRING">AB-CD-EF</nt:var>
</nt:vars>
I built a dataframe on it using below code. Though the code is displaying 3 rows and retrieving id and type fields it'snot displaying actual value which is 89:19:00.01, 1.9.5.67.2, AB-CD-EF
spark.read.format("xml").option("rootTag","nt:vars").option("rowTag","nt:var").load("/FileStore/tables/POC_DB.xml").show()
Could you please help me if I have to add any other options to above line to bring the values as well please.
You can instead specify rowTag as nt:vars:
df = spark.read.format("xml").option("rowTag","nt:vars").load("file.xml")
df.printSchema()
root
|-- nt:var: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _id: string (nullable = true)
| | |-- _type: string (nullable = true)
df.show(truncate=False)
+-------------------------------------------------------------------------------------------+
|nt:var |
+-------------------------------------------------------------------------------------------+
|[[ 89:19:00.01, 1.3.0, TimeStamp], [1.9.5.67.2, 1.3.1, OBJECT ], [AB-CD-EF, 1.3.9, STRING]]|
+-------------------------------------------------------------------------------------------+
And to get the values as separate rows, you can explode the array of structs:
df.select(F.explode('nt:var')).show(truncate=False)
+--------------------------------+
|col |
+--------------------------------+
|[ 89:19:00.01, 1.3.0, TimeStamp]|
|[1.9.5.67.2, 1.3.1, OBJECT ] |
|[AB-CD-EF, 1.3.9, STRING] |
+--------------------------------+
Or if you just want the values:
df.select(F.explode('nt:var._VALUE')).show()
+------------+
| col|
+------------+
| 89:19:00.01|
| 1.9.5.67.2|
| AB-CD-EF|
+------------+

pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

I'm running pyspark-sql code on Horton sandbox
18/08/11 17:02:22 INFO spark.SparkContext: Running Spark version 1.6.3
# code
from pyspark.sql import *
from pyspark.sql.types import *
rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv")
rdd2 = rdd1.map( lambda x : x.split("," ) )
df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"])
df1.printSchema()
root
|-- id: string (nullable = true)
|-- cat_id: string (nullable = true)
|-- name: string (nullable = true)
|-- desc: string (nullable = true)
|-- price: string (nullable = true)
|-- url: string (nullable = true)
df1.show()
+---+------+--------------------+----+------+--------------------+
| id|cat_id| name|desc| price| url|
+---+------+--------------------+----+------+--------------------+
| 1| 2|Quest Q64 10 FT. ...| | 59.98|http://images.acm...|
| 2| 2|Under Armour Men'...| |129.99|http://images.acm...|
| 3| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 4| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 5| 2|Riddell Youth Rev...| |199.99|http://images.acm...|
# When I try to get counts I get the following error.
df1.count()
**Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.**
# I get the same error for the following code as well
df1.registerTempTable("products_tab")
df_query = sqlContext.sql ("select id, name, desc from products_tab order by name, id ").show();
I see column desc is null, not sure if null column needs to be handled differently when creating data frame and using any method on it.
The same error occurs when running sql query. It seems sql error is due to "order by" clause, if I remove order by then query runs successfully.
Please let me know if you need more info and appreciate answer on how to handle this error.
I tried to see if name field contains any comma, as suggested by Chandan Ray.
There's no comma in name field.
rdd1.count()
=> 1345
rdd2.count()
=> 1345
# clipping id and name column from rdd2
rdd_name = rdd2.map(lambda x: (x[0], x[2]) )
rdd_name.count()
=>1345
rdd_name_comma = rdd_name.filter (lambda x : True if x[1].find(",") != -1 else False )
rdd_name_comma.count()
==> 0
I found the issue- it was due to one bad record, where comma was embedded in string. And even though string was double quoted, python splits string into 2 columns.
I tried using databricks package
# from command prompt
pyspark --packages com.databricks:spark-csv_2.10:1.4.0
# on pyspark
schema1 = StructType ([ StructField("id",IntegerType(), True), \
StructField("cat_id",IntegerType(), True), \
StructField("name",StringType(), True),\
StructField("desc",StringType(), True),\
StructField("price",DecimalType(), True), \
StructField("url",StringType(), True)
])
df1 = sqlContext.read.format('com.databricks.spark.csv').schema(schema1).load('/user/maria_dev/spark_data/products.csv')
df1.show()
df1.show()
+---+------+--------------------+----+-----+--------------------+
| id|cat_id| name|desc|price| url|
+---+------+--------------------+----+-----+--------------------+
| 1| 2|Quest Q64 10 FT. ...| | 60|http://images.acm...|
| 2| 2|Under Armour Men'...| | 130|http://images.acm...|
| 3| 2|Under Armour Men'...| | 90|http://images.acm...|
| 4| 2|Under Armour Men'...| | 90|http://images.acm...|
| 5| 2|Riddell Youth Rev...| | 200|http://images.acm...|
df1.printSchema()
root
|-- id: integer (nullable = true)
|-- cat_id: integer (nullable = true)
|-- name: string (nullable = true)
|-- desc: string (nullable = true)
|-- price: decimal(10,0) (nullable = true)
|-- url: string (nullable = true)
df1.count()
1345
I suppose your name field has comma in it, so its splitting this also. So its expecting 7 columns
There might be some malformed lines.
Please try to use the code as below to exclude bad record in one file
val df = spark.read.format(“csv”).option("badRecordsPath", "/tmp/badRecordsPath").load(“csvpath”)
//it will read csv and create a dataframe, if there will be any malformed record it will move this into the path you provided.
// please read below
https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html
Here is my take on cleaning of such records, we normally encounter such situations:
a. Anomaly on the data where the file when created, was not looked if "," is the best delimiter on the columns.
Here is my solution on the case:
Solution a: In such cases, we would like to have the process identify as part of data cleansing if that record is a qualified records. The rest of the records if routed to a bad file/collection would give the opportunity to reconcile such records.
Below is the structure of my dataset (product_id,product_name,unit_price)
1,product-1,10
2,product-2,20
3,product,3,30
In the above case, product,3 is supposed to be read as product-3 which might have been a typo when the product was registered. In such as case, the below sample would work.
>>> tf=open("C:/users/ip2134/pyspark_practice/test_file.txt")
>>> trec=tf.read().splitlines()
>>> for rec in trec:
... if rec.count(",") == 2:
... trec_clean.append(rec)
... else:
... trec_bad.append(rec)
...
>>> trec_clean
['1,product-1,10', '2,product-2,20']
>>> trec_bad
['3,product,3,30']
>>> trec
['1,product-1,10', '2,product-2,20','3,product,3,30']
The other alternative of dealing with this problem would be trying to see if skipinitialspace=True would work to parse out the columns.
(Ref:Python parse CSV ignoring comma with double-quotes)

Inserting arrays into parquet using spark sql query

I need to add complex data types to a parquet file using the SQL query option.
I've had partial success using the following code:
self._operationHandleRdd = spark_context_.sql(u"INSERT OVERWRITE
TABLE _df_Dns VALUES
array(struct(struct(35,'ww'),5,struct(47,'BGN')),
struct(struct(70,'w'),1,struct(82,'w')),
struct(struct(86,'AA'),1,struct(97,'ClU'))
)")
spark_context_.sql("select * from _df_Dns").collect()
[Row(dns_rsp_resource_record_items=[Row(dns_rsp_rr_name=Row(seqno=86, value=u'AA'),
dns_rsp_rr_type=1, dns_rsp_rr_value=Row(seqno=97, value=u'ClU')),
Row(dns_rsp_rr_name=Row(seqno=86, value=u'AA'), dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(seqno=97, value=u'ClU')),
Row(dns_rsp_rr_name=Row(seqno=86, value=u'AA'), dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(seqno=97, value=u'ClU'))])]
So, this returns an array with 3 items but the last item appears thrice.
Did anyone encounter such issues and found a way around just by using Spark SQL and not Python?
Any help is appreciated.
Using your example:
from pyspark.sql import Row
df = spark.createDataFrame([
Row(dns_rsp_resource_record_items=[Row(
dns_rsp_rr_name=Row(
seqno=35, value=u'ww'),
dns_rsp_rr_type=5,
dns_rsp_rr_value=Row(seqno=47, value=u'BGN')),
Row(
dns_rsp_rr_name=Row(
seqno=70, value=u'w'),
dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(
seqno=82, value=u'w')),
Row(
dns_rsp_rr_name=Row(
seqno=86, value=u'AA'),
dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(
seqno=97,
value=u'ClU'))])])
df.write.saveAsTable("_df_Dns")
Overwriting and inserting new lines work fine with your code (appart from the extra parenthesis):
spark.sql(u"INSERT OVERWRITE \
TABLE _df_Dns VALUES \
array(struct(struct(35,'ww'),5,struct(47,'BGN')), \
struct(struct(70,'w'),1,struct(82,'w')), \
struct(struct(86,'AA'),1,struct(97,'ClU')) \
)")
spark.sql("select * from _df_Dns").show(truncate=False)
+---------------------------------------------------------------+
|dns_rsp_resource_record_items |
+---------------------------------------------------------------+
|[[[35,ww],5,[47,BGN]], [[70,w],1,[82,w]], [[86,AA],1,[97,ClU]]]|
+---------------------------------------------------------------+
The only possible reason I see for the weird outcome you get is that your initial table had a compatible but different schema.
df.printSchema()
root
|-- dns_rsp_resource_record_items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dns_rsp_rr_name: struct (nullable = true)
| | | |-- seqno: long (nullable = true)
| | | |-- value: string (nullable = true)
| | |-- dns_rsp_rr_type: long (nullable = true)
| | |-- dns_rsp_rr_value: struct (nullable = true)
| | | |-- seqno: long (nullable = true)
| | | |-- value: string (nullable = true)

Resources