pyspark can't stop reading empty string as null (spark 3.0) - apache-spark

I have a some csv data file like this (^ as delmiter):
ID
name
age
0
1
Mike
20
When I do
df = spark.read.option("delimiter", "^").option("quote","").option("header", "true").option(
"inferSchema", "true").csv(xxxxxxx)
spark will default the 2 column after 0 row to null
df.show():
ID
name
age
0
null
null
1
Mike
20
How can I stop pyspark to read the data as null but just empty string?
I have tried add some option in the end
1,option("nullValue", "xxxx").option("treatEmptyValuesAsNulls", False)
2,option("nullValue", None).option("treatEmptyValuesAsNulls", False)
3,option("nullValue", None).option("emptyValue", None)
4,option("nullValue", "xxx").option("emptyValue", "xxx")
But no matter what I do pyspark is still reading the data as null.. Is there a way to make pyspark read the empty string as it is?
Thanks

It looks that the empty values since Spark Version 2.0.1 are treated as null. A manner to achieve your result is using df.na.fill(...):
df = spark.read.csv('your_data_path', sep='^', header=True)
# root
# |-- ID: string (nullable = true)
# |-- name: string (nullable = true)
# |-- age: string (nullable = true)
# Fill all columns
# df = df.na.fill('')
# Fill specific columns
df = df.na.fill('', subset=['name', 'age'])
df.show(truncate=False)
Output
+---+----+---+
|ID |name|age|
+---+----+---+
|0 | | |
|1 |Mike|20 |
+---+----+---+

Related

Why did Spark interchange values of two columns?

Please can someone explain why spark interchanges the values of two columns when querying a DataFrame?
The values of ProposedAction are returned for SimpleMatchRate vise versa.
Here is the code sample:
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType as ST, StructField as SF, StringType as STR
spark = (SparkSession.builder
.master("local")
.appName("Fuzzy")
.config("spark.jars", "../jars/mysql-connector-java-8.0.29.jar")
.config("spark.driver.extraClassPath", "../jars/mysql-connector-java-8.0.29.jar")
.getOrCreate())
customschema = ST([
SF("Matched", STR()),
SF("MatchRate", STR()),
SF("ProposedAction", STR()), # e.g. is_new
SF("SimpleMatchRate", STR()), # e.g. 76.99800
SF("Status", STR())])
files = [file for file in glob.glob('../source_files/*fuzzy*')]
df = spark.read.csv(files, sep="\t", header="true", encoding="UTF-8", schema=customschema)
df.printSchema()
root
|-- Matched: string (nullable = true)
|-- MatchRate: string (nullable = true)
|-- ProposedAction: string (nullable = true)
|-- SimpleMatchRate: string (nullable = true)
|-- Status: string (nullable = true)
Now if I try to query the df as a table:
df.createOrReplaceTempView("tmp_table")
spark.sql("""SELECT MatchRate, ProposedAction, SimpleMatchRate
FROM tmp_table LIMIT 5""").show()
I get:
+-----------+----------------+-----------------+
| MatchRate | ProposedAction | SimpleMatchRate |
+-----------+----------------+-----------------+
| 0.043169 | 0.000000 | is_new |
| 88.67153 | 98.96907 | is_linked |
| 89.50349 | 98.94736 | is_linked |
| 99.44025 | 100.00000 | is_dupe |
| 90.78082 | 98.92473 | is_linked |
+-----------+----------------+-----------------+
I found what I was doing wrong. My schema definition did not correctly follow the order of columns in the input file. ProposedAction comes after SimpleMatchRate like below:
. . .Matched MatchRate SimpleMatchRate ProposedAction status
I modified the definition to the below and the issue is fixed:
customschema = ST([
SF("Matched", STR()),
SF("MatchRate", STR()),
SF("SimpleMatchRate", STR()),
SF("ProposedAction", STR()), # Now in the correct position as in the input file
SF("Status", STR())])

Convert spark dataframe with string column to StructType column

I have a CSV file with a header as "message" and rows as
{"a":1,"b":"hello 1","c":"1234"}
{"a":2,"b":"hello 2","c":"2345"}
I want to convert them in different columns a,b,c.
I tried the following code:
df1 = spark.read.format("csv").option("header","true")
.option("delimiter","^")
.option("inferSchema","false")
.load("testing.csv")
But it is taking it as a string column.
df1.printScema() --> String
Your file is in json format, with the first line as "message".
The first line can be ignored using the option "DROPMALFORMED" while reading using Spark's DataFrameReader
file : json-test.txt
message
{"a":1,"b":"hello 1","c":"1234"}
{"a":2,"b":"hello 2","c":"2345"}
reading a json file by ignoring bad records [initial record]:
val jsondf = spark.read
.option("multiLine", false)
.option("mode", "DROPMALFORMED")
.json("files/file-reader-test/json-test.txt")
jsondf.show()
output:
+---+-------+----+
| a| b| c|
+---+-------+----+
| 1|hello 1|1234|
| 2|hello 2|2345|
+---+-------+----+
schema :
jsondf.printSchema()
root
|-- a: long (nullable = true)
|-- b: string (nullable = true)
|-- c: string (nullable = true)

pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

I'm running pyspark-sql code on Horton sandbox
18/08/11 17:02:22 INFO spark.SparkContext: Running Spark version 1.6.3
# code
from pyspark.sql import *
from pyspark.sql.types import *
rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv")
rdd2 = rdd1.map( lambda x : x.split("," ) )
df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"])
df1.printSchema()
root
|-- id: string (nullable = true)
|-- cat_id: string (nullable = true)
|-- name: string (nullable = true)
|-- desc: string (nullable = true)
|-- price: string (nullable = true)
|-- url: string (nullable = true)
df1.show()
+---+------+--------------------+----+------+--------------------+
| id|cat_id| name|desc| price| url|
+---+------+--------------------+----+------+--------------------+
| 1| 2|Quest Q64 10 FT. ...| | 59.98|http://images.acm...|
| 2| 2|Under Armour Men'...| |129.99|http://images.acm...|
| 3| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 4| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 5| 2|Riddell Youth Rev...| |199.99|http://images.acm...|
# When I try to get counts I get the following error.
df1.count()
**Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.**
# I get the same error for the following code as well
df1.registerTempTable("products_tab")
df_query = sqlContext.sql ("select id, name, desc from products_tab order by name, id ").show();
I see column desc is null, not sure if null column needs to be handled differently when creating data frame and using any method on it.
The same error occurs when running sql query. It seems sql error is due to "order by" clause, if I remove order by then query runs successfully.
Please let me know if you need more info and appreciate answer on how to handle this error.
I tried to see if name field contains any comma, as suggested by Chandan Ray.
There's no comma in name field.
rdd1.count()
=> 1345
rdd2.count()
=> 1345
# clipping id and name column from rdd2
rdd_name = rdd2.map(lambda x: (x[0], x[2]) )
rdd_name.count()
=>1345
rdd_name_comma = rdd_name.filter (lambda x : True if x[1].find(",") != -1 else False )
rdd_name_comma.count()
==> 0
I found the issue- it was due to one bad record, where comma was embedded in string. And even though string was double quoted, python splits string into 2 columns.
I tried using databricks package
# from command prompt
pyspark --packages com.databricks:spark-csv_2.10:1.4.0
# on pyspark
schema1 = StructType ([ StructField("id",IntegerType(), True), \
StructField("cat_id",IntegerType(), True), \
StructField("name",StringType(), True),\
StructField("desc",StringType(), True),\
StructField("price",DecimalType(), True), \
StructField("url",StringType(), True)
])
df1 = sqlContext.read.format('com.databricks.spark.csv').schema(schema1).load('/user/maria_dev/spark_data/products.csv')
df1.show()
df1.show()
+---+------+--------------------+----+-----+--------------------+
| id|cat_id| name|desc|price| url|
+---+------+--------------------+----+-----+--------------------+
| 1| 2|Quest Q64 10 FT. ...| | 60|http://images.acm...|
| 2| 2|Under Armour Men'...| | 130|http://images.acm...|
| 3| 2|Under Armour Men'...| | 90|http://images.acm...|
| 4| 2|Under Armour Men'...| | 90|http://images.acm...|
| 5| 2|Riddell Youth Rev...| | 200|http://images.acm...|
df1.printSchema()
root
|-- id: integer (nullable = true)
|-- cat_id: integer (nullable = true)
|-- name: string (nullable = true)
|-- desc: string (nullable = true)
|-- price: decimal(10,0) (nullable = true)
|-- url: string (nullable = true)
df1.count()
1345
I suppose your name field has comma in it, so its splitting this also. So its expecting 7 columns
There might be some malformed lines.
Please try to use the code as below to exclude bad record in one file
val df = spark.read.format(“csv”).option("badRecordsPath", "/tmp/badRecordsPath").load(“csvpath”)
//it will read csv and create a dataframe, if there will be any malformed record it will move this into the path you provided.
// please read below
https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html
Here is my take on cleaning of such records, we normally encounter such situations:
a. Anomaly on the data where the file when created, was not looked if "," is the best delimiter on the columns.
Here is my solution on the case:
Solution a: In such cases, we would like to have the process identify as part of data cleansing if that record is a qualified records. The rest of the records if routed to a bad file/collection would give the opportunity to reconcile such records.
Below is the structure of my dataset (product_id,product_name,unit_price)
1,product-1,10
2,product-2,20
3,product,3,30
In the above case, product,3 is supposed to be read as product-3 which might have been a typo when the product was registered. In such as case, the below sample would work.
>>> tf=open("C:/users/ip2134/pyspark_practice/test_file.txt")
>>> trec=tf.read().splitlines()
>>> for rec in trec:
... if rec.count(",") == 2:
... trec_clean.append(rec)
... else:
... trec_bad.append(rec)
...
>>> trec_clean
['1,product-1,10', '2,product-2,20']
>>> trec_bad
['3,product,3,30']
>>> trec
['1,product-1,10', '2,product-2,20','3,product,3,30']
The other alternative of dealing with this problem would be trying to see if skipinitialspace=True would work to parse out the columns.
(Ref:Python parse CSV ignoring comma with double-quotes)

Aggregating tuples within a DataFrame together [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I currently am trying to do some aggregation on the services column. I would like to group all the similar services and sum the values, and if possible flatten this into a single row.
Input:
+------------------+--------------------+
| cid | Services|
+------------------+--------------------+
|845124826013182686| [112931, serv1]|
|845124826013182686| [146936, serv1]|
|845124826013182686| [32718, serv2]|
|845124826013182686| [28839, serv2]|
|845124826013182686| [8710, serv2]|
|845124826013182686| [2093140, serv3]|
Hopeful Output:
+------------------+--------------------+------------------+--------------------+
| cid | serv1 | serv2 | serv3 |
+------------------+--------------------+------------------+--------------------+
|845124826013182686| 259867 | 70267 | 2093140 |
Below is the code I currently have
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName("Service Aggregation").getOrCreate()
pathToFile = '/path/to/jsonfile'
df = spark.read.json(pathToFile)
df2 = df.select('cid',functions.explode_outer(df.nodes.services))
finaldataFrame = df2.select('cid',(functions.explode_outer(df2.col)).alias('Services'))
finaldataFrame.show()
I am quite new to pyspark and have been looking at resources and trying to create some UDF to apply to that column but the map function withi pyspark only works fro RDDs and not DataFrames and am unsure how move forward to get the desired output.
Any suggestions or help would be much appreciated.
Result of printSchema
root
|-- clusterId: string (nullable = true)
|-- col: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cpuCoreInSeconds: long (nullable = true)
| | |-- name: string (nullable = true)
First, extract the service and the value from the Services column by position. Note this assumes that the value is always in position 0 and the service is always in position 1 (as shown in your example).
import pyspark.sql.functions as f
df2 = df.select(
'cid',
f.col("Services").getItem(0).alias('value').cast('integer'),
f.col("Services").getItem(1).alias('service')
)
df2.show()
#+------------------+-------+-------+
#| cid| value|service|
#+------------------+-------+-------+
#|845124826013182686| 112931| serv1|
#|845124826013182686| 146936| serv1|
#|845124826013182686| 32718| serv2|
#|845124826013182686| 28839| serv2|
#|845124826013182686| 8710| serv2|
#|845124826013182686|2093140| serv3|
#+------------------+-------+-------+
Note that I casted the value to integer, but it may already be an integer depending on how your schema is defined.
Once the data is in this format, it's easy to pivot() it. Group by the cid column, pivot the service column, and aggregate by summing the value column:
df2.groupBy('cid').pivot('service').sum("value").show()
#+------------------+------+-----+-------+
#| cid| serv1|serv2| serv3|
#+------------------+------+-----+-------+
#|845124826013182686|259867|70267|2093140|
#+------------------+------+-----+-------+
Update
Based on the schema you provided, you will have to get the value and service by name, rather than by position:
df2 = df.select(
'cid',
f.col("Services").getItem("cpuCoreInSeconds").alias('value'),
f.col("Services").getItem("name").alias('service')
)
The rest is the same. Also, no need to cast to integer as cpuCoreInSeconds is already a long.

Spark Adding a column consisting of a tuple to a dataframe

I am using Spark 1.6 and I want to add a column to a dataframe. The new column actually is a constant sequence: Seq("-0", "-1", "-2", "-3")
Here is my original dataframe:
scala> df.printSchema()
root
|-- user_name: string (nullable = true)
|-- test_name: string (nullable = true)
df.show()
|user_name| test_name|
+------------+--------------------+
|user1| SAT|
| user9| GRE|
| user7|MCAT|
I want to add this extra column (attempt) so that the new dataframe becomes:
|user_name|test_name|attempt|
+------------+--------------------+
|user1| SAT|Seq("-0","-1","-2","-3")|
| user9| GRE|Seq("-0","-1","-2","-3")
| user7|MCAT|Seq("-0","-1","-2","-3")
How do I do that?
you can use the withColumn function:
import org.apache.spark.sql.functions._
df.withColumn("attempt", lit(Array("-0","-1","-2","-3")))
You can add using the typedLit(Spark version > 2.2).
import org.apache.spark.sql.functions.typedLit
df.withColumn("attempt", typedLit(Seq("-0", "-1", "-2", "-3")))

Resources