how to specify ; as field delimiter in spark in csv file reading - apache-spark

how to specify ; as field delimiter in spark
I have below code
DataSet dataset = session.format("csv").option.("delimiter ",);
please let me know what i can pass here in value

You can use the following piece of code to load data from a file delimited with ";". It can be changed to any other value.
Input:
San;1;100
Ku;3;200
Nam;3;200
Spark Code:
val df = spark.read.format("csv").option("delimiter",";").load("test.dat")
df.printSchema()
df.show()
Output:
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|San| 1|100|
| Ku| 3|200|
|Nam| 3|200|
+---+---+---+

Related

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

How to convert JSON file into regular table DataFrame in Apache Spark

I have the following JSON fields
{"constructorId":1,"constructorRef":"mclaren","name":"McLaren","nationality":"British","url":"http://en.wikipedia.org/wiki/McLaren"}
{"constructorId":2,"constructorRef":"bmw_sauber","name":"BMW Sauber","nationality":"German","url":"http://en.wikipedia.org/wiki/BMW_Sauber"}
The following code produces the the following DataFrame:
I'm running the code on Databricks
df = (spark.read
.format(csv) \
.schema(mySchema) \
.load(dataPath)
)
display(df)
However, I need the DataFrame to look like the following:
I believe the problem is because the JSON is nested, and I'm trying to convert to CSV. However, I do need to convert to CSV.
Is there code that I can apply to remove the nested feature of the JSON?
Just try:
someDF = spark.read.json(somepath)
Infer schema by default or supply your own, set in your case in pySpark multiLine to false.
someDF = spark.read.json(somepath, someschema, multiLine=False)
See https://spark.apache.org/docs/latest/sql-data-sources-json.html
With schema inference:
df = spark.read.option("multiline","false").json("/FileStore/tables/SOabc2.txt")
df.printSchema()
df.show()
df.count()
returns:
root
|-- constructorId: long (nullable = true)
|-- constructorRef: string (nullable = true)
|-- name: string (nullable = true)
|-- nationality: string (nullable = true)
|-- url: string (nullable = true)
+-------------+--------------+----------+-----------+--------------------+
|constructorId|constructorRef| name|nationality| url|
+-------------+--------------+----------+-----------+--------------------+
| 1| mclaren| McLaren| British|http://en.wikiped...|
| 2| bmw_sauber|BMW Sauber| German|http://en.wikiped...|
+-------------+--------------+----------+-----------+--------------------+
Out[11]: 2

pyspark save json handling nulls for struct

Using Pyspark and Spark 2.4, Python3 here. While writing the dataframe as json file, if the struct column is null I want it to be written as {} and if the struct field is null I want it as "". For example:
>>> df.printSchema()
root
|-- id: string (nullable = true)
|-- child1: struct (nullable = true)
| |-- f_name: string (nullable = true)
| |-- l_name: string (nullable = true)
|-- child2: struct (nullable = true)
| |-- f_name: string (nullable = true)
| |-- l_name: string (nullable = true)
>>> df.show()
+---+------------+------------+
| id| child1| child2|
+---+------------+------------+
|123|[John, Matt]|[Paul, Matt]|
|111|[Jack, null]| null|
|101| null| null|
+---+------------+------------+
df.fillna("").coalesce(1).write.mode("overwrite").format('json').save('/home/test')
Result:
{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
{"id":"111","child1":{"f_name":"jack","l_name":""}}
{"id":"111"}
Output Required:
{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
{"id":"111","child1":{"f_name":"jack","l_name":""},"child2": {}}
{"id":"111","child1":{},"child2": {}}
I tried some map and udf's but was not able to acheive what I need. Appreciate your help here.
Spark 3.x
If you pass option ignoreNullFields into your code, you will have output like this. Not exactly an empty struct as you requested, but the schema is still correct.
df.fillna("").coalesce(1).write.mode("overwrite").format('json').option('ignoreNullFields', False).save('/home/test')
{"child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"},"id":"123"}
{"child1":{"f_name":"Jack","l_name":null},"child2":null,"id":"111"}
{"child1":null,"child2":null,"id":"101"}
Spark 2.x
Since that option above does not exist, I figured there is a "dirty fix" for that, is mimicking the JSON structure and bypassing the null check. Again, the result is not exactly like you're asking for, but the schema is correct.
(df
.select(F.struct(
F.col('id'),
F.coalesce(F.col('child1'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child1'),
F.coalesce(F.col('child2'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child2')
).alias('json'))
.coalesce(1).write.mode("overwrite").format('json').save('/home/test')
)
{"json":{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}}
{"json":{"id":"111","child1":{"f_name":"Jack"},"child2":{}}}
{"json":{"id":"101","child1":{},"child2":{}}}

How to modify a dataframe in-place so that its ArrayType column can't be null (nullable = false and containsNull = false)?

Take the following example dataframe:
val df = Seq(Seq("xxx")).toDF("a")
Schema:
root
|-- a: array (nullable = true)
| |-- element: string (containsNull = true)
How can I modify df in-place so that the resulting dataframe is not nullable anywhere, i.e. has the following schema:
root
|-- a: array (nullable = false)
| |-- element: string (containsNull = false)
I understand that I can re-create another dataframe enforcing a non-nullable schema, such as following Change nullable property of column in spark dataframe
spark.createDataFrame(df.rdd, StructType(StructField("a", ArrayType(StringType, false), false) :: Nil))
But this is not an option under structured streaming, so I want it to be some kind of in-place modification.
So the way to achieve this is with a UserDefinedFunction
// Problem setup
val df = Seq(Seq("xxx")).toDF("a")
df.printSchema
root
|-- a: array (nullable = true)
| |-- element: string (containsNull = true)
Onto the solution:
import org.apache.spark.sql.types.{ArrayType, StringType}
import org.apache.spark.sql.functions.{udf, col}
// We define a sub schema with the appropriate data type and null condition
val subSchema = ArrayType(StringType, containsNull = false)
// We create a UDF that applies this sub schema
// while specifying the output of the UDF to be non-nullable
val applyNonNullableSchemaUdf = udf((x:Seq[String]) => x, subSchema).asNonNullable
// We apply the UDF
val newSchemaDF = df.withColumn("a", applyNonNullableSchemaUdf(col("a")))
And there you have it.
// Check new schema
newSchemaDF.printSchema
root
|-- a: array (nullable = false)
| |-- element: string (containsNull = false)
// Check that it actually works
newSchemaDF.show
+-----+
| a|
+-----+
|[xxx]|
+-----+

Convert spark dataframe with string column to StructType column

I have a CSV file with a header as "message" and rows as
{"a":1,"b":"hello 1","c":"1234"}
{"a":2,"b":"hello 2","c":"2345"}
I want to convert them in different columns a,b,c.
I tried the following code:
df1 = spark.read.format("csv").option("header","true")
.option("delimiter","^")
.option("inferSchema","false")
.load("testing.csv")
But it is taking it as a string column.
df1.printScema() --> String
Your file is in json format, with the first line as "message".
The first line can be ignored using the option "DROPMALFORMED" while reading using Spark's DataFrameReader
file : json-test.txt
message
{"a":1,"b":"hello 1","c":"1234"}
{"a":2,"b":"hello 2","c":"2345"}
reading a json file by ignoring bad records [initial record]:
val jsondf = spark.read
.option("multiLine", false)
.option("mode", "DROPMALFORMED")
.json("files/file-reader-test/json-test.txt")
jsondf.show()
output:
+---+-------+----+
| a| b| c|
+---+-------+----+
| 1|hello 1|1234|
| 2|hello 2|2345|
+---+-------+----+
schema :
jsondf.printSchema()
root
|-- a: long (nullable = true)
|-- b: string (nullable = true)
|-- c: string (nullable = true)

Resources