Avoid losing data type for the partitioned data when writing from Spark - apache-spark

I have a dataframe like below.
itemName, itemCategory
Name1, C0
Name2, C1
Name3, C0
I would like to save this dataframe as partitioned parquet file:
df.write.mode("overwrite").partitionBy("itemCategory").parquet(path)
For this dataframe, when I read the data back, it will have String the data type for itemCategory.
However at times, I have dataframe from other tenants as below.
itemName, itemCategory
Name1, 0
Name2, 1
Name3, 0
In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory.
Parquet file has the metadata that describe the data type. How can I specify the data type for the partition so it will be read back as String instead of Int?

If you set "spark.sql.sources.partitionColumnTypeInference.enabled" to "false", spark will infer all partition columns as Strings.
In spark 2.0 or greater, you can set like this:
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
In 1.6, like this:
sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
The downside is you have to do this each time you read the data, but at least it works.

As you partition by the itemCategory column, this data will be stored in the file structure and not in the actual csv files. Spark infer the datatype depending on the values, if all values are integers then the column type will be int.
One simple solution would be to cast the column to StringType after reading the data:
import spark.implicits._
df.withColumn("itemCategory", $"itemCategory".cast(StringType))
Another option would be to duplicate the column itself. Then one of the columns will be used for the partitioning and, hence, be saved in the file structure. However, the other duplicated column would be saved normally in the parquet file. To make a duplicate simply use:
df.withColumn("itemCategoryCopy", $"itemCategory")

Read it with a schema:
import spark.implicits._
val path = "/tmp/test/input"
val source = Seq(("Name1", "0"), ("Name2", "1"), ("Name3", "0")).toDF("itemName", "itemCategory")
source.write.partitionBy("itemCategory").parquet(path)
spark.read.schema(source.schema).parquet(path).printSchema()
// will print
// root
// |-- itemName: string (nullable = true)
// |-- itemCategory: string (nullable = true)
See https://www.zepl.com/viewer/notebooks/bm90ZTovL2R2aXJ0ekBnbWFpbC5jb20vMzEzZGE2ZmZjZjY0NGRiZjk2MzdlZDE4NjEzOWJlZWYvbm90ZS5qc29u

Related

Spark infer schema with limit during a read.csv

I'd like to infer a Spark.DataFrame schema from a directory of CSV files using a small subset of the rows (say limit(100)).
However, setting inferSchema to True means that the Input Size / Records for the FileScanRDD seems to always be equal to the number of rows in all the CSV files.
Is there a way to make the FileScan more selective, such that Spark looks at fewer rows when inferring a schema?
Note: setting the samplingRatio option to be < 1.0 does not have the desired behaviour, though it is clear that inferSchema uses only the sampled subset of rows.
You could read a subset of your input data into a dataSet of String.
The CSV method allows you to pass this as a parameter.
Here is a simple example (I'll leave reading the sample of rows from the input file to you):
val data = List("1,2,hello", "2,3,what's up?")
val csvRDD = sc.parallelize(data)
val df = spark.read.option("inferSchema","true").csv(csvRDD.toDS)
df.schema
When run in spark-shell, the final line from the above prints (I reformatted it for readability):
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_c0,IntegerType,true),
StructField(_c1,IntegerType,true),
StructField(_c2,StringType,true)
)
Which is the correct Schema for my limited input data set.
Assuming you are only interested in the schema, here is a possible approach based on cipri.l's post in this link
import org.apache.spark.sql.execution.datasources.csv.{CSVOptions, TextInputCSVDataSource}
def inferSchemaFromSample(sparkSession: SparkSession, fileLocation: String, sampleSize: Int, isFirstRowHeader: Boolean): StructType = {
// Build a Dataset composed of the first sampleSize lines from the input files as plain text strings
val dataSample: Array[String] = sparkSession.read.textFile(fileLocation).head(sampleSize)
import sparkSession.implicits._
val sampleDS: Dataset[String] = sparkSession.createDataset(dataSample)
// Provide information about the CSV files' structure
val firstLine = dataSample.head
val extraOptions = Map("inferSchema" -> "true", "header" -> isFirstRowHeader.toString)
val csvOptions: CSVOptions = new CSVOptions(extraOptions, sparkSession.sessionState.conf.sessionLocalTimeZone)
// Infer the CSV schema based on the sample data
val schema = TextInputCSVDataSource.inferFromDataset(sparkSession, sampleDS, Some(firstLine), csvOptions)
schema
}
Unlike GMc's answer from above, this approach tries to directly infer the schema the same way the DataFrameReader.csv() does in the background (but without going through the effort of building an additional Dataset with that schema, that we would then only use to retrieve the schema back from it)
The schema is inferred based on a Dataset[String] containing only the first sampleSize lines from the input files as plain text strings.
When trying to retrieve samples from data, Spark has only 2 types of methods:
Methods that retrieve a given percentage of the data. This operation takes random samples from all partitions. It benefits from higher parallelism, but it must read all the input files.
Methods that retrieve a specific number of rows. This operation must collect the data on the driver, but it could read a single partition (if the required row count is low enough)
Since you mentioned you want to use a specific small number of rows and since you want to avoid touching all the data, I provided a solution based on option 2
PS: The DataFrameReader.textFile method accepts paths to files, folders and it also has a varargs variant, so you could pass in one or more files or folders.

Expand an array to columns in spark with structured streaming

I have this problem:
I'm reading the data from Kafka using structured streaming, data are CSV rows. When I get the data from Kafka, I have a streaming dataframe where the CSV row is inside "value" and it's a byte sequence.
sDF2 = sDF.selectExpr("CAST(value as string)").select( split("value",","))
using this I have a new dataframe where "value" is a string and it is the CSV row.
How can I get a new dataframe where I have parsed and split the CSV fields into dataframe columns?
Example:
csv row is "abcd,123,frgh,1321"
sDF schema, which contains the data downloaded from Kafka, is
key, value, topic, timestamp etc... and here value is a byte sequence with no type
sDF2.schema has only a column ( named value of type string )
I like that the new dataframe is
sDF3.col1 = abcd
sDF3.col2 = 123
sDF3.col3 = frgh ...etc
where all the columns are String.
I still can do this:
sDF3 = sDF2.select( sDF2.csv[0].alias("EventId").cast("string"),
sDF2.csv[1].alias("DOEntitlementId").cast("string"),
sDF2.csv[3].alias("AmazonSubscriptionId").cast("string"),
sDF2.csv[4].alias("AmazonPlanId").cast("string"),
... etc ...
but it looks ugly.
I have not tried it, but something like this should work.
sDF2 =
sDF.selectExpr("CAST(value as string)")
.alias("csv").select("csv.*")
.select("split(value,',')[0] as DOEntitlementId",
"split(value,',')[1] as AmazonSubscriptionId",
"split(value,',')[2] as AmazonPlanId")

How to handle null values when writing to parquet from Spark

Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
However it will be a long time before spark supports that new parquet feature - if ever. Here is the associated (closed - will not fix) JIRA:
https://issues.apache.org/jira/browse/SPARK-10943
So what are folks doing with regards to null column values today when writing out dataframe's to parquet ? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null - short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).
You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns.
The problem is that null alone carries no type information at all
scala> spark.sql("SELECT null as comments").printSchema
root
|-- comments: null (nullable = true)
As per comment by Michael Armbrust all you have to do is cast:
scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema
root
|-- comments: double (nullable = true)
and the result can be safely written to Parquet.
I wrote a PySpark solution for this (df is a dataframe with columns of NullType):
# get dataframe schema
my_schema = list(df.schema)
null_cols = []
# iterate over schema list to filter for NullType columns
for st in my_schema:
if str(st.dataType) == 'NullType':
null_cols.append(st)
# cast null type columns to string (or whatever you'd like)
for ncol in null_cols:
mycolname = str(ncol.name)
df = df \
.withColumn(mycolname, df[mycolname].cast('string'))

How to write just the `row` value of a DataFrame to a file in spark?

I have a dataframe that has just one column, whose value is a JSON string. I'm trying to write just the values to a file with one record per line.
scala> selddf.printSchema
root
|-- raw_event: string (nullable = true)
The data looks like this:
scala> selddf.show(1)
+--------------------+
| raw_event|
+--------------------+
|{"event_header":{...|
+--------------------+
only showing top 1 row
I am running the following to save it to file:
selddf.select("raw_event").write.json("/data/test")
The output looks like:
{"raw_event":"{\"event_header\":{\"version\":\"1.0\"...}"}
I would like the output to just say:
{\"event_header\":{\"version\":\"1.0\"...}
What am I missing?
The reason this happens is that when you write a json you are writing the dataframe in which the column is raw_event.
Your first option is to simply write it as text:
df.write.text(filename)
Another option (if your json schema is constant to all elements) is using the from_json function to convert this to a legal dataframe. Select the elements (the content of the column which would include all members of the json) and only then save it:
val df = Seq("{\"a\": \"str\", \"b\": [1,2,3], \"c\": {\"d\": 1, \"e\": 2}}").toDF("raw_event")
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("a", StringType), StructField("b", ArrayType(IntegerType)), StructField("c", StructType(Seq(StructField("d", IntegerType), StructField("e", IntegerType))))))
df.withColumn("jsonData", from_json($"raw_event", schema)).select("jsonData.*").write.json("bla.json")
The advantage of the second option is that you can test for maleformed rows (which would result in null) and therefore you can add a filter to remove them.
Note that in both cases you don't have escaping for the ". If you want that you would need to use the first option and first do a UDF which adds the escaping.

Spark sql how to execute sql command in a loop for every record in input DataFrame

Spark sql how to execute sql command in a loop for every record in input DataFrame
I have a DataFrame with following schema
%> input.printSchema
root
|-- _c0: string (nullable = true)
|-- id: string (nullable = true)
I have another DataFrame on which I need to execute sql command
val testtable = testDf.registerTempTable("mytable")
%>testDf.printSchema
root
|-- _1: integer (nullable = true)
sqlContext.sql(s"SELECT * from mytable WHERE _1=$id").show()
$id should be from the input DataFrame and the sql command should execute for all input table ids
Assuming you can work with a single new DataFrame containing all the rows present in testDf that matches the values present in the id column of input, you can do an inner join operation, as stated by Alberto:
val result = input.join(testDf, input("id") == testDf("_1"))
result.show()
Now, if you want a new, different DataFrame for each distinct value present in testDf, the problem is considerably harder. If this is the case, I would suggest you to make sure the data in your lookup table can be collected as a local list, so you could loop through its values and create a new DataFrame for each one as you already thought (this is not recommended):
val localArray: Array[Int] = input.map { case Row(_, id: Integer) => id }.collect
val result: Array[DataFrame] = localArray.map {
i => testDf.where(testDf("_1") === i)
}
Anyway, unless the lookup table is very small, I suggest that you adapt your logic to work with the single joined DataFrame of my first example.

Resources