Processing json strings in a Spark dataframe column - apache-spark

When I look for ways to parse json within a string column of a dataframe, I keep running into results that more simply read json file sources. My source is actually a hive ORC table with some strings in one of the columns which is in a json format. I'd really like to convert that to something parsed like a map.
I'm having trouble finding a way to do this:
import java.util.Date
import org.apache.spark.sql.Row
import scala.util.parsing.json.JSON
val items = sql("select * from db.items limit 10")
//items.printSchema
val internal = items.map {
case Row(externalid: Long, itemid: Long, locale: String,
internalitem: String, version: Long,
createdat: Date, modifiedat: Date)
=> JSON.parseFull(internalitem)
}
I thought this should work, but maybe there's a more Spark way of doing this instead because I get the following error:
java.lang.ClassNotFoundException: scala.Any
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
Specifically, my input data looks approximately like this:
externalid, itemid, locale, internalitem, version, createdat, modifiedat
123, 321, "en_us", "{'name':'thing','attr':{
21:{'attrname':'size','attrval':'big'},
42:{'attrname':'color','attrval':'red'}
}}", 1, 2017-05-05…, 2017-05-06…
Yes it's not RFC 7158 exactly.
The attr keys can be 5 to 30 of any 80,000 values, so I wanted get to something like this instead:
externalid, itemid, locale, internalitem, version, createdat, modifiedat
123, 321, "en_us", "{"name':"thing","attr":[
{"id":21,"attrname':"size","attrval":"big"},
{"id":42,"attrname":"color","attrval":"red"}
]}", 1, 2017-05-05…, 2017-05-06…
Then flatten the internalitem to fields and explode the attr array:
externalid, itemid, locale, name, attrid, attrname attrval, version, createdat, modifiedat
123, 321, "en_us", "thing", 21, "size", "big", 1, 2017-05-05…, 2017-05-06…
123, 321, "en_us", "thing", 21, "color", "red", 1, 2017-05-05…, 2017-05-06…

I've never been using such computations, but I have an advice for you :
Before doing any operation on columns on your own, just check the sql.functions package which contain a whole bunch of helpful functions to work with columns like date extracting and formatting, string concatenation and spliting, ... and it also provide a couple of functions to work with json objects like : from_json and json_tuple.
To use those methods you simply need to import them and call them inside a select method like this :
import spark.implicits._
import org.apache.spark.sql.functions._
val transofrmedDf = df.select($"externalid", $"itemid", … from_json($"internalitem", schema), $"version" …)
First of all you have to create a schema for your json column and put it in the schema variable
Hope it helps.

Related

Change the datatype of a column in delta table

Is there a SQL command that I can easily use to change the datatype of a existing column in Delta table. I need to change the column datatype from BIGINT to STRING. Below is the SQL command I'm trying to use but no luck.
%sql ALTER TABLE [TABLE_NAME] ALTER COLUMN [COLUMN_NAME] STRING
Error I'm getting:
org.apache.spark.sql.AnalysisException
ALTER TABLE CHANGE COLUMN is not supported for changing column 'bam_user' with type
'IntegerType' to 'bam_user' with type 'StringType'
SQL doesn't support this, but it can be done in python:
from pyspark.sql.functions import col
# set dataset location and columns with new types
table_path = '/mnt/dataset_location...'
types_to_change = {
'column_1' : 'int',
'column_2' : 'string',
'column_3' : 'double'
}
# load to dataframe, change types
df = spark.read.format('delta').load(table_path)
for column in types_to_change:
df = df.withColumn(column,col(column).cast(types_to_change[column]))
# save df with new types overwriting the schema
df.write.format("delta").mode("overwrite").option("overwriteSchema",True).save("dbfs:" + table_path)
No Option to change the data type of column or dropping the column. You can read the data in datafame, modify the data type and with help of withColumn() and drop() and overwrite the table.
There is no real way to do this using SQL, unless you copy to a different table altogether. This option includes INSERT data to a new table, DROP TABLE and re-CREATE with the new structure and therefore risky.
The way to do this in python is as follows:
Let's say this is your table :
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
You can check the table structure using the following:
DESCRIBE TABLE person
IF you need to change the id to String:
This is the code:
%py
from pyspark.sql.functions import col
df = spark.read.table("person")
df1 = df.withColumn("id",col("id").cast("string"))
df1.write
.format ("parquet")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable("person")
Couple of pointers: the format is parquet in this table. That's the default for Databricks. So you can omit the "format" line (note that Python is very sensitive regarding spaces).
Re databricks:
If the format is "delta" you must specify this.
Also, if the table is partitioned, it's important to mention that in the code:
For example:
df1.write
.format ("delta")
.mode("overwrite")
.partitionBy("col_to_partition1", "col_to_partition2")
.option("overwriteSchema", "true")
.save(table_location)
When table_location is where the delta table is saved.
(some of this answer is based on this)
Suppose you want to change data type of column "column_name" to "int" of table "delta_table_name"
spark.read.table("delta_table_name") .withColumn("Column_name",col("Column_name").cast("new_data_type")) .write.format("delta").mode("overwrite").option("overwriteSchema",true).saveAsTable("delta_table_name")
Read the table using spark.
Use withColumn method to transform the column you want.
Write the table back, mode overwrite and overwriteSchema True
Reference: https://docs.databricks.com/delta/update-schema.html#explicitly-update-schema-to-change-column-type-or-name
from pyspark.sql import functions as F
spark.read.table("<TABLE NAME>") \
.withColumn("<COLUMN NAME> ",F.col("<COLUMN NAME>").cast("<DATA TYPE>")) \
.write.format("delta").mode("overwrite").option("overwriteSchema",True).saveAsTable("<TABLE NAME>")

Pyspark Streaming with Pandas UDF

I am new to Spark Streaming and Pandas UDF. I am working on pyspark consumer from kafka, payload is of xml format and trying to parse the incoming xml by applying pandas udf
#pandas_udf("col1 string, col2 string",PandasUDFType.GROUPED_MAP)
def test_udf(df):
import xmltodict
from collections import MutableMapping
xml_str=df.iloc[0,0]
df_col=['col1', 'col2']
doc=xmltodict.parse(xml_str,dict_constructor=dict)
extract_needed_fields = { k:doc[k] for k in df_col }
return pd.DataFrame( [{'col1': 'abc', 'col2': 'def'}] , index=[0] , dtype="string" )
data=df.selectExpr("CAST(value AS STRING) AS value")
data.groupby("value").apply(test_udf).writeStream.format("console").start()
I get the below error
File "pyarrow/array.pxi", line 859, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
File "pyarrow/array.pxi", line 104, in pyarrow.lib._handle_arrow_array_protocol
ValueError: Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol.
Is this the right approach ? What am i doing wrong
It looks like, as if this is a more kind of undocumented limitation than a bug.
You cannot use any pandas type which will be stored as an array object, which has a method named __arrow_array__, because pyspark always defines a mask. The string type you used, is stored in a StringArray, which is such a case. After I converted the string dtype into object, the error went away.
While converting a pandas dataframe to a pyspark one, I stumbled upon this error as well :
Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol
My pandas dataframe had datetime-like values that I tried to convert to "string". I initially used the astype("string") method, which looked like this :
df["time"] = (df["datetime"].dt.time).astype("string")
When I tried to get the info of this dataframe, it seemed like it was indeed converted to a string type :
df.info(verbose=True)
> ...
> # Column Non-Null Count Dtype
> ...
> 6 time 295452 non-null string
But the error kept coming back to me.
Solution
To avoid it, I instead went on to use the apply(str) method :
df["time"] = (df["datetime"].dt.time).apply(str)
Which gave me a type of object
df.info(verbose=True)
> ...
> # Column Non-Null Count Dtype
> ...
> 6 time 295452 non-null object
After that, the conversion was successful
spark.createDataFrame(df)
# DataFrame[datetime: string, date: string, year: bigint, month: bigint, day: bigint, day_name: string, time: string, hour: bigint, minute: bigint]

Not able to convert from RDD to Dataset successfully at runtime [duplicate]

This question already has answers here:
How to convert a simple DataFrame to a DataSet Spark Scala with case class?
(2 answers)
Closed 4 years ago.
I am trying to run this small Spark program. Spark Version 2.1.1
val rdd = sc.parallelize(List((2012, "Tesla", "S"), (1997, "Ford", "E350"), (2015, "Chevy", "Volt")))
import spark.implicits._
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd).as[CarDetails] // Error Line
carDetails.map(car => {
val name = if (car.name == "Tesla") "S" else car.name
CarDetails(car.year, name, car.model)
}).collect().foreach(print)
It is throwing error on this line:
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd).as[CarDetails]
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`year`' given input columns: [_1, _2, _3];
There is no compilation error!
I tried by doing many changes like to use List instead of RDD. Also, tried to first convert to DS and then to as[CarDetails], but didn't work. Now I am clueless.
Why is it taking the columns as _1, _2 and _3 when I have already given the case class
case class CarDetails(year: Int, name: String, model: String)
I tried to change from Int to Long for year in case class. It still did not work.
Edit:
I changed this line after referring the probable duplicate question and it worked.
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd)
.withColumnRenamed("_1","year")
.withColumnRenamed("_2","name")
.withColumnRenamed("_3","model")
.as[CarDetails]
But, I am still not clear as to why I need to rename the columns even after explicitly mapping to a case class.
The rules of as conversion are explained in detail in the API docs:
The method used to map columns depend on the type of U:
When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive).
When U is a tuple, the columns will be mapped by ordinal (i.e. the first column will be assigned to _1).
When U is a primitive type (i.e. String, Int, etc), then the first column of the DataFrame will be used.
If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.
To explain this with code. Conversion from case class to Tuple* is valid (fields are matched structurally):
Seq(CarDetails(2012, "Tesla", "S")).toDF.as[(Int, String, String)]
but conversion from Tuple* to arbitrary case class is not (fields are matched by name). You have to rename fields first (ditto):
Seq((2012, "Tesla", "S")).toDF("year", "name", "model").as[CarDetails]
It has quite interesting practical implications:
Tuple typed object cannot contain extraneous fields:
case class CarDetailsWithColor(
year: Int, name: String, model: String, color: String)
Seq(
CarDetailsWithColor(2012, "Tesla", "S", "red")
).toDF.as[(Int, String, String)]
// org.apache.spark.sql.AnalysisException: Try to map struct<year:int,name:string,model:string,color:string> to Tuple3, but failed as the number of fields does not line up.;
While case class typed Dataset can:
Seq(
(2012, "Tesla", "S", "red")
).toDF("year", "name", "model", "color").as[CarDetails]
Of course, starting with case class typed variant would save you all the trouble:
sc.parallelize(Seq(CarDetails(2012, "Tesla", "S"))).toDS

How do I Programmatically parsed a fixed width text file in Pyspark?

This post does a great job of showing how parse a fixed width text file into a Spark dataframe with pyspark (pyspark parse text file).
I have several text files I want to parse, but they each have slightly different schemas. Rather than having to write out the same procedure for each one like the previous post suggests, I'd like to write a generic function that can parse a fixed width text file given the widths and column names.
I'm pretty new to pyspark so I'm not sure how to write a select statement where the number of columns, and their types is variable.
Any help would be appreciated!
Say we have a text file like the one in the example thread:
00101292017you1234
00201302017 me5678
in "/tmp/sample.txt". And a dictionary containing for each file name, a list of columns and a list of width:
schema_dict = {
"sample": {
"columns": ["id", "date", "string", "integer"],
"width" : [3, 8, 3, 4]
}
}
We can load the dataframes and split them into columns iteratively, using:
import numpy as np
input_path = "/tmp/"
df_dict = dict()
for file in schema_dict.keys():
df = spark.read.text(input_path + file + ".txt")
start_list = np.cumsum([1] + schema_dict[file]["width"]).tolist()[:-1]
df_dict[file] = df.select(
[
df.value.substr(
start_list[i],
schema_dict[file]["width"][i]
).alias(schema_dict[file]["columns"][i]) for i in range(len(start_list))
]
)
+---+--------+------+-------+
| id| date|string|integer|
+---+--------+------+-------+
|001|01292017| you| 1234|
|002|01302017| me| 5678|
+---+--------+------+-------+

Defining DataFrame Schema for a table with 1500 columns in Spark

I have a table with around 1500 columns in SQL Server. I need to read the data from this table and then convert it to proper datatype format and then insert the records into Oracle DB.
What is the best way to define the schema for this type of table with more than 1500 columns in a table. Is there any other option than hard coding the column names along with the datatypes?
Using Case class
Using StructType.
Spark Version used is 1.4
For this type of requirements. I'd offer case class approach to prepare a dataframe
Yes, There are some limitations like productarity but we can overcome...
you can do like below example for < versions 2.11 :
prepare a case class which extends Product and overrides methods.
like...
productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:
productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:
canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:
Example implementation you can refer this Student case class which has 33 fields in it
Example student dataset description here
Another option :
Use the StructType to define the schema and create the dataframe.(if you don't want to use spark csv api)
The options for reading a table with 1500 columns
1) Using Case class
Case class would not work because its limited to 22 fields( for scala version < 2.11).
2) Using StructType
You can use the StructType to define the schema and create the dataframe.
Third option
You can use the Spark-csv package . In this, you can use .option("inferschema","true"). This will automatically read the schema from the file.
You can have your schema with hundreds of columns in the json format. And then you can read this json file to construct you custom schema.
For example,
Your schema json be:
[
{
"columnType": "VARCHAR",
"columnName": "NAME",
"nullable": true
},
{
"columnType": "VARCHAR",
"columnName": "AGE",
"nullable": true
},
.
.
.
]
Now you can read the the json to parse it to some case class to form the StructType.
case class Field(name: String, dataType: String, nullable: Boolean)
You can create a Map to have spark DataTypes corresponding to column Type Oracle string in json schema.
val dataType = Map(
"VARCHAR" -> StringType,
"NUMERIC" -> LongType,
"TIMESTAMP" -> TimestampType,
.
.
.
)
def parseJsonForSchema(jsonFilePath: String) = {
val jsonString = Source.fromFile(jsonFilePath).mkString
val parsedJson = parse(jsonString)
val fields = parsedJson.extract[Field]
val schemaColumns = fields.map(field => StructField(field.name, getDataType(field), field.nullable))
StructType(schemaColumns)
}

Resources