How to write just the `row` value of a DataFrame to a file in spark? - apache-spark

I have a dataframe that has just one column, whose value is a JSON string. I'm trying to write just the values to a file with one record per line.
scala> selddf.printSchema
root
|-- raw_event: string (nullable = true)
The data looks like this:
scala> selddf.show(1)
+--------------------+
| raw_event|
+--------------------+
|{"event_header":{...|
+--------------------+
only showing top 1 row
I am running the following to save it to file:
selddf.select("raw_event").write.json("/data/test")
The output looks like:
{"raw_event":"{\"event_header\":{\"version\":\"1.0\"...}"}
I would like the output to just say:
{\"event_header\":{\"version\":\"1.0\"...}
What am I missing?

The reason this happens is that when you write a json you are writing the dataframe in which the column is raw_event.
Your first option is to simply write it as text:
df.write.text(filename)
Another option (if your json schema is constant to all elements) is using the from_json function to convert this to a legal dataframe. Select the elements (the content of the column which would include all members of the json) and only then save it:
val df = Seq("{\"a\": \"str\", \"b\": [1,2,3], \"c\": {\"d\": 1, \"e\": 2}}").toDF("raw_event")
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("a", StringType), StructField("b", ArrayType(IntegerType)), StructField("c", StructType(Seq(StructField("d", IntegerType), StructField("e", IntegerType))))))
df.withColumn("jsonData", from_json($"raw_event", schema)).select("jsonData.*").write.json("bla.json")
The advantage of the second option is that you can test for maleformed rows (which would result in null) and therefore you can add a filter to remove them.
Note that in both cases you don't have escaping for the ". If you want that you would need to use the first option and first do a UDF which adds the escaping.

Related

How to remove special characters from dataframe using udf function

I am a learner in spark sql. Could anyone please help with below scenario?
package name: sparksql,class name:custommethod, method name:removespecialchar
create custom method in scala which takes 1 string as argument and 1 return on type string
Method has to remove all special characters numbers 0 to 9 - ? , / _ ( ) [ ] from dataframe one column using replaceall function.
input: windows-X64 (os system)
output : windows x os system
I have a dataframe called df1 with 6 columns inside another class called sparksql2
3.Import the package, instantiate the custommethod method inside sparksql2 class and register the method generated in above step as a udf for invoking spark sql dataframe.
Call the above udf in the DSL by passing single columnname as an argument to get the special characters removed from dataframe and save the result as json into hdfs location
You don't need UDFs for that you can just use plain spark and define it in a function with regexp_replace.
take this example:
import org.apache.spark.sql.{SparkSession,DataFrame}
import org.apache.spark.sql.functions.regexp_replace
def removeFromColumn(spark: SparkSession, columnName: String, df: DataFrame) =
df.select(regexp_replace(
df(columnName),
"[0-9]|\\[|\\]|\\-|\\?|\\(|\\)|\\,|_|/",
""
).as(columnName))
with this you can use it on a DataFrame without going into the trouble of registering a UDF:
import spark.implicits._
val df = Seq("2res012-?,/_()[]ult").toDF("columnName")
removeFromColumn(spark, "columnName", df)
Output:
+----------+
|columnName|
+----------+
| result|
+----------+

Avoid losing data type for the partitioned data when writing from Spark

I have a dataframe like below.
itemName, itemCategory
Name1, C0
Name2, C1
Name3, C0
I would like to save this dataframe as partitioned parquet file:
df.write.mode("overwrite").partitionBy("itemCategory").parquet(path)
For this dataframe, when I read the data back, it will have String the data type for itemCategory.
However at times, I have dataframe from other tenants as below.
itemName, itemCategory
Name1, 0
Name2, 1
Name3, 0
In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory.
Parquet file has the metadata that describe the data type. How can I specify the data type for the partition so it will be read back as String instead of Int?
If you set "spark.sql.sources.partitionColumnTypeInference.enabled" to "false", spark will infer all partition columns as Strings.
In spark 2.0 or greater, you can set like this:
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
In 1.6, like this:
sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
The downside is you have to do this each time you read the data, but at least it works.
As you partition by the itemCategory column, this data will be stored in the file structure and not in the actual csv files. Spark infer the datatype depending on the values, if all values are integers then the column type will be int.
One simple solution would be to cast the column to StringType after reading the data:
import spark.implicits._
df.withColumn("itemCategory", $"itemCategory".cast(StringType))
Another option would be to duplicate the column itself. Then one of the columns will be used for the partitioning and, hence, be saved in the file structure. However, the other duplicated column would be saved normally in the parquet file. To make a duplicate simply use:
df.withColumn("itemCategoryCopy", $"itemCategory")
Read it with a schema:
import spark.implicits._
val path = "/tmp/test/input"
val source = Seq(("Name1", "0"), ("Name2", "1"), ("Name3", "0")).toDF("itemName", "itemCategory")
source.write.partitionBy("itemCategory").parquet(path)
spark.read.schema(source.schema).parquet(path).printSchema()
// will print
// root
// |-- itemName: string (nullable = true)
// |-- itemCategory: string (nullable = true)
See https://www.zepl.com/viewer/notebooks/bm90ZTovL2R2aXJ0ekBnbWFpbC5jb20vMzEzZGE2ZmZjZjY0NGRiZjk2MzdlZDE4NjEzOWJlZWYvbm90ZS5qc29u

Spark sql how to execute sql command in a loop for every record in input DataFrame

Spark sql how to execute sql command in a loop for every record in input DataFrame
I have a DataFrame with following schema
%> input.printSchema
root
|-- _c0: string (nullable = true)
|-- id: string (nullable = true)
I have another DataFrame on which I need to execute sql command
val testtable = testDf.registerTempTable("mytable")
%>testDf.printSchema
root
|-- _1: integer (nullable = true)
sqlContext.sql(s"SELECT * from mytable WHERE _1=$id").show()
$id should be from the input DataFrame and the sql command should execute for all input table ids
Assuming you can work with a single new DataFrame containing all the rows present in testDf that matches the values present in the id column of input, you can do an inner join operation, as stated by Alberto:
val result = input.join(testDf, input("id") == testDf("_1"))
result.show()
Now, if you want a new, different DataFrame for each distinct value present in testDf, the problem is considerably harder. If this is the case, I would suggest you to make sure the data in your lookup table can be collected as a local list, so you could loop through its values and create a new DataFrame for each one as you already thought (this is not recommended):
val localArray: Array[Int] = input.map { case Row(_, id: Integer) => id }.collect
val result: Array[DataFrame] = localArray.map {
i => testDf.where(testDf("_1") === i)
}
Anyway, unless the lookup table is very small, I suggest that you adapt your logic to work with the single joined DataFrame of my first example.

How to pass whole Row to UDF - Spark DataFrame filter

I'm writing filter function for complex JSON dataset with lot's of inner structures. Passing individual columns is too cumbersome.
So I declared the following UDF:
val records:DataFrame = = sqlContext.jsonFile("...")
def myFilterFunction(r:Row):Boolean=???
sqlc.udf.register("myFilter", (r:Row)=>myFilterFunction(r))
Intuitively I'm thinking it will work like this:
records.filter("myFilter(*)=true")
What is the actual syntax?
You have to use struct() function for constructing the row while making a call to the function, follow these steps.
Import Row,
import org.apache.spark.sql._
Define the UDF
def myFilterFunction(r:Row) = {r.get(0)==r.get(1)}
Register the UDF
sqlContext.udf.register("myFilterFunction", myFilterFunction _)
Create the dataFrame
val records = sqlContext.createDataFrame(Seq(("sachin", "sachin"), ("aggarwal", "aggarwal1"))).toDF("text", "text2")
Use the UDF
records.filter(callUdf("myFilterFunction",struct($"text",$"text2"))).show
When u want all columns to be passed to UDF.
records.filter(callUdf("myFilterFunction",struct(records.columns.map(records(_)) : _*))).show
Result:
+------+------+
| text| text2|
+------+------+
|sachin|sachin|
+------+------+
scala> inputDF
res40: org.apache.spark.sql.DataFrame = [email: string, first_name: string ... 3 more fields]
scala> inputDF.printSchema
root
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: long (nullable = true)
|-- last_name: string (nullable = true)
Now, I would like to filter the rows based on the Gender Field. I can accomplish that by using the .filter($"gender" === "Male") but I would like to do with the .filter(function).
So, defined my anonymous functions
val isMaleRow = (r:Row) => {r.getAs("gender") == "Male"}
val isFemaleRow = (r:Row) => { r.getAs("gender") == "Female" }
inputDF.filter(isMaleRow).show()
inputDF.filter(isFemaleRow).show()
I felt the requirement can be done in a better way i.e without declaring as UDF and invoke it.
In addition to the first answer. When we want all columns to be passed to UDF we can use
struct("*")
If you want to take an action over the whole row and process it in a distributed way, take the row in the DataFrame and send to a function as a struct and then convert to a dictionary to execute the specific action, is very important to execute the collect method over the final DataFrame because Spark has the LazyLoad activated and don't work with full data at less you tell it explicitly.
In my case I should send the row of a DataFrame to index as Dictionary object:
Import libraries.
Declare the udf and the lambda must receiving the row structure.
Execute specific function, in this case send to index a dictionary (the row structure converted to a dict).
The DataFrame origin execute a withColum method that indicates to Spark execute this in each row, before make the call to collect, this allows to execute the function in a distribuible way. Don't forget send to a other DataFrame Variable.
Execute the collect method to execute the process and distribute the function.
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType
myUdf = udf(lambda row: sendToES(row.asDict()), IntegerType())
dfWithControlCol = df.withColumn("control_col", myUdf(struct([df[x] for x in df.columns])))
dfWithControlCol.collect()

How to read a nested collection in Spark

I have a parquet table with one of the columns being
, array<struct<col1,col2,..colN>>
Can run queries against this table in Hive using LATERAL VIEW syntax.
How to read this table into an RDD, and more importantly how to filter, map etc this nested collection in Spark?
Could not find any references to this in Spark documentation. Thanks in advance for any information!
ps. I felt might be helpful to give some stats on the table.
Number of columns in main table ~600. Number of rows ~200m.
Number of "columns" in nested collection ~10. Avg number of records in nested collection ~35.
There is no magic in the case of nested collection. Spark will handle the same way a RDD[(String, String)] and a RDD[(String, Seq[String])].
Reading such nested collection from Parquet files can be tricky, though.
Let's take an example from the spark-shell (1.3.1):
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> case class Inner(a: String, b: String)
defined class Inner
scala> case class Outer(key: String, inners: Seq[Inner])
defined class Outer
Write the parquet file:
scala> val outers = sc.parallelize(List(Outer("k1", List(Inner("a", "b")))))
outers: org.apache.spark.rdd.RDD[Outer] = ParallelCollectionRDD[0] at parallelize at <console>:25
scala> outers.toDF.saveAsParquetFile("outers.parquet")
Read the parquet file:
scala> import org.apache.spark.sql.catalyst.expressions.Row
import org.apache.spark.sql.catalyst.expressions.Row
scala> val dataFrame = sqlContext.parquetFile("outers.parquet")
dataFrame: org.apache.spark.sql.DataFrame = [key: string, inners: array<struct<a:string,b:string>>]
scala> val outers = dataFrame.map { row =>
| val key = row.getString(0)
| val inners = row.getAs[Seq[Row]](1).map(r => Inner(r.getString(0), r.getString(1)))
| Outer(key, inners)
| }
outers: org.apache.spark.rdd.RDD[Outer] = MapPartitionsRDD[8] at map at DataFrame.scala:848
The important part is row.getAs[Seq[Row]](1). The internal representation of a nested sequence of struct is ArrayBuffer[Row], you could use any super-type of it instead of Seq[Row]. The 1 is the column index in the outer row. I used the method getAs here but there are alternatives in the latest versions of Spark. See the source code of the Row trait.
Now that you have a RDD[Outer], you can apply any wanted transformation or action.
// Filter the outers
outers.filter(_.inners.nonEmpty)
// Filter the inners
outers.map(outer => outer.copy(inners = outer.inners.filter(_.a == "a")))
Note that we used the spark-SQL library only to read the parquet file. You could for example select only the wanted columns directly on the DataFrame, before mapping it to a RDD.
dataFrame.select('col1, 'col2).map { row => ... }
I'll give a Python-based answer since that's what I'm using. I think Scala has something similar.
The explode function was added in Spark 1.4.0 to handle nested arrays in DataFrames, according to the Python API docs.
Create a test dataframe:
from pyspark.sql import Row
df = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3]), Row(a=2, intlist=[4,5,6])])
df.show()
## +-+--------------------+
## |a| intlist|
## +-+--------------------+
## |1|ArrayBuffer(1, 2, 3)|
## |2|ArrayBuffer(4, 5, 6)|
## +-+--------------------+
Use explode to flatten the list column:
from pyspark.sql.functions import explode
df.select(df.a, explode(df.intlist)).show()
## +-+---+
## |a|_c0|
## +-+---+
## |1| 1|
## |1| 2|
## |1| 3|
## |2| 4|
## |2| 5|
## |2| 6|
## +-+---+
Another approach would be using pattern matching like this:
val rdd: RDD[(String, List[(String, String)]] = dataFrame.map(_.toSeq.toList match {
case List(key: String, inners: Seq[Row]) => key -> inners.map(_.toSeq.toList match {
case List(a:String, b: String) => (a, b)
}).toList
})
You can pattern match directly on Row but it is likely to fail for a few reasons.
Above answers are all great answers and tackle this question from different sides; Spark SQL is also quite useful way to access nested data.
Here's example how to use explode() in SQL directly to query nested collection.
SELECT hholdid, tsp.person_seq_no
FROM ( SELECT hholdid, explode(tsp_ids) as tsp
FROM disc_mrt.unified_fact uf
)
tsp_ids is a nested of structs, which has many attributes, including person_seq_no which I'm selecting in the outer query above.
Above was tested in Spark 2.0. I did a small test and it doesn't work in Spark 1.6. This question was asked when Spark 2 wasn't around, so this answer adds nicely to the list of available options to deal with nested structures.
Have a look also on following JIRAs for Hive-compatible way to query nested data using LATERAL VIEW OUTER syntax, since Spark 2.2 also supports OUTER explode (e.g. when a nested collection is empty, but you still want to have attributes from a parent record):
SPARK-13721: Add support for LATERAL VIEW OUTER explode()
Noticable not resolved JIRA on explode() for SQL access:
SPARK-7549: Support aggregating over nested fields

Resources