spark read orc with specific columns - apache-spark

I have a orc file, when read with below option it reads all the columns .
val df= spark.read.orc("/some/path/")
df.printSChema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
|-- all: string (nullable = true)
|-- next: string (nullable = true)
|-- action: string (nullable = true)
but I want to read only two columns from that file , is there any way to read only two columns (id,name) while loading orc file ?

is there any way to read only two columns (id,name) while loading orc file ?
Yes, all you need is subsequent select. Spark will take care of the rest for you:
val df = spark.read.orc("/some/path/").select("id", "name")

Spark has lazy execution model. So you can do any data transformation in you code without immediate real effect. Only after action call Spark start to doing job. And Spark are smart enough not to do extra work.
So you can write like this:
val inDF: DataFrame = spark.read.orc("/some/path/")
import spark.implicits._
val filteredDF: DataFrame = inDF.select($"id", $"name")
// any additional transformations
// real work starts after this action
val result: Array[Row] = filteredDF.collect()

Related

Do Parquet files preserve column order of Spark DataFrames?

Does creating a Spark DataFrame and saving it in Parquet format guarantee that the order of columns in the parquet file will be preserved?
Ex) A Spark DataFrame is created with columns A, B, C, and saved as Parquet. When the Parquet files are read, will the column order always be A, B, C?
I've noticed that if I save a Spark DataFrame, and then read the parquet files, the column order is preserved:
df.select("A", "B", "C").write.save(...)
df = spark.read.load(...)
df.printSchema()
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
Then, if I save by selecting a different order of columns, and then read the parquet files, I can see the order is also what I expect:
df.select("C", "B", "A").write.save(...)
df = spark.read.load(...)
df.printSchema()
root
|-- C: string (nullable = true)
|-- B: string (nullable = true)
|-- A: string (nullable = true)
However, I can't seem to find any documentation supporting this, and the comments of this post: Is there a possibility to keep column order when reading parquet? have conflicting information.

How to write result of streaming query to multiple database tables?

I am using spark structured streaming and reading from Kafka topic. The goal is to write the message to PostgreSQL database multiple tables.
The message schema is:
root
|-- id: string (nullable = true)
|-- name: timestamp (nullable = true)
|-- comment: string (nullable = true)
|-- map_key_value: map (nullable = true)
|-- key: string
|-- value: string (valueContainsNull = true)
While writing to one table after dropping map_key_value works with below code:
My write code is:
message.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write.format("jdbc").option("url", "url")
.option("user", "username")
.option("password", "password")
.option(JDBCOptions.JDBC_TABLE_NAME, "table_1')
.mode(SaveMode.Append).save();
}.outputMode(OutputMode.Append()).start().awaitTermination()
I want to write the message to two DB tables table 1(id, name, comment) and tables 2 need have the map_key_value.
You will need N streaming queries for N sinks; t1 and t2 both count as a separate sink.
writeStream does not currently write to jdbc so you should use foreachBatch operator.

Reading orc file of Hive managed tables in pyspark

I am trying to read orc file of a managed hive table using below pyspark code.
spark.read.format('orc').load('hive managed table path')
when i do a print schema on fetched dataframe, it is as follow
root
|-- operation: integer (nullable = true)
|-- originalTransaction: long (nullable = true)
|-- bucket: integer (nullable = true)
|-- rowId: long (nullable = true)
|-- currentTransaction: long (nullable = true)
|-- row: struct (nullable = true)
| |-- col1: float (nullable = true)
| |-- col2: integer (nullable = true)
|-- partition_by_column: date (nullable = true)
Now i am not able to parse this data and do any manipulation on data frame. While applying action like show(), i am getting an error saying
java.lang.IllegalArgumentException: Include vector the wrong length
did someone face the same issue? if yes can you please suggest how to resolve it.
It's a known issue.
You get that error because you're trying to read Hive ACID table but Spark still doesn't have support for this.
Maybe you can export your Hive table to normal ORC files and then read them with Spark or try using alternatives like Hive JDBC as described here
As i am not sure about the versions You can try other ways to load the ORC file.
Using SqlContext
val df = sqlContext.read.format("orc").load(orcfile)
OR
val df= spark.read.option("inferSchema", true).orc("filepath")
OR SparkSql(recommended)
import spark.sql
sql("SELECT * FROM table_name").show()

JSON Struct to Map[String,String] using sqlContext

I am trying to read json data in spark streaming job.
By default sqlContext.read.json(rdd) is converting all map types to struct types.
|-- legal_name: struct (nullable = true)
| |-- first_name: string (nullable = true)
| |-- last_name: string (nullable = true)
| |-- middle_name: string (nullable = true)
But when i read from hive table using sqlContext
val a = sqlContext.sql("select * from student_record")
below is the schema.
|-- leagalname: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Is there any way we can read data using read.json(rdd) and get Map data type?
Is there any option like
spark.sql.schema.convertStructToMap?
Any help is appreciated.
You need to explicitly define your schema, when calling read.json.
You can read about the details in Programmatically specifying the schema in the Spark SQL Documentation.
For example in your specific case it would be
import org.apache.spark.sql.types._
val schema = StructType(List(StructField("legal_name",MapType(StringType,StringType,true))))
That would be one column legal_name being a map.
When you have defined you schema you can call
sqlContext.read.json(rdd, schema) to create your data frame from your JSON dataset with the desired schema.

Multiple aggregations on nested structure in a single Spark statement

I have a json structure like this:
{
"a":5,
"b":10,
"c":{
"c1": 3,
"c4": 5
}
}
I have a dataframe created from this structure with several million rows. What I need are aggregation in several keys like this:
df.agg(count($"b") as "cntB", sum($"c.c4") as "sumC")
Do I just miss the syntax? Or is there a different way to do it? Most important Spark should only scan the data once for all aggregations.
It is possible, but your JSON must be in one line.
Each line = new JSON object.
val json = sc.parallelize(
"{\"a\":5,\"b\":10,\"c\":{\"c1\": 3,\"c4\": 5}}" :: Nil)
val jsons = sqlContext.read.json(json)
jsons.agg(count($"b") as "cntB", sum($"c.c4") as "sumC").show
Works fine - please see that json is formatted to be in one line.
jsons.printSchema() is printing:
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: struct (nullable = true)
| |-- c1: long (nullable = true)
| |-- c4: long (nullable = true)

Resources