Reading orc file of Hive managed tables in pyspark - apache-spark

I am trying to read orc file of a managed hive table using below pyspark code.
spark.read.format('orc').load('hive managed table path')
when i do a print schema on fetched dataframe, it is as follow
root
|-- operation: integer (nullable = true)
|-- originalTransaction: long (nullable = true)
|-- bucket: integer (nullable = true)
|-- rowId: long (nullable = true)
|-- currentTransaction: long (nullable = true)
|-- row: struct (nullable = true)
| |-- col1: float (nullable = true)
| |-- col2: integer (nullable = true)
|-- partition_by_column: date (nullable = true)
Now i am not able to parse this data and do any manipulation on data frame. While applying action like show(), i am getting an error saying
java.lang.IllegalArgumentException: Include vector the wrong length
did someone face the same issue? if yes can you please suggest how to resolve it.

It's a known issue.
You get that error because you're trying to read Hive ACID table but Spark still doesn't have support for this.
Maybe you can export your Hive table to normal ORC files and then read them with Spark or try using alternatives like Hive JDBC as described here

As i am not sure about the versions You can try other ways to load the ORC file.
Using SqlContext
val df = sqlContext.read.format("orc").load(orcfile)
OR
val df= spark.read.option("inferSchema", true).orc("filepath")
OR SparkSql(recommended)
import spark.sql
sql("SELECT * FROM table_name").show()

Related

How to convert JSON file into regular table DataFrame in Apache Spark

I have the following JSON fields
{"constructorId":1,"constructorRef":"mclaren","name":"McLaren","nationality":"British","url":"http://en.wikipedia.org/wiki/McLaren"}
{"constructorId":2,"constructorRef":"bmw_sauber","name":"BMW Sauber","nationality":"German","url":"http://en.wikipedia.org/wiki/BMW_Sauber"}
The following code produces the the following DataFrame:
I'm running the code on Databricks
df = (spark.read
.format(csv) \
.schema(mySchema) \
.load(dataPath)
)
display(df)
However, I need the DataFrame to look like the following:
I believe the problem is because the JSON is nested, and I'm trying to convert to CSV. However, I do need to convert to CSV.
Is there code that I can apply to remove the nested feature of the JSON?
Just try:
someDF = spark.read.json(somepath)
Infer schema by default or supply your own, set in your case in pySpark multiLine to false.
someDF = spark.read.json(somepath, someschema, multiLine=False)
See https://spark.apache.org/docs/latest/sql-data-sources-json.html
With schema inference:
df = spark.read.option("multiline","false").json("/FileStore/tables/SOabc2.txt")
df.printSchema()
df.show()
df.count()
returns:
root
|-- constructorId: long (nullable = true)
|-- constructorRef: string (nullable = true)
|-- name: string (nullable = true)
|-- nationality: string (nullable = true)
|-- url: string (nullable = true)
+-------------+--------------+----------+-----------+--------------------+
|constructorId|constructorRef| name|nationality| url|
+-------------+--------------+----------+-----------+--------------------+
| 1| mclaren| McLaren| British|http://en.wikiped...|
| 2| bmw_sauber|BMW Sauber| German|http://en.wikiped...|
+-------------+--------------+----------+-----------+--------------------+
Out[11]: 2

pyspark with hive - can't properly create with partition and save a table from a dataframe

I'm trying to convert json files to parquet with very few transformations (adding date) but I then need to partition this data before saving it to parquet.
I'm hitting a wall on this area.
Here is the creation process of the table:
df_temp = spark.read.json(data_location) \
.filter(
cond3
)
df_temp = df_temp.withColumn("date", fn.to_date(fn.lit(today.strftime("%Y-%m-%d"))))
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
then regarding the save of the conversion:
df_final.write.mode("append").format("parquet").partitionBy("customer_id", "date").saveAsTable('duration')
but this generates the following error:
pyspark.sql.utils.AnalysisException: '\nSpecified partitioning does not match that of the existing table default.duration.\nSpecified partition columns: [customer_id, date]\nExisting partition columns: []\n ;'
the schema being:
root
|-- action_id: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- duration: long (nullable = true)
|-- initial_value: string (nullable = true)
|-- item_class: string (nullable = true)
|-- set_value: string (nullable = true)
|-- start_time: string (nullable = true)
|-- stop_time: string (nullable = true)
|-- undo_event: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- date: date (nullable = true)
Thus I tried to change the create table to:
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp PARTITIONED BY (customer_id, date) LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
But this create an error like:
...mismatched input 'PARTITIONED' expecting ...
So I discovered that PARTITIONED BY doesn't work with LIKE but I'm running out of ideas.
if using USING instead of LIKE I got the error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to specify partition columns when the table schema is not defined. When the table schema is not provided, schema and partition columns will be inferred.;'
How am I supposed to add a partition when creating the table?
Ps - Once the schema of the table is defined with the partitions, I want to simply use:
df_final.write.format("parquet").insertInto('duration')
I finally figured out how to do it with spark.
df_temp.read.json...
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("""
CREATE TABLE IF NOT EXISTS {1}
USING PARQUET
PARTITIONED BY (customer_id, date)
LOCATION '{2}/{1}' AS SELECT * FROM {0}_tmp
""".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
df_temp.write.mode("append").partitionBy("customer_id", "date").saveAsTable('duration')
I don't know why, but if I can't use insertInto, it uses a weird customer_id out of nowhere and doesn't append the different dates.

spark read orc with specific columns

I have a orc file, when read with below option it reads all the columns .
val df= spark.read.orc("/some/path/")
df.printSChema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
|-- all: string (nullable = true)
|-- next: string (nullable = true)
|-- action: string (nullable = true)
but I want to read only two columns from that file , is there any way to read only two columns (id,name) while loading orc file ?
is there any way to read only two columns (id,name) while loading orc file ?
Yes, all you need is subsequent select. Spark will take care of the rest for you:
val df = spark.read.orc("/some/path/").select("id", "name")
Spark has lazy execution model. So you can do any data transformation in you code without immediate real effect. Only after action call Spark start to doing job. And Spark are smart enough not to do extra work.
So you can write like this:
val inDF: DataFrame = spark.read.orc("/some/path/")
import spark.implicits._
val filteredDF: DataFrame = inDF.select($"id", $"name")
// any additional transformations
// real work starts after this action
val result: Array[Row] = filteredDF.collect()

python spark write timestamp as long in elasticsearch

im reading data from jdbc source and writing it directly into elastic search index. when I queried the data in ES I saw that all timestamp fields in my dataframe transformed to long
Below is to save
spark_df1.write.format("org.elasticsearch.spark.sql")
.option('es.index.auto.create', 'true')
.option("es.write.operation", "index")
.option('es.host','localhost')
.option('es.mapping.date.rich',"True")
.option('es.mapping.id', 'Ticket')
.mode("append")
.save("index_esche/type")
when I run spark_df.printSchema()
|-- Createdon: timestamp (nullable = true)
|-- Updatedon: timestamp (nullable = true)
|-- Resolvedon: timestamp (nullable = true)

JSON Struct to Map[String,String] using sqlContext

I am trying to read json data in spark streaming job.
By default sqlContext.read.json(rdd) is converting all map types to struct types.
|-- legal_name: struct (nullable = true)
| |-- first_name: string (nullable = true)
| |-- last_name: string (nullable = true)
| |-- middle_name: string (nullable = true)
But when i read from hive table using sqlContext
val a = sqlContext.sql("select * from student_record")
below is the schema.
|-- leagalname: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Is there any way we can read data using read.json(rdd) and get Map data type?
Is there any option like
spark.sql.schema.convertStructToMap?
Any help is appreciated.
You need to explicitly define your schema, when calling read.json.
You can read about the details in Programmatically specifying the schema in the Spark SQL Documentation.
For example in your specific case it would be
import org.apache.spark.sql.types._
val schema = StructType(List(StructField("legal_name",MapType(StringType,StringType,true))))
That would be one column legal_name being a map.
When you have defined you schema you can call
sqlContext.read.json(rdd, schema) to create your data frame from your JSON dataset with the desired schema.

Resources