Parse hive text format RDD[String] to DataFrame using schema of a existing table - apache-spark

I have and RDD[String], each String is a hive text format row data, and the hive table is in hive database so I can get the schema, is there way to let spark parse RDD[String] to a DataFrame with the schema so I don't need to it manually.

If in your RDD[String], each string represent a particular structure like (id,name,salary). You can create the case class in scala and convert your RDD[String] to RDD[case class] and then use toDF() function to convert RDD to DataFrame.
If your file is delimited then you can use csv package to create the DataFrame on delimited file if you are using spark 2.x or later. Or if you are using spark 1.6.x or earlier you can use external spark-csv package for the same.
Hope it helps.
Regards,
Neeraj

Related

Which file formats can I save a pyspark dataframe as?

I would like to save a huge pyspark dataframe as a Hive table. How can I do this efficiently? I am looking to use saveAsTable(name, format=None, mode=None, partitionBy=None, **options) from pyspark.sql.DataFrameWriter.saveAsTable.
# Let's say I have my dataframe, my_df
# Am I able to do the following?
my_df.saveAsTable('my_table')
My question is which formats are available for me to use and where can I find this information for myself? Is OrcSerDe an option? I am still learning about this. Thank you.
Following file formats are supported.
text
csv
ldap
json
parquet
orc
Referece: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
So I was able to write the pyspark dataframe to a compressed Hive table by using a pyspark.sql.DataFrameWriter. To do this I had to do something like the following:
my_df.write.orc('my_file_path')
That did the trick.
https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.write
I am using pyspark 1.6.0 btw

wrong schema in case of dataframe obtained from ORC hive table [duplicate]

I have a directory containing ORC files. I am creating a DataFrame using the below code
var data = sqlContext.sql("SELECT * FROM orc.`/directory/containing/orc/files`");
It returns data frame with this schema
[_col0: int, _col1: bigint]
Where as the expected schema is
[scan_nbr: int, visit_nbr: bigint]
When I query on files in parquet format I get correct schema.
Am I missing any configuration(s)?
Adding more details
This is Hortonworks Distribution HDP 2.4.2 (Spark 1.6.1, Hadoop 2.7.1, Hive 1.2.1)
We haven't changed the default configurations of HDP, but this is definitely not the same as the plain vanilla version of Hadoop.
Data is written by upstream Hive jobs, a simple CTAS (CREATE TABLE sample STORED AS ORC as SELECT ...).
I tested this on filed generated by CTAS with the latest 2.0.0 hive & it preserves the column names in the orc files.
The problem is the Hive version, which is 1.2.1, which has this bug HIVE-4243
This was fixed in 2.0.0.
Setting
sqlContext.setConf('spark.sql.hive.convertMetastoreOrc', 'false')
fixes this.
If you have the parquet version as well, you can just copy the column names over, which is what I did (also, the date column was partition key for orc so had to move it to the end):
tx = sqlContext.table("tx_parquet")
df = sqlContext.table("tx_orc")
tx_cols = tx.schema.names
tx_cols.remove('started_at_date')
tx_cols.append('started_at_date') #move it to end
#fix column names for orc
oldColumns = df.schema.names
newColumns = tx_cols
df = functools.reduce(
lambda df, idx: df.withColumnRenamed(
oldColumns[idx], newColumns[idx]), range(
len(oldColumns)), df)
We can use:
val df = hiveContext.read.table("tableName")
Your df.schema or df.columns will give actual column names.
If version upgrade is not an available option, quick fix could be to rewrite ORC file using PIG. That seems to work just fine.

Spark SQL on ORC files doesn't return correct Schema (Column names)

I have a directory containing ORC files. I am creating a DataFrame using the below code
var data = sqlContext.sql("SELECT * FROM orc.`/directory/containing/orc/files`");
It returns data frame with this schema
[_col0: int, _col1: bigint]
Where as the expected schema is
[scan_nbr: int, visit_nbr: bigint]
When I query on files in parquet format I get correct schema.
Am I missing any configuration(s)?
Adding more details
This is Hortonworks Distribution HDP 2.4.2 (Spark 1.6.1, Hadoop 2.7.1, Hive 1.2.1)
We haven't changed the default configurations of HDP, but this is definitely not the same as the plain vanilla version of Hadoop.
Data is written by upstream Hive jobs, a simple CTAS (CREATE TABLE sample STORED AS ORC as SELECT ...).
I tested this on filed generated by CTAS with the latest 2.0.0 hive & it preserves the column names in the orc files.
The problem is the Hive version, which is 1.2.1, which has this bug HIVE-4243
This was fixed in 2.0.0.
Setting
sqlContext.setConf('spark.sql.hive.convertMetastoreOrc', 'false')
fixes this.
If you have the parquet version as well, you can just copy the column names over, which is what I did (also, the date column was partition key for orc so had to move it to the end):
tx = sqlContext.table("tx_parquet")
df = sqlContext.table("tx_orc")
tx_cols = tx.schema.names
tx_cols.remove('started_at_date')
tx_cols.append('started_at_date') #move it to end
#fix column names for orc
oldColumns = df.schema.names
newColumns = tx_cols
df = functools.reduce(
lambda df, idx: df.withColumnRenamed(
oldColumns[idx], newColumns[idx]), range(
len(oldColumns)), df)
We can use:
val df = hiveContext.read.table("tableName")
Your df.schema or df.columns will give actual column names.
If version upgrade is not an available option, quick fix could be to rewrite ORC file using PIG. That seems to work just fine.

Read parquet files in Spark with pattern matching

I'm running Spark 1.3.0 and want to read a number of parquet files based on pattern matching. the parquet files are basically the underlying files of a Hive DB and I want to read some of the files (across different folders) only. the folder structure is
hdfs://myhost:8020/user/hive/warehouse/db/blogs/some/meta/files/
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160101/01/file1.parq
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160101/02/file2.parq
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160103/01/file3.parq
Something like
val v1 = sqlContext.parquetFile("hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd={[0-9]*}")
I want to ignore the meta files and load only the parquet files inside the date folders. Is this possible?
you can use wildcard in parquet like so (works on 1.5 didn't test on 1.3):
val v1 = sqlContext.parquetFile("hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd*")
another thing you can do in case that doesn't work is to create external table using hive with partition by yymmdd and read parquet from that table using:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("SELECT FROM ...")
you can't use regular expression.
also I think you folder structure is problematic. it should be
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=150204/
or
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=150204/part=01
and not:
hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=150204/1
because they way you use it I think you will have troubles using the folder names (yymmdd) as partition because the files are not directly under it

save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"

I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark.
The documentation states:
"spark.sql.hive.convertMetastoreParquet: When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support."
Looking at the Spark tutorial, is seems that this property can be set:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
# code to create dataframe
my_dataframe.saveAsTable("my_dataframe")
However, when I try to query the saved table in Hive it returns:
hive> select * from my_dataframe;
OK
Failed with exception java.io.IOException:java.io.IOException:
hdfs://hadoop01.woolford.io:8020/user/hive/warehouse/my_dataframe/part-r-00001.parquet
not a SequenceFile
How do I save the table so that it's immediately readable in Hive?
I've been there...
The API is kinda misleading on this one.
DataFrame.saveAsTable does not create a Hive table, but an internal Spark table source.
It also stores something into Hive metastore, but not what you intend.
This remark was made by spark-user mailing list regarding Spark 1.3.
If you wish to create a Hive table from Spark, you can use this approach:
1. Use Create Table ... via SparkSQL for Hive metastore.
2. Use DataFrame.insertInto(tableName, overwriteMode) for the actual data (Spark 1.3)
I hit this issue last week and was able to find a workaround
Here's the story:
I can see the table in Hive if I created the table without partitionBy:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_HAPPY")
hive> desc TBL_HIVE_IS_HAPPY;
OK
user_id string
email string
ts string
But Hive can't understand the table schema(schema is empty...) if I do this:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_NOT_HAPPY")
hive> desc TBL_HIVE_IS_NOT_HAPPY;
# col_name data_type from_deserializer
[Solution]:
spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
spark-shell>df.write
.partitionBy("ts")
.mode(SaveMode.Overwrite)
.saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE
hive> DROP TABLE IF EXISTS Happy_HIVE;
hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string)
PARTITIONED BY(day STRING)
STORED AS PARQUET
LOCATION '/apps/hive/warehouse/Happy_HIVE';
hive> MSCK REPAIR TABLE Happy_HIVE;
The problem is that the datasource table created through Dataframe API(partitionBy+saveAsTable) is not compatible with Hive.(see this link). By setting spark.sql.hive.convertMetastoreParquet to false as suggested in the doc, Spark only puts data onto HDFS,but won't create table on Hive. And then you can manually go into hive shell to create an external table with proper schema&partition definition pointing to the data location.
I've tested this in Spark 1.6.1 and it worked for me. I hope this helps!
I have done in pyspark, spark version 2.3.0 :
create empty table where we need to save/overwrite data like:
create table databaseName.NewTableName like databaseName.OldTableName;
then run below command:
df1.write.mode("overwrite").partitionBy("year","month","day").format("parquet").saveAsTable("databaseName.NewTableName");
The issue is you can't read this table with hive but you can read with spark.
metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore, to the hive metastore.

Resources