Converting CSV to ORC with Spark - apache-spark

I've seen this blog post by Hortonworks for support for ORC in Spark 1.2 through datasources.
It covers version 1.2 and it addresses the issue or creation of the ORC file from the objects, not conversion from csv to ORC.
I have also seen ways, as intended, to do these conversions in Hive.
Could someone please provide a simple example for how to load plain csv file from Spark 1.6+, save it as ORC and then load it as a data frame in Spark.

I'm going to ommit the CSV reading part because that question has been answered quite lots of time before and plus lots of tutorial are available on the web for that purpose, it will be an overkill to write it again. Check here if you want !
ORC support :
Concerning ORCs, they are supported with the HiveContext.
HiveContext is an instance of the Spark SQL execution engine that integrates with data stored in Hive. SQLContext provides a subset of the Spark SQL support that does not depend on Hive but ORCs, Window function and other feature depends on HiveContext which reads the configuration from hive-site.xml on the classpath.
You can define a HiveContext as following :
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
If you are working with the spark-shell, you can directly use sqlContext for such purpose without creating a hiveContext since by default, sqlContext is created as a HiveContext.
Specifying as orc at the end of the SQL statement below ensures that the Hive table is stored in the ORC format. e.g :
val df : DataFrame = ???
df.registerTempTable("orc_table")
val results = hiveContext.sql("create table orc_table (date STRING, price FLOAT, user INT) stored as orc")
Saving as an ORC file
Let’s persist the DataFrame into the Hive ORC table we created before.
results.write.format("orc").save("data_orc")
To store results in a hive directory rather than user directory, use this path instead /apps/hive/warehouse/data_orc (hive warehouse path from hive-default.xml)

Related

Table created with "stored as Parquet" option using PySpark SQL or Hive does not actually store data files in Parquet format

I create table on Hadoop cluster using PySpark SQL:spark.sql("CREATE TABLE my_table (...) PARTITIONED BY (...) STORED AS Parquet") and load some data with: spark.sql("INSERT INTO my_table SELECT * FROM my_other_table"), however the resulting files do not seem to be Parquet files, they're missing ".snappy.parquet" extension.
The same problem occurs when repeating those steps in Hive.
But surprisingly when I create table using PySpark DataFrame: df.write.partitionBy("my_column").saveAsTable(name="my_table", format="Parquet")
everything works just fine.
So, my question is: what's wrong with the SQL way of creating and populating Parquet table?
Spark version 2.4.5, Hive version 3.1.2.
Update (27 Dec 2022 after #mazaneicha answer)
Unfortunately, there is no parquet-tools on the cluster I'm working with, so the best I could do is to check the content of the files with hdfs dfs -tail (and -head). And in all cases there is "PAR1" both at the beginning and at the end of the file. And even more - the meta-data of parquet version (implementation):
Method # of files Total size Parquet version File name
Hive Insert 8 34.7 G Jparquet-mr version 1.10.0 xxxxxx_x
PySpark SQL Insert 8 10.4 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF insertInto 8 10.9 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF saveAsTable 8 11.5 G Jparquet-mr version 1.10.1 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-c000.snappy.parquet
(To create the same number of files I used "repartition" with df, and "distribute by" with SQL).
So, considering the above mentioned, it's still not clear:
Why there is no file extension in 3 out of 4 cases?
Why files created with Hive are so big? (no compression, I suppose).
Why PySpark SQL and PySpark Dataframe versions/implementations of parquet differ and how set them explicitly?
File format is not defined by the extension, but rather by the contents. You can quickly check if format is parquet by looking for magic bytes PAR1 at the very beginning and the very end of a file.
For in-depth format, metadata and consistency checking, try opening a file with parquet-tools.
Update:
As mentioned in online docs, parquet is supported by Spark as one of the many data sources via its common DataSource framework, so that it doesn't have to rely on Hive:
"When reading from Hive metastore Parquet tables and writing to non-partitioned Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance..."
You can find and review this implementation in Spark git repo (its open-source! :))

Understanding how Hive SQL gets executed in Spark

I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark
Ex:
warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
DF = spark.sql("select * from hive_table")
In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.
I am just wondering how the SQL is being processed. Whether in Hive or in Spark?
enableHiveSupport() and HiveContext are quite misleading, as they suggest some deeper relationship with Hive.
In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.
Hive support does not imply:
Full Hive Query Language compatibility.
Any form of computation on Hive.
SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.
The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.

Is it possible to use Spark with ORC file format without Hive?

I am working with HDP 2.6.4, to be more specific Hive 1.2.1 with TEZ 0.7.0 , Spark 2.2.0.
My task is simple. Store data in ORC file format then use Spark to process the data. To achieve this, I am doing this:
Create a Hive table through HiveQL
Use Spark.SQL("select ... from ...") to load data into dataframe
Process against the dataframe
My questions are:
1. What is Hive's role behind the scene?
2. Is it possible to skip Hive?
You can skip Hive and use SparkSQL to run the command in step 1
In your case, Hive is defining a schema over your data and providing you a query layer for Spark and external clients to communicate
Otherwise, spark.orc exists for reading and writing of dataframes directly on the filesystem

pyspark, how to read Hive tables with SQLContext?

I am new to the Hadoop ecosystem and I am still confused with few things. I am using Spark 1.6.0 (Hive 1.1.0-cdh5.8.0, Hadoop 2.6.0-cdh5.8.0)
I have some Hive table that exist and I can do some SQL queries using HUE web interface with Hive (map reduce) and Impala (mpp).
I am now using pySpark (I think behind this is pyspark-shell) and I wanted to understand and test HiveContext and SQLContext. There are many thready that discussed the differences between the two and for various version of Spark.
With Hive context, I have no issue to query the Hive tables:
from pyspark.sql import HiveContext
mysqlContext = HiveContext(sc)
FromHive = mysqlContext.sql("select * from table.mytable")
FromHive.count()
320
So far so good. Since SQLContext is subset of HiveContext, I was thinking that a basic SQL select should work:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
FromSQL = mysqlContext.sql("select * from table.mytable")
FromSQL.count()
Py4JJavaError: An error occurred while calling o81.sql.
: org.apache.spark.sql.AnalysisException: Table not found: `table`.`mytable`;
I added the hive-site.xml to pyspark-shell. When running
sc._conf.getAll(
I see:
('spark.yarn.dist.files', '/etc/hive/conf/hive-site.xml'),
My questions are:
Can I acess Hive table with SQLContext for simple queries (I know
HiveContext is more powerfull but for me this is just to understand
things)
If this is possible what is missing ? I couldn't find any info apart
from the hive-site.xml that I tried but doesn't seems to work
Thanks a lot
Cheers
Fabien
As mentioned in other answer, you can't use SQLContext to access Hive tables, they've given a seperate HiveContext in Spark 1.x.x which is basically an extension of SQLContext.
Reason::
Hive uses an external metastore to keep all the metadata, for example the information about db and tables. This metastore can be configured to be kept in MySQL etc. Default is derby.
This done so that all the users accessing Hive may see all the contents facilitated by metastore.
Derby creates a private metastore as a directory metastore_db in the directory from where the spark app is executed. Since this metastore is private, what ever you create or edit in this session, will not be accessible to anyone else. SQLContext basically facilitates a connection to a private metastore.
Needless to say, in Spark 2.x.x they've merged the two into SparkSession which acts as a singular entry point to spark. You can enable Hive support while creating SparkSession by .enableHiveSupport()
You cannot use standard SQLContext to access Hive directly. To work with Hive you need Spark binaries built with Hive support and HiveContext.
You could use use JDBC data source, but it won't be acceptable performance wise for large scale processing.
To access SQLContext tables, you need to register it temporarily. Then you can easily make SQL queries on it. Suppose you have some data in the form of JSON. You can make it in dataframe.
Like below:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
df = sqlSparkContext.read.json("your json data")
sql_df = df.registerTempTable("mytable")
FromSQL = sqlSparkContext.sql("select * from mytable")
FromSQL.show()
Also you can collect the SQL data in row type array as below:-
r = FromSSQL.collect()
print r.column_Name
Try without keeping sc into sqlContext,I think when we create sqlContext object with sc spark is trying to call HiveContext but we are having sqlContext instead
>>>df=sqlContext.sql("select * from <db-name>.<table-name>")
Use the superset of SQL Context i.e HiveContext to Connect and load the hive tables to spark dataframes
>>>df=HiveContext(sc).sql("select * from <db-name>.<table-name>")
(or)
>>>df=HiveContext(sc).table("default.text_Table")
(or)
>>> hc=HiveContext(sc)
>>> df=hc.sql("select * from default.text_Table")

Read avro data using spark dataset in java

I am newbie to spark and am trying to load avro data to spark 'dataset' (spark 1.6) using java. I see some examples in scala but not in java.
Any pointers to examples in java will be helpful. I tried to create a javaRDD and then convert it to 'dataset'. I believe there must be a straight forward way.
first of all you need to set hadoop.home.dir
System.setProperty("hadoop.home.dir", "C:/app/hadoopo273/winutils-master/hadoop-2.7.1");
then create a Spark session specifying where the avro file will be located
SparkSession spark = SparkSession .builder().master("local").appName("ASH").config("spark.cassandra.connection.host", "127.0.0.1").config("spark.sql.warehouse.dir", "file:///C:/cygwin64/home/a622520/dev/AshMiner2/cass-spark-embedded/cassspark/cassspark.all/spark-warehouse/").getOrCreate();
In my code am using an embedded spark environement
// Creates a DataFrame from a specified file
Dataset<Row> df = spark.read().format("com.databricks.spark.avro") .load("./Ash.avro");
df.createOrReplaceTempView("words");
Dataset<Row> wordCountsDataFrame = spark.sql("select count(*) as total from words");
wordCountsDataFrame.show();
hope this helps

Resources