Can we use DataFrame while reading data from HDFS.
I have a tab separated data in HDFS.
I googled, but saw it can be used with NoSQL data
DataFrame is certainly not limited to NoSQL data sources. Parquet, ORC and JSON support is natively provided in 1.4 to 1.6.1; text delimited files are supported using the spark-cvs package.
If you have your tsv file in HDFS at /demo/data then the following code will read the file into a DataFrame
sqlContext.read.
format("com.databricks.spark.csv").
option("delimiter","\t").
option("header","true").
load("hdfs:///demo/data/tsvtest.tsv").show
To run the code from spark-shell use the following:
--packages com.databricks:spark-csv_2.10:1.4.0
In Spark 2.0 csv is natively supported so you should be able to do something like this:
spark.read.
option("delimiter","\t").
option("header","true").
csv("hdfs:///demo/data/tsvtest.tsv").show
If I am understanding correctly, you essentially want to read data from the HDFS and you want this data to be automatically converted to a DataFrame.
If that is the case, I would recommend you this spark csv library. Check this out, it has a very good documentation.
Related
I have a question, is it possible to execute ETL for data using flume.
To be more specific I have flume configured on spoolDir which contains CSV files and I want to convert those files into Parquet files before storing them into Hadoop. Is it possible ?
If it's not possible would you recommend transforming them before storing in Hadoop or transform them using spark on Hadoop?
I'd probably suggest using nifi to move the files around. Here's a specific tutorial on how to do that with Parquet. I feel nifi was the replacement for Apache Flume.
Flume partial answers:(Not Parquet)
If you are flexible on format you can use an avro sink. You can use a hive sink and it will create a table in ORC format.(You can see if it also allows parquet in the definition but I have heard that ORC is the only supported format.)
You could likely use some simple script to use hive to move the data from the Orc table to a Parquet table. (Converting the files into the parquet files you asked for.)
Looks like spark by default write "org.apache.spark.sql.parquet.row.metadata" to parquet file footer. However, what if I want to write some random metadata(such as version=123) to a parquet file produced by spark?
This does NOT work:
df.write().option("version","123").parquet("somefile.parquet");
And I'm using spark version 1.6.2
Column level metadata, yes see my comment.
Table level comments/user metadata: See https://issues.apache.org/jira/browse/SPARK-10803
Sadly, not yet
We have a file that we want split into 3 and that we need to perform some data cleanup on before it can be imported into Hana Vora - otherwise everything has to be typed as String, which is not ideal.
We can import and prepare the DataFrames in spark just fine, but then when i try to write to either the HDFS filesystem or, better, to save as a Table in the "com.sap.spark.vora" datasource, i get errors.
Can any one advise on a reliable way to import the spark-prepared datasets into Hana Vora? Thanks!
Vora currently only officially supports appending data to an existing table (using the APPEND statement). For details see SAP HANA Vora Developer Guide -> Chapter "3.5 Appending Data to Existing Tables"
This means you would have to create an intermediate file. Vora supports reading from CSV, ORC, Parquet files. A dataframe can be saved in an ORC and Parquet files directly from Spark (see https://spark.apache.org/docs/1.6.1/sql-programming-guide.htm). To write to CSV files from Spark see https://github.com/databricks/spark-csv
I want to convert xml files to avro. The data will be in xml format and will be hit the kafka topic first. Then, I can either use flume or spark-streaming to ingest and convert from xml to avro and land the files in hdfs. I have a cloudera enviroment.
When the avro files hit hdfs, I want the ability to read them into hive tables later.
I was wondering what is the best method to do this? I have tried automated schema conversion such as spark-avro (this was without spark-streaming) but the problem is spark-avro converts the data but hive cannot read it. Spark avro converts the xml to dataframe and then from dataframe to avro. The avro file can only be read by my spark application. I am not sure if I am using this correctly.
I think I will need to define an explicit schema for the avro schema. Not sure how to go about this for the xml file. It has multiple namespaces and is quite massive.
If you are on cloudera(since u have flume, may u have it), you can use morphline to work on conversion at record level. You can use batch/streaming. You can see here for more info.
I checked this blog https://code.facebook.com/posts/370832626374903/even-faster-data-at-the-speed-of-presto-orc/.
How can I use this "presto-orc" file format ?
I have my data in S3 in text format. I want to rewrite in "presto-orc" format.
I use hive in general to write data into ORC/RCFile/Parquet.
There is no special "presto-orc" format. Presto has an optimized reader for the standard ORC format (and the Facebook DWRF variant).
You can write the files in ORC data using any program that supports it: Hive, Presto, Spark, etc.