Looks like spark by default write "org.apache.spark.sql.parquet.row.metadata" to parquet file footer. However, what if I want to write some random metadata(such as version=123) to a parquet file produced by spark?
This does NOT work:
df.write().option("version","123").parquet("somefile.parquet");
And I'm using spark version 1.6.2
Column level metadata, yes see my comment.
Table level comments/user metadata: See https://issues.apache.org/jira/browse/SPARK-10803
Sadly, not yet
Related
I have a question, is it possible to execute ETL for data using flume.
To be more specific I have flume configured on spoolDir which contains CSV files and I want to convert those files into Parquet files before storing them into Hadoop. Is it possible ?
If it's not possible would you recommend transforming them before storing in Hadoop or transform them using spark on Hadoop?
I'd probably suggest using nifi to move the files around. Here's a specific tutorial on how to do that with Parquet. I feel nifi was the replacement for Apache Flume.
Flume partial answers:(Not Parquet)
If you are flexible on format you can use an avro sink. You can use a hive sink and it will create a table in ORC format.(You can see if it also allows parquet in the definition but I have heard that ORC is the only supported format.)
You could likely use some simple script to use hive to move the data from the Orc table to a Parquet table. (Converting the files into the parquet files you asked for.)
we had some old code using org.apache.parquet.hadoop.api.WriteSupport API to write Parquet formatted file, and we start to use Apache Spark to do the same thing.
Those two ways can successfully generate Parquet files with same input data, and the output data are almost identical. However, the output file size is quite different.
The one generated by WriteSupport is 2G-ish, whereas the one generated by Spark is 5.5G-ish. I compared the schema, they are same, is there any area I can further look into?
Btw, the WriteSupport has parquet-mr version 1.8.0; Spark one has 1.10.0.
I am using Confluent's KafkaAvroDerserializer to deserialize Avro Objects sent over Kafka.
I want to write the recieved data to a Parquet file.
I want to be able to append data to the same parquet and to create a Parquet with Partitions.
I managed to create a Parquet with AvroParquetWriter - but I didn't find how to add partitions or append to the same file:
Before using Avro I used spark to write the Parquet - With spark writing a parquet with partitions and using append mode was trivial - should I try creating Rdds from my Avro objects and use spark to create the parquet ?
I want to write the Parquets to HDFS
Personally, I would not use Spark for this.
Rather I would use the HDFS Kafka Connector. Here is a config file that can get you started.
name=hdfs-sink
# List of topics to read
topics=test_hdfs
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
# increase to be the sum of the partitions for all connected topics
tasks.max=1
# the folder where core-site.xml and hdfs-site.xml exist
hadoop.conf.dir=/etc/hadoop
# the namenode url, defined as fs.defaultFS in the core-site.xml
hdfs.url=hdfs://hdfs-namenode.example.com:9000
# number of messages per file
flush.size=10
# The format to write the message values
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
# Setup Avro parser
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://schema-registry.example.com:8081
value.converter.schemas.enable=true
schema.compatibility=BACKWARD
If you want HDFS Partitions based on a field rather than the literal "Kafka Partition" number, then refer to the configuration docs on the FieldPartitioner. If you want automatic Hive integration, see the docs on that as well.
Let's say you did want to use Spark, though, you can try AbsaOSS/ABRiS to read in an Avro DataFrame, then you should be able to do something like df.write.format("parquet").path("/some/path") (not exact code, because I have not tried it)
I am trying to generate parquet files in S3 file using spark with the goal that presto can be used later to query from parquet. Basically, there is how it looks like,
Kafka-->Spark-->Parquet<--Presto
I am able to generate parquet in S3 using Spark and its working fine. Now, I am looking at presto and what I think I found is that it needs hive meta store to query from parquet. I could not make presto read my parquet files even though parquet saves the schema. So, does it mean at the time of creating the parquet files, the spark job has to also store metadata in hive meta store?
If that is the case, can someone help me find an example of how it's done. To add to the problem, my data schema is changing, so to handle it, I am creating a programmatic schema in spark job and applying it while creating parquet files. And, if I am creating the schema in hive metastore, it needs to be done keeping this in consideration.
Or could you shed light on it if there is any better alternative way?
You keep the Parquet files on S3. Presto's S3 capability is a subcomponent of the Hive connector. As you said, you can let Spark define tables in Spark or you can use Presto for that, e.g.
create table hive.default.xxx (<columns>)
with (format = 'parquet', external_location = 's3://s3-bucket/path/to/table/dir');
(Depending on Hive metastore version and its configuration, you might need to use s3a instead of s3.)
Technically, it should be possible to create a connector that infers tables' schemata from Parquet headers, but I'm not aware of an existing one.
Can we use DataFrame while reading data from HDFS.
I have a tab separated data in HDFS.
I googled, but saw it can be used with NoSQL data
DataFrame is certainly not limited to NoSQL data sources. Parquet, ORC and JSON support is natively provided in 1.4 to 1.6.1; text delimited files are supported using the spark-cvs package.
If you have your tsv file in HDFS at /demo/data then the following code will read the file into a DataFrame
sqlContext.read.
format("com.databricks.spark.csv").
option("delimiter","\t").
option("header","true").
load("hdfs:///demo/data/tsvtest.tsv").show
To run the code from spark-shell use the following:
--packages com.databricks:spark-csv_2.10:1.4.0
In Spark 2.0 csv is natively supported so you should be able to do something like this:
spark.read.
option("delimiter","\t").
option("header","true").
csv("hdfs:///demo/data/tsvtest.tsv").show
If I am understanding correctly, you essentially want to read data from the HDFS and you want this data to be automatically converted to a DataFrame.
If that is the case, I would recommend you this spark csv library. Check this out, it has a very good documentation.