IICS (informatica) - is there a way to write a multi part parquet - iics-di

Looking for a way to create a multi-part parquet in IICS.
My job is running out of memory because of the size - can multi-part parquet help?

Related

How to write Avro Objects to Parquet with partitions in Java ? How to append data to the same parquet?

I am using Confluent's KafkaAvroDerserializer to deserialize Avro Objects sent over Kafka.
I want to write the recieved data to a Parquet file.
I want to be able to append data to the same parquet and to create a Parquet with Partitions.
I managed to create a Parquet with AvroParquetWriter - but I didn't find how to add partitions or append to the same file:
Before using Avro I used spark to write the Parquet - With spark writing a parquet with partitions and using append mode was trivial - should I try creating Rdds from my Avro objects and use spark to create the parquet ?
I want to write the Parquets to HDFS
Personally, I would not use Spark for this.
Rather I would use the HDFS Kafka Connector. Here is a config file that can get you started.
name=hdfs-sink
# List of topics to read
topics=test_hdfs
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
# increase to be the sum of the partitions for all connected topics
tasks.max=1
# the folder where core-site.xml and hdfs-site.xml exist
hadoop.conf.dir=/etc/hadoop
# the namenode url, defined as fs.defaultFS in the core-site.xml
hdfs.url=hdfs://hdfs-namenode.example.com:9000
# number of messages per file
flush.size=10
# The format to write the message values
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
# Setup Avro parser
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://schema-registry.example.com:8081
value.converter.schemas.enable=true
schema.compatibility=BACKWARD
If you want HDFS Partitions based on a field rather than the literal "Kafka Partition" number, then refer to the configuration docs on the FieldPartitioner. If you want automatic Hive integration, see the docs on that as well.
Let's say you did want to use Spark, though, you can try AbsaOSS/ABRiS to read in an Avro DataFrame, then you should be able to do something like df.write.format("parquet").path("/some/path") (not exact code, because I have not tried it)

Concatenate ORC partition files on disk?

I am using Spark 2.3 to convert some CSV data to ORC for use with Amazon Athena; it is working fine! Athena works best with files that are not too small so, after manipulating the data a bit, I am using Spark to coalesce the partitions into a single partition before writing to disk, like so:
df.coalesce(1).write.orc("out.orc", compression='zlib', mode='append')
This results in a single ORC file that is an optimal file size for use with Athena. However, the coalesce step takes a very long time. It adds about 33% to the total amount of time to convert the data!
This is obviously due to the fact that Spark cannot parallelize the coalesce step when saving to a single file. When I create the same number of partitions as there are CPUs available, the ORC write out to disk is much faster!
My question is, can I parallelize the ORC write to disk and then concatenate the files somehow? This would allow me to parallelize the write and merge the files without having to compress everything on a single CPU?

Best file formats for S3 using Spark for ETL on EMR

We are planning to perform ETL processing using Spark with source data sitting on S3. The data volume for ETL processing is less than 100 million. What is the best format to store data in S3 in this scenario i.e. the best compression and file format (text, sequence, parquet etc.)
ORC or Parquet for queries, compressed with Snappy. Avro is another general purpose format, but way less efficient for SparkSQL queries as you have to scan a lot more data.
Important At the time of writing (June 2017), you cannot safely use S3 as a direct destination of spark RDD/dataframe queries (i.e. save()) calls. See Cloud Integration for an explanation. Write to HDFS then copy

Spark write.avro creates individual avro files

I have a spark-submit job I wrote that reads an in directory of json docs, does some processing on them using data frames, and then writes to an out directory. For some reason, though, it creates individual avro, parquet or json files when I use df.save or df.write methods.
In fact, I even used the saveAsTable method and it did the same thing with parquet.gz files in the hive warehouse.
It seems to me that this is inefficient and negates the use of a container file format. Is this right? Or is this normal behavior and what I'm seeing just an abstraction in HDFS?
If I am right that this is bad, how do I write the data frame from many files into a single file?
As #zero323 told its normal behavior due to many workers(to support fault tolerance).
I would suggest you to write all the records in parquet or avro file which has avro generic record using something like this
dataframe.write().mode(SaveMode.Append).
format(FILE_FORMAT).partitionBy("parameter1", "parameter2").save(path);
but it wont write in to single file but it will group similar kind of Avro Generic records to one file(may be less number of medium sized) files

Spark with Avro, Kryo and Parquet

I'm struggling to understand what exactly Avro, Kryo and Parquet do in the context of Spark. They all are related to serialization but I've seen them used together so they can't be doing the same thing.
Parquet describes its self as a columnar storage format and I kind of get that but when I'm saving a parquet file can Arvo or Kryo have anything to do with it? Or are they only relevant during the spark job, ie. for sending objects over the network during a shuffle or spilling to disk? How do Arvo and Kryo differ and what happens when you use them together?
Parquet works very well when you need to read only a few columns when querying your data. However if your schema has lots of columns (30+) and in your queries/jobs you need to read all of them then record based formats (like AVRO) will work better/faster.
Another limitation of Parquet is that it is essentially write-once format. So usually you need to collect data in some staging area and write it to a parquet file once a day (for example).
This is where you might want to use AVRO. E.g. you can collect AVRO-encoded records in a Kafka topic or local files and have a batch job that converts all of them to Parquet file at the end of the day. This is fairly easy to implement thanks to parquet-avro library that provides tools to convert between AVRO and Parquet formats automatically.
And of course you can use AVRO outside of Spark/BigData. It is fairly good serialization format similar to Google Protobuf or Apache Thrift.
This very good blog post explains the details for everything but Kryo.
http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
Kryo would be used for fast serialization not involving permanent storage, such as shuffle data and cached data, in memory or on disk as temp files.

Resources