I am having confusion on the difference of the following code in Databricks
spark.readStream.format('json')
vs
spark.readStream.format('cloudfiles').option('cloudFiles.format', 'json')
I know cloudfiles as the format would be regarded as Databricks Autoloader . In performance/function comparison , which one is better ? Anyone has some experience on that?
Thanks
There are multiple differences between these two. When you use Auto Loader you get at least, there are more things (see doc for all details):
Better performance, scalability and cost efficiency when discovering new files. You can either use file notification mode (when you get notified about new files using cloud-native integration) or optimized file listing mode that uses native cloud APIs to list files and directories. Spark's file streaming is relying on the Hadoop APIs that are much slower, especially if you have a lot of nested directories and a lot of files
Support for schema inference and evolution. With Auto Loader you can detect changes in the schema for JSON/CSV/Avro, and adjust it to process new fields.
Related
H All,
I am currently working on developing an architecture which should be able to handle both real time and batch data(coming from disparate sources and point solutions - third party tools). The existing architecture is old school and uses mostly RDBMS(I am not going to to go detail in that).
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
But I have been told to use kafka-spark streaming pair for handling all kinds of data.
If anyone has any experience working on kafka-spark streaming pair for handling all kinds of data, could you please give me a brief details if this could be a viable solution and better than having two different pipeline.
Thanks in advance!
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
Pipeline 1: Sqoop is a good choice for batch load, but it will slow in performance because underlying architecture is still on map-reduce. Though there are options to run sqoop on spark, but didn't try that. Once the data is in HDFS then you can use hive, which is great solution for batch processing. Having said that you can replace sqoop with Spark, if you are worrying about the RDMS fetch time. You can also do a batch transformations in spark also. I would say this is good solution.
Pipeline 2: Kafka and Spark streaming are the most obvious choice and is a good choice. But, If you are using Confluent dist. of Kafka then you could replace most of the spark transformations with K-SQL, K-Streams which will create a realtime transformations.
I would say, its good to have separate system for batching and one for real-time. This is what is lambda architecture. But if you are looking for a more unified framework, then you can try Apache Beam, which provides an unified framework for both batch and realtime processing. You can choose from multiple runners to execute your query.
Hope this helps :)
Lambda architecture would be the way to go!
Hope this link gives you enough ideas:
https://dzone.com/articles/lambda-architecture-how-to-build-a-big-data-pipeli
Thanks much.
I'm using Spark (2.4) to process I data being stored on S3.
I'm trying to understand if there's a way to spare the listing of the objects that I'm reading as my batch job inputs (I'm talking about ~1M )
I know about S3Guard that stores the objects metadata, and thought that I can use it for skipping the S3 listing.
I've read this Cloudera's blog
Note that it is possible to skip querying S3 in some cases, just
serving results from the Metadata Store. S3Guard has mechanisms for
this but it is not yet supported in production.
I know it's quite old , is it already available in production?
As of July 2019 it is still tagged as experimental; HADOOP-14936 lists the tasks there.
The recent work has generally corner cases you aren't going to encounter on a daily basis, but which we know exist and can't ignore.
The specific feature you are talking about, "auth mode", relies on all clients to be using S3Guard and update the tables, and us being happy that we can handle the failure conditions for consistency.
For a managed table, I'm going to say Hadoop 3.3 will be ready to use this. For HADOOP-3.2, it's close. Really, more testing is needed.
In the meantime, if you can't reduce the number of files in S3, can you make sure you don't have a deep directory tree, as its that recursive directory scan which really suffers against it
Two points here after reading https://wiki.apache.org/hadoop/AmazonS3
Not sure what to make of this below.
...
eventual consistency: changes made by one application (creation, updates and deletions) will not be visible until some undefined time.
...
Some undefined time? What does that mean for writing SPARK Applications then? If I have n JOBs, that may be something may not yet be visible?
How does the SPARK default partitioning apply then for S3 data?
that Hadoop doc is a bit out of date; I'd google for "spark and object stores" to get some more up to date stuff.
The spark documentation has some spark-specific details.
Some undefined time? What does that mean for writing SPARK Applications then?
Good question. AWS never give the hard data here; the best empirical study is Benchmarking Eventual Consistency: Lessons Learned from Long-Term Experimental Studies
That showed that consistency delays depend on total AWS load, and had some other patterns. Because its so variable, nobody dares give a good value of "undefined time".
My general expectations are
Normally list inconsistency can take a few seconds, but under load it can get worse
if nothing has actually gone wrong with S3 then a few minutes is enough for listing inconsistencies to be resolved.
All the s3 connectors mimic rename by listing all files under a path, copying and deleting them, renaming a directory immediately after processes have written to them may miss data.
Because the means by which Spark jobs commit their output to a filesystem depends on rename(), it is not safe to use them to commit the output of tasks.
If I have n JOBs, that may be something may not yet be visible?
It's worse than that. You can't rely on the rename operations within a single job to get it right.
It's why Amazon offer a consistent emrfs option using dynamoDB for listing consistency, and Hadoop 2.9+ have a feature, S3Guard, which uses dynamoDB for that same operation. Neither deal with update inconsistency though, which is why Hadoop 3.1's "S3A committers" default to generating unique filenames for new files.
If you are using the Apache S3A connector to commit work to S3 using the normal filesystem FileOutputCommitter then, without S3Guard, you are at risk of losing data.
Don't worry about chaining work; worry about that.
BTW: I don't know what Databricks do here. Ask them for the details.
How does the SPARK default partitioning apply then for S3 data?
the partitioning is based on whatever blocksize the object store connector makes up. For example, for the s3a connector, its that of fs.s3a.blocksize
I am working with some legacy systems in investment banking domain, which are very unfriendly in the sense that, only way to extract data from them is through a file export/import. Lots of trading takes place and large number of transactions are stored on these system.
Q is how to read large number of large files on NFS and dump it on a system on which analytics can be done by something like Spark or Samza.
Back to issue. Due nature of legacy systems, we are extracting data and dumping into files. Each file is in hundreds of gigabyte size.
I feel next step is to read these and dump to Kafka or HDFS, or maybe even Cassandra or HBase. Reason being I need to run some financial analytics on this data. I have two questions:
How to efficiently read large number of large files which are located on one or numerous machines
Apparently you've discovered already that mainframes are good at writing large numbers of large files. They're good at reading them too. But that aside...
IBM has been pushing hard on Spark on z/OS recently. It's available for free, although if you want support, you have to pay for that. See: https://www-03.ibm.com/systems/z/os/zos/apache-spark.html My understanding is that z/OS can be a peer with other machines in a Spark cluster.
The z/OS Spark implementation comes with a piece that can read data directly from all sorts of mainframe sources: sequential, VSAM, DB2, etc. It might allow you to bypass the whole dump process and read the data directly from the source.
Apparently Hadoop is written in Java, so one would expect that it should be able to run on z/OS with little problem. However, watch out for ASCII vs. EBCDIC issues.
On the topic of using Hadoop with z/OS, there's a number of references out there, including a red piece: http://www.redbooks.ibm.com/redpapers/pdfs/redp5142.pdf
You'll note that in there they make mention of using the CO:z toolkit, which I believe is available for free.
However you mention "unfriendly". I'm not sure if that means "I don't understand this environment as it doesn't look like anything I've used before" or it means "the people I'm working with don't want to help me". I'll assume something like the latter since the former is simply a learning opportunity. Unfortunately, you're probably going to have a tough time getting the unfriendly people to get anything new up and running on z/OS.
But in the end, it may be best to try to make friends with those unfriendly z/OS admins as they likely can make your life easier.
Finally, I'm not sure what analytics you're planning on doing with the data. But in some cases it may be easier/better to move the analytics process to the data instead of moving the data to the analytics.
The simplest way to do it better is zconnector, a ibm product for data ingestion between mainframe to hadoop cluster.
I managed to find an answer. The biggest bottleneck is that reading files is essentially a serial operation.. that is the most efficient way to read from a disk. So for one file I am stuck with a single thread reading it from NFS and sending it to HDFS or Kafka via their APIs.
So it appears best way is to make sure that the source from where data is coming dumps files in multiple NFS folders. That point onward I can run multiple processes to load data to HDFS or Kafka since they are highly parallelized.
How to load? One good way is to mount NFS into Hadoop infrastructure and use distcp. There are other possiblities too which open up once we make sure files are available from large number of NFS. Otherwise remember, reading file is a serial operation. Thanks.
I have multiple input sources (~200) coming in on Kafka topics - the data for each is similar, but each must be run separately because there are differences in schemas - and we need to perform aggregate health checks on the feeds (so we can't throw them all into 1 topic in a simple way, without creating more work downstream). I've created a spark app with a spark streaming context, and everything seems to be working, except that it is only running the streams sequentially. There are certain bottlenecks in each stream which make this very inefficient, and I would like all streams to run at the same time - is this possible? I haven't been able to find a simple way to do this. I've seen the concurrentJobs parameter, but that doesn't worked as desired. Any design suggestions are also welcome, if there is not an easy technical solution.
Thanks
The answer was here:
https://spark.apache.org/docs/1.3.1/job-scheduling.html
with the fairscheduler.xml file.
By default it is FIFO... it only worked for me once I explicitly wrote the file (couldn't set it programmatically for some reason).