I am using flink program to write the streaming data which I got from the kafka to Azure Data Lake. When I used synchronisation in getting ADLOutputStream and writing and closing, it works fine but the performance is very poor since only one thread is writing to data lake.When I am using multiple thread without synchronisation it is throwing http 400 illegalargument exception. Is there any way that multiple thread could write to a file in Azure data lake?
Have another think on your design.
One approach would be to write multiple files to the Data Lake - one for each thread. Once in Data Lake, you can use USQL or PolyBase to query over a set of files as if they were one data source. Alternatively, you could then orchestrate a USQL job to merge the files once they are in the lake. This would be local processing and would perform well.
Using AdlOuputStream is not the right mechanism for such parallel writes. AdlOutputStream is designed for a single writer scenario. When ingesting data in parallel from multiple threads there typically are a few characteristics that we commonly observe:
You want to optimize for throughput and not do synchronization across threads
Ordering (across threads) is typically not important
For specifically addressing these types of scenarios, Azure Data Lake Store provides a unique, high-performance API that we call "Concurrent Appends".
Here is the gist that shows you how to use this API: https://gist.github.com/asikaria/0a806091655c6e963eea59e89fdd40a9
The method is available on the Core class in our SDK: https://azure.github.io/azure-data-lake-store-java/javadoc/com/microsoft/azure/datalake/store/Core.html
Some points to note specific to the Azure Data Lake Store implementation of Concurrent Append:
Once a file is used with concurrent appends, you cannot use fixed offset appends with it
It is possible that you may see duplicate data in the file. This is possible side effect of error modes and automatic retries.
Edit: Also the answer from Murray Foxcraft is suitable for long running threads with reasonable file-rotation policy. The only downside to watch in that approach is that you don't end up with a ton of small files.
Related
I know there is no way to update an existing s3 object or a file on HDFS filesystem.
However, my data sources are updated with new data on a regular basis.
Currently, I am thinking mainly of JDBC data sources, but later there will be also other type of datasources (ex.: kafka stream).
I am wondering what would be the best solution to store these large amounts of data in the cloud, in a way that allows me to quickly perform operations on it in hadoop.
I would like to execute complex SQL queries on them (for example with Spark SQL), and some kind of ML algorithm will also be performed on the datasets. These processes will be initiated by users in a web interface.
As far as I know, actions can be executed relatively quickly on S3 objects in hadoop.
My plan is to upload only the new data (so what is not yet in the S3 storage) as a new object version in S3. But I’m not sure I can treat different versions of an object as one single object, and execute sql statements and perform ML stuffs on the whole dataset, not just on the chunks separately.
I'm a beginner in Cloud technology. Currently, only the data storage part is interesting. If I understand this part a little better, I can plan the rest more easily.
So what do you think? Can I achieve it with S3 storage type? If not, what method would you suggest?
Thanks.
I have a requirement to write upto 500k records daily to Azure SQL DB using an ADF pipeline.
I had simple calculations as part of the data transformation that can performed in a SQL Stored procedure activity. I've also observed Databricks Notebooks being used commonly, esp. due to benefits of scalability going forward. But there is an overhead activity of placing files in another location after transformation, managing authentication etc. and I want to avoid any over-engineering unless absolutely required.
I've tested SQL Stored Proc and it's working quite well for ~50k records (not yet tested with higher volumes).
But I'd still like to know the general recommendation between the 2 options, esp. from experienced Azure or data engineers.
Thanks
I'm not sure there is enough information to make a solid recommendation. What is the source of the data? Why is ADF part of the solution? Is this 500K rows once per day or a constant stream? Are you loading to a Staging table then using SPROC to move and transform the data to another table?
Here are a couple thoughts:
If the data operation is SQL to SQL [meaning the same SQL instance for both source and sink], then use Stored Procedures. This allows you to stay close to the metal and will perform the best. An exception would be if the computational load is really complicated, but that doesn't appear to be the case here.
Generally speaking, the only reason to call Data Bricks from ADF is if you already have that expertise and the resources already exist to support it.
Since ADF is part of the story, there is a middle ground between your two scenarios - Data Flows. Data Flows are a low-code abstraction over Data Bricks. They are ideal for in-flight data transforms and perform very well at high loads. You do not author or deploy notebooks, nor do you have to manage the Data Bricks configuration. And they are first class citizens in ADF pipelines.
As an experienced (former) DBA, Data Engineer and data architect, I cannot see what Databricks adds in this situation. This piece of the architecture you might need to scale is the target for the INSERTs, ie Azure SQL Database which is ridiculously easy to scale either manually via the portal or via the REST API, if even required. Consider techniques such as loading into heaps and partition switching if you need to tune the insert.
The overhead of adding an additional component to your architecture and then taking your data through would have to be worth it, plus the additional cost of spinning up Spark clusters at the same time your db is running.
Databricks is a superb tool and has a number of great use cases, eg advanced data transforms (ie things you cannot do with SQL), machine learning, streaming and others. Have a look at this free resource for a few ideas:
https://databricks.com/p/ebook/the-big-book-of-data-science-use-cases
In Apache Pulsar topic documentation it says can we set a topic time retention policy to -1 for infinite time based retention, What are the downsides of having infinite retention and can we use pulsar as message store where data lives forever in topics and build event sourcing application around them?
The downside is that your data will grow forever. However, due to the segment based architecture of the underlying storage (bookkeeper), more space can by added by adding storage nodes (i.e. all the data doesn't have to fit on one machine, as is the case in some other systems).
The segment based architecture also makes it fairly straightforward to move data to a bulk storage system (s3 or something) while still having it available from Pulsar. However, this is still in earlier stages of discussion right now.
Actually, you can and should use Pulsar's Tiered Storage option to offload your older data to more cost effective storage such as S3, Google Blob Storage, or HDFS. Unlike Kafka, Pulsar has decoupled the serving layers from the storage layers, which allows this. In Kafka, you would have to "add hard drives endlessly" and broker instances to store them.
Using the benefits to Pulsar is a better option because it provides more organization for your data store. Since Pulsar's strength is a storage layer that separates tiered storage away from topics, I would recommend going that route because your data will both me more secure and easily accessible.
What are the use-cases of Hazelcast Jet? Has anyone started using it?
Our project uses Hazelcast for Distributed Map holding Key-Value pair and Distributed computing on those Keys to run the task at the node holding the Key. We use NearCache solution as well.
I was curious to know how different is Hazelcast Jet and what problems does it solve?
As of current version (0.3), Jet's advantage over just submitting a Runnable to each partition is the ability to perform grouping by a key other than the one used in the Hazelcast map. For this to work in a distributed environment you have to send each item to the processing unit responsible for its grouping key, and this is something that is easy to get from Jet.
Further from that, you can build a multistage cascade of groupBy operations, you can have forks in your data stream to reuse the same intermediate result in more than one way, you can build a pipeline where an I/O task distributes the processing of the data it reads across all CPU cores, etc... in short, all the advantages that a full-blown DAG computation engine offers.
By the time it reaches 1.0 Jet will also support fault-tolerant infinite stream processing, event time-based windows, and more.
2021 answer for use cases:
Change data capture streaming - Use Debezium/Hazelcast to detect changes to your database and stream to other microservices (if data is common), stream changes to a data lake, or update a search engine
Real time analytics - Take market data stream and perform technical analysis in realtime or twitter analysis
Async job processing - PDF conversion service
I am working on spark streaming job that requires to store intermediate results in order to reuse them in next window stream. Number of data is extremely large so probably there is no way to store it in spark cache. What is more I need in someway to read data by some 'key'.
I was thinking about Cassandra as intermediate storage but it also has some drawbacks.
Alternatively, maybe Kafka will be do the job but it will require additional work in order to select given portion of data by key.
Could you advise me what I should do?
How such problems are resolved in Storm - is there any internal mechanism or it is preferred to use some external tools?
Solr as Index + Cassandra as NoSQL storage working fine for my use case where I have to process tera bytes of data. But in my case, I am using Cassandra for persistent storage of years of data.
Kafka is working fine as a replacement Jboss/AMQ due to it's simple architecture. Currently I am working Apache Storm + Kafka for real time stream processing in one of the projects.
Since you are storing intermediate data, I think Kafka is best choice by setting right retention period.
Have a look at one more SE Question and other article
As you mention, Kafka has some problems getting items by key. It really only provides APIs for FIFO paradigm. I would advise to use a dedicated storage software, Cassandra, MongoDB, I even seen Solr used to store text. It would be easier to use something designed for key retrieval rather than try to modify Kafka yourself and most likely introduce bugs/issues that could take forever to solve.
As SQL.injection said, you'll have to manage the storage and logic by yourself. Storm doesn't offer such a mechanism.