Cluster design for downloading/streaming a dataset to a user - apache-spark

In our system, we classically have two components: A Cloudera Hadoop cluster (CDH) and an OpenShift "backend" system. In HDFS, we have some huge .parquet files.
We now have a business requirement to "export the data by a user given filter criterion" to a user in "realtime" as downloadable file. So the flow is: The user enters a SQL like filter string, for instance user='Theo' and command='execution'. He then sends a GET /export request to our backend service with the filter string as parameter. The user shall now get a "download file" from his web browser and immediately start downloading that file as CSV (even if its multiple terrabytes or even petabytes in size, thats the user's choice if he wants to try out and wait that long). In fact, the cluster should respond synchronously but not cache the entire response on a single node before sending the result but only receive data at "internet speed" of the user and directly stream it to the user. (With a buffer of e.g. 10 oder 100 MB).
I now face the problem on how to best approach this requirement. My considerations:
I wanted to use Spark for that. Spark would read the Parquet file, apply the filter easily and then "coalesce" the filtered result to the driver which in turn streams the data back to the requesting backend/client. During this task, the driver should of course not run out of memory if the data is sent too slowly back to the backend/user, but just have the executors deliver the data in the same speed as it is "consumed").
However, I face some problems here:
The standard use case is that the user has fine grained filters so that his exported file contains something like 1000 lines only. If I'd submit a new spark job via spark-submit for each request, I already come into latencies of multiple seconds due to initialization and query plan creation (Even if its just as simple as reading and filtering the data). I'd like to avoid that.
The cluster and the backend are strictly isolated. The operation guys ideally don't want us to reach the cluster from the backend at all, but the cluster should just call the backend. We are able to "open" maybe one port, but we'll possibly not able to argue something like "our backend will run the spark driver but being connected to the cluster as execution backend".
Is it a "bad design smell" if we run a "server spark job", i.e. we submit an application with mode "client" to the cluster master which also opens a port for HTTP requests and only runs a spark pipeline on requests, but holds the spark context open all the time (and is reachable from our backend via a fixed URL)? I know that there is "spark-job-server" project which does this, but it still feels a bit weird due to the nature of Spark and Jobs, where "naturally" a job would be to download a file and not be a 24h running server waiting to execute some pipeline steps from time to time.
I have no idea on how to limit sparks result fetching so that the executors send in a speed so that the driver won't run out of memory if the user requested petabytes.. Any suggestion on this?
Is Spark a good choice for this task after all or do you have any suggestions for better tooling here? (At best in CDH 5.14 environment as we don't get the operation team to install any additional tool).

Related

periodic refresh of static data in Structure Streaming and Stateful Streaming

I am trying to implement 5 min batch monitoring using spark structured streaming where read from kafka and look up on (1 huge and 1 smaller) diff static datasets as part of ETL logic and call rest API to send final results to an external application (out of billions of records from kafka only less than 100 will be out to rest API after ETL).
How to achieve refreshing static look ups with out restarting the whole streaming application ? (StreamingQueryListener using StreamingQueryManager.addListener method to have our own logic of refreshing/recreating static df via StreamingQuery.AwaitTermination ? or use persist and unpersis cache ? or any other better ideas ?)
Note : Went through below article but not sure if hbase is better option as its an old one.
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
Once a record is enriched with look up information and applied some rules/conditions , we need to start keep track of it to send updates until it completed its lifecycle of an event as per custom logic via rest API. So hoping flatmapwithGroupState implementation helps here to keep track of event state. Please suggest best options here.
Managing group state with in HDFS vs using HBase. Please suggest best options from an operationalization and monitoring point of view in production environment where support team has minimal knowledge of Spark. If we use HDFS for state maintenance, how to keep it up with event state tracking in case of rest API fails to send updates to end user/system?

nodejs - run a function at a specific time

I'm building a website that some users will enter and after a specific amount of time an algorithm has to run in order to take the input of the users that is stored in the database and create some results for them storing the results also in the database. The problem is that in nodejs i cant figure out where and how should i implement this algorithm in order to run after a specific amount of time and only once(every few minutes or seconds).
The app is builded in nodejs-expressjs.
For example lets say that i start the application and after 3 minutes the algorithm should run and take some data from the database and after the algorithm has created some output stores it in database again.
What are the typical solutions for that (at least one is enough). thank you!
Let say you have a user request that saves url to crawl and get listed products
So one of the simplest ways would be to:
On user requests create in DB "tasks" table
userId | urlToCrawl | dateAdded | isProcessing | ....
Then in node main site you have some setInterval(findAndProcessNewTasks, 60000)
so it will get all tasks that are not currently in work (where isProcessing is false)
every 1 min or whatever interval you need
findAndProcessNewTasks
will query db and run your algorithm for every record that is not processed yet
also it will set isProcessing to true
eventually once algorithm is finished it will remove the record from tasks (or mark some another field like "finished" as true)
Depending on load and number of tasks it may make sense to process your algorithm in another node app
Typically you would have a message bus (Kafka, rabbitmq etc.) with main app just sending events and worker node.js apps doing actual job and inserting products into db
this would make main app lightweight and allow scaling worker apps
From your question it's not clear whether you want to run the algorithm on the web server (perhaps processing input from multiple users) or on the client (processing the input from a particular user).
If the former, then use setTimeout(), or something similar, in your main javascript file that creates the web server listener. Your server can then be handling inputs from users (via the app listener) and in parallel running algorithms that look at the database.
If the latter, then use setTimeout(), or something similar, in the javascript code that is being loaded into the user's browser.
You may actually need some combination of the above: code running on the server to periodically do some processing on a central database, and code running in each user's browser to periodically refresh the user's display with new data pulled down from the server.
You might also want to implement a websocket and json rpc interface between the client and the server. Then, rather than having the client "poll" the server for the results of your algorithm, you can have the client listen for events arriving on the websocket.
Hope that helps!
If I understand you correctly - I would just send the data to the client-side while rendering the page and store it into some hidden tag (like input type="hidden"). Then I would run a script on the server-side with setTimeout to display the data to the client.

spring batch design advice for processing 50k files

We have more than 50k files coming in everyday and needs to be processed. For that we have developed POC apps with design like,
Polling app picks the file continuously from ftp zone.
Validate that file and create metadata in db table.
Another poller picks 10-20 files from db(only file id and status) and deliver it to slave apps as message
Slave app take message and launch a spring batch job, which is reading data, does biz validation in processors and writes validated data to db/another file.
We used spring integration and spring batch technology for this POC
Is it a good idea to launch spring batch job in slaves or directly implement read,process and write logic as plan java or spring bean objects?
Need some insight on launching this job where slave can have 10-25 MDP(spring message driven pojo) and each of this MDP is launching a job.
Note : Each file will have max 30 - 40 thousand records
Generally, using SpringIntegration and SpringBatch for such tasks is a good idea. This is what they are intended for.
With regard to SpringBatch, you get the whole retry, skip and restart handling out of the box. Moreover, you have all these readers and writers that are optimised for bulk operations. This works very well and you only have to concentrate on writing the appropriate mappers and such stuff.
If you want to use plain java or spring bean objects, you will probably end up developing such infrastructure code by yourself... incl. all the needed effort for testing and so on.
Concerning your design:
Besides validating and creation of the metadata entry, you could consider to load the entries directly into a database table. This would give you a better "transactional" control, if something fails. Your load job could look something like this:
step1:
tasklet to create an entry in metadata table with columns like
FILE_TO_PROCESS: XY.txt
STATE: START_LOADING
DATE: ...
ATTEMPT: ... first attempt
step2:
read and validate each line of the file and store it in a data table
DATA: ........
STATE:
FK_META_TABLE: ForeignKey to meta table
step3:
update metatable with status LOAD_completed
-STATE : LOAD_COMPLETED
So, as soon as your metatable entry gets the state LOAD_COMPLETED, you know that all entries of the files have been validated and are ready for further processing.
If something fails, you just can fix the file and reload it.
Then, to process further, you could just have jobs which poll periodically and check if there are new data in the database which should be processed. If more than one file had been loaded during the last period, simply process all files that are ready.
You could even have several slave-processes polling from time to time. Just do a read for update on the state of the metadata table or use an optimistic locking approach to prevent several slaves from trying to process the same entries.
With this solution, you don't need a message infrastructure and you can still scale the whole application without any problems.

Designing a message processing system

I have been asked to create a message processing system as following. As I am not sure if this is the right place to post this, feel free to move it to any other appropriate SC group.
Problem
Server have about 100 to 500 clients connected at every moment. When a client connects to server, server loads part of their data and cache it in memory for faster access. Server will receive between 200~1000 messages per second for all clients. These messages are relatively small (about 500 bytes). Any changes to data in cache should be saved to disk as soon as possible. When client disconnects all their data is saved to disk and removed from cache. each message contains some instruction and a text message which will be saved as file. Instructions should be executed as fast as possible (near instant) and all clients using that file should get the update. Only writing the modified message to disk can be delayed.
Here is my solution in a diagram
My solution consists of a web server (http or socket) a message queue and two or more instances of file server and instruction server.
Web server grabs client messages and if there is message available for client in message queue, pushes it back to client.
Instruction processor grabs instructions from queue and creates necessary message to be processed by file server (Get/set file) and waits for the file to be available in queue and more process to create another message for client.
File server only provides the files, either from cache or physical file depending on type of file.
Concerns:
There are peak times that total connected clients might go over 10000 at once and total messages received from clients increase to 10~15K.
I should be able to clear the queue and go back to normal state as soon as possible (with processing requests obviously).
I should be able to add extra instruction processors and file servers on the fly without having to shut down the other instances.
In case file server crashes it shouldn’t lose files so it has to write files to disk as soon as there are any changes and process time is available.
File system should be in b+ tree format so some applications (local reporting apps) could easily access files without having to go through queue server
My Solution
I am thinking of using node.js for socket/web server. And may be a NoSQL database for file server and a queue server such as rabbitMQ or Node_Redis and Redis.
Questions:
Is there a better way of structuring this system?
What are my other options for components of this system?
is it possible to run all the instances in same server machine or even in same application (in different thread)?
You have a couple of holes here, mostly around the web server "pushing" the message back to the client. That doesn't really work in a web-based world. You can try and use websockets, but generally, this ends up being polling based.
I don't know what the "instructions" are to be executed, but saving 1000 500byte messages is trivial. Many NoSQL solutions boast million+ write per second capacity. Especially if you let committing to disk to lag.
Don't bother with the queue for the return of the file. A good NoSQL solution will scale better. Build out a Cassandra cluster, load test it until it can handle your peak load.
This simplifies your architecture into a 1 or more web servers, clients polling that server for file updates, a queue for submitting "messages" to the "instruction server" (also known as an application server in web-developer terms), and a no-sql database for the instruction server to write files to.
This makes scaling easy, you can always add more web servers, and with a decent cluster size for your no-sql server, you should get to scale horizontally there as well. Your only real bottleneck is your instruction server queue, which you could always throw more instruction servers at.

Hazelcast - OperationTimeoutException

I am using Hazelcast version 3.3.1.
I have a 9 node cluster running on aws using c3.2xlarge servers.
I am using a distributed executor service and a distributed map.
Distributed executor service uses a single thread.
Distributed map is configured with no replication and no near-cache and stores about 1 million objects of size 1-2kb using Kryo serializer.
My use case goes as follow:
All 9 nodes constantly execute a synchronous remote operation on the distributed executor service and generate about 20k hits per second (about ~2k per node).
Invocations are executed using Hazelcast API: com.hazelcast.core.IExecutorService#executeOnKeyOwner.
Each operation accesses the distributed map on the node owning the partition, does some calculation using the stored object and stores the object in to the map. (for that I use the get and set API of the IMap object).
Every once in a while Hazelcast encounters a timeout exceptions such as:
com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:mapService', op=GetOperation{}, partitionId=212, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[172.31.44.2]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received! backups-expected:0 backups-completed: 0
In some cases I see map partitions start to migrate which makes thing even worse, nodes constantly leave and re-join the cluster and the only way I can overcome the problem is by restarting the entire cluster.
I am wondering what may cause Hazelcast to block a map-get operation for 120 seconds?
I am pretty sure it's not network related since other services on the same servers operate just fine.
Also note that the servers are mostly idle (~70%).
Any feedbacks on my use case will be highly appreciated.
Why don't you make use of an entry processor? This is also send to the right machine owning the partition and the load, modify, store is done automatically and atomically. So no race problems. It will probably outperform the current approach significantly since there is less remoting involved.
The fact that the map.get is not returning for 120 seconds is indeed very confusing. If you switch to Hazelcast 3.5 we added some logging/debugging stuff for this using the slow operation detector (executing side) and slow invocation detector (caller side) and should give you some insights what is happening.
Do you see any Health monitor logs being printed?

Resources