I have a Spark dataframe that I need to send as body of HTTP POST request. The storage system is Apache Solr. We are creating Spark dataframe by reading Solr collection. I can use Jackson library to create JSON and send it over HTTP POST. Also, dataframe may have millions of records so preferred way is to send them in batches over HTTP.
Below are the two approaches I can think.
We can use foreach/foreachPartition operations of Spark dataframe and call HTTP POST which means that HTTP call will happen within each executor (If I am not wrong). Is this approach right? Also, it means if I have 3 executors then there will be 3 HTTP calls that we can make in parallel. Right? But opening and closing HTTP connection so many times, will it not cause issue?
After getting the Spark dataframe, we can save it in some other SOLR collection (using Spark) and then data from that collection will be read to get the data in batches using SOLR API (using rows, start parameters), create JSON out of it and send it over HTTP request.
I would like to know which one of the above two approaches is preferred?
After getting the Spark dataframe, we can save it in some other SOLR
collection (using Spark) and then data from that collection will be
read to get the data in batches using SOLR API (using rows, start
parameters), create JSON out of it and send it over HTTP request.
out of your 2 approaches 2nd approach is best since you have paging feature in solrj
1) save your dataframe as solr documents with indexes
2) use solrj is api which will interact with your solr collections and will return solr documents based on your criteria.
3) you can convert them in to json using any parser and present in uis or user queries.
Infact this is not new approach, people who are using hbase with solr will do in the same way (since querying from hbase is really slow compared to querying from solr collections), where each hbase table is solr collection and can be queried via solrj and present to dashborads like angular js.
more illustrative diagram like below..
Related
I am re-designing a project I built a year ago when I was just starting to learn how to code. I used MEAN stack, back then and want to convert it to a PERN stack now. My AWS knowledge has also grown a bit and I'd like to expand on these new skills.
The application receives real-time data from an api which I clean up to write to a database as well as broadcast that data to connected clients.
To better conceptualize this question I will refer to the following items:
api-m1 : this receives the incoming data and passes it to my schema I then send it to my socket-server.
socket-server: handles the WSS connection to the application's front-end clients. It also will write this data to a postgres database which it gets from Scraper and api-m1. I would like to turn this into clusters eventually as I am using nodejs and will incorporate Redis. Then I will run it behind an ALB using sticky-sessions etc.. for multiple EC2 instances.
RDS: postgres table which socket-server writes incoming scraper and api-m1 data to. RDS is used to fetch the most recent data stored along with user profile config data. NOTE: RDS main data table will have max 120-150 UID records with 6-7 columns
To help better visualize this see img below.
From a database perspective, what would be the quickest way to write my data to RDS.
Assuming we have during peak times 20-40 records/s from the api-m1 + another 20-40 records/s from the scraper? After each day I tear down the database using a lambda function and start again (as the data is only temporary and does not need to be saved for any prolonged period of time).
1.Should I INSERT each record using a SERIAL id, then from the frontend fetch the most recent rows based off of the uid?
2.a Should I UPDATE each UID so i'd have a fixed N rows of data which I just search and update? (I can see this bottlenecking with my Postgres client.
2.b Still use UPDATE but do BATCHED updates (what issues will I run into if I make multiple clusters i.e will I run into concurrency problems where table record XYZ will have an older value overwrite a more recent value because i'm using BATCH UPDATE with Node Clusters?
My concern is UPDATES are slower than INSERTS and I don't want to make it as fast as possible. This section of the application isn't CPU heavy, and the rt-data isn't that intensive.
To make my comments an answer:
You don't seem to need SQL semantics for anything here, so I'd just toss RDS and use e.g. Redis (or DynamoDB, I guess) for that data store.
We use Apache Nifi to get data from multiple sources like Twitter and Reddit in specific interval (for example 30s). Then we would like to send it to Apache Kafka and probably it should somehow group both Twitter and Reddit messages into 1 topic so that Spark would always receive data from both sources for given interval at once.
Is there any way to do that?
#Sebastian What you describe is basic NiFI routing. You would just route both Twitter and Redis to the same downstream Kafka Producer and same Topic. After you get data into NiFi from each service, you should run it to UpdateAttribute and set attribute topicName to what you want for each source. If there are additional steps per Data Source do them after Update Attribute and before PublishKafka.
If you code all the upstream routes as above, you could route all the different Data Sources to PublishKafka processor using ${topicName} dynamically.
We are using Apache Livy 0.6.0-incubating and using its REST API to make calls to custom spark jar using /batches/ API.
The custom spark code reads data from HDFS and does some processing. This code is successful and the REST response is also 'SUCCESS'. We want the data to be returned back to the client, the way /sessions/ API returns the data. Is there a way to do this?
Note: /sessions/ API can only accept spark scala code.
I have some similiar set up, the way I return the data is by writing the spark result to HDFS. and when I recieve a SUCCESS I read from the client machine the HDFS to get the result.
I have 2 approaches:
Approach #1
Kafka --> Spark Stream (processing data) --> Kafka -(Kafka Consumer)-> Nodejs (Socket.io)
Approach #2
Kafka --> Kafka Connect (processing data) --> MongoDB -(mongo-oplog-watch)-> Nodejs (Socket.io)
Note: in Approach #2, I use mongo-oplog-watch to check when inserting data.
What is the advantage and disadvantage when using Kafka as a storage vs using another storage like MongoDB in real-time application context?
Kafka topics typically have a retention period (default to 7 days) after which they will be deleted. Though, there is no hard rule that we must not persist in Kafka.
You can set the topic retention period to -1 (reference)
The only problem, I know of persisting data in Kafka, is security. Kafka, out of the box (atleast as of now) doesn't provide Data-at-rest encryption. You need to go with a custom solution (or a home-grown one) to have that.
Protecting data-at-rest in Kafka with Vormetric
A KIP is also there, but it is Under discussion
Add end to end encryption in Kafka (KIP)
MongoDB on the other hand seems to provide Data-at-rest encryption.
Security data at rest in MongoDB
And most importantly, it also depends on the type of the data that you are going to store and what you want to do with it.
If you are dealing with data that is quite complex (not easy as Key-Value i.e., give the key and get the value model), for example, like querying by indexed fields etc (as you do typically with logs), then MongoDB could probably make sense.
In simple words, if you are querying by more than one field (other than the key), then storing it in MongoDB could make sense, if you intend to use Kafka for such a purpose, you would probably end up with creating a topic for every field that should be queried... which is too much.
I am a newbie and trying to get going on ArangoDB.I want to run a batch of AQL queries which would be interdependent on each other. I want to do the same things we do in PL-SQL. I tried clubbing two or more queries in one post/get request through FOXX but didn't work. Can someone suggest me a better way to do this? or a tutorial for this?
It all depends what is the client accessing the database.
E.g. we are using Java and the java driver to access ArangoDB. Then either transaction call or AQL query with subsequent AQL queries can be made.
The question is, if the AQL queries are interdependent on each other, why whould you run them in one request? How would you get the results of each one?
Take a look at Gremlin language (it is a Graph language), you would find that it uses WebSockets and result of one query is returned in a binary way through WS... Thus batching such queries wouldn't have any sense. (just a note, ArangoDB also has a provider for the Gremlin API).
I expect, if you are accessing ArangoDB through HTTP. And now you are trying to save http requests. If that is the case I would recommend writing your own API layer, which would expose interface, where you would be able to batch the requests. However the API layer would make 2 calls to Arango (e.g. in parallel), getting the results and somehow merging them to the final output.