Hey I am a new guy to big data. I am making a system which will fetch data from social media and process the result, for this I am using apache spark.
Following is the flow of my model:
user will save the desired keywords using a webpage made in php.
with those key words I would be fetching data from social media,
processing the data(ex, sentiments and views) and then provide it to
the end user.
Now my confusion is how should I fetch data from social media. using
apache kafka
apache flume
or by directly calling the API twitter4j(just an example).
Though I have to learn to implement all three data fetching techniques, and If I happen to use direct api then I can skip the whole hadoop part. It would be great if you guys could suggest me which one is better.
All of the above I am doing on a local machine. I have completed the ui part now I am in the phase where I have to fetch data.
Thanks.
I guess I will make this a suggestion.
You may not want to fetch data from any source using distributed system, unless you plan to DDoS their production server. If your cluster is setup behind one router, your whole cluster may be blacklisted because all nodes consistently hit the access rate limit that's adding up at your router, depending on whether the server is powerful or not. Twitter server doesn't care about 100 threads to be honest (provided you know what you are doing), but any startup will probably get to you right away.
If you have a workstation with 4 cores, having it up catching streaming data should suffice for initial stage of academic research. Or if you really want tons of data, you can perhaps do Hadoop streaming with your fetcher script as mapper and no reducer, quick and easy. If you are superstar in Java or Scala, get a fetching thread on each vcore on Spark's executor.
Now, Twitter has the REST API, which means you can pretty much fetch data in any programming language. Of course sometimes it may be easier to use existing interfaces, assuming they are well-maintained, they are almost always more robust. But I get lazy all the time. For example, I sometimes just want a sample data point, so I just pipe curl into jq to check what I want to check.
Yes, learn about jq too, will save you tons of time. And be a gentleman who doesn't DDoS people.
Related
I have code in PySpark that parallelizes heavy computation. It takes two files and several parameters, performs the computation and generates information that at this moment is stored in a CSV file (tomorrow it would be ideal that the information is stored in a Postgres database).
Now, I want to consume this functionality as a service from a system made in Django, from which users will set the parameters of the Spark service, the selection of the two files and then query the results of the process.
I can think of several ways to cover this, but I don't know which one is the most convenient in terms of simplicity of implementation:
Use Spark API-REST: this would allow a request to be made from Django to the Spark cluster. The problem is that I found no official documentation, I get everything through blogs whose parameters and experiences correspond to a particular situation or solution. At no point does it mention, for example, how I could send files to be consumed through the API by Spark, or get the result.
Develop a simple API in Spark's Master to receive all parameters and execute the spark-submit command at the OS level. The awkwardness of this solution is that everything must be handled at the OS level, the parameter files and the final process result must be saved on disk and accessible by the Django server that wants to get it to save its information in the DB lately.
Integrate the Django app in the Master server, writing PySpark code inside It, Spark connects to the master server and runs the code that manipulates the RDDs. This scheme does not convince me because it sacrifices the independence between Spark and the Django application, which is already huge.
If someone could enlighten me about this, maybe due to lack of experience I am overlooking a cleaner, more robust, or idiomatic solution.
Thanks in advance
I want to know following things so that I can fix my server architecture and make it more flexible.
Is it good to store home feed data [ex: Facebook homeFeed] to the variable for future manipulation or just fetch data related to homeFeed and manipulate everything which needs to be done on run time.
Please note that data set of home feed can contain anything. [ not developed yet ]
Is there any limit to request to MongoDB at any given time which can create a delay in data processing?
Are node.js and MongoDB a good option for social network development?
If you know anything related to social network development then please share the pros and cons.
Is it good to store home feed data [ex: Facebook homeFeed] to the variable for future manipulation or just fetch data related to homeFeed and manipulate everything which needs to be done on run time.
You can (and sometimes should) pre-compute home feed data for certain users (for example those who are the most active). You don't store that in a variable though, you cache the results with something like Redis.
Generating the home feed on a "request" basis is also possible and good.
Both approaches require careful thinking about your system's architecture, performance, scalability, robustness, fault-tolerance, etc...
Is there any limit to request to MongoDB at any given time. which can create a delay in data processing?
Yes. A MongoDB instance (or any other database) has limited resources. Look at the Sharding and Replication docs of MongoDB for more info about how to work with MongoDb at scale.
Are node.js and MongoDB are a good option for social network development.
Node.js and MongoDb are a good combinations for quick prototyping, you can get productive fairly quickly. Any language(s) you are familiair with is/are a good choice here, since your focus seems to be on architecture. Go, Java and PHP are good candidates too.
In the real world social networks are built with a lot more tools than that. Since the teams use various programming languages, databases and frameworks depending on the task at hand.
I have a simple NodeJS web app that calls several apis asynchronously and merges the results to return one big result. Now let's say that I want to optimize this. How do I do this?
I am new to NoeJS and also the concept of scaling systems. I have been reading about load balancing, distributed systems, etc... I think this is the right way to go, but honestly I don't know.
I was thinking of doing something like this -
Set up a system that has several servers, and each has an instance of a NodeJS webapp that makes an api call given a path, and returns the result.
Have a master server that grabs the result from each of these servers, and merge the result and return it to the client.
Is this right way to go? What technologies do I use? Thank you for your help.
I am guessing you are trying to setup web-crawling or api-crawling, to grab data from 3rd party end point. If that is true, you would have a list of users / IDs or something like that that you pass to the web service you call and grab the data.
First of making a large number of requests very fast and in a stable way is tricky and depends on several factors to be stable and robust.
Is the 3rd party API rate limited.
Network connection on the client machine making the requests.
Error handling for both API and client errors like connection reset etc.
Sheer volume of data you are fetching back, like if you are trying to crawl data on millions of users from 3rd party API as fast as possible.
Your instinct is correct that you would have to scale this over several servers or at-least several parallel node processes on machine with lot of resources, however start small, test, and then scale would be my recommendation. Here are a few steps.
Use a good robust node http client like axios
If you are dealing with huge number of items (username, ids. emails etc) you will need stable way of iterating over them. Put them in a database like PostgreSQL or MySQL.
From here on figure out what's the fastest rate at which your API supports calling. And write stable function to iterate over your 'input' and call the API.
Then you have a couple of options. If data you are collecting is separate for each request you make. You can save it back in the database for each input. If you literally want to merge the data from multiple API calls, you can use a key-value storage like Redis. You can give an ID to each call and create a combination key for input+request_id format, then when all requests are done, you can merge them.
When you a small scale model in place you can now add a good job manager like Kue or Bull to the mix, and split the set of inputs in database from point (2) over several jobs that can be run in parallel.
Once you have a stable job-manager for that can repeat this node process for a range of inputs , now you are at a point where you can scale.
Deploy this same code on multiple servers that all talk to same Database and Redis. Install the Node process to run using a process manager like PM2.
Finally the way setup works is, each copy of same node program fetches a different set of inputs (usernames/IDs etc) form the source database, and writes the results back to the database or Redis depending on how you want to handle the output.
Optional post processing on redis to fetch the key value pairs and merge the responses grouped by input.
Some important things you have to be hyper aware of when coding this issues are:
Memory Management: Use design patterns/code/libraries that saves you most memory. Load absolutely minimum of what you need to in memory. Eg: iterating on an array of 1 millions usernames in memory is more expensive than keeping them in database and paging over them.
Error Handing: There will be lots of them. API errors, unforeseen exceptions, memory leaks, network drops etc. Having robust error handling and recovery mechanism will save the day.
Logging: Good quality logging will be critical to keep a check on how different parts of system are doing. Look at winston.
Throttling API calls: Remember making 10,000 API calls at the same minute will likely crash your machine or even most APIs.At the very least go very slow due to memory overloads. However adding a slight delay (like 10 milliseconds) between every 10 parallel calls will be HUGE boost in speed and make the calls much more stable. This strategy is called throttling or rate-limiting the API calls. Finding a sweet spot that works for your problem is important. Yes going slow can actually make you reach goal faster!
Your question was quite broad without specific code question, this is a general strategy and hopefully will give you a good starting point and links to reference materials so you can start building your solution.
I recently started working on a content repository migration project between two different content management systems.
We have around 11 petabytes of documents in a source repository. We want to migrate all of them one document at a time by querying with source system API and saving through destination system API.
We will have a single standalone machine for this migration and should be able to manage (start, stop, resume) the whole process.
What platforms and tools would you suggest for such task? Is Flink's Dataset API for bounded data suitable for this job?
Flink's DataStream API is probably a better choice than the DataSet API because the streaming API can be stopped/resumed and can recover from failures. By contrast, the DataSet API reruns failed jobs from the beginning, which isn't a good fit for a job that might run for days (or weeks).
While Flink's streaming API is designed for unbounded data streams, it also works very well for bounded datasets.
If the underlying CMSes can support doing the migration in parallel, Flink would easily accommodate this. The Async I/O feature would be helpful in that context. But if you are going to do the migration serially, then I'm not sure you'll get much benefit from a framework like Flink or Spark.
Basically what David said above. The main challenge I think you'll run into is tracking progress such that checkpointing/savepointing (and thus restarting) works properly.
This assumes you have some reasonably efficient and stable way to enumerate the unique IDs for all 1B documents in the source system. One approach we've used in a previous migration project (though not with Flink) was to use the document creation timestamp as the "event time".
I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.