Learning to Becoming a Hadoop Administrator [closed] - linux

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I would like to become a Hadoop administrator. I have a copy of the book 'Hadoop Operations' and i would like to get my hands dirty with setups et al.
So here's the question: Should i invest in a physical server for practice? or is it all done in the cloud?

Don't invest in a physical server, unless you're sure (and I mean SURE) you want to spend hundreds of CPU-hours in practical exercises. A more cost-effective option may be to get an account with a IaaS provider (such as Amazon), and experiment with virtual machines. You can turn off unneeded VMs when not doing exercises, so your costs could be a lot smaller. Plus you can get many VMs for short periods of time without huge upfront investments.
Some of the most challenging aspects of administering Hadoop is dealing with large clusters and clusters that are highly utilized. Unfortunately this means that there is only so much you can learn on your own, as both of those scenarios can be very expensive and time-consuming to set up. So don't try going too deep on your toy cluster, instead get familiar with the basics & configuration options and then try to find a job, or a project where you could join an existing ops team.

Related

reading sql server log files (ldf) with spark [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
this is probably far fetched but... can spark - or any advanced "ETL" technology you know - connect directly to sql server's log file (the .ldf) - and extract its data?
Agenda is to get SQL server's real time operational data without replicating the whole database first (nor selecting directly from it).
Appreciate your thoughts!
Rea
to answer your question, I have never heard of any tech to read an LDF directly, but there are several products on the market that can "link-clone" a database almost instantly by using some internal tricks. Keep in mind that the data is not copied using these tools, but it allows instant access for use cases like yours.
There may be some free ways to do this, especially using cloud functions, or maybe linked-clone functions that Virtual Machines offer, but I only know about paid products at this time like Dell EMC, Redgate's and Windocks.
The easiest to try that are not in the cloud are:
Red Gate SQL Clone with a 14 day free trial:
Red Gate SQL Clone Link
Windocks.com (this is free for some cases, but harder to get started with)

Why don't developers run games on clusters? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
With multicores now more common, game developers are now leaning toward the use of threads, which is discussed in this question:
Why don't large programs (such as games) use loads of different threads?
To me, this idea seems analogous to the idea having multiple machines running things in clusters, or parallel computing.
Some games run on dedicated servers.
My question is: Can you use clusters to maximize parallel power, in the same way threading does on a multi-core system? Will it give the same benefit? Why/why not?
You can do that. But sharing the computation of a Game engine between clusters will introduce more bottleneck to the system. Because clusters will us the network and it is way slower than CPU and main memory.
Some games use simulation clients to share large computing loads. But they need to be pretty careful about the synchronization issues caused by the network delays.

Use NoSQL on a single box [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am designing a software that will be deployed to one single server. I will have about 1TB data and there will be more writing than reading.
I have an option to buy a good server. I also have an option to use Redis and Cassandra. But I cannot do both. I doubt if it makes sense to run NoSQL on one single node. Will I get enough speedup over traditional SQL database?
This type of questions is very problematic as it calls for an opinion, which is at most cases highly subjective.
I cannot speak on Cassandra's behalf for better or worse.
Redis is an in-memory solution - that basically means that whether reading or writing, you'll get the best performance available today. It also means that your 1TB of data will need to fit in that one good server's RAM. Also note that you'll need additional RAM to actually operate the server (OS) and Redis itself. Depending on what/how you do, you could end up with a RAM requirement of up to x2.5-3 the data's size. That means ~4TB of RAM... and that's a lot.
If the single server requirement isn't hard, I'd look into loosing it. Any setup, Redis or not, will not offer any availability off a single box. If you use a cluster, you'll be able to scale easily using cheaper, "less good" ;), servers.
If there will be more writing than reading then redis is probably not your answer.
Cassandra will handle heavy writes pretty well, but the key question is: do you know your read queries ahead of time? If so, then Cassandra is a good solution. However, if you plan to do ad-hoc querying then Cassandra is not the answer. This last point is actually the key one.

What is cheaper when deployed on aws Node.js or Java Web Services [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
What costs less money when deployed on Amazon Cloud Node.js or Java Web Services?
Or when it does matter. We take into consideration only one way traffic (to server) for many clients.
They're both going to cost roughly the same in terms of hosting costs. In terms of development costs, however, things might be different:
Node is just Javascript -- it has a huge ecosystem and lots of new developers are using it -- since it's quite 'hip', it's easier to find people to hop onto new projects.
Java is old school and has been around forever, there are tons of 'senior' guys you can hire (for good $$).
Node is quite a bit faster to develop with. If you're building a small application, you might spend much less time developing it with Node than Java.

Photo Sharing Vs Storage [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
There are lot of photo sharing applications out there, some are making money and some don't. Photo sharing takes lot of space, so I doubt where they host these! Rich services probably using Amazon or their own server, but the rest? Do they have access to any kind of free service? Or they have purchased terabytes from their web host?
AWS S3 is what you are generally referring to. The cost is mainly due to the reliability they give to the data they store. For photo-sharing, generally this much reliability is not required (compared with say a financial statement).
They also have other services like S3 RRS (Reduced redundancy), and Glacier. They are lot cheaper. Say those photos not accessed for a long time may be kept on Glacier (it will take time to retrieve, but cheap). RRS can be used for any transformed images (which can be re-constructed even if lost) - like thumbnails. So these good photo-sharing services, will do a lot of such complicated decisions on storage to manage cost.
You can read more on these types here : http://aws.amazon.com/s3/faqs/
There is also a casestudy of SmugMug on AWS. I also listened to him once, where he was telling about using his own hard-disks initially to store, but later S3 costs came down and he moved on to AWS. Read the details here:
AWS Case Study: SmugMug's Cloud Migration : http://aws.amazon.com/solutions/case-studies/smugmug/

Resources