In our company, we use HDFS. So far everything works out and we can extract data by using queries.
In the past I had worked a lot with Project R. It was always great for my analyses. So I checked Project R and the support of HDFS (rbase, rhdfs,...).
Nevertheless, I am a little bit confused since I found tons of tutorials where they do analyses with simple data saved in CSV files. Don't get me wrong. That's fine but I want to ask if there is a possibility to write queries, extracting the data and do some statistics in one run.
Or in other words: When we talk about statistics for data stored in HDFS, how do you handle this?
Thanks a lot and hopefully some of you can help me out to see pros and cons for my question.
All the best -
Peter
You might like to check out Apache Hive and Apache Spark. Although there are many other option but I am not sure whether you are asking how to work on the data from hdfs when the data is not handed down to you in a file.
Related
Its a bit architectural kind of question. I need to design an application using Spark and Scala as the primary tool. I want to minimise the manual intervention as much as possible.
I will receive a zip with multiple files having different structures as an input at a regular interval of time say, daily. I need to process it using Spark . After transformations need to move the data to a back-end database.
Wanted to understand the best way I can use to design the application.
What would be the best way to process the zip ?
Is the Spark Streaming can be considered as an option looking at the frequency of the file ?
What other options should I take into consideration?
Any guidance will be really appreciable.
Its a broad question, there are batch options and stream options not sure your exact requirements. you may start your research here: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html
Just wondering, is there somewhere a list of questions to ask ourselves in order to know whether Spark is the right tool or not ?
Once again I spent part of the week implementing a POC with Apache Spark in order to compare the performance against pure python code and I was baffled when I saw the 1/100 ratio (in favor of python).
I know that Spark is a "big data" tool and everyone keeps saying "Spark is the right tool to process TB/PB of data" but I think that is not the only thing to take into account.
In brief, my question is, when given small data as input, how can I know if the computing will be consuming enough so that Spark can actually improve things ?
I'm not sure if there is such a list, but if there was, the first question would probably be
Does your data fit on a single machine?
And if the answer is 'Yes', you do not need Spark.
Spark is designed to process lots of data such that it cannot be handled by a single machine, as an alternative to Hadoop, in a fault-tolerant manner.
There are lots of overheads, such as fault-tolerance and network, associated with operating in a distributed manner that cause the apparent slow-down when compared to traditional tools on a single machine.
Because Spark can be used as a parallel processing framework on a small dataset, does not mean that it should be used in such a way. You will get faster results and less complexity by using, say, Python, and parallelizing your processing using threads.
Spark excels when you have to process a dataset that does not fit onto a single machine, when the processing is complex and time-consuming and the probability of encountering an infrastructure issue is high enough and a failure would result in starting again from scratch.
Comparing Spark to native Python is like comparing a locomotive to a bicycle. A bicycle is fast and agile, until you need to transport several tonnes of steel from one end of the country to the other: then - not so fun.
I'm looking into using Cassandra to store 50M+ documents that I currently have in XML format. I've been hunting around but I can't seem to find anything I can really follow on how to bulk load this data into Cassandra without needing to write some Java (not high on my list of language skills!).
I can happily write a script to convert this data into any format if it would make the loading easier although CSV might be tricky given the body of the document could contain just about anything!
Any suggestions welcome.
Thanks
Si
If you're willing to convert the XML to a delimited format of some kind (i.e. CSV), then here are a couple options:
The COPY command in cqlsh. This actually got a big performance boost in a recent version of Cassandra.
The cassandra-loader utility. This is a lot more flexible and has a bunch of different options you can tweak depending on the file format.
If you're willing to write code other than Java (for example, Python), there are Cassandra drivers available for a bunch of programming languages. No need to learn Java if you've got another language you're better with.
Is there any sense to use Spark (in particular, MLlib) on a single node (besides the goal of learning this technology)?
Is there any improvement in speed?
Are you comparing this to using a non-Spark machine learning system?
It really depends what the capabilities are of the other library you might use.
If, for example, you've got all your training data stored in Parquet files, then Spark makes it very easy to read in those files and work with, whether that's on 1 machine or 100.
so my issue comes from the excel data I currently have which I need to convert into 4 separate forms, each with different details. The specfics don't really matter, but what I'm trying to do is code some kind of script that would do into this data and extract the stuff I need, therefore saving me tons of time copying and pasting.
The problem is, i'm not really sure where to start. I have done some research so I am familiar with csv files and I already have a pretty good grasp on java. What would be the best way to approach this problem, from what I have researched, python is very helpful at these string type manipulations, but I also know that it could be done in java using buffered reads/file writes, but I feel like that could get really clunky.
Thanks