How to perform factorial experiment design on data scares situation? - statistics

I am trying to analysis data for my thesis so a factorial experimental design has been suggested. The problem is I have done data collection process befor designing an experiment and now my sample size is just 3. Should I run the experiment again or there is solution? Because experiment is very expensive and time consuming.

Related

Why some portion of statistics is not used in data science

I have learned statistics including mean, median, mode and different tests
being Z test, F test and chi-square and all but generally participating in
difficult numeric data prediction challenges like on kaggle and other
platforms I hardly see anyone using statistical tests like z, f, chi-square,
normalization of data these - all we use boxplots, bar plots to see mean,
median, mode etc.
my question is where these tests are an integral part in data science, for what
sort of problems are these mainly designed - research based.
What portion of statistics should ideally be used in a data science problem and
why only some portion is used when all of statistics is must for data science.
I am asking regarding tests and other statistics except the algorithms.
You're most likely to see statistical hypothesis testing in data science if you're looking at something like A/B testing, where your goal is to determine whether there is a reliable difference between two samples and the size of that difference.
Kaggle competitions specifically are supervised learning problems rather than hypothesis testing, which is why you don't see people using things like chi-squared. (Which makes sense: if you have ten people do hypothesis testing on the same dataset, they should all get pretty much the same answer, which would make for a pretty uninteresting competition.)
Personally, I think it's good to be familiar with both statistical hypothesis testing and machine-learning techniques, since they have different uses. Hope that helps! :)
Every problem in data science requires a different approach, so a generic statistics might not apply. There will be problems where some statistics might not be needed

How can I know if Apache Spark is the right tool?

Just wondering, is there somewhere a list of questions to ask ourselves in order to know whether Spark is the right tool or not ?
Once again I spent part of the week implementing a POC with Apache Spark in order to compare the performance against pure python code and I was baffled when I saw the 1/100 ratio (in favor of python).
I know that Spark is a "big data" tool and everyone keeps saying "Spark is the right tool to process TB/PB of data" but I think that is not the only thing to take into account.
In brief, my question is, when given small data as input, how can I know if the computing will be consuming enough so that Spark can actually improve things ?
I'm not sure if there is such a list, but if there was, the first question would probably be
Does your data fit on a single machine?
And if the answer is 'Yes', you do not need Spark.
Spark is designed to process lots of data such that it cannot be handled by a single machine, as an alternative to Hadoop, in a fault-tolerant manner.
There are lots of overheads, such as fault-tolerance and network, associated with operating in a distributed manner that cause the apparent slow-down when compared to traditional tools on a single machine.
Because Spark can be used as a parallel processing framework on a small dataset, does not mean that it should be used in such a way. You will get faster results and less complexity by using, say, Python, and parallelizing your processing using threads.
Spark excels when you have to process a dataset that does not fit onto a single machine, when the processing is complex and time-consuming and the probability of encountering an infrastructure issue is high enough and a failure would result in starting again from scratch.
Comparing Spark to native Python is like comparing a locomotive to a bicycle. A bicycle is fast and agile, until you need to transport several tonnes of steel from one end of the country to the other: then - not so fun.

MapReduce - Anything else except word-counting?

I have been looking at MapReduce and reading through various papers about it and applications of it, but, to me, it seems that MapReduce is only suitable for a very narrow class of scenarios that ultimately result in word-counting.
If you look at the original paper Google's employees provide "various" potential use cases, like "distributed grep", "distributed sort", "reverse web-link graph", "term-vector per host", etc.
But if you look closer, all those problems boil down to simply "counting words" - that is counting the number occurrences of something in a chunk of data, then aggregating/filtering and sorting that list of occurrences.
There also are some cases where MapReduce has been used for genetic algorithms or relational databases, but they don't use the "vanilla" MapReduce published by Google. Instead they introduce further steps along the Map-Reduce chain, like Map-Reduce-Merge, etc.
Do you know of any other (documented?) scenarios where "vanilla" MapReduce has been used to perform more than mere word-counting? (Maybe for ray-tracing, video-transcoding, cryptography, etc. - in short anything "computation heavy" that is parallelizable)
Atbrox had been maintaining mapreduce hadoop algorithms in academic papers. Here is the link. All of these could be applied for practical purpose.
MapReduce is good for problems that can be considered to be embarrassingly parallel. There are a lot of problems that MapReduce is very bad at, such as those that require lots of all-to-all communication between nodes. E.g., fast Fourier transforms and signal correlation.
There are projects using MapReduce for parallel computations in statistics. For instance, Revolutions Analytics has started a RHadoop project for use with R. Hadoop is also used in computational biology and in other fields with large datasets that can be analyzed by many discrete jobs.
I am the author of one of the packages in RHadoop and I wrote the several examples distributed with the source and used in the Tutorial, logistic regression, linear least squares, matrix multiplication etc. There is also a paper I would like to recommend http://www.mendeley.com/research/sorting-searching-simulation-mapreduce-framework/
that seems to strongly support the equivalence of mapreduce with classic parallel programming models such as PRAM and BSP. I often write mapreduce algorithms as ports from PRAM algorithms, see for instance blog.piccolboni.info/2011/04/map-reduce-algorithm-for-connected.html. So I think the scope of mapreduce is clearly more than "embarrassingly parallel" but not infinite. I have myself experienced some limitations for instance in speeding up some MCMC simulations. Of course it could have been me not using the right approach. My rule of thumb is the following: if the problem can be solved in parallel in O(log(N)) time on O(N) processors, then it is a good candidate for mapreduce, with O(log(N)) jobs and constant time spent in each job. Other people and the paper I mentioned seem to focus more on the O(1) jobs case. When you go beyond O(log(N)) time the case for MR seems to get a little weaker, but some limitations may be inherent in the current implementation (high job overhead) rather the fundamental. It's a pretty fascinating time to be working on charting the MR territory.

Examples of simple stats calculation with hadoop

I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my data, i.e. arithmetic mean and variance.
I've been googling for a while, but maybe I'm not using the right keywords and I haven't really found anything which is a good primer for doing this sort of calculation, so I thought I would ask here.
Can anyone point me to some good samples of how to calculate mean and variance using hadoop, and/or provide some sample code.
Thanks
Pig latin has an associated library of reusable code called PiggyBank that has numerous handy functions. Unfortunately it didn't have variance last time I checked, but maybe that has changed. If nothing else, it might provide examples to get you started on your own implementation.
I should note that variance is difficult to implement in a stable way over huge data sets, so take care!
You might double check and see if your clustering code can drop into Cascading. Its quite trivial to add new functions, do joins, etc with your existing java libraries.
http://www.cascading.org/
And if you are into Clojure, you might watch these github projects:
http://github.com/clj-sys
They are layering new algorithms implemented in Clojure over Cascading (which in turn is layered over Hadoop MapReduce).

Server side language for cpu/memory intensive process

Whats a good server side language for doing some pretty cpu and memory intensive things that plays well with php and mysql. Currently, I have a php script which runs some calculations based on a large subset of a fairly large database and than updates that database based on those calculations (1.5millions rows). The current implementation is very slow, taking 1-2 hours depending on other activities on the server. I was hoping to improve this and was wondering what peoples opinions are on a good language for this type of task?
Where is the bottleneck? Run some real profiling, and see what exactly is causing the problem. Is it the DB I/O? Is it the cpu? Is the algorithm inefficient? Are you calling slow library methods in a tight inner loop? Could precalculation be used.
You're pretty much asking what vehicle you need to get from point A to point B, and you've offered a truck, car, bicycle, airplane, jet, and helicopter. The answer won't make sense without more context.
The language isn't the issue, your issue is probably where you are doing these calculations. Sounds like you may be better off writing this in SQL, if possible. Is it? What are you doing?
I suspect your bottleneck is not the computation. It definitely takes several hours to just update a few million records.
If that's the case, you can write a customized function in c/c++ for MySQL and execute the function in stored procedure.
We do this in our database to re-encrypt some sensitive fields during key-rotation. It shrunk key-rotation time from days to hours. However, it's a pain to maintain your own copy of MySQL. We have been looking for alternatives but nothing is close to the performance of this approach.

Resources