CORRMAP function EEGLAB not using all the datasets - eeglab

I have a study consist of 54 datasets and I am using the CORRMAP function to find the common components between the three conditions that I have.
The problem is that corrmap is just using only a few of the datasets and not all the settings files that I have in the study. Therefore it ends up by giving me the common components of a subset of data and not all the data.
Do you thing what might be the reason for that?!

Related

How can I load my own dataset for person?

How can I load a dataset for person reidentification. In my dataset there are two folders train and test.
I wish I could provide comments, but I cannot yet. Therefore, I will "answer" your question to the best of my ability.
First, you should provide a general format or example content of the dataset. This would help me provide a less nebulous answer.
Second, from the nature of your question I am assuming that you are fairly new to python in general. Forgive me if I'm wrong in my assumption. With that assumption, depending on what kind of data you are trying to load (i.e. text, numbers, or a mixture of text and numbers) there are various ways to load the data. Some of the methods are easier than others. If you are strictly loading numbers, I suggest using numpy.loadtxt(<file name>). If you are using text, you could use the Pandas package, or if it's in a CSV file you could use the built-in (into Python that is) CSV package. Alternatively, if it's in a format that Tensorflow can read, you could use the provided load data functions.
Once you have loaded your data you will need to separate the data into the input and output values. Considering that Tensorflow models accept either lists or numpy arrays, you should be able to use these in your training and testing steps.
Checkout modules csv (import csv) or load your dataset via open(filename, „r“) or so. It might be easiest if you provide more context/info.

Spark Dataset when to use Except vs Left Anti Join

I was wondering if there are performance difference between calling except (https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#except(org.apache.spark.sql.Dataset) and using a left anti-join. So far, the only difference I can see is that with the left anti-join, the 2 datasets can have different columns.
Your title vs. explanation differ.
But, if you have the same structure you can use both methods to find missing data.
EXCEPT
is a specific implementation that enforces same structure and is a subtract operation, whereas
LEFT ANTI JOIN
allows different structures as you would say, but can give the same result.
Use cases differ: 1) Left Anti Join can apply to many situations pertaining to missing data - customers with no orders (yet), orphans in a database. 2) Except is for subtracting things, e.g. Machine Learning splitting data into test- and training sets.
Performance should not be a real deal breaker as they are different use cases in general and therefore difficult to compare. Except will involve the same data source whereas LAJ will involve different data sources.

Py-tables vs Blaze vs S-Frames

I am working on an exploratory data analysis using python on a huge Dataset (~20 Million records and 10 columns). I would be segmenting, aggregating data and create some visualizations, I might as well create some decision trees liner regression models using that dataset.
Because of the large data set I need to use a data-frame that allows out of core data storage. Since I am relatively new to Python and working with large data-sets, i want to use a method which would allow me to easily use sklearn on my data-sets. I'm confused weather to use Py-tables, Blaze or s-Frame for this exercise. If someone could help me understand what are their pros and cons. What are the factors that are important in this kind of decision making that would be much appreciated.
good question! one option you may consider is to not use any of the libraries aformentioned, but instead read and process your file chunk-by-chunk, something like this:
csv="""\path\to\file.csv"""
pandas allows to read data from (large) files chunk-wise via a file-iterator:
it = pd.read_csv(csv, iterator=True, chunksize=20000000 / 10)
for i, chunk in enumerate(it):
...

Avoid the use of Java data structures in Apache Spark to avoid copying the data

I have a MySQL database with a single table containing about 100 million records (~25GB, ~5 columns). Using Apache Spark, I extract this data via a JDBC connector and store it in a DataFrame.
From here, I do some pre-processing of the data (e.g. replacing the NULL values), so I absolutely need to go through each record.
Then I would like to perform dimensionality reduction and feature selection (e.g. using PCA), perform clustering (e.g. K-Means) and later on do the testing of the model on new data.
I have implemented this in Spark's Java API, but it is too slow (for my purposes) since I do a lot of copying of the data from a DataFrame to a java.util.Vector and java.util.List (to be able to iterate over all records and do the pre-processing), and later back to a DataFrame (since PCA in Spark expects a DataFrame as input).
I have tried extracting information from the database into a org.apache.spark.sql.Column but cannot find a way to iterate over it.
I also tried avoiding the use of Java data structures (such as List and Vector) by using the org.apache.spark.mllib.linalg.{DenseVector, SparseVector}, but cannot get that to work either.
Finally, I also considered using JavaRDD (by creating it from a DataFrame and a custom schema), but couldn't work it out entirely.
After a lengthy description, my question is: is there a way to do all steps mentioned in the first paragraph, without copying all the data into a Java data structure?
Maybe one of the options I tried could actually work, but I just can't seem to find out how, as the docs and literature on Spark are a bit scarce.
From the wording of your question, it seems there is some confusion about the stages of Spark processing.
First, we tell Spark what to do by specifying inputs and transformations. At this point, the only things that are known are (a) the number of partitions at various stages of processing and (b) the schema of the data. org.apache.spark.sql.Column is used at this stage to identify the metadata associated with a column. However, it doesn't contain any of the data. In fact, there is no data at all at this stage.
Second, we tell Spark to execute an action on a dataframe/dataset. This is what kicks off processing. The input is read and flows through the various transformations and into the final action operation, be it collect or save or something else.
So, that explains why you cannot "extract information from the database into" a Column.
As for the core of your question, it's hard to comment without seeing your code and knowing exactly what it is you are trying to accomplish but it is safe to say that much migrating between types is a bad idea.
Here are a couple of questions that might help guide you to a better outcome:
Why can't you perform the data transformations you need by operating directly on the Row instances?
Would it be convenient to wrap some of your transformation code into a UDF or UDAF?
Hope this helps.

Building a collaborative filtering recommendation engine using Spark mlLib

I am trying to build a recommendation engine based on collaborative filtering using apache Spark. I have been able to run the recommendation_example.py on my data, with quite good result. (MSE ~ 0.9). Some of the specific questions that I have are:
How to make recommendation for the users who have not done any activity on the site. Isn't there some API call for popular items, which would give me the most popular items based on user actions. One way to do is to identify the popular items by ourselves, and catch the java.util.NoSuchElementException exception, and return those popular items.
How to reload the model, after some data has been added in the input file. I am trying to reload the model using another function, which tries to save the model, but it gives error as org.apache.hadoop.mapred.FileAlreadyExistsException. One way to do is to listen for the incoming data on a parallel thread, save it using model.save(sc, "target/tmp/<some target>") and then reload the model after significant data has been received. I am lost here, how to achieve that.
It would be very helpful, if I could get some direction here.
For the first part, you can find item_id, Number of times that item_id appeared. You can use map and reduceByKey functions of spark for that. After that find the top 10/20 items having max count. You can also give the weightage depending on recency of the items.
For the second part, you can save the model with new name every time. I generally create a folder name on the go using the current date and time and use the same name to reload the model from the saved folder. You will always have to train the model again, using past data and the new data received and then use the model to predict.
Independent of using platforms like Spark, there are some very good techniques(for ex. non-negative matrix factorization) of link prediction which predicts link between 2 sets.
Other very effective(and good) techniques of recommendations are:-
1. Thompson Sampling, 2.MAB (Multi Arm Bandits). A lot depends on the raw dataset. How is your raw dataset distributed. I would recommend to apply above methods on 5% raw dataset, build a hypothesis, use A/B testing, predicts links and move forward.
Again, all these techniques are independent of platform. I would also recommend of moving from scratch instead of using platforms like spark which are only useful for large datasets. You can always move to these platforms in future for scalability.
Hope it helps!

Resources