A spherical region of space is filled with a specific distribution of smaller, different size spheres. Each sphere is associated with some physical properties: position, radius, mass, velocity, and ID all represented as 1d or 3d numpy arrays. I would like to shuffle this population of spheres in a totally random manner such that any single sphere preserves all of its properties except its 3d position array. I have encountered this similar question in here (Randomly shuffle columns except first column) but, is there an easy and fast pythonic way to do this without using DataFrame?
Thank for your help.
If you're using pandas, you could just shuffle one column:
df['col'] = df['col'].sample(frac=1).values
This works equally well on any subset of columns, e.g.
cols = ['col1', 'col2']
df[cols] = df[cols].sample(frac=1).values
The two columns are shuffled together, i.e. their respective values remain aligned.
See also this answer.
You can implement a Knuth shuffle (https://en.wikipedia.org/wiki/Random_permutation), its quite straight-forward.
You can adapt the implementation algorithm to only swap your desired properties.
Related
I am trying to apply some function on every possible pair of columns in a data frame. The iteration method works, but since the data frame is huge it takes up a lot of time. My data frame has a size of around 10,000 columns and 1000 rows.
Is there a faster way of doing this. Given below is a toy example of the same.
TOY EXAMPLE
My function is something like this:
def foo(x,y):
if ['alpha','beta'] == df[[x,y]]:
print(x,y)
df = pd.DataFrame([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
for i in df.columns:
for j in df.columns:
foo(i,j)
I have also tried loop comprehension and itertools.combinations, but it is also taking a lot of time.
z = [foo(i,j) for i,j in itertools.combinations(df.columns,2)]
My function is exactly the same. It checks if 3-4 rows are present in the pair of columns and writes the column information to a file.
I tried using numpy matrices instead of data frame, but did not achieve any significant time improvement. All of the above are working but takes a lot of time (obviously due to the huge size of the data frame). Hence I need some help in optimizing the time.
Any suggestions regarding the same would be highly appreciated. Thanks a lot.
I m using this dataset of crop agriculture. In order to use it for creating a neural network, I preprocessed the data using MinMaxScalar, this would scale the data between 0 and 1. But my dataset also consist of categorical columns, because of which I got an error during preprocessing. So I tried encoding the categorical columns using OneHotEncoder and LabelEncoder but I don't understand what to do with it then.
My aim is to predict "Crop_Damage".
How do I proceed ?
Link to the dataset -
https://www.kaggle.com/aniketng21600/crop-damage-information-in-india
You have several options.
You may use one hot encoding and pass your categorical variable to network as one-hot network.
You may get inspiration from NLP and their processing. One hot vectors are sparse and may be really huge(depends on unique values of your categorical variable). Please look at techniques Word2vec(cat2vec) or GloVe. Both of them aims to create from categorical element, nonsparse numeric vector(meaningful).
Beside of these two keras offer another way how to obtain this numeric vector. Its called embeded layer. For example, lets consider that you have variable Crop damage with these values:
Huge
Medium
Little
First you assign unique integer for every unique value of your categorical variable.
Huge = 0
Medium = 1
Little= 2
Than you pass translated categorical values(unique integers) to emebeded layer. Embeded layer takes at input sequence of unique integers and produce sequence of dense vectors. Values of these vectors are firstly random, but during training are optimized like regular weights of neural network. So we can say that during the training neural network build vector representation of categories according to loss function.
For me is embeded layer the easiest way to obtain good enough vector representation of categorical variables. But you can try first with one hot if accuracy satisfy you.
here is a one hot encoder. df is the data frame you are working with, column is the name
of the column you want to encode. prefix is a string that gets appended to the column names created by pandas dummies. What happens is the new dummy columns are created and
appended to the data frame as new columns. The original column is then deleted.
There is an excellent series of videos on encoding data frames and other topics on Youtube here.
def onehot_encode(df, column, prefix):
df = df.copy()
dummies = pd.get_dummies(df[column], prefix=prefix)
df = pd.concat([df, dummies], axis=1)
df = df.drop(column, axis=1)
return df
In my situation, I would like to encode around 5 different columns in my dataset but the issue is that these 5 columns have many unique values.
If I encode them using label encoder I add an unnecessary order that is not right whereas if I do OHE or pd.get_dummies then I end up having a lot of features that will add to much sparseness in the data.
I am currently dealing with a supervised learning problem and the following are the unique values per column:
Job_Role : Unique categorical values = 29
Country : Unique categorical values = 12
State : Unique categorical values = 14
Segment : Unique categorical values = 12
Unit : Unique categorical values = 10
I have already looked into multiple references but not sure about the best approach. What should in this situation to have least number of features with maximum positive impact on my model
As far as I know, usually uses OneHotEncoder for these cases but as you said, there are so many unique values in your data. I've looked for a solution for a project before and I saw different ways as follows,
OneHotEncoder + PCA: I think this way is not quite right, because PCA is designed for continuous variables.[*]
Entity Embeddings: I don't know this way very well, but you can check it from the link in the title.
BinaryEncoder: I think, this is useful when you have a large number of categories and doing the one-hot encoding will increase the dimensions and which in turns increases model complexity. So, binary encoding is a good choice to encode the categorical variables with less number of dimensions.
There are some other solutions in category_encoders library.
Say I have bunch of categorical string columns in my dataframe. Then I do below transform:
StringIndex the columns
then I use VectorAssembler to assemble all the transformed columns into one vector feature column
do VectorIndexer on the new vector feature column.
Question: for step 3, does it make sense, or is it duplicated effort? I think step 1 already did the index.
Yes it makes sense if you're going to use Spark tree based algorithm (RandomForestClassifier or GBMClassifier) and you have high cardinality features.
E.g. for criteo dataset StringIndexer would convert values in categorical column to integers in range 1 to 65000. It will save this in metadata as a NominalAttribute. Then in RFClassifier it would extract this from metadata as categorical features.
For tree based algorithms you have to specify maxBins parameter that
Must be >= 2 and >= number of categories in any categorical feature.
Too high maxBins parameter would lead to slow performance. To solve this need to use VectorIndexer with .setMaxCategories(64) for example. This will treat as categorical variables only those that has <64 unique values.
I am using Spark to design a TSP solver. Essentially, each element in the RDD is a 3-tuple (id, x, y) where id is the index of a point and x-y is the coordinate of that point. Given a RDD storing a sequence of 3-tuple, how can I evaluate the path cost of this sequence? For example, the sequence (1, 0, 0), (2, 0, 1), (3, 1, 1) will give the cost 1 + 1 = 2 (from the first point to the second point and then to the third point). It seems in order to do this I have to know how exactly the Spark partitions the sequence (RDD). Also, how can I evaluate the cost between boundary points of two partitions? Or is there any simple operation for me to do this?
With any parallel processing, you want to put serious thought into what a single data element is, so that only the data that needs to be together is together.
So instead of having every row be a point, it's likely that every row should be the array of points that define a path, at which point calculating the total path length with Spark becomes easy. You'd just use whatever you would normally use to calculate the total length of an array of line segments given the defining points.
But even then it's not clear that we need the full generality of points. For the TSP, a candidate solution is a path that includes all locations, which means that we don't need to store the locations of the cities for every solution, or calculate the distances every time. We just need to calculate one matrix of distances, which we can then broadcast so every Spark worker has access to it, and then lookup the distances instead of calculating them.
(It's actually a permutation of location ids, rather than just a list of them, which can simplify things even more.)