split multiple values into two columns based on single seprator - python-3.x

I am new to pandas.I have a situation I want to split length column into two columns a and b.Values in length column are in pair.I want to compare first pair smaller value should be in a nad larger in b.then compare next pair on same row and smaller in a,larger in b.
I have hundred rows.I think I can not use str.split because there are multiple values and same delimiter.I have no idea how to do it
The output should be same like this.
Any help will be appreciated
length a b
{22.562,"35.012","25.456",37.342,24.541,38.241} 22.562,25.45624.541 35.012,37.342,38.241
{21.562,"37.012",25.256,36.342} 31.562,25.256 37.012,36.342
{22.256,36.456,26.245,35.342,25.56,"36.25"} 22.256,26.245,25.56 36.456,35.342,36.25
I have tried
df['a'] = df['length'].str.split(',').str[0::2]
df['b'] = df['length'].str.split(',').str[1::3]
through this ode column b output is perfect but col a is printing first full pair then second.. It is not giving only 0,2,4th values

The problem comes from the fact that your length column is made of set not lists.
Here is a way to do what you want by casting your length column as list:
df['length'] = [list(x) for x in df.length] # We cast the sets as lists
df['a'] = [x[0::2] for x in df.length]
df['b'] = [x[1::2] for x in df.length]
Output:
length a \
0 [35.012, 37.342, 38.241, 22.562, 24.541, 25.456] [35.012, 38.241, 24.541]
1 [25.256, 36.342, 21.562, 37.012] [25.256, 21.562]
2 [35.342, 36.456, 36.25, 22.256, 25.56, 26.245] [35.342, 36.25, 25.56]
b
0 [37.342, 22.562, 25.456]
1 [36.342, 37.012]
2 [36.456, 22.256, 26.245]

Related

Using value_counts() and filter elements based on number of instances

I use the following code to create two arrays in a histogram, one for the counts (percentages) and the other for values.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
So, an output looks like
counts = 66.7, 8.3, 8.3, 8.3, 8.3
values = 1024, 356352, 73728, 16384, 4096
Problem is that some values exist one time only and I would like to ignore them. In the example above, only 1024 repeated multiple times and others are there only once. I can manually check the number of occurrences in the row and see if they are not repeated multiple times and ignore them.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
for v in values:
# N = get_number_of_instances in row
# if N == 1
# remove v in row
I would like to know if there are other ways for that using the built-in functions in Pandas.
Some clarity requested on your question in comments above
If keys is a column and you want to retain non duplicates, please try
values=df.loc[~df['keys'].duplicated(keep=False), 'keys'].to_list()

how how iloc[:,1:] works ? can any one explain [:,1:] params

What is the meaning of below lines., especially confused about how iloc[:,1:] is working ? and also data[:,:1]
data = np.asarray(train_df_mv_norm.iloc[:,1:])
X, Y = data[:,1:],data[:,:1]
Here train_df_mv_norm is a dataframe --
Definition: pandas iloc
.iloc[] is primarily integer position based (from 0 to length-1 of the
axis), but may also be used with a boolean array.
For example:
df.iloc[:3] # slice your object, i.e. first three rows of your dataframe
df.iloc[0:3] # same
df.iloc[0, 1] # index both axis. Select the element from the first row, second column.
df.iloc[:, 0:5] # first five columns of data frame with all rows
So, your dataframe train_df_mv_norm.iloc[:,1:] will select all rows but your first column will be excluded.
Note that:
df.iloc[:,:1] select all rows and columns from 0 (included) to 1 (excluded).
df.iloc[:,1:] select all rows and columns, but exclude column 1.
To complete the answer by KeyMaker00, I add that data[:,:1] means:
The first : - take all rows.
:1 - equal to 0:1 take columns starting from column 0,
up to (excluding) column 1.
So, to sum up, the second expression reads only the first column from data.
As your expression has the form:
<variable_list> = <expression_list>
each expression is substituted under the corresponding variable (X and Y).
Maybe it will complete the answers before.
You will know
what you get,
its shape
how to use it with de column name
df.iloc[:,1:2] # get column 1 as a DATAFRAME of shape (n, 1)
df.iloc[:,1:2].values # get column 1 as an NDARRAY of shape (n, 1)
df.iloc[:,1].values # get column 1 as an NDARRAY of shape ( n,)
df.iloc[:,1] # get column 1 as a SERIES of shape (n,)
# iloc with the name of a column
df.iloc[:, df.columns.get_loc('my_col')] # maybe there is some more
elegants methods

How to find common elements in string cells?

I want to find the common elements in multiple (>=2) cell arrays of strings.
A related question is here, and the answer proposes to use the function intersect(), however it works for only 2 inputs.
In my case, I have more than two cells, and I want to obtain a single common subset. Here is an example of what I want to achieve:
c1 = {'a','b','c','d'}
c2 = {'b','c','d'}
c3 = {'c','d'}
c_common = my_fun({c1,c2,c3});
in the end, I want c_common={'c','d'}, since only these two strings occur in all the inputs.
How can I do this with MATLAB?
Thanks in advance,
P.S. I also need the indices from each input, but I can probably do that myself using the output c_common, so not necessary in the answer. But if anyone wants to tackle that too, my actual output will be like this:
[c_common, indices] = my_fun({c1,c2,c3});
where indices = {[3,4], [2,3], [1,2]} for this case.
Thanks,
Listed in this post is a vectorized approach to give us the common strings and indices using unique and accumarray. This would work even when the strings are not sorted within each cell array to give us indices corresponding to their positions within it, but they have to be unique. Please have a look at the sample input, output section* to see such a case run. Here's the implementation -
C = {c1,c2,c3}; % Add more cell arrays here
% Get unique strings and ID each of the strings based on their uniqueness
[unqC,~,unqID] = unique([C{:}]);
% Get count of each ID and the IDs that have counts equal to the number of
% cells arrays in C indicate that they are present in all cell arrays and
% thus are the ones to be finally selected
match_ID = find(accumarray(unqID(:),1)==numel(C));
common_str = unqC(match_ID)
% ------------ Additional work to get indices ----------------
N_str = numel(common_str);
% Store matches as a logical array to be used at later stages
matches = ismember(unqID,match_ID);
% Use ismember to find all those indices in unqID and subtract group
% lengths from them to give us the indices within each cell array
clens = [0 cumsum(cellfun('length',C(1:end-1)))];
match_index = reshape(find(matches),N_str,[]);
% Sort match_index along each column based on the respective unqID elements
[m,n] = size(match_index);
[~,sidx] = sort(reshape(unqID(matches),N_str,[]),1);
sorted_match_index = match_index(bsxfun(#plus,sidx,(0:n-1)*m));
% Subtract cumulative group lens to give us indices corres. to each cell array
common_idx = bsxfun(#minus,sorted_match_index,clens).'
Please note that at the step that calculates match_ID : accumarray(unqID(:),1) could be replaced by histc(unqID,1:max(unqID)). Also, histcounts be another alternative there.
*Sample input, output -
c1 =
'a' 'b' 'c' 'd'
c2 =
'b' 'c' 'a' 'd'
c3 =
'c' 'd' 'a'
common_str =
'a' 'c' 'd'
common_idx =
1 3 4
3 2 4
3 1 2
As noted in the comments to this question, there is a file in File Exchange called "MINTERSECT -- Multiple set intersection." at http://www.mathworks.com/matlabcentral/fileexchange/6144-mintersect-multiple-set-intersection that contains simple code to generalize intersect to multiple sets. In a nutshell, the code gets the output from performing intersect on the first pair of cells and then perform intersect on this output with the next cell. This process continues until all cells have been compared. Note that the author points out that the code is not particularly efficient but it may be sufficient for your use case.

masking a double over a string

This is a question in MatLab...
I have two matrices, one being a (5 x 1 double) :
1
2
3
1
3
And the second matrix being a (5 x 3 string), with spaces where no character appears :
a
bc
def
g
hij
I am trying to get an output such that a (5 x 1 string) is created and outputs the nth value from each line of matrix two, where n is the value in matrix one. I am unsure how to do this using a mask which would be able to handle much larger matrces. My target matrix would have the following :
a
c
f
g
j
Thank you very much for the help!!!
There are so many ways you can accomplish this task. I'll give you two.
Method #1 - Generate linear indices and access elements
Use sub2ind to generate a set of linear indices that correspond to the row and column locations you want to access in your matrix. You'll note that the column locations are the ones changing, but the row locations are always increasing by 1 as you want to access each row. As such, given your string matrix A, and your columns you want to access stored in ind, just do this:
A = ['a '; 'bc '; 'def'; 'g ';'hij'];
ind = [1 2 3 1 3];
out = A(sub2ind(size(A), (1:numel(ind)).', ind(:)))
out =
a
c
f
g
j
Method #2 - Create a sparse matrix, convert to logical and access
Alternatively, you can create a sparse matrix through sparse where the non-zero entries are rows vary from 1 up to as many elements as you have in ind and the columns vary like what you have given us.
S = sparse((1:numel(ind)).',ind(:),true,size(A,1),size(A,2));
A = A.'; out = A(S.');
Be mindful that you are trying to access each element in a row-major fashion, yet MATLAB will do this in a column-major format. As such, we would need to transpose our data matrix, and also take our sparse matrix and transpose that too. The end result should give you the same order as Method #1.

How to change stringified numbers in data frame into pure numeric values in R

I have the following data.frame:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
# Note that the salary below is in stringified format.
# In reality there are more such stringified numerical columns.
salary <- as.character(c(21000, 23400, 26800))
df <- data.frame(employee,salary)
The output is:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ employee: Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2
$ salary : Factor w/ 3 levels "21000","23400",..: 1 2 3
What I want to do is to convert the change the value from string into pure number
straight fro the df variable. At the same time preserve the string name for employee.
I tried this but won't work:
as.numeric(df)
At the end of the day I'd like to perform arithmetic on these numeric
values from df. Such as df2 <- log2(df), etc.
Ok, there's a couple of things going on here:
R has two different datatypes that look like strings: factor and character
You can't modify most R objects in place, you have to change them by assignment
The actual fix for your example is:
df$salary = as.numeric(as.character(df$salary))
If you try to call as.numeric on df$salary without converting it to character first, you'd get a somewhat strange result:
> as.numeric(df$salary)
[1] 1 2 3
When R creates a factor, it turns the unique elements of the vector into levels, and then represents those levels using integers, which is what you see when you try to convert to numeric.

Resources