Create new data frame columns with binary (0/1) data based on text strings in existing column in R - string

I have an R data frame that looks like:
ID YR SC
ABX:22798 1976 A's Frnd; Cat; Cat & Mse
ABX:23798 1983 A's Frnd; Cat; Zebra Fish
ABX:22498 2010 Zebra Fish
ABX:22728 2010 Bear; Dog; Zebra Fish
ABX:22228 2011 Bear
example data:
df <- structure(list(ID = c("ABX:22798", "ABX:23798", "ABX:22498", "ABX:22728", "ABX:22228"), YR = c(1976, 1983, 2010, 2010, 2011), SC = c("A's Frnd; Cat; Cat & Mse", "A's Frnd; Cat; Zebra Fish", "Zebra Fish", "Bear; Dog; Zebra Fish", "Bear")), .Names = c("ID", "YR", "SC"), row.names = c(NA, 5L), class = "data.frame")
That I would like to transform by splitting the text string in the SC column by "; ". Then, I'd like to use the resulting lists of strings to populate new columns with binary data. The final data frame would look like this:
ID YR A's Frnd Bear Cat Cat & Mse Dog Zebra Fish
ABX:22798 1976 1 0 1 1 0 0
ABX:23798 1983 1 0 1 0 0 1
ABX:22498 2010 0 0 0 0 0 1
ABX:22728 2010 0 1 0 0 1 1
ABX:22228 2011 0 1 0 0 0 0
I'll be analyzing a number of different datasets individually. In any given data set, there are between about 100 and 230 unique SCs entries, and the number of rows per set ranges from about 500 to several thousand. The number of SCs per row ranges from 1 to about 6 or so.
I have had a couple of starts with this, most are quite ugly. I thought the approach below looked promising (it's similar to a python pandas implementation that works well). It would be great to learn a good way to do this in R!
My starter code:
# Get list of unique SCs
SCs <- df[,2]
SCslist <- lapply(SCs, strsplit, split="; ")
SCunique <- unique(unlist(SCslist, use.names = FALSE))
# Sort alphabetically,
# note that apostrophes could be a problem
SCunique <- sort(SCunique)
# create a dataframe of 0s to add to the original df
df0 <- as.data.frame(matrix(0, ncol=length(SCunique), nrow=nrow(df)))
colnames(df0) <- SCunique
...(and then...?)
I've found similar questions/answers, including:
Dummy variables from a string variable Split strings into columns in R where each string has a potentially different number of column entries
Edit: Found one more answer set of interest:
Improve text processing speed using R and data.table
Thanks in advance for your answers.

I think this should do what you're looking for:
library(reshape2)
a <- strsplit(as.character(df$SC), "; ", fixed = TRUE)
dfL <- merge(df, melt(a), by.x = "row.names", by.y = "L1", all = TRUE)
dcast(dfL, ID + YR ~ value, value.var="value", fun.aggregate=length, fill = 0)
# ID YR A's Frnd Cat Cat & Mse Zebra Fish Bear Dog
# 1 ABX:22228 2011 0 0 0 0 1 0
# 2 ABX:22498 2010 0 0 0 1 0 0
# 3 ABX:22728 2010 0 0 0 1 1 1
# 4 ABX:22798 1976 1 1 1 0 0 0
# 5 ABX:23798 1983 1 1 0 1 0 0
Factor the "value" column first if the column order is important to you.
I also have a package called "splitstackshape" that would work nicely with this if your data didn't have quotes in it. That's a bug that I'm looking into. Once it is resolved, it should be possible to do a command like:
library(reshape2)
library(splitstackshape)
dcast(concat.split.multiple(df, "SC", ";", "long"),
ID + YR ~ SC, value.var="SC", fun.aggregate=length, fill=0)
to get what you need.

Continuing on your logic, you can do something like:
##your code
# Get list of unique SCs
SCs <- df[,3] #NOTE: here, you have df[,2], but I guess should have been df[,3]
SCslist <- lapply(SCs, strsplit, split="; ")
SCunique <- unique(unlist(SCslist, use.names = FALSE))
# Sort alphabetically,
# note that apostrophes could be a problem
SCunique <- sort(SCunique)
##
df0 <- sapply(SCunique, function(x) as.integer(grepl(x, SCs)))
df0
# A's Frnd Bear Cat Cat & Mse Dog Zebra Fish
#[1,] 1 0 1 1 0 0
#[2,] 1 0 1 0 0 1
#[3,] 0 0 0 0 0 1
#[4,] 0 1 0 0 1 1
#[5,] 0 1 0 0 0 0

Related

How to split string with the values in their specific columns indexed on their label?

I have the following data
Index Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
. .
. .
. .
I have to create split the data format into different columns each with their own header and their values, the result should be as below:
Index Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
. .
. .
. .
The data is not limited to CO/PET/CV/EL, will need as many columns needed each displaying its corresponding value.
The .str.split('-', expand=True) function will only delimit the data and keep all first values in same column and does not rename each column.
Is there a way to implement this in python?
You could do:
df.Data.str.split('-').explode().str.split(r'(?<=\d)(?=\D)',expand = True). \
reset_index().pivot('index',1,0).fillna(0).reset_index()
1 Index CO CV EL PET
0 0 100 0 0 0
1 1 50 0 0 50
2 2 0 98 2 0
3 3 50 50 0 0
Idea is first split values by -, then extract numbers and no numbers values to tuples, append to list and convert to dictionaries. It is passed in list comprehension to DataFrame cosntructor, replaced misisng values and converted to numeric:
import re
def f(x):
L = []
for val in x.split('-'):
k, v = re.findall('(\d+)(\D+)', val)[0]
L.append((v, k))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
If in data exist some values without number or number only solution should be changed for more general like:
print (df)
Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
4 AAA
5 20
def f(x):
L = []
for val in x.split('-'):
extracted = re.findall('(\d+)(\D+)', val)
if len(extracted) > 0:
k, v = extracted[0]
L.append((v, k))
else:
if val.isdigit():
L.append(('No match digit', val))
else:
L.append((val, 0))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL AAA No match digit
0 100CO 100 0 0 0 0 0
1 50CO-50PET 50 50 0 0 0 0
2 98CV-2EL 0 0 98 2 0 0
3 50CV-50CO 50 0 50 0 0 0
4 AAA 0 0 0 0 0 0
5 20 0 0 0 0 0 20
Try this:
import pandas as pd
import re
df = pd.DataFrame({'Data':['100CO', '50CO-50PET', '98CV-2EL', '50CV-50CO']})
split_df = pd.DataFrame(df.Data.apply(lambda x: {re.findall('[A-Z]+', el)[0] : re.findall('[0-9]+', el)[0] \
for el in x.split('-')}).tolist())
split_df = split_df.fillna(0)
df = pd.concat([df, split_df], axis = 1)

pandas assign value in multiple columns based on value in one

I have a dataset like this,
sample = {'Theme': ['never give a ten','interaction speed','no feedback,premium'],
'cat1': [0,0,0],
'cat2': [0,0,0],
'cat3': [0,0,0],
'cat4': [0,0,0]
}
pd.DataFrame(sample,columns = ['Theme','cat1','cat2','cat3','cat4'])
Theme cat1 cat2 cat3 cat4
0 never give a ten 0 0 0 0
1 interaction speed 0 0 0 0
2 no feedback,premium 0 0 0 0
Now, I need to replace the values in cat columns based on value in Theme. If the Theme column has 'never give a ten', then change cat1 as 1, similarly if the theme column has 'interaction speed', then change cat2 as 1, if the theme column has 'no feedback' in it, change 'cat3' as 1 and for 'premium' change cat4 as 1.
In this sample I have provided 4 categories, I have in total 21 categories. I can do if word in string 21 times for 21 categories, but I am looking for an efficient way to write this in a function, loop every row and go through the logic and update the corresponding columns, can anyone help please?
Thanks in advance.
Here is possible set columns names by categories with Series.str.get_dummies - columns names are sorted:
df1 = df['Theme'].str.get_dummies(',')
print (df1)
interaction speed never give a ten no feedback premium
0 0 1 0 0
1 1 0 0 0
2 0 0 1 1
If need first column in output add DataFrame.join:
df11 = df[['Theme']].join(df['Theme'].str.get_dummies(','))
print (df11)
Theme interaction speed never give a ten no feedback \
0 never give a ten 0 1 0
1 interaction speed 1 0 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1
If order of columns is important add DataFrame.reindex:
#removed posible duplicates with remain ordering
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df['Theme'].str.get_dummies(',').reindex(cols, axis=1)
print (df2)
never give a ten interaction speed no feedback premium
0 1 0 0 0
1 0 1 0 0
2 0 0 1 1
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df[['Theme']].join(df['Theme'].str.get_dummies(',').reindex(cols, axis=1))
print (df2)
Theme never give a ten interaction speed no feedback \
0 never give a ten 1 0 0
1 interaction speed 0 1 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1

Pandas Flag Rows with Complementary Zeros

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':[0,4,4,4],
'B':[0,4,4,0],
'C':[0,4,4,4],
'D':[4,0,0,4],
'E':[4,0,0,0],
'Name':['a','a','b','c']})
df
A B C D E Name
0 0 0 0 4 4 a
1 4 4 4 0 0 a
2 4 4 4 0 0 b
3 4 0 4 4 0 c
I'd like to add a new field called "Match_Flag" which labels unique combinations of rows if they have complementary zero patterns (as with rows 0, 1, and 2) AND have the same name (just for rows 0 and 1). It uses the name of the rows that match.
The desired result is as follows:
A B C D E Name Match_Flag
0 0 0 0 4 4 a a
1 4 4 4 0 0 a a
2 4 4 4 0 0 b NaN
3 4 0 4 4 0 c NaN
Caveat:
The patterns may vary, but should still be complementary.
Thanks in advance!
UPDATE
Sorry for the confusion.
Here is some clarification:
The reason why rows 0 and 1 are "complementary" is that they have opposite patterns of zeros in their columns; 0,0,0,4,4 vs, 4,4,4,0,0.
The number 4 is arbitrary; it could just as easily be 0,0,0,4,2 and 65,770,23,0,0. So if 2 such rows are indeed complementary and they have the same name, I'd like for them to be flagged with that same name under the "Match_Flag" column.
You can identify a compliment if it's dot product is zero and it's element wise sum is nowhere zero.
def complements(df):
v = df.drop('Name', axis=1).values
n = v.shape[0]
row, col = np.triu_indices(n, 1)
# ensure two rows are complete
# their sum contains no zeros
c = ((v[row] + v[col]) != 0).all(1)
complete = set(row[c]).union(col[c])
# ensure two rows do not overlap
# their product is zero everywhere
o = (v[row] * v[col] == 0).all(1)
non_overlap = set(row[o]).union(col[o])
# we are a compliment iff we do
# not overlap and we are complete
complement = list(non_overlap.intersection(complete))
# return slice
return df.Name.iloc[complement]
Then groupby('Name') and apply our function
df['Match_Flag'] = df.groupby('Name', group_keys=False).apply(complements)

Combining pairs in a string (Matlab)

I have a string:
sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG'
How can I combine pairs which have the last character of 1 pair is the first character of the follow pairs into strings? And the new strings must contain all of the character 'A','B','C','D','E','F' , 'G', those characters are appeared in the sup_pairs string.
The expected output should be:
S1 = 'BAEFCGD' % because BA will be followed by AE in sup_pairs string, so we combine BAE, and so on...we continue the rule to generate S1
S2 = 'DFCEABG'
If I have AB, BC and BD, the generated strings should be both : ABC and ABD .
If there is any repeated character in the pairs like : AB BC CA CE . We will skip the second A , and we get ABCE .
This, like all good things in life, is a graph problem. Each letter is a node, and each pair is an edge.
First we must transform your string of pairs into a numeric format so we can use the letters as subscripts. I will use A=2, B=3, ..., G=8:
sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG';
p=strsplit(sup_pairs,' ');
m=cell2mat(p(:));
m=m-'?';
A=sparse(m(:,1),m(:,2),1);
The sparse matrix A is now the adjacency matrix (actually, more like an adjacency list) representing our pairs. If you look at the full matrix of A, it looks like this:
>> full(A)
ans =
0 0 0 0 0 0 0 0
0 0 1 0 0 1 0 0
0 1 0 0 0 0 0 1
0 0 0 0 0 1 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
As you can see, the edge BA, which translates to subscript (3,2) is equal to 1.
Now you can use your favorite implementation of Depth-first Search (DFS) to perform a traversal of the graph from your starting node of choice. Each path from the root to a leaf node represents a valid string. You then transform the path back into your letter sequence:
treepath=[3,2,6,7,4,8,5];
S1=char(treepath+'?');
Output:
S1 = BAEFCGD
Here's a recursive implementation of DFS to get you going. Normally in MATLAB you have to worry about not hitting the default limitation on recursion depth, but you're finding Hamiltonian paths here, which is NP-complete. If you ever get anywhere near the recursion limit, the computation time will be so huge that increasing the depth will be the least of your worries.
function full_paths = dft_all(A, current_path)
% A - adjacency matrix of graph
% current_path - initially just the start node (root)
% full_paths - cell array containing all paths from initial root to a leaf
n = size(A, 1); % number of nodes in graph
full_paths = cell(1,0); % return cell array
unvisited_mask = ones(1, n);
unvisited_mask(current_path) = 0; % mask off already visited nodes (path)
% multiply mask by array of nodes accessible from last node in path
unvisited_nodes = find(A(current_path(end), :) .* unvisited_mask);
% add restriction on length of paths to keep (numel == n)
if isempty(unvisited_nodes) && (numel(current_path) == n)
full_paths = {current_path}; % we've found a leaf node
return;
end
% otherwise, still more nodes to search
for node = unvisited_nodes
new_path = dft_all(A, [current_path node]); % add new node and search
if ~isempty(new_path) % if this produces a new path...
full_paths = {full_paths{1,:}, new_path{1,:}}; % add it to output
end
end
end
This is a normal Depth-first traversal except for the added condition on the length of the path in line 15:
if isempty(unvisited_nodes) && (numel(current_path) == n)
The first half of the if condition, isempty(unvisited_nodes) is standard. If you only use this part of the condition you'll get all paths from the start node to a leaf, regardless of path length. (Hence the cell array output.) The second half, (numel(current_path) == n) enforces the length of the path.
I took a shortcut here because n is the number of nodes in the adjacency matrix, which in the sample case is 8 rather than 7, the number of characters in your alphabet. But there are no edges into or out of node 1 because I was apparently planning on using a trick that I never got around to telling you about. Rather than run DFS starting from each of the nodes to get all of the paths, you can make a dummy node (in this case node 1) and create an edge from it to all of the other real nodes. Then you just call DFS once on node 1 and you get all the paths. Here's the updated adjacency matrix:
A =
0 1 1 1 1 1 1 1
0 0 1 0 0 1 0 0
0 1 0 0 0 0 0 1
0 0 0 0 0 1 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
If you don't want to use this trick, you can change the condition to n-1, or change the adjacency matrix not to include node 1. Note that if you do leave node 1 in, you need to remove it from the resulting paths.
Here's the output of the function using the updated matrix:
>> dft_all(A, 1)
ans =
{
[1,1] =
1 2 3 8 5 7 4 6
[1,2] =
1 3 2 6 7 4 8 5
[1,3] =
1 3 8 5 2 6 7 4
[1,4] =
1 3 8 5 7 4 6 2
[1,5] =
1 4 6 2 3 8 5 7
[1,6] =
1 5 7 4 6 2 3 8
[1,7] =
1 6 2 3 8 5 7 4
[1,8] =
1 6 7 4 8 5 2 3
[1,9] =
1 7 4 6 2 3 8 5
[1,10] =
1 8 5 7 4 6 2 3
}

How do I limit results to only those data frame rows and columns that contain 0s?

I am doing approximate string matching in R. I am rather inexperienced with this technique, but because I want to find instances where my x strings match parts of my y strings exactly, I am only interested in Levenshtein scores of 0 (is this the correct approach?).
What's the most convenient way to subset the results? Because I have about 10k columns and 1k rows, I'm not sure there's any way to efficiently visualize the results either. I apologize for the lack of tact in this question. I just lack experience with this.
Using Mark's data, here's a way to build the indices with apply:
rows <- apply(my.data, 1, function(x) any(!x))
cols <- apply(my.data, 2, function(x) any(!x))
my.data[rows, cols]
## V2 V3 V4
## 1 0 2 1
## 3 1 1 0
## 5 0 0 0
This will retain all rows and columns that contain a zero.
set.seed(2234)
my.data <- as.data.frame(matrix(sample(0:2,20,replace=TRUE), nrow=5))
my.data
aa <- unique(which(my.data==0,arr.ind=TRUE)[,1])
bb <- unique(which(my.data==0,arr.ind=TRUE)[,2])
my.data2 <- my.data[sort(aa),sort(bb)]
my.data2
> my.data
V1 V2 V3 V4
1 2 0 2 1
2 2 2 1 2
3 2 1 1 0
4 2 2 2 1
5 1 0 0 0
> my.data2
V2 V3 V4
1 0 2 1
3 1 1 0
5 0 0 0

Resources