TCL, extract 2 integers from string into list? - string

I have 2 string formatted as such:
(1234, 4567)
And I have a list
points {0 1 2 4}
I would like to extract 2 integers from the first list and replace the first two integers in the list, after that extract two more integers from the 2nd list and replace the 3rd and 4th integers in the list so at the end I will have a list of 4 integers from the two strings.
So far I have tried all kind of things but always end up with errors or brackets in the list which I do not want. I feel I am missing out on the easy way to do that.

With the first set of values, you can parse with scan or regexp; in this case, I think scan looks better:
set input "(1234, 5678)"
scan $input "(%d,%d)" a b
To update a Tcl list (formally, one in a variable), you use lset; you can give a sequence of (zero-based) indices to it to navigate into the exact place in the list where you want to update:
set workingArea "points {0 1 2 4}"
lset workingArea 1 2 $a
lset workingArea 1 3 $b
puts $workingArea
# prints: points {0 1 1234 5678}

Related

How to convert negations and single words with same repetitive letter

I have a data frame that has a column with text data in it. I want to remove words that mean nothing and convert negations like "isn't" to "is not" from the text data. Because when I remove the punctuations "isn't" becomes "isn t" and when I will remove words having letters less than length 2 "t" will be deleted completely. So, I want to do the following 3 tasks-
1) convert negations like "isn't" to "is not"
2) remove words that mean nothing
3) remove less than length 2 letters
For eg, the df column looks similar to this-
user_id text data column
1 it's the coldest day
2 they aren't going
3 aa
4 how are you jkhf
5 v
6 ps
7 jkhf
The output should be-
user_id text data column
1 it is the coldest day
2 they are not going
3
4 how are you
5
6
7
How to implement this?
def is_repetitive(w):
"""Predicate, true for words like jj or aaaaa."""
w = str(w) # caller should have provided a single word as input
return len(w) > 1 and all((c == w[0] for c in w[1:]))
Feed all words in the corpus to that function,
to accumulate a list of repetitive words.
Then add such words to your list of stop words.
1) Use SpaCy or NLTK's lemmatization tools to convert strings (though they do other things like convert plural to singular as well - so you may end up needing to write your own code to do this).
2) Use stopwords from NLTK or spacy to remove the obvious stop words. Alternatively, feed them your own list of stop words (their default stop words are things like is, a, the).
3)Use a basic filter, if len<2 remove row

How to intrpret Tcl "list as string" as list?

By some unknown reason the variable result in the following line of code
set result [[$sqlCmd execute] allrows -as lists]
gets string which looks like list: {2 3 4 5}
If I write puts "result $result => [llength $result]" it prints {2 3 4 5} => 1
if I write puts [list $result], it prints {{2 3 4 5}}, what is correct because list creates list from one string.
Is there any way to convert this string to what it expected to be - list - without any string processing steps like deletion of braces and splitting string to list by split function? I suggest it must be some interpretation but I'm unable to find nice solutition.
The allrows method always returns a list, one per row (even when there's only a single row returned). When the -as lists option is passed in, each element of that list is itself a list representing the columns in that row.
Thus, to iterate over the columns of that row, you'd do:
set result [[$sqlCmd execute] allrows -as lists]
set rowresult [lindex $result 0]
foreach col $rowresult {
puts "I've got a '$col'"
}
You're usually recommended to use the default that represents rows as dictionaries indexed by column name, as that has a better representation of SQL NULLs (i.e., the column is absent then instead of being the driver-designated null value, which is often and ambiguously the empty string).
allrows is giving you a list of lists, each sublist representing a row. There is 1 row in this list, 2 3 4 5, so the length is 1. You can index or iterate over the list the usual ways to access its one element.
# If you're assuming there will only be one row
set only_row [lindex $result 0]
# Or if you want to iterate over all rows
foreach row $result {
do whatever with $row
}

Mapping of elements by number of occurrences in J

Using J language, I wish to attain a mapping of the counts of elements of an array.
Specifically, I want to input a lowercased English word with two to many letters and get back each pair of letters in the word along with counts of occurences.
I need a verb that gives something like this, in whatever J structure you think is appropriate:
For 'cocoa':
co 2
oc 1
oa 1
For 'banana':
ba 1
an 2
na 2
For 'milk':
mi 1
il 1
lk 1
For 'to':
to 1
(For single letter words like 'a', the task is undefined and will not be attempted.)
(Order is not important, that's just how I happened to list them.)
I can easily attain successive pairs of letters in a word as a matrix or list of boxes:
2(] ;._3)'cocoa'
co
oc
co
oa
]
2(< ;._3)'cocoa'
┌──┬──┬──┬──┐
│co│oc│co│oa│
└──┴──┴──┴──┘
But I need help getting from there to a mapping of pairs to counts.
I am aware of ~. and ~: but I don't just want to return the unique elements or indexes of duplicates. I want a mapping of counts.
NuVoc's "Loopless" page is indicating that / (or /\. or /\) are where I should be looking for accumulation problems. I am familiar with / for arithmetic operations on numeric arrays, but for u/y I don't know what u would have to be to accumulate the list of pairs of letters that would make up y.
(NB. I can already do this in "normal" languages like Java or Python without help. Similar questions on SO are for languages with very different syntax and semantics to J. I am interested in the idiomatic J approach to this sort of problem.)
To get the list of 2-letter combinations I'd use dyadic infix (\):
2 ]\ 'banana'
ba
an
na
an
na
To count occurrences the primitive that immediately comes to mind is key (/.)
#/.~ 2 ]\ 'banana'
1 2 2
If you want to match the counts to the letter combinations you can extend the verb to the following fork:
({. ; #)/.~ 2 ]\ 'banana'
┌──┬─┐
│ba│1│
├──┼─┤
│an│2│
├──┼─┤
│na│2│
└──┴─┘
I think that you are looking to map counts of unique items to the items. You can correct me if I am wrong.
Starting with
[t=. 2(< ;._3)'cocoa'
┌──┬──┬──┬──┐
│co│oc│co│oa│
└──┴──┴──┴──┘
You can use ~. (Nub) to return the unique items in the list
~.t
┌──┬──┬──┐
│co│oc│oa│
└──┴──┴──┘
Then if you compare the nub to the boxed list you get a matrix where the 1's are the positions that match the nub to the boxed pairs in your string
t =/ ~.t
1 0 0
0 1 0
1 0 0
0 0 1
Sum the columns of this matrix and you get the number of times each item of the nub shows up
+/ t =/ ~.t
2 1 1
Then box them so that you can combine the integers along side the boxed characters
<"0 +/ t =/ ~.t
┌─┬─┬─┐
│2│1│1│
└─┴─┴─┘
Combine them by stitching together the nub and the count using ,. (Stitch)
(~.t) ,. <"0 +/ t =/ ~.t
┌──┬─┐
│co│2│
├──┼─┤
│oc│1│
├──┼─┤
│oa│1│
└──┴─┘
[t=. 2(< ;._3)'banana'
┌──┬──┬──┬──┬──┐
│ba│an│na│an│na│
└──┴──┴──┴──┴──┘
(~.t) ,. <"0 +/ t =/ ~.t
┌──┬─┐
│ba│1│
├──┼─┤
│an│2│
├──┼─┤
│na│2│
└──┴─┘
[t=. 2(< ;._3)'milk'
┌──┬──┬──┐
│mi│il│lk│
└──┴──┴──┘
(~.t) ,. <"0 +/ t =/ ~.t
┌──┬─┐
│mi│1│
├──┼─┤
│il│1│
├──┼─┤
│lk│1│
└──┴─┘
Hope this helps.

Count number of occurences of a string and relabel

I have a n x 1 cell that contains something like this:
chair
chair
chair
chair
table
table
table
table
bike
bike
bike
bike
pen
pen
pen
pen
chair
chair
chair
chair
table
table
etc.
I would like to rename these elements so they will reflect the number of occurrences up to that point. The output should look like this:
chair_1
chair_2
chair_3
chair_4
table_1
table_2
table_3
table_4
bike_1
bike_2
bike_3
bike_4
pen_1
pen_2
pen_3
pen_4
chair_5
chair_6
chair_7
chair_8
table_5
table_6
etc.
Please note that the dash (_) is necessary Could anyone help? Thank you.
Interesting problem! This is the procedure that I would try:
Use unique - the third output parameter in particular to assign each string in your cell array to a unique ID.
Initialize an empty array, then create a for loop that goes through each unique string - given by the first output of unique - and creates a numerical sequence from 1 up to as many times as we have encountered this string. Place this numerical sequence in the corresponding positions where we have found each string.
Use strcat to attach each element in the array created in Step #2 to each cell array element in your problem.
Step #1
Assuming that your cell array is defined as a bunch of strings stored in A, we would call unique this way:
[names, ~, ids] = unique(A, 'stable');
The 'stable' is important as the IDs that get assigned to each unique string are done without re-ordering the elements in alphabetical order, which is important to get the job done. names will store the unique names found in your array A while ids would contain unique IDs for each string that is encountered. For your example, this is what names and ids would be:
names =
'chair'
'table'
'bike'
'pen'
ids =
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
1
1
1
1
2
2
names is actually not needed in this algorithm. However, I have shown it here so you can see how unique works. Also, ids is very useful because it assigns a unique ID for each string that is encountered. As such, chair gets assigned the ID 1, followed by table getting assigned the ID of 2, etc. These IDs will be important because we will use these IDs to find the exact locations of where each unique string is located so that we can assign those linear numerical ranges that you desire. These locations will get stored in an array computed in the next step.
Step #2
Let's pre-allocate this array for efficiency. Let's call it loc. Then, your code would look something like this:
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
As such, for each unique name we find, we look for every location in the ids array that matches this particular name found. find will help us find those locations in ids that match a particular name. Once we find these locations, we simply assign an increasing linear sequence from 1 up to as many names as we have found to these locations in loc. The output of loc in your example would be:
loc =
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
5
6
7
8
5
6
Notice that this corresponds with the numerical sequence (the right most part of each string) of your desired output.
Step #3
Now all we have to do is piece loc together with each string in our cell array. We would thus do it like so:
out = strcat(A, '_', num2str(loc));
What this does is that it takes each element in A, concatenates a _ character and then attaches the corresponding numbers to the end of each element in A. Because we want to output strings, you need to convert the numbers stored in loc into strings. To do this, you must use num2str to convert each number in loc into their corresponding string equivalents. Once you find these, you would concatenate each number in loc with each element in A (with the _ character of course). The output is stored in out, and we thus get:
out =
'chair_1'
'chair_2'
'chair_3'
'chair_4'
'table_1'
'table_2'
'table_3'
'table_4'
'bike_1'
'bike_2'
'bike_3'
'bike_4'
'pen_1'
'pen_2'
'pen_3'
'pen_4'
'chair_5'
'chair_6'
'chair_7'
'chair_8'
'table_5'
'table_6'
For your copying and pasting pleasure, this is the full code. Be advised that I've nulled out the first output of unique as we don't need it for your desired output:
[~, ~, ids] = unique(A, 'stable');
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
out = strcat(A, '_', num2str(loc));
If you want an alternative to unique, you can work with a hash table, which in Matlab would entail to using the containers.Map object. You can then store the occurrences of each individual label and create the new labels on the go, like in the code below.
data={'table','table','chair','bike','bike','bike'};
map=containers.Map(data,zeros(numel(data),1)); % labels=keys, counts=values (zeroed)
new_data=data; % initialize matrix that will have outputs
for ii=1:numel(data)
map(data{ii}) = map(data{ii})+1; % increment counts of current labels
new_data{ii} = sprintf('%s_%d',data{ii},map(data{ii})); % format outputs
end
This is similar to rayryeng's answer but replaces the for loop by bsxfun. After the strings have been reduced to unique labels (line 1 of code below), bsxfun is applied to create a matrix of pairwise comparisons between all (possibly repeated) labels. Keeping only the lower "half" of that matrix and summing along rows gives how many times each label has previously appeared (line 2). Finally, this is appended to each original string (line 3).
Let your cell array of strings be denoted as c.
[~, ~, labels] = unique(c); %// transform each string into a unique label
s = sum(tril(bsxfun(#eq, labels, labels.')), 2); %'// accumulated occurrence number
result = strcat(c, '_', num2str(x)); %// build result
Alternatively, the second line could be replaced by the more memory-efficient
n = numel(labels);
M = cumsum(full(sparse(1:n, labels, 1)));
s = M((1:n).' + (labels-1)*n);
I'll give you a psuedocode, try it yourself, post the code if it doesn't work
Initiate a counter to 1
Iterate over the cell
If counter > 1 check with previous value if the string is same
then increment counter
else
No- reset counter to 1
end
sprintf the string value + counter into a new array
Hope this helps!

How to change stringified numbers in data frame into pure numeric values in R

I have the following data.frame:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
# Note that the salary below is in stringified format.
# In reality there are more such stringified numerical columns.
salary <- as.character(c(21000, 23400, 26800))
df <- data.frame(employee,salary)
The output is:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ employee: Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2
$ salary : Factor w/ 3 levels "21000","23400",..: 1 2 3
What I want to do is to convert the change the value from string into pure number
straight fro the df variable. At the same time preserve the string name for employee.
I tried this but won't work:
as.numeric(df)
At the end of the day I'd like to perform arithmetic on these numeric
values from df. Such as df2 <- log2(df), etc.
Ok, there's a couple of things going on here:
R has two different datatypes that look like strings: factor and character
You can't modify most R objects in place, you have to change them by assignment
The actual fix for your example is:
df$salary = as.numeric(as.character(df$salary))
If you try to call as.numeric on df$salary without converting it to character first, you'd get a somewhat strange result:
> as.numeric(df$salary)
[1] 1 2 3
When R creates a factor, it turns the unique elements of the vector into levels, and then represents those levels using integers, which is what you see when you try to convert to numeric.

Resources