Break sentence to letters and rearrange them - string

I've try to break down a sentence to letters and rearrange them alphabetically.
Please see if i could improve this code in someway.
Kind regards
sen = "the quick brown fox jumps over the lazy dog"
smallest=[]
re=''
while len(sen) >0:
smallest.append( min(sen))
print(ord(min(sen)))
re=re+min(sen)
sen = sen[:sen.index(min(sen))]+sen[sen.index(min(sen))+1:]
counter+=1
print(smallest) #list
print(re) #string

What you're doing is exactly like sorting an array of number. You got a value for each char. There is many way to sort an array of number some are pretty fast but depends on what you're seeking. I think the fastest are insertion sort, bubble sort, or selection sort.
You can code them or find them already done in many languages.
There is other way to sort an array you can fin all of them here :
https://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_of_algorithms

Related

Is there anything else used instead of slicing the String?

This is one of the practice problems from Problem solving section of Hackerrank. The problem statement says
Steve has a string of lowercase characters in range ascii[ā€˜aā€™..ā€™zā€™]. He wants to reduce the string to its shortest length by doing a series of operations. In each operation he selects a pair of adjacent lowercase letters that match, and he deletes them.
For example : 'aaabbccc' -> 'ac' , 'abba' -> ''
I have tried solving this using slicing of strings but this gives me timeout runtime error on larger strings. Is there anything else to be used?
My code:
s = list(input())
i=1
while i<len(s):
if s[i]==s[i-1]:
s = s[:i-1]+s[i+1:]
i = i-2
i+=1
if len(s)==0:
print("Empty String")
else:
print(''.join(s))
This gives me terminated due to timeout message.
Thanks for your time :)
Interning each new immutable string can be expensive,
as it has O(N) linear cost with the length of the string.
Consider processing "aa" * int(1e6).
You will write on the order of 1e12 characters to memory
by the time you're finished.
Take a moment (well, take linear time) to
copy each character over to a mutable list element:
[c for c in giant_string]
Then you can perform dup processing by writing a tombstone
of "" to each character you wish to delete,
using just constant time.
Finally, in linear time you can scan through the survivors using "".join( ... )
One other possible solution is to use regex. The pattern ([a-z])\1 matches a duplicate lowercase letter. The implementation would involve something like this:
import re
pattern = re.compile(r'([a-z])\1')
while pattern.search(s): # While match is found
s = pattern.sub('', s) # Remove all matches from "s"
I'm not an expert at efficiency, but this seems to write fewer strings to memory than your solution. For the case of "aa" * int(1e6) that J_H mentioned, it will only write one, thanks to pattern.sub replacing all occurances at once.

Cell array of strings not fully sorted?

I have the following Matlab function I'm working on:
function [data] = ReadAndCountWords(fileName)
fid = fopen(fileName);
data = textscan(fid, '%s');
data = sort(data{1});
for i = 1:length(data)
str = data{i};
str = lower(str(isstrprop(str, 'alpha')));
disp(str);
end
fclose(fid);
end
Right now I am passing in a text document containing The Gettysburg Address, and I want to print out the words contained in that file in order of how many times the word occurs. To get the word count I figured I would sort the cell array and then do a string comparison within my loop since it seemed simple enough. So I tried sorting my cell array with both sortrows() and sort(), but the results are the same:
but
four
god
it
it
it
now
the
the
we
we
a
a
a
a
a
a
a
above
add
advanced
ago
all
altogether
and
and
and
and
and
and
any
are
are
are
as
battlefield
be
be
before
birth
brave
brought
but
by
can
can
cannot
cannot
cannot
cause
civil
come
conceived
conceived
consecrate
consecrated
continent
created
dead
dead
dead
dedicate
dedicate
dedicated
dedicated
dedicated
dedicated
detract
devotion
devotion
did
died
do
earth
endure
engaged
equal
far
far
fathers
field
final
fitting
for
for
for
for
for
forget
forth
fought
freedom
from
from
full
gave
gave
government
great
great
great
ground
hallow
have
have
have
have
have
here
here
here
here
here
here
here
here
highly
honored
in
in
in
in
increased
is
is
is
it
it
larger
last
liberty
little
live
lives
living
living
long
long
measure
men
men
met
might
nation
nation
nation
nation
nation
never
new
new
nobly
nor
not
not
note
of
of
of
of
of
on
on
or
or
our
our
people
people
people
perish
place
poor
portion
power
proper
proposition
rather
rather
remaining
remember
resolve
resting
say
score
sense
seven
shall
shall
shall
should
so
so
so
struggled
take
task
testing
that
that
that
that
that
that
that
that
that
that
that
that
that
the
the
the
the
the
the
the
the
the
their
these
these
they
they
they
this
this
this
this
those
thus
to
to
to
to
to
to
to
to
under
unfinished
us
us
us
vain
war
war
we
we
we
we
we
we
we
we
what
what
whether
which
which
who
who
who
will
work
world
years
Why are those first 11 words out of order? I did some research on it and couldn't find anyone having the same problem, and the Matlab documentation seems to be doing it the same way I am. Any suggestions?
Some of the words start with uppercase letters. You call the sort before the "lower" command then display the words. If you put a breakpoint at line 4, you can inspect the sorted data:
'But,'
'Fourscore'
'God,'
'It'
'It'
'It'
'Liberty,'
'Now'
'The'
'The'
'We'
'We'
'a'
'a'
'a'
'a'
'a'
'a'
'a'
'above'
...
Which is sorted correctly based on the case of the words.
To count number of occurrences you can also use unique as follows:
data = {'and' 'And, ' 'cut' 'be.' 'dear' 'be' 'eggs' 'egg'}; %// example data
data = regexprep(lower(data), '[^a-z]', ''); %// make lower and remove special chars
[words, ~, labels] = unique(data);
count = histc(labels, 1:max(labels));
Result:
words =
'and' 'be' 'cut' 'dear' 'egg' 'eggs'
count =
2 2 1 1 1 1
You're converting the strings to lower-case after you sort them. Those 11 words are the ones capitalized in the original text, thus they come at the front of the list.

String matching without using builtin functions

I want to search for a query (a string) in a subject (another string).
The query may appear in whole or in parts, but will not be rearranged. For instance, if the query is 'da', and the subject is 'dura', it is still a match.
I am not allowed to use string functions like strfind or find.
The constraints make this actually quite straightforward with a single loop. Imagine you have two indices initially pointing at the first character of both strings, now compare them - if they don't match, increment the subject index and try again. If they do, increment both. If you've reached the end of the query at that point, you've found it. The actual implementation should be simple enough, and I don't want to do all the work for you ;)
If this is homework, I suggest you look at the explanation which precedes the code and then try for yourself, before looking at the actual code.
The code below looks for all occurrences of chars of the query string within the subject string (variables m; and related ii, jj). It then tests all possible orders of those occurrences (variable test). An order is "acceptable" if it contains all desired chars (cond1) in increasing positions (cond2). The result (variable result) is affirmative if there is at least one acceptable order.
subject = 'this is a test string';
query = 'ten';
m = bsxfun(#eq, subject.', query);
%'// m: test if each char of query equals each char of subject
[ii jj] = find(m);
jj = jj.'; %'// ii: which char of query is found within subject...
ii = ii.'; %'// jj: ... and at which position
test = nchoosek(1:numel(jj),numel(query)).'; %'// test all possible orders
cond1 = all(jj(test) == repmat((1:numel(query)).',1,size(test,2)));
%'// cond1: for each order, are all chars of query found in subject?
cond2 = all(diff(ii(test))>0);
%// cond2: for each order, are the found chars in increasing positions?
result = any(cond1 & cond2); %// final result: 1 or 0
The code could be improved by using a better approach as regards to test, i.e. not testing all possible orders given by nchoosek.
Matlab allows you to view the source of built-in functions, so you could always try reading the code to see how the Matlab developers did it (although it will probably be very complex). (thanks Luis for the correction)
Finding a string in another string is a basic computer science problem. You can read up on it in any number of resources, such as Wikipedia.
Your requirement of non-rearranging partial matches recalls the bioinformatics problem of mapping splice variants to a genomic sequence.
You may solve your problem by using a sequence alignment algorithm such as Smith-Waterman, modified to work with all English characters and not just DNA bases.
Is this question actually from bioinformatics? If so, you should tag it as such.

Groovy String Comparison

I need to know if two strings "match" where "matching" basically means that there is significant overlap between the two strings. For example, if string1 is "foo" and string2 is "foobar", this should be a match. If string2 was "barfoo", that should also be a match with string1. However, if string2 was "fobar", this should not be a match. I'm struggling to find a clever way to do this. Do I need to split the strings into lists of characters first or is there a way to do this kind of comparison already in Groovy? Thanks!
Using Apache commons StringUtils:
#Grab( 'org.apache.commons:commons-lang3:3.1' )
import static org.apache.commons.lang3.StringUtils.getLevenshteinDistance
int a = getLevenshteinDistance( 'The quick fox jumped', 'The fox jumped' )
int b = getLevenshteinDistance( 'The fox jumped', 'The fox' )
// Assert a is more similar than b
assert a < b
Levenshtein Distance tells you the number of characters that have to change for one string to become another
So to get from 'The quick fox jumped' to 'The fox jumped', you need to delete 6 chars (so it has a score of 6)
And to get from 'The fox jumped' to 'The fox', you need to delete 7 chars.
As per your examples, plain old String.contains may suffice:
assert 'foobar'.contains('foo')
assert 'barfoo'.contains('foo')
assert !'fobar'.contains('foo')

MATLAB string handling

I want to calculate the frequency of each word in a string. For that I need to turn string into an array (matrix) of words.
For example take "Hello world, can I ask you on a date?" and turn it into
['Hello' 'world,' 'can' 'I' 'ask' 'you' 'on' 'a' 'date?']
Then I can go over each entry and count every appearance of a particular word.
Is there a way to make an array (matrix) of words in MATLAB, instead of array of just chars?
Here is a little simpler regexp:
words = regexp(s,'\w+','match');
\w here means any symbol that can appear in words (including underscore).
Notice that the last question mark will not be included. Do you need it for counting words actually?
Regular expressions
s = 'Hello world, can I ask you on a date?'
slist = regexp(s, '[^ ]*', 'match')
yield
slist =
'Hello' 'world,' 'can' 'I' 'ask' 'you' 'on' 'a' 'date?'
Another way to do it is like this:
s = cell(java.lang.String('Hello world, can I ask you on a date?').split('[^\w]+'));
I.e. by creating a Java String object and using its methods to do the work, then converting back to a cell array of strings. Not necessarily the best way to do a job this simple, but Java has a rich library of string handling methods & classes that can come in handy.
Matlab's ability to switch into Java at the drop of a hat can come in handy sometimes - for example, when parsing & writing XML.

Resources