Keeping only letters and digits in a string - string

I am recoding some open survey responses in SPSS and am wanting just to keep the usual characters a-z and 1-9
I have done rtrim and ltrim which has worked on the majority, but some strings have trailing spaces remaining, which I am assuming are not actually spaces but are hidden characters.
I have also removed punctuation such as "?" but I imagine there must be a more straightforward way than going through each one.
e.g. I need
"exam'ple! " or " exam!!--ple?"
to say "example"

The following syntax will create a new clean field and copy to it only the digits and letters (uppercase or lowercase) from the original field.
Note that I used 15 as the new field width and as the number of iterations in the loop - please change 15 to the actual width of the original field
do repeat val=1 to 15.
compute #i = number(char.substr(OrigField, val, 1), PIB).
if range(#i, 48, 57) or
range(#i, 65, 90) or
range(#i, 97, 122)
CleanField=concat(rtrim(CleanField), char.substr(OrigField, val, 1)).
end repeat.
exe.
See the link suggested by #user45392 to understand how/why this works.
Also see this list for additional values you can add to the loop if you'd like.

Related

Legal Statute Sorting Algorithm (Algorithmic Challenge)

I have designed a down and dirty sorting algorithm for New Jersey Legal Statutes, but I'm looking for a better way. Statutes are formatted in the following manner:
Title - A number < 999 and may include up to 2 letters after the title. Ex. 26, 26A, 26AA
Chapter - Formatted exactly the same as title.
Paragraph - A number < 999 which may be followed by 1-2 letters, and or a decimal point, and or a number < 999, and or a letter, and or another number. Ex. 25, 25.26, 25a, 25a.26, 25a.26b, 25aa.26, etc.
I convert the title and paragraph to decimals by stripping out the whole number in the beginning and dividing by 1000, giving me a decimal value which is converted to a string. I have assigned all letters(converted to lowercase) string values from 01-26. I add those values to the end of the initial whole number string sequentially. Any numbers also mixed in are added as their string value.
The obvious bottleneck is the mess of possibilities in the paragraph section. I have actually split that up to paragraph (pre any decimal) and section. I apply the above logic to the broken down sections if they exist.
As for the sorting 17 < 17A < 17AA < 17B < 18.
An example value conversion of 17B:26bb-2a5.1a5 would break down as the following:
Title- .01702
Chapter- .0260202
Paragraph- .002015
Section- .001015
Some more examples of statutes:
17:2-3
18B:2a-1
19AA:3-56g
26:56a-16
1:56-12.123
2:34–15.12a
The method I've devised is pretty dirty. I had to split it up in sections to ensure I had the correct values for each 'section' converting the whole number part to a decimal. I'm also using JS(Node) which doesn't handle large numbers well.
If anyone has a more efficient/clean way, any thoughts, or feedback, I'd greatly appreciate it.

Excel VBA function returns an incorrect number of characters

I am using the Excel VBA function "Left" to get the left three characters of a string, but the function only returns two. The string has more than three characters, so I'm at a loss for why it would return fewer. The official documentation indicates that the function should return the number of characters requested or the entire string if the string is shorter than the requested character count.
The error appears in the section below, but Trim() appears to be working correctly. I also can't step into the Left() function to see how it's handling the inputs.
part = Trim(part)
pref = Left(part, 3)
EDIT1: I changed the character count in the Left() function to see if it would change accordingly. Left(part, 2) returned 1 character (just P). It appears the function is systemically returning 1 fewer character than requested.
EDIT2: I also changed the If statement to accept one fewer character for events where Left() returned the incorrect quantity.
In the snippet above, a single V should be accepted but the code skipped to the line indicated by the arrow, which shows that the comparison was still false. This is leading me to believe that there is a non-printing character leading the entire string, but I don't know how to check for that.
As mentioned in the comments, there were non-printing characters preceding the first visible character in the strings. I used the following code to remove other potentially problematic characters:
part = Trim(part)
While Left(part, 1) < Chr(32) And Len(part) > 1
part = Right(part, Len(part) - 1)
Wend
This offered a more general way to remove problematic leading characters.

How to sort an output file based on second parameter?

I am trying to sort a result I write into an output file after a certain character ':' by the integer after the character.
First, I tried using sort function but it did not work as it was not a list. I also tried to convert everything to string list and tried to sort accordingly but do not think it’s the most efficient way to go.
NOTE: output file lines are all strings
**Current written output in output_file1.txt:**
hi: 011
hello: 000
hero: 001
You are done!
**Expected written output in output_file1.txt:**
hello: 000
hero: 001
hi: 011
You are done!
Thank you for your help.
with open(filepath) as file:
r = file.readlines()
#splits based on ":" and then sort using the second value ie binary numbers
s = sorted([line.split(":") for line in r[1:-1]], key=lambda x: int(x[1]))
s.insert(0,r[0])
s.append(r[-1])
#Write 's' into File
I would the insert the lines in order from the very beginning, you can do that efficiently using binary search.
Q: How to compare the current number in the current line with your old file.
Answer:
Case 1: If the maximum number of lines is 111, (in your example you start with 001, I assume you padded with zeros to show how many digits you expect per number) or you now the max number of lines, you can pad with a sufficient number of zeros, then all you have to do is compare you current number with the current last three entries in your line (line[-3:]).
Case 2: You don't know the number of digits:
Sol 2.1: You can try storing a file for words and a file for numbers, and update them in parallel, this will save you from the overhead of Sol 2.2.
Sol 2.2: Split the line by delimiter ':' and get the number (don't forget that you have a space after the delimiter).
That is what I could come up with for now!
As mentioned above, there is no way to sort a file in place.
Regarding your second question, you can use list.sort(key=sort_key). This allows you to supply a method which is applied to every element in your list when comparing the elements for sorting.
In your case you can define a simple function which extracts the last three characters and sort them alphabetically:
def num_sort(x):
return x[-3:]
your_list.sort(key=num_sort)

How to capture only one word from a sentence in Excel?

AWESOME :)
Another QUESTION:
What if I have multiple Sentences like:
[PROGRAMMING]-Old System-TRT Operates-192.168.6.0-qwert8-plain (AMB)
[PATCHING]-Old System-TRT Operates-192.168.6.0-qwert8-plain (CCB)
Notice that the last word that I need to take out varies from sentence to sentence. How can i make sure that I always take out the last part of the sentence. In this case; (AMB) and (CCB)
I also need to do the same with the words at the beginning:
[PROGRAMMING]
[PATCHING]
Thanks :)
You can use this for the part within []:
=MID(A2,2,FIND("]",A2)-2)
And this for the part within ():
=MID(A2,FIND("(",A2)+1,3)
googlespreadsheet sample
MID takes 3 parameters:
A text,
A starting position,
The length of the extracted text.
FIND takes 2-3 parameters and returns a position number:
Something it will look for,
The text in which it will look for the something,
The position from where it'll start looking. If not mentioned, looks from the beginning.
=MID(A2,2,FIND("]",A2)-2) with your first example becomes the following after replacing the innermost evaluation:
=MID(A2,2,FIND("]","[PROGRAMMING]-Old System-TRT Operates-192.168.6.0-qwert8-plain (AMB)")-2)
FIND("]","[PROGRAMMING]-Old System-TRT Operates-192.168.6.0-qwert8-plain (AMB)")
] appears at the 13th position, so this FIND() returns 13. The MID becomes:
=MID(A2,2,13-2) => =MID(A2,2,11)
And if you count the characters in PROGRAMMING, there are 11. I removed 2, because 1 is for the beginning [ to be removed, the second is for the ] to be removed.
Now, it becomes:
=MID("[PROGRAMMING]-Old System-TRT Operates-192.168.6.0-qwert8-plain (AMB)",2,11)
Which means start (including) at character 2 and take 11 characters, which gives the text you are looking for.
The one for () is just as simple if you got the above.
You can use the MID() function if the data always follows the same pattern.
=MID(A1, FIND("(",A1, 1) + 1, LEN(A1)-FIND("(",A1, 1)-1)
Assuming the string is in A1. The first parameter is the string. The second is the start of the substring to extract. You want to start one character past the first parenthesis. The last parameter is the length of the substring to extract. You want to take the whole string minus all the characters before the parenthesis and also ignore the last parenthesis (thus the -1).

Strange trim function behavior

I wonder why I've got empty string as a result when I'm especting something completely else...
I use trim function to cut phone number from string:
select trim(leading '509960405' from '509960405509960404');
Why the result isn't 509960404 as expected?
trim strips out any characters matching a list of characters. All the characters in your string are in your "leading" list of characters. What you wrote could just as easily be written as
select trim(leading '04569' from '509960405509960404');
It removes any 0, 4, 5, 6 or 9 characters from the beginning of your string. Since your string consists of only 0, 4, 5, 6, or 9 characters, it removes them all.
#Paul clarified the behaviour of trim().
The solution you presented in the comment is potentially treacherous:
SELECT replace('509960405509960404','509960405','')
Replaces all occurrences of '509960405' not just the first. For example:
SELECT replace('509960405509960404','50996040' ,'');
Results in 54. I suspect that's not what you want.
Use a regular expressions with regexp_replace():
SELECT regexp_replace('509960405509960404','^509960405' ,'');
^ .. glues the pattern to the start of the string ("left-anchored").
regexp_replace() is more expensive than a simple replace() but also more versatile.

Resources