substitue string by index without using regular expressions - string

It should be very easy, but I am looking for an efficient way to perform it.
I know that I could split the string into two parts and insert the new value, but I have tried to substitute each line between the indexes 22-26 as follows:
line.replace(line[22:26],new_value)
The Problem
However, that function substitutes everything in the line that is similar to the pattern in line[22:26].
In the example below, I want to replace the marked number 1 with number 17:
Here are the results. Note the replacement of 1 with 17 in several places:
Thus I don't understand the behavior of replace command. Is there a simple explanation of what I'm doing wrong?
Why I don't want RE
The values between index 22-26 are not unified in form.
Note: I am using python 3.5 on Unix/Linux machines.

str.replace replaces 1 sub-string pattern with another everywhere in the string.
e.g.
'ab cd ab ab'.replace('ab', 'xy')
# produces output 'xy cd xy xy'
similarly,
mystr = 'ab cd ab ab'
mystr.replace(mystr[0:2], 'xy')
# also produces output 'xy cd xy xy'
what you could do instead, to replace just the characters in position 22-26
line = line[0:22] + new_value + line[26:]
Also, looking at your data, it seems to me to be a fixed-width text file. While my suggestion will work, a more robust way to process this data would be to read it & separate the different fields in the record first, before processing the data.
If you have access to the pandas library, it provides a useful function just for reading fixed-width files

Related

Python regular expressions with Foreign characters in python PyQT5

This problem might be very simple but I find it a bit confusing & that is why I need help.
With relevance to this question I posted that got solved, I got a new issue that I just noticed.
Source code:
from PyQt5 import QtCore,QtWidgets
app=QtWidgets.QApplication([])
def scroll():
#QtCore.QRegularExpression(r'\b'+'cat'+'\b')
item = listWidget.findItems(r'\bcat\b', QtCore.Qt.MatchRegularExpression)
for d in item:
print(d.text())
window = QtWidgets.QDialog()
window.setLayout(QtWidgets.QVBoxLayout())
listWidget = QtWidgets.QListWidget()
window.layout().addWidget(listWidget)
cats = ["love my cat","catirization","cat in the clouds","catść"]
for i,cat in enumerate(cats):
QtWidgets.QListWidgetItem(f"{i} {cat}", listWidget)
btn = QtWidgets.QPushButton('Scroll')
btn.clicked.connect(scroll)
window.layout().addWidget(btn)
window.show()
app.exec_()
Output GUI:
Now as you can see I am just trying to print out the text data based on the regex r"\bcat\b" when I press the "Scroll" button and it works fine!
Output:
0 love my cat
2 cat in the clouds
3 catść
However... as you can see on the #3, it should not be printed out cause it obviously does not match with the mentioned regular expression which is r"\bcat\b". However it does & I am thinking it has something to do with that special foreign character ść that makes it a match & prints it out (which it shouldn't right?).
I'm expecting an output like:
0 love my cat
2 cat in the clouds
Researches I have tried
I found this question and it says something about this \p{L} & based on the answer it means:
If all you want to match is letters (including "international"
letters) you can use \p{L}.
To be honest I'm not so sure how to apply that with PyQT5 also still I've made some tries & and I tried changing the regex to like this r'\b'+r'\p{cat}'+r'\b'. However I got this error.
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
Obviously the error says it's not a valid regex. Can someone educate me on how to solve this issue? Thank you!
In general, when you need to make your shorthand character classes and word boundaries Unicode-aware, you need to pass the QRegularExpression.UseUnicodePropertiesOption option to the regex compiler. See the QRegularExpression.UseUnicodePropertiesOption reference:
The meaning of the \w, \d, etc., character classes, as well as the meaning of their counterparts (\W, \D, etc.), is changed from matching ASCII characters only to matching any character with the corresponding Unicode property. For instance, \d is changed to match any character with the Unicode Nd (decimal digit) property; \w to match any character with either the Unicode L (letter) or N (digit) property, plus underscore, and so on. This option corresponds to the /u modifier in Perl regular expressions.
In Python, you could declare it as
rx = QtCore.QRegularExpression(r'\bcat\b', QtCore.QRegularExpression.UseUnicodePropertiesOption)
However, since the QListWidget.findItems does not support a QRegularExpression as argument and only allows the regex as a string object, you can only use the (*UCP) PCRE
verb as an alternative:
r'(*UCP)\bcat\b'
Make sure you define it at the regex beginning.

Finding substring of variable length in bash

I have a string, such as time=1234, and I want to extract just the number after the = sign. However, this number could be in the range of 0 and 100000 (eg. - time=1, time=23, time=99999, etc.).
I've tried things like $(string:5:8}, but this will only work for examples of a certain length.
How do I get the substring of everything after the = sign? I would prefer to do it without outside commands like cut or awk, because I will be running this script on devices that may or may not have that functionality. I know there are examples out there using outside functions, but I am trying to find a solution without the use of such.
s=time=1234
time_int=${s##*=}
echo "The content after the = in $s is $time_int"
This is a parameter expansion matching everything matching *= from the front of the variable -- thus, everything up to and including the last =.
If intending this to be non-greedy (that is, to remove only content up to the first = rather than the last =), use ${s#*=} -- a single # rather than two.
References:
The bash-hackers page on parameter expansion
BashFAQ #100 ("How do I do string manipulations in bash?")
BashFAQ #73 ("How can I use parameter expansion? How can I get substrings? [...])
BashSheet quick-reference, paramater expansion section
if time= part is constant you can remove prefix by using ${str#time=}
Let's say you have str='time=123123' if you execute echo ${str#time=} you would get 123123

How could I use "tr" to translate every byte?

My goal is to have every byte from an input file xor with 42 using tr.
But I was stuck at this point:
tr '\0-\377' '?'
Anyone can help? Thanks a lot..
Some restrictions:
translation has to be done by tr.
we are allowed to use bash script, but it shouldn’t use any temporary files. (only pipeline in other words)
This isn't possible with tr alone since it - as the name says - simply translates from one codeset to another. Math calculations or logical operations are not supported.
Btw, if you want to address the whole ascii range using numeric values, you are bound to octal numbers. The range would be \0-\377 in that case. But anyway, calculating the xor value is not possible.
What you can do is the prepare a table with xor^42'ed ascii values and use as SET2. I'm using python to create that list:
xor.py
v = []
for i in range(0,255):
v.append("\\" + oct(i^42))
print("".join(v))
Or simply:
print("".join(["\\" + oct(i^42) for i in range(0,255)]))
Then use that to create SET2 for tr:
tr '\0-\0377' "$(python xor.py)" < input.file
Note: When python is already required, why not using python for the whole solution?
Edited by Mark Setchell
As Peter pointed out this list can of course getting generated in advance and then getting passed as a string for SET2. This would not require a runtime python call. Like this:
tr '\0-\0377' '\052\053\050\051\056\057\054\055\042\043\040\041\046\047\044\045\072\073\070\071\076\077\074\075\062\063\060\061\066\067\064\065\012\013\010\011\016\017\014\015\02\03\0\01\06\07\04\05\032\033\030\031\036\037\034\035\022\023\020\021\026\027\024\025\0152\0153\0150\0151\0156\0157\0154\0155\0142\0143\0140\0141\0146\0147\0144\0145\0172\0173\0170\0171\0176\0177\0174\0175\0162\0163\0160\0161\0166\0167\0164\0165\0112\0113\0110\0111\0116\0117\0114\0115\0102\0103\0100\0101\0106\0107\0104\0105\0132\0133\0130\0131\0136\0137\0134\0135\0122\0123\0120\0121\0126\0127\0124\0125\0252\0253\0250\0251\0256\0257\0254\0255\0242\0243\0240\0241\0246\0247\0244\0245\0272\0273\0270\0271\0276\0277\0274\0275\0262\0263\0260\0261\0266\0267\0264\0265\0212\0213\0210\0211\0216\0217\0214\0215\0202\0203\0200\0201\0206\0207\0204\0205\0232\0233\0230\0231\0236\0237\0234\0235\0222\0223\0220\0221\0226\0227\0224\0225\0352\0353\0350\0351\0356\0357\0354\0355\0342\0343\0340\0341\0346\0347\0344\0345\0372\0373\0370\0371\0376\0377\0374\0375\0362\0363\0360\0361\0366\0367\0364\0365\0312\0313\0310\0311\0316\0317\0314\0315\0302\0303\0300\0301\0306\0307\0304\0305\0332\0333\0330\0331\0336\0337\0334\0335\0322\0323\0320\0321\0326\0327\0324' < inputFile > outputFile

need guidance with basic function creation in MATLAB

I have to write a MATLAB function with the following description:
function counts = letterStatistics(filename, allowedChar, N)
This function is supposed to open a text file specified by filename and read its entire contents. The contents will be parsed such that any character that isn’t in allowedChar is removed. Finally it will return a count of all N-symbol combinations in the parsed text. This function should be stored in a file name “letterStatistics.m” and I made a list of some commands and things of how the function should be organized according to my professors' lecture notes:
Begin the function by setting the default value of N to 1 in case:
a. The user specifies a 0 or negative value of N.
b. The user doesn’t pass the argument N into the function, i.e., counts = letterStatistics(filename, allowedChar)
Using the fopen function, open the file filename for reading in text mode.
Using the function fscanf, read in all the contents of the opened file into a string variable.
I know there exists a MATLAB function to turn all letters in a string to lower case. Since my analysis will disregard case, I have to use this function on the string of text.
Parse this string variable as follows (use logical indexing or regular expressions – do not use for loops):
a. We want to remove all newline characters without this occurring:
e.g.
In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since.
In my younger and more vulnerableyears my father gave me some advicethat I’ve been turning over in my mindever since.
Replace all newline characters (special character \n) with a single space: ' '.
b. We will treat hyphenated words as two separate words, hence do the same for hyphens '-'.
c. Remove any character that is not in allowedChar. Hint: use regexprep with an empty string '' as an argument for replace.
d. Any sequence of two or more blank spaces should be replaced by a single blank space.
Use the provided permsRep function, to create a matrix of all possible N-symbol combinations of the symbols in allowedChar.
Using the strfind function, count all the N-symbol combinations in the parsed text into an array counts. Do not loop through each character in your parsed text as you would in a C program.
Close the opened file using fclose.
HERE IS MY QUESTION: so as you can see i have made this list of what the function is, what it should do, and using which commands (fclose etc.). the trouble is that I'm aware that closing the file involves use of 'fclose' but other than that I'm not sure how to execute #8. Same goes for the whole function creation. I have a vague idea of how to create a function using what commands but I'm unable to produce the actual code.. how should I begin? Any guidance/hints would seriously be appreciated because I'm having programmers' block and am unable to start!
I think that you are new to matlab, so the documentation may be complicated. The root of the problem is the basic understanding of file I/O (input/output) I guess. So the thing is that when you open the file using fopen, matlab returns a pointer to that file, which is generally called a file ID. When you call fclose you want matlab to understand that you want to close that file. So what you have to do is to use fclose with the correct file ID.
fid = open('test.txt');
fprintf(fid,'This is a test.\n');
fclose(fid);
fid = 0; % Optional, this will make it clear that the file is not open,
% but it is not necessary since matlab will send a not open message anyway
Regarding the function creation the syntax is something like this:
function out = myFcn(x,y)
z = x*y;
fprintf('z=%.0f\n',z); % Print value of z in the command window
out = z>0;
This is a function that checks if two numbers are positive and returns true they are. If not it returns false. This may not be the best way to do this test, but it works as example I guess.
Please comment if this is not what you want to know.

Using sed to drop strings with repeated and incremental characters?

I'm trying to use sed to drop strings containing repeated characters before appending them to a file.
So far I have this, to drop stings with consecutive repetition like 'AA' or '22', but I'm struggling with full string repetition and incremental characters.
generic string generator | sed '/\([^A-Za-z0-9_]\|[A-Za-z0-9]\)\1\{1,\}/d' >> parsed sting to file
I also want to drop strings contain any repetition like 'ABA'.
As well as, strings containing any ascending or descending characters like 'AEF' or 'AFE'.
I'm assuming it would be easier to use multiple passes of sed to drop the unwanted strings.
** A little more information to try to avoid the XY problem mentioned. **
The character strings could be from 8 to 64 in length, but in this instance I'm focusing on 8. While at the same time I've restricted the string generation to only output an upper-case alpha string (A-Z). This is for a few reasons, but mainly that I don't want the generated file to have a ridiculously huge footprint.
With the first pass of sed dropping unnecessary outputs like 'AAAAAAAA' and 'AAAAAAAB' from the stream. This results in the file starting with strings 'ABABABAB' and 'ABABABAC'.
Next pass I want to check that from one character to the next doesn't increase or decrease by a value of one. So strings like 'ABABABAB' would be dropped, but 'ACACACAC' would parse to the stream.
Next pass I want to drop strings that contain any repeated characters in the whole string. So strings like 'ACACACAC' would be dropped, but 'ACEBDFHJ' would parse to the file.
Hope that helps.
In order to do what you're describing with sed, you'd need to run it many times. Since sed doesn't understand the concept of "this character is incremental from this other character", you need to run it across all possible combinations:
sed '/AB/d'
sed '/BC/d'
sed '/CD/d'
sed '/DE/d'
etc.
For descending characters, the same thing:
sed '/BA/d'
sed '/CB/d'
In order to then drop strings with repeated characters, you can do something like this:
sed '/\(.\).*\1/d'
The following should do the trick:
generic string generator |sed '/\(.\).*\1/d'|sed /BA/d|sed /AB/d||sed /CB/d|sed /BC/d|sed /DC/d|sed /CD/d|sed /ED/d|sed /DE/d|sed /FE/d|sed /EF/d|sed /GF/d|sed /FG/d|sed /HG/d|sed /GH/d|sed /IH/d|sed /HI/d|sed /JI/d|sed /IJ/d|sed /KJ/d|sed /JK/d|sed /LK/d|sed /KL/d|sed /ML/d|sed /LM/d|sed /NM/d|sed /MN/d|sed /ON/d|sed /NO/d|sed /PO/d|sed /OP/d|sed /QP/d|sed /PQ/d|sed /RQ/d|sed /QR/d|sed /SR/d|sed /RS/d|sed /TS/d|sed /ST/d|sed /UT/d|sed /TU/d|sed /VU/d|sed /UV/d|sed /WV/d|sed /VW/d|sed /XW/d|sed /WX/d|sed /YX/d|sed /XY/d|sed /ZY/d|sed /YZ/d
I only tested this on a few input samples, but they all seemed to work.
Note that this is quite ungainly, and would be better done by something a little more sophisticated than sed. Here's a sample in python:
import math
def isvalid(x):
if set(len(x)) < len(x):
return False
for a in range(1, len(x)):
if math.fabs(ord(x[a])-ord(x[a-1])) == 1:
return False
return True
This is much more readable than the giant set of sed calls, and has the same functionality.

Resources