Using sed to drop strings with repeated and incremental characters? - linux

I'm trying to use sed to drop strings containing repeated characters before appending them to a file.
So far I have this, to drop stings with consecutive repetition like 'AA' or '22', but I'm struggling with full string repetition and incremental characters.
generic string generator | sed '/\([^A-Za-z0-9_]\|[A-Za-z0-9]\)\1\{1,\}/d' >> parsed sting to file
I also want to drop strings contain any repetition like 'ABA'.
As well as, strings containing any ascending or descending characters like 'AEF' or 'AFE'.
I'm assuming it would be easier to use multiple passes of sed to drop the unwanted strings.
** A little more information to try to avoid the XY problem mentioned. **
The character strings could be from 8 to 64 in length, but in this instance I'm focusing on 8. While at the same time I've restricted the string generation to only output an upper-case alpha string (A-Z). This is for a few reasons, but mainly that I don't want the generated file to have a ridiculously huge footprint.
With the first pass of sed dropping unnecessary outputs like 'AAAAAAAA' and 'AAAAAAAB' from the stream. This results in the file starting with strings 'ABABABAB' and 'ABABABAC'.
Next pass I want to check that from one character to the next doesn't increase or decrease by a value of one. So strings like 'ABABABAB' would be dropped, but 'ACACACAC' would parse to the stream.
Next pass I want to drop strings that contain any repeated characters in the whole string. So strings like 'ACACACAC' would be dropped, but 'ACEBDFHJ' would parse to the file.
Hope that helps.

In order to do what you're describing with sed, you'd need to run it many times. Since sed doesn't understand the concept of "this character is incremental from this other character", you need to run it across all possible combinations:
sed '/AB/d'
sed '/BC/d'
sed '/CD/d'
sed '/DE/d'
etc.
For descending characters, the same thing:
sed '/BA/d'
sed '/CB/d'
In order to then drop strings with repeated characters, you can do something like this:
sed '/\(.\).*\1/d'
The following should do the trick:
generic string generator |sed '/\(.\).*\1/d'|sed /BA/d|sed /AB/d||sed /CB/d|sed /BC/d|sed /DC/d|sed /CD/d|sed /ED/d|sed /DE/d|sed /FE/d|sed /EF/d|sed /GF/d|sed /FG/d|sed /HG/d|sed /GH/d|sed /IH/d|sed /HI/d|sed /JI/d|sed /IJ/d|sed /KJ/d|sed /JK/d|sed /LK/d|sed /KL/d|sed /ML/d|sed /LM/d|sed /NM/d|sed /MN/d|sed /ON/d|sed /NO/d|sed /PO/d|sed /OP/d|sed /QP/d|sed /PQ/d|sed /RQ/d|sed /QR/d|sed /SR/d|sed /RS/d|sed /TS/d|sed /ST/d|sed /UT/d|sed /TU/d|sed /VU/d|sed /UV/d|sed /WV/d|sed /VW/d|sed /XW/d|sed /WX/d|sed /YX/d|sed /XY/d|sed /ZY/d|sed /YZ/d
I only tested this on a few input samples, but they all seemed to work.
Note that this is quite ungainly, and would be better done by something a little more sophisticated than sed. Here's a sample in python:
import math
def isvalid(x):
if set(len(x)) < len(x):
return False
for a in range(1, len(x)):
if math.fabs(ord(x[a])-ord(x[a-1])) == 1:
return False
return True
This is much more readable than the giant set of sed calls, and has the same functionality.

Related

Extracting string from Regex in Pandas for large dataset

We have a csv file which contains log entries in each row.
We need to extract the thread names from each log entry into a separate column.
What would be the fastest way to implement the same ?
The approach below (string functions) also seems to take alot of time for large datasets.
We have csv files with minimum of 100K entries in each csv file.
This is the piece of codes which extracts the path
df['thread'] = df.message.str.extract(pat = '(\[(\w+.)+?\]|$)')[0]
The below is a sample log entry, we are picking out:
[c.a.j.sprint_planning_resources.listener.RunAsyncEvent]
from the regex above.
2020-12-01 05:07:36,485-0500 ForkJoinPool.commonPool-worker-30 WARN Ives_Chen 245x27568399x23 oxk7fv 10.97.200.99,127.0.0.1 /browse/MDT-206838 [c.a.j.sprint_planning_resources.listener.RunAsyncEvent] Event processed: com.atlassian.jira.event.issue.IssueEvent#5c8703d0[issue=ABC-61381,comment=<null>,worklog=<null>,changelog=[GenericEntity:ChangeGroup][issue,1443521][author,JIRAUSER39166][created,2020-12-01 05:07:36.377][id,15932782],eventTypeId=2,sendMail=true,params={eventsource=action, baseurl=https://min.com},subtasksUpdated=true,spanningOperation=Optional.empty]
Does anyone know a better/faster method to implement the same ?
The \[(\w+.)+?\] is a very inefficient pattern that may cause catastrophic backtracking due to the nested quantifiers with an unescaped . that matches any char, and thus also matches what \w does.
You can use
df['thread'] = df['message'].str.extract(r'\[(\w+(?:\.\w+)*)]', expand=False).fillna("")
See this regex demo. Note there is no need adding $ as an alternative since .fillna("") will replace the NA with an empty string.
The regex matches
\[ - a [ char
(\w+(?:\.\w+)*) - Capturing group 1: one or more word chars followed with zero or more sequences of a . and one or more word chars
] - a ] char.
Your regex takes a whopping 8,572 steps to complete, see https://regex101.com/r/5c3vi7/1
You can use this regex to significantly cut down the regex processing to 4 steps:
\[[^\]]+\]
Do notice the absence of the /g modifier
https://regex101.com/r/6522P8/1

Looking for the best way in bash shell to extract a string

I have the following string being exported from a program that is analyzing the certificate on a website which will be part of a bugfix analysis
CERT_SUMMARY:127.0.0.1:127.0.0.1:631:sha256WithRSAEncryption:
/O=bfcentos7-test/CN=bfcentos7-test/emailAddress=root$bfcentos7-
test:/O=bfcentos7-test/CN=bfcentos7-test/emailAddress=root$bfcentos7-
test:170902005715Z:270831005715Z:self signed certificate
(consider output above to be a single line)
What I need is the best way in a bash shell to extract the sha256WithRSAEncryption. This could be anything like sha384withRSAEncryption or something else.
After the CERTSUMMARY it will always be 127.0.0.1:127.0.0.1:portnum above its port 631, but it could be anything.
This runs internally on a system and returns this string along with SSL or TLS (not pictured)
Here is another example of a return
CERT_SUMMARY:127.0.0.1:127.0.0.1:52311:sha256WithRSAEncryption:
/CN=ServerSigningCertificate_0/name=Type`Administrator
/name=DBName`ServerSigningCertificate_0:/C=US/CN=BLAHBLAH/
ST=California/L=Address, Emeryville CA 94608/O=IBM BigFix Evaluation
License/OU=Customer/emailAddress=blahblay#gmail.com/name=
Hash`sha1/name=Server`bigfix01/name=CustomActions`Enable
/name=LicenseAllocation`999999/name=CustomRetrievedProperties`Enable:
170702212459Z:270630212459Z:unable to get local issuer certificate
Thanks in advance.
Novice at shell programming, but learning!!
you need the best way and yet do not seem to provide the best description - "This could be anything like sha384withRSAEncryption or something else."
Given the examples, the string you are looking for is the 4th, when : is a separator, so the command should be OK:
cut -f4 -d":"
If the output string has a strict length format, one easy option is the 'cut' command with -c. This is not the case though since there is a port number.
CERT_SUMMARY:127.0.0.1:127.0.0.1:631:sha256WithRSAEncryption:
as #cyrus pointed out, this was as simple as picking the right column with awk... I am learning.
This worked
awk -F ":" '/CERT_SUMMARY/ {print $5}'
Thanks for the help!!
| sed -E 's/^([^:]*:){4}([^:]*):.*/\2/'
Regular expressions are you friend. If there is one thing one really should be familiar with if one needs to do a lot of string parsing or string processing, it's definitely regular expressions.
echo 'CERT_SUMMARY:127.0.0.1:127.0.0.1:52311:sha256WithRSAEncryption:
/CN=ServerSigningCertificate_0/name=Type`Administrator
/name=DBName`ServerSigningCertificate_0:/C=US/CN=BLAHBLAH/ST=California
/L=Address, Emeryville CA 94608/O=IBM BigFix Evaluation
License/OU=Customer/emailAddress=blahblay#gmail.com/name=Hash`sha1
/name=Server`bigfix01/name=CustomActions`Enable
/name=LicenseAllocation`999999
/name=CustomRetrievedProperties
`Enable:170702212459Z:270630212459Z:unable to get local issuer
certificate'
| sed -E 's/^([^:]*:){4}([^:]*):.*/\2/'
prints
sha256WithRSAEncryption
It's probably a bit overkill here, but there is almost nothing that cannot be done with regular expressions and as you have also built-in regex support in many languages today, knowing regex is never going to be a waste of time.
See also here to get a nice explanation of what each regex expression actually means, including an interactive editing view. Basically I'm telling the regex parser to skip the first 4 groups consisting of any number of characters that are not :, followed by a single : and then capture the 5th group that consists of any number of characters that are not : and finally match anything else (no matter what) to the end of the string. The whole regex is part of a sed "replace" operation, where I replace the whole string by just the content that has been captured by the second capture group (everything in round parenthesis is a capture group).
Could you please use following also, not printing it by field's number so if your Input_file's sha256 location is a bit here and there too than shown one then this could be more helpful too.
awk '{match($0,/sha.*Encryption:/);if(substr($0,RSTART,RLENGTH)){print substr($0,RSTART,RLENGTH-1)}}' Input_file
Pipe the output to:
awk ‘BEGIN{FS=“:”} {print $5}’
You could also take a step back to the openssl x509 command 'name options'. Using sep_comma_plus avoids the slashes in the output and therefore your regex will be simpler.

substitue string by index without using regular expressions

It should be very easy, but I am looking for an efficient way to perform it.
I know that I could split the string into two parts and insert the new value, but I have tried to substitute each line between the indexes 22-26 as follows:
line.replace(line[22:26],new_value)
The Problem
However, that function substitutes everything in the line that is similar to the pattern in line[22:26].
In the example below, I want to replace the marked number 1 with number 17:
Here are the results. Note the replacement of 1 with 17 in several places:
Thus I don't understand the behavior of replace command. Is there a simple explanation of what I'm doing wrong?
Why I don't want RE
The values between index 22-26 are not unified in form.
Note: I am using python 3.5 on Unix/Linux machines.
str.replace replaces 1 sub-string pattern with another everywhere in the string.
e.g.
'ab cd ab ab'.replace('ab', 'xy')
# produces output 'xy cd xy xy'
similarly,
mystr = 'ab cd ab ab'
mystr.replace(mystr[0:2], 'xy')
# also produces output 'xy cd xy xy'
what you could do instead, to replace just the characters in position 22-26
line = line[0:22] + new_value + line[26:]
Also, looking at your data, it seems to me to be a fixed-width text file. While my suggestion will work, a more robust way to process this data would be to read it & separate the different fields in the record first, before processing the data.
If you have access to the pandas library, it provides a useful function just for reading fixed-width files

Having some issues with Perl Splitting and Merging Functions

First and foremost, I'm not familiar with Perl at all. I've been studying C++ primarily for the last 1/2 year. I'm in a class now that that is teaching Linux commands, and we have short little topics on languages used in Linux, including Perl, which is totally throwing me for a loop (no pun intended). I have a text file that contains a bunch of random numbers separated by spaces and tabs, maybe even newlines, that gets read into the program via a filehandle. I'm supposed to write 2 lines of code that split the lines of numbers and merge them into one array, inside of a foreach loop. I'm not looking for an answer, just a nudge in the right direction. I've been trying different things for multiple hours and feel totally silly I can't get it, I'm totally lost with the syntax. Its just a bit odd not working inside a compiler and out of my comfort zone working outside of C++. I really appreciate it. I've included a few photos. Basically, the code we are writing it just to store the numbers and the rest of the program will determine the smallest number and sum of all numbers. Mine is currently incorrect because I'm not sure what to do. In the output photo, it will display all the numbers being entered in via the text file, so you can see them.
Several things to fix here. First of all, please don't post screenshots of your sample data or code, as it makes it impossible to copy and paste to test your code or data. Post your code/data by indenting it with four spaces and a newline preceding the code block.
Add use strict; in your script. This should be lesson 0 in your class. After that add my to all variable declarations.
To populate #all_numbers with contents of each line's numbers, without using push, you can use something like this:
foreach my $line (#output_lines)
{
my #numbers = split /\s/, $line;
#all_numbers = (#all_numbers, #numbers);
}
You say you're "not looking for an answer," so here's your nudge:
You're almost there. You split each line well (using split/\s/) and store the numeric values in #all_numbers. However, notice that each time around in the loop, you replace (using the assignment, #all_numbers = ...) the whole contents of #all_numbers with the numbers you found in the current line. Effectively, you're throwing away everything you've stored from the previous lines.
Instead, you want to add to #all_numbers, not replace #all_numbers. Have a look at the push() function for how to do this.
NB: Your split() call is fine, but it's more customary to use split(' ', $line) in this case. (See split(): you can use a single space, ' ', instead of the pattern, /\s/, when you want to split on any whitespace.)
I hope you need to store the all splitting element into array, so you looking for push function.
foreach $line (#input_lines)
{
push(#all_numbers,split(/\s/,$line));
}
Your problem is, in every iteration, the splitted value is over written in an array not to append together. For example,
#array = qw(one two three);
#array = qw(five four seven);
print "#array";
output is five four seven not the one two three five four seven because this is reinitialize with a new values. You want to append the new values in the array in before or after use unshift or push
for example
#array = qw(one two three);
push(#array,qw(five four seven));
Another way:
my #all_numbers = map { split ' ', $_ } #output_lines;
See http://perldoc.perl.org/functions/map.html

How could I use "tr" to translate every byte?

My goal is to have every byte from an input file xor with 42 using tr.
But I was stuck at this point:
tr '\0-\377' '?'
Anyone can help? Thanks a lot..
Some restrictions:
translation has to be done by tr.
we are allowed to use bash script, but it shouldn’t use any temporary files. (only pipeline in other words)
This isn't possible with tr alone since it - as the name says - simply translates from one codeset to another. Math calculations or logical operations are not supported.
Btw, if you want to address the whole ascii range using numeric values, you are bound to octal numbers. The range would be \0-\377 in that case. But anyway, calculating the xor value is not possible.
What you can do is the prepare a table with xor^42'ed ascii values and use as SET2. I'm using python to create that list:
xor.py
v = []
for i in range(0,255):
v.append("\\" + oct(i^42))
print("".join(v))
Or simply:
print("".join(["\\" + oct(i^42) for i in range(0,255)]))
Then use that to create SET2 for tr:
tr '\0-\0377' "$(python xor.py)" < input.file
Note: When python is already required, why not using python for the whole solution?
Edited by Mark Setchell
As Peter pointed out this list can of course getting generated in advance and then getting passed as a string for SET2. This would not require a runtime python call. Like this:
tr '\0-\0377' '\052\053\050\051\056\057\054\055\042\043\040\041\046\047\044\045\072\073\070\071\076\077\074\075\062\063\060\061\066\067\064\065\012\013\010\011\016\017\014\015\02\03\0\01\06\07\04\05\032\033\030\031\036\037\034\035\022\023\020\021\026\027\024\025\0152\0153\0150\0151\0156\0157\0154\0155\0142\0143\0140\0141\0146\0147\0144\0145\0172\0173\0170\0171\0176\0177\0174\0175\0162\0163\0160\0161\0166\0167\0164\0165\0112\0113\0110\0111\0116\0117\0114\0115\0102\0103\0100\0101\0106\0107\0104\0105\0132\0133\0130\0131\0136\0137\0134\0135\0122\0123\0120\0121\0126\0127\0124\0125\0252\0253\0250\0251\0256\0257\0254\0255\0242\0243\0240\0241\0246\0247\0244\0245\0272\0273\0270\0271\0276\0277\0274\0275\0262\0263\0260\0261\0266\0267\0264\0265\0212\0213\0210\0211\0216\0217\0214\0215\0202\0203\0200\0201\0206\0207\0204\0205\0232\0233\0230\0231\0236\0237\0234\0235\0222\0223\0220\0221\0226\0227\0224\0225\0352\0353\0350\0351\0356\0357\0354\0355\0342\0343\0340\0341\0346\0347\0344\0345\0372\0373\0370\0371\0376\0377\0374\0375\0362\0363\0360\0361\0366\0367\0364\0365\0312\0313\0310\0311\0316\0317\0314\0315\0302\0303\0300\0301\0306\0307\0304\0305\0332\0333\0330\0331\0336\0337\0334\0335\0322\0323\0320\0321\0326\0327\0324' < inputFile > outputFile

Resources