Best way to split large string in linux with single character shift

Best way to split large string in linux with single character shift - linux

I have a large file that contains a single example string
ABCDEFGHI (example length 10 characters).
Actual file length could be millions of characters.
I would like to split the string into multiple lines with a predetermined length but while splitting the character is shifted 1 at a time. This means after splits the
no. of lines = string length - split size + 1
Example if I split it by 3 character at a time then desired output
ABC
BCD
CDE
DEF
...
If I split by 4 characters then
ABCD
BCDE
CDEF
DEFG
What is the best way of doing this split using shell commands or scripting?
Thanks for any hints

You can try something like this:
gawk -v FS="" '{
r=3 # Set the length
s=1 # Set the start point
while(s<=NF-r+1) {
for (i=s;i<r+s;i++) {
printf $i
}
s++
print ""
}
}'
Test:
$ echo "ABCDEFGHI" | gawk -v FS="" '{r=4; s=1; while(s<=NF-r+1) { for (i=s;i<r+s;i++) printf $i ; s++; print ""}}'
ABCD
BCDE
CDEF
DEFG
EFGH
FGHI
$ echo "ABCDEFGHI" | gawk -v FS="" '{r=3; s=1; while(s<=NF-r+1) { for (i=s;i<r+s;i++) printf $i ; s++; print ""}}'
ABC
BCD
CDE
DEF
EFG
FGH
GHI

Here is a way with sed (in bash):
GNU sed:
sed -r ':a;s/([^\n])([^\n]{'$(( n-1 ))'})([^\n])/\1\2\n\2\3/;ta' filename
or POSIX sed (I think):
sed ':a;s/\([^\n]\)\([^\n]\{'$(( n-1 ))'\}\)\([^\n]\)/\1\2\n\2\3/;ta' filename
Output:
with n=3:
ABC
BCD
CDE
DEF
EFG
FGH
GHI
with n=4:
ABCD
BCDE
CDEF
DEFG
EFGH
FGHI

Another awk-based option, involving substr
echo 'abcdefgh' |
awk -v limit=3 'BEGIN{FS=""};
{value=$0; for (i=1; i<= NF-limit +1; ++i) print substr(value, i, limit)}'
abc
bcd
cde
def
efg
fgh
ghi

While I generally dislike bringing in heavyweight scripting languages like this, python makes this pretty much trivial
$ cat test.py
#!/usr/bin/env python
from os import sys
n = int(sys.argv[1])
s = sys.argv[2]
while len(s) > 0:
print s[:n]
s = s[1:]
$ python test.py 3 abcdef
abc
bcd
cde
def
ef
f
$ python test.py 4 abcdef
abcd
bcde
cdef
def
ef
f
$
If you want to stop once you run out of characters, you can change the while condition to len(s) >= n.

using python you could write something like this:
import itertools
filename = "myfile"
length = 4
with open(filename, 'r') as f:
out = ''
# get your input character by character
for c in itertools.chain.from_iterable(f):
# append it to your output buffer
out += c
# if your buffer is more than N characters, remove the first char
if len(out) > length:
out = out[1:]
# if your buffer is exactly N characters, print it out (or do something else)
if len(out) is length:
print out
# if the last iteration was less than N characters, print it out (or do something else)
if len(out) < length:
print out
where file is a string containing the full path of your string. You can use also raw_input() instead of open()/read(). There sure is a neat solution using awk, but I would need to RTFM to tell you how to do it.
Whatever your solution is, this algorithm is a good way to do it, as you always keep only up to N+1 characters for the buffer, plus one character for the new read. So the complexity of this algorithm is linear (O(n)) to the input character stream.

Related

Print random line into different text file

What I want to do is print a random line from text file A into text file B WITHOUT it choosing the same line twice. So if text file B has a line with the number 25 in it, it will not choose that line from text file A
I have figured out how to print a random line from text file A to text file B, however, I am not sure how to make sure it does not choose the same line twice.
echo "$(printf $(cat A.txt | shuf -n 1))" > /home/B.txt

grep -Fxv -f B A | shuf -n 1 >> B
First part (grep) prints difference of A and B to stdout, i.e. lines present in A but absent in B:
-F — Interpret PATTERNS as fixed strings, not regular expressions.
-x — Select only those matches that exactly match the whole line.
-v — Invert the sense of matching.
-f FILE — Obtain patterns from FILE.
Second part (shuf -n 1) prints random line from stdin. Output is appended to B.

That's not really "random", then. Never mind.
Please try the following awk solution - I think it does what you're trying to achieve.
$ cat A
11758
1368
26149
2666
27666
11155
31832
11274
21743
25
$ cat B
18518
8933
941
32286
1234
25
1608
5284
23040
19028
$ cat pseudo
BEGIN{
"bash -c 'echo ${RANDOM}'"|getline seed # Generate a random seed
srand(seed) # use random seed, otherwise each repeated run will generate the same random sequence
count=0 # set a counter
}
NR==FNR{ # while on the first file, remember every number; note this will weed out duplicates!
b[$1]=1
}
!($1 in b) { # for numbers we haven't seen yet (so on the second file, ignoring ones present in file B)
a[count]=$1 # remember new numbers in an associative array with an integer index
count++
}
END{
r=(int(rand() * count)) # generate a random number in the range of our secondary array's index values
print a[r] >> "B" # print that randomly chosen element to the last line of file B
}
$ awk -f pseudo B A
$ cat B
18518
8933
941
32286
1234
25
1608
5284
23040
19028
27666
$
$ awk -f pseudo B A
$ cat B
18518
8933
941
32286
1234
25
1608
5284
23040
19028
27666
31832

Convert carriage return (\r) to actual overwrite

Questions
Is there a way to convert the carriage returns to actual overwrite in a string so that 000000000000\r1010 is transformed to 101000000000?
Context
1. Initial objective:
Having a number x (between 0 and 255) in base 10, I want to convert this number in base 2, add trailing zeros to get a 12-digits long binary representation, generate 12 different numbers (each of them made of the last n digits in base 2, with n between 1 and 12) and print the base 10 representation of these 12 numbers.
2. Example:
With x = 10
Base 2 is 1010
With trailing zeros 101000000000
Extract the 12 "leading" numbers: 1, 10, 101, 1010, 10100, 101000, ...
Convert to base 10: 1, 2, 5, 10, 20, 40, ...
3. What I have done (it does not work):
x=10
x_base2="$(echo "obase=2;ibase=10;${x}" | bc)"
x_base2_padded="$(printf '%012d\r%s' 0 "${x_base2}")"
for i in {1..12}
do
t=$(echo ${x_base2_padded:0:${i}})
echo "obase=10;ibase=2;${t}" | bc
done
4. Why it does not work
Because the variable x_base2_padded contains the whole sequence 000000000000\r1010. This can be confirmed using hexdump for instance. In the for loop, when I extract the first 12 characters, I only get zeros.
5. Alternatives
I know I can find alternative by literally adding zeros to the variable as follow:
x_base2=1010
x_base2_padded="$(printf '%s%0.*d' "${x_base2}" $((12-${#x_base2})) 0)"
Or by padding with zeros using printf and rev
x_base2=1010
x_base2_padded="$(printf '%012s' "$(printf "${x_base2}" | rev)" | rev)"
Although these alternatives solve my problem now and let me continue my work, it does not really answer my question.
Related issue
The same problem may be observed in different contexts. For instance if one tries to concatenate multiple strings containing carriage returns. The result may be hard to predict.
str=$'bar\rfoo'
echo "${str}"
echo "${str}${str}"
echo "${str}${str}${str}"
echo "${str}${str}${str}${str}"
echo "${str}${str}${str}${str}${str}"
The first echo will output foo. Although you might expect the other echo to output foofoofoo..., they all output foobar.

The following function overwrite transforms its argument such that after each carriage return \r the beginning of the string is actually overwritten:
overwrite() {
local segment result=
while IFS= read -rd $'\r' segment; do
result="$segment${result:${#segment}}"
done < <(printf '%s\r' "$#")
printf %s "$result"
}
Example
$ overwrite $'abcdef\r0123\rxy'
xy23ef
Note that the printed string is actually xy23ef, unlike echo $'abcdef\r0123\rxy' which only seems to print the same string, but still prints \r which is then interpreted by your terminal such that the result looks the same. You can confirm this with hexdump:
$ echo $'abcdef\r0123\rxy' | hexdump -c
0000000 a b c d e f \r 0 1 2 3 \r x y \n
000000f
$ overwrite $'abcdef\r0123\rxy' | hexdump -c
0000000 x y 2 3 e f
0000006
The function overwrite also supports overwriting by arguments instead of \r-delimited segments:
$ overwrite abcdef 0123 xy
xy23ef
To convert variables in-place, use a subshell: myvar=$(overwrite "$myvar")

With awk, you'd set the field delimiter to \r and iterate through fields printing only the visible portions of them.
awk -F'\r' '{
offset = 1
for (i=NF; i>0; i--) {
if (offset <= length($i)) {
printf "%s", substr($i, offset)
offset = length($i) + 1
}
}
print ""
}'
This is indeed too long to put into a command substitution. So you better wrap this in a function, and pipe the lines to be resolved to that.

To answer the specific question, how to convert 000000000000\r1010 to 101000000000, refer to Socowi's answer.
However, I wouldn't introduce the carriage return in the first place and solve the problem like this:
#!/usr/bin/env bash
x=$1
# Start with 12 zeroes
var='000000000000'
# Convert input to binary
binary=$(bc <<< "obase = 2; $x")
# Rightpad with zeroes: ${#binary} is the number of characters in $binary,
# and ${var:x} removes the first x characters from $var
var=$binary${var:${#binary}}
# Print 12 substrings, convert to decimal: ${var:0:i} extracts the first
# i characters from $var, and $((x#$var)) interprets $var in base x
for ((i = 1; i <= ${#var}; ++i)); do
echo "$((2#${var:0:i}))"
done

How can I count most occuring sequence of 3 letters within a word with a bash script

I have a sample file like
XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant
Here I need to grep most occurring sequence of 3 letters within a word
Output should be
acc = 5
aco = 3
Is that possible in Bash?
I got absolutely no idea how I can accomplish it with either awk, sed, grep.
Any clue how it's possible...
PS: no output because I got no idea how to do that, I dont wanna wrote unnecessary awk -F, xyz abc... that not gonna help anywhere...

Here's how to get started with what I THINK you're trying to do:
$ cat tst.awk
BEGIN { stringLgth = 3 }
{
for (fldNr=1; fldNr<=NF; fldNr++) {
field = $fldNr
fieldLgth = length(field)
if ( fieldLgth >= stringLgth ) {
maxBegPos = fieldLgth - (stringLgth - 1)
for (begPos=1; begPos<=maxBegPos; begPos++) {
string = tolower(substr(field,begPos,stringLgth))
cnt[string]++
}
}
}
}
END {
for (string in cnt) {
print string, cnt[string]
}
}
.
$ awk -f tst.awk file | sort -k2,2nr
acc 5
cou 5
cco 4
ing 4
nti 4
oun 4
tin 4
unt 4
aco 3
abc 1
ant 1
any 1
bca 1
cac 1
cal 1
com 1
con 1
fir 1
ica 1
irm 1
lta 1
mpa 1
nsu 1
omp 1
ons 1
ous 1
pan 1
sti 1
sul 1
tan 1
tic 1
ult 1
ust 1
xyz 1
yza 1
zac 1

This is an alternative method to the solution of Ed Morton. It is less looping, but needs a bit more memory. The idea is not to care about spaces or any non-alphabetic character. We filter them out in the end.
awk -v n=3 '{ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) if (s !~ /[^a-z]/) print s,a[s] }' file
When you use GNU awk, you can do this a bit differently and optimized by setting each record to be a word. This way the end selection does not need to happen:
awk -v n=3 -v RS='[[:space:]]' '
(length>=n){ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) print s,a[s] }' file

This might work for you (GNU sed, sort and uniq):
sed -E 's/.(..)/\L&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -c |
sort -s -k1,1rn |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p'
Use the first sed invocation to output 3 letter lower case words.
Sort the words.
Count the duplicates.
Sort the counts in reverse numerical order maintaining the alphabetical order.
Use the second sed invocation to manipulate the results into the desired format.
If you only want lines with duplicates and in alphabetical order and case wise, use:
sed -E 's/.(..)/&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -cd |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p

Parse string using grep, sed or awk

I have a string that looks like this
807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482
I need output like this:
S:S6S11,07001,23668732,1,1496851208,807262,7482
I need the string with the column separated like this:
S:S6 + the next 3 characters;
In this case S:S6S11 this works:
echo 807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482 |
grep -P -o 'F:S6.{1,3}'
Output:
S:S6S11
This gets me close, getting just the numbers
echo 807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482 |
grep -o '[0-9]\+' | tr '\n' ','
Output:
807001,6,11,23668732,1,1496851208,807262,7482,
How can I get S:S6S11 in the beginning of my output and avoid 6,11 after that?
If this can be done better with sed or awk I don't mind.
Edit - clarification of structure
The rest of the string is:
LETTERS NUMBERS
BB 23668732
CC 1
DD 1496851208.807262
EE 7482
I need just the numbers but they have to correspond to the letters.

awk to the rescue!
$ echo "807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482" |
awk '{pre=gensub(".*(S:S6...).*","\\1","g"); ## extract prefix
sub(/./,","); ## replace first char with comma
gsub(/[^0-9]+/,","); ## replace non-numeric values with comma
print pre $0}' ## print prefix and replaced line
S:S6S11,07001,6,11,23668732,1,1496851208,807262,7482

... or sed:
$ echo "807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482" | sed -re 's/^.([0-9]+)(S:S6...)ABB([0-9]+)CC([0-9]+)DD([0-9]+)\.([0-9]+)EE([0-9]*)$/\2,\1,\3,\4,\5,\6,\7/'
S:S6S11,07001,23668732,1,1496851208,807262,7482
That is, if your line format is fixed.

If you use GNU awk, you can simplify the task by defining RS as the desired pattern, e.g.:
parse.awk
BEGIN { RS = "S:S6...|\n" }
# Start of the string
RT != "\n" {
sub(".", ",") # Replace first char by a comma
pst = $0 # Remember the rest of the string
pre = RT # Remember the S:S6 pattern
}
# End of string
RT == "\n" {
gsub("[A-Z.]+", ",") # Replace letters and dots by commas
print pre pst $0 # Print the final result
}
Run e.g. it like this:
s=807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482
gawk -f parse.awk <<<$s
Output:
S:S6S11,07001,23668732,1,1496851208,807262,7482

Here is one way you could do it with sed:
parse.sed
h # Duplicate string to hold space
s/.*(S:S6...).*/\1/ # Extract the desired pattern
x # Swap hold and pattern space
s/S:S6...// # Remove pattern (still in hold space)
s/[A-Z.]+/,/g # Replace letters and dots with commas
s/./,/ # Replace first char with comma
G # Append hold space content
s/([^\n]+)\n(.*)/\2\1/ # Rearrange to match desired output
Run it like this:
s=807001S:S6S11ABB23668732CC1DD1496851208.807262EE7482
sed -Ef parse.sed <<<$s
Output:
S:S6S11,07001,23668732,1,1496851208,807262,7482

It sounds like this MAY be what you're really trying to do:
$ awk -F'[A-Z]{2,}|[.]' -v OFS=',' '{$1=substr($1,7) OFS substr($1,2,5)}1' file
S:S6S11,07001,23668732,1,1496851208,807262,7482
but your requirements for how and what to match where are very unclear and just one sample input line doesn't help much.

convert a fixed width file from text to csv

I have a large data file in text format and I want to convert it to csv by specifying each column length.
number of columns = 5
column length
[4 2 5 1 1]
sample observations:
aasdfh9013512
ajshdj 2445df
Expected Output
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

GNU awk (gawk) supports this directly with FIELDWIDTHS, e.g.:
gawk '$1=$1' FIELDWIDTHS='4 2 5 1 1' OFS=, infile
Output:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

I would use sed and catch the groups with the given length:
$ sed -r 's/^(.{4})(.{2})(.{5})(.{1})(.{1})$/\1,\2,\3,\4,\5/' file
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

Here's a solution that works with regular awk (does not require gawk).
awk -v OFS=',' '{print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)}'
It uses awk's substr function to define each field's start position and length. OFS defines what the output field separator is (in this case, a comma).
(Side note: This only works if the source data does not have any commas. If the data has commas, then you have to escape them to be proper CSV, which is beyond the scope of this question.)
Demo:
echo 'aasdfh9013512
ajshdj 2445df' |
awk -v OFS=',' '{print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)}'
Output:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

Adding a Generic way of handling this(alternative to FIELDSWIDTH option) in awk(where we need not to harcode sub string positions, this will work as per position nuber provided by user wherever comma needs to be inserted) could be as follows, written and tested in GNU awk. To use this, we have to define values(like OP showed in samples), position numbers where we need to insert commas, awk variable name is colLength give position numbers with space between them.
awk -v colLengh="4 2 5 1 1" '
BEGIN{
num=split(colLengh,arr,OFS)
}
{
j=sum=0
while(++j<=num){
if(length($0)>sum){
sub("^.{"arr[j]+sum"}","&,")
}
sum+=arr[j]+1
}
}
1
' Input_file
Explanation: Simple explanation would be, creating awk variable named colLengh where we need to define position numbers wherever we need to insert commas. Then in BEGIN section creating array arr which has value of indexes where we need to insert commas in it.
In main program section first of all nullifying variables j and sum here. Then running a while loop from j=1 to till value of j becomes equal to num. In each run substituting from starting of current line(if length of current line is greater than sum else it doesn't make sense to perform substitution to I have put addiotnal check here) everything with everything + , as per need. Eg: sub function will become .{4} for first time loop runs then it becomes, .{7} because its 7th position we need to insert comma and so on. So sub will substitute those many characters from starting to till generated numbers with matched value + ,. At last in this program mentioning 1 will print edited/non-edited lines.

If any one is still looking for a solution, I have developed a small script in python. its easy to use provided you have python 3.5
https://github.com/just10minutes/FixedWidthToDelimited/blob/master/FixedWidthToDelimiter.py
"""
This script will convert Fixed width File into Delimiter File, tried on Python 3.5 only
Sample run: (Order of argument doesnt matter)
python ConvertFixedToDelimiter.py -i SrcFile.txt -o TrgFile.txt -c Config.txt -d "|"
Inputs are as follows
1. Input FIle - Mandatory(Argument -i) - File which has fixed Width data in it
2. Config File - Optional (Argument -c, if not provided will look for Config.txt file on same path, if not present script will not run)
Should have format as
FieldName,fieldLength
eg:
FirstName,10
SecondName,8
Address,30
etc:
3. Output File - Optional (Argument -o, if not provided will be used as InputFIleName plus Delimited.txt)
4. Delimiter - Optional (Argument -d, if not provided default value is "|" (pipe))
"""
from collections import OrderedDict
import argparse
from argparse import ArgumentParser
import os.path
import sys
def slices(s, args):
position = 0
for length in args:
length = int(length)
yield s[position:position + length]
position += length
def extant_file(x):
"""
'Type' for argparse - checks that file exists but does not open.
"""
if not os.path.exists(x):
# Argparse uses the ArgumentTypeError to give a rejection message like:
# error: argument input: x does not exist
raise argparse.ArgumentTypeError("{0} does not exist".format(x))
return x
parser = ArgumentParser(description="Please provide your Inputs as -i InputFile -o OutPutFile -c ConfigFile")
parser.add_argument("-i", dest="InputFile", required=True, help="Provide your Input file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE", type=extant_file)
parser.add_argument("-o", dest="OutputFile", required=False, help="Provide your Output file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE")
parser.add_argument("-c", dest="ConfigFile", required=False, help="Provide your Config file name here,File should have value as fieldName,fieldLength. if file is on different path than where this script resides then provide full path of the file", metavar="FILE",type=extant_file)
parser.add_argument("-d", dest="Delimiter", required=False, help="Provide the delimiter string you want",metavar="STRING", default="|")
args = parser.parse_args()
#Input file madatory
InputFile = args.InputFile
#Delimiter by default "|"
DELIMITER = args.Delimiter
#Output file checks
if args.OutputFile is None:
OutputFile = str(InputFile) + "Delimited.txt"
print ("Setting Ouput file as "+ OutputFile)
else:
OutputFile = args.OutputFile
#Config file check
if args.ConfigFile is None:
if not os.path.exists("Config.txt"):
print ("There is no Config File provided exiting the script")
sys.exit()
else:
ConfigFile = "Config.txt"
print ("Taking Config.txt file on this path as Default Config File")
else:
ConfigFile = args.ConfigFile
fieldNames = []
fieldLength = []
myvars = OrderedDict()
with open(ConfigFile) as myfile:
for line in myfile:
name, var = line.partition(",")[::2]
myvars[name.strip()] = int(var)
for key,value in myvars.items():
fieldNames.append(key)
fieldLength.append(value)
with open(OutputFile, 'w') as f1:
fieldNames = DELIMITER.join(map(str, fieldNames))
f1.write(fieldNames + "\n")
with open(InputFile, 'r') as f:
for line in f:
rec = (list(slices(line, fieldLength)))
myLine = DELIMITER.join(map(str, rec))
f1.write(myLine + "\n")

Portable awk
Generate an awk script with the appropriate substr commands
cat cols
4
2
5
1
1
<cols awk '{ print "substr($0,"p","$1")"; cs+=$1; p=cs+1 }' p=1
Output:
substr($0,1,4)
substr($0,5,2)
substr($0,7,5)
substr($0,12,1)
substr($0,13,1)
Combine lines and make it a valid awk-script:
<cols awk '{ print "substr($0,"p","$1")"; cs+=$1; p=cs+1 }' p=1 |
paste -sd, | sed 's/^/{ print /; s/$/ }/'
Output:
{ print substr($0,1,4),substr($0,5,2),substr($0,7,5),substr($0,12,1),substr($0,13,1) }
Redirect the above to a file, e.g. /tmp/t.awk and run it on the input-file:
<infile awk -f /tmp/t.awk
Output:
aasd fh 90135 1 2
ajsh dj 2445 d f
Or with comma as the output separator:
<infile awk -f /tmp/t.awk OFS=,
Output:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Best way to split large string in linux with single character shift - linux

Another awk-based option, involving substr echo 'abcdefgh' | awk -v limit=3 'BEGIN{FS=""}; {value=$0; for (i=1; i<= NF-limit +1; ++i) print substr(value, i, limit)}' abc bcd cde def efg fgh ghi

Related

Print random line into different text file

Convert carriage return (\r) to actual overwrite

How can I count most occuring sequence of 3 letters within a word with a bash script

Parse string using grep, sed or awk

convert a fixed width file from text to csv

Categories

Resources