Count overlapping occurences of a repeated string using grep/linux/bash - linux
I'm trying to count occurences of a repeated string. Eg.
echo 'joebobtomtomtomjoebobmike' | grep -o 'tomtom' | wc -l
This outputs 1, but obviously the string 'tomtom' fits twice here. How can I make it so it counts both occurences?
Thanks!
You can use this awk script
{
count = 0
$0 = tolower($0)
while (length() > 0) {
m = match($0, pattern)
if (m == 0)
break
count++
$0 = substr($0, m + 1)
}
print count
}
Explanation
We first convert the line to all lower case to ignore case. This script works by shortening the string after matching the pattern. It uses the function match() to find the position where the pattern is matched. If
m == 0, that means no matches were found, so we can break from the loop. We increment count each iteration of the loop, then reset the $0 string to the substring starting at index m + 1.
If you save this as a.awk, you can do
echo "joebobtomtomtomjoebobmike" | awk -v "pattern=tomtom" -f a.awk
And it will output 2.
This might work for you (GNU sed):
sed -r '/(tom)\1/!d;:a;s//\n\1/;ta;s/\n//'| wc -l
The repeating pattern tomtom can be rewritten in regexp form as (tom)\1 then replacing the first part of the repeating patten by a newline and looping until no more patterns are found will give a number of lines indicating the overlapping pattern. As the result is printed this must be taken into account and subtracted from the result i.e. the last (in this case the first) newline must be removed. Of course if there is no repeating pattern the result must be zero hence the first sed command.
You could just walk the length of the string and see if the substring at the current location is the desired text:
string=joebobtomtomtomjoebobmiketomtomtom
match=tomtom
for ((i=0; i <= ${#string} - ${#match}; i++)); do
[[ ${string:i:${#match}} == $match ]] && ((count++))
done
echo $count # => 4
Related
How to swap the even and odd lines via bash script?
In the incoming string stream from the standard input, swap the even and odd lines. I've tried to do it like this, but reading from file and $i -lt $a.count aren't working: $a= gc test.txt for($i=0;$i -lt $a.count;$i++) { if($i%2) { $a[$i-1] } else { $a[$i+1] } } Please, help me to get this working
Suggesting one line awk script: awk '!(NR%2){print$0;print r}NR%2{r=$0}' input.txt awk script explanation !(NR % 2){ # if row number divide by 2 witout reminder print $0; # print current row print evenRow; # print saved row } (NR % 2){ # if row number divided by 2 with reminder evenRow = $0; # save current row in variable }
There's already a good awk-based answer to the question here, and there are a few decent-looking sed and awk solutions in Swap every 2 lines in a file. However, the shell solution in Swap every 2 lines in a file is almost comically incompetent. The code below is an attempt at a functionally correct pure Bash solution. Bash is very slow, so it is practical to use only on small files (maybe up to 10 thousand lines). #! /bin/bash -p idx=0 lines=() while IFS= read -r 'lines[idx]' || [[ -n ${lines[idx]} ]]; do (( idx == 1 )) && printf '%s\n%s\n' "${lines[1]}" "${lines[0]}" idx=$(( (idx+1)%2 )) done (( idx == 1 )) && printf '%s\n' "${lines[0]}" The lines array is used to hold two consecutive lines. IFS= prevents whitespace being stripped as lines are read. The idx variable cycles through 0, 1, 0, 1, ... (idx=$(( (idx+1)%2 ))) so reading to lines[idx] cycles through putting input lines at indexes 0 and 1 in the lines array. The read function returns non-zero status immediately if it encounters an unterminated final line in the input. That could cause the loop to terminate before processing the last line, thus losing it in the output. The || [[ -n ${lines[idx]} ]] avoids that by checking if the read actually read some input. (Fortunately, there's no such thing as an unterminated empty line at the end of a file.) printf is used instead of echo to output the lines because echo doesn't work reliably for arbitrary strings. (For instance, a line containing just -n would get lost completely by echo "$line".) See Why is printf better than echo?. The question doesn't say what to do if the input has an odd number of lines. This code ((( idx == 1 )) && printf '%s\n' "${lines[0]}") just passes the last line through, after the swapped lines. Other reasonable options would be to drop the last line or print it preceded by an extra blank line. The code can easily be modified to do one of those if desired. The code is Shellcheck-clean.
To find the nearest smaller value based on a variable input inside bash script
I'm trying to find all the files with format: 100_Result.out, 101_Result.out, ... 104_Result.out from the subdirectories of a directory: /as/test_dir/. The structure of the subdirectories looks like: /as/test_dir/4/, /as/test_dir/5/,/as/test_dir/6/, /as/test_dir/7/, ... So, if I have a variable num=102 in the script, then I want to check all the *_Result.out files and need to capture the file which is one value smaller than the num variable – i.e., I want the file: 101_Result.out. Similarly, if num=101, then file should be 100_Result.out But sometimes it could happen that the .out files are not in sequential, i.e., not all values are present. So, if num=102 but there is no 101_Result.out file, but I have a 100_Result.out file in one of the sub-directories, then that's what I want. I tried below and I believe somehow I've reached to it but it doesn't look perfect. #!/bin/bash dir="/as/test_dir" files=( $(find "$dir"/*/ -type f -name "*_Result.out" -exec basename "{}" \;) ) num=102 len=${#files[#]} i=0 while [ $i -lt $len ]; do var=$(echo "${files[$i]}" | awk -F'_' '{print $1}') dif=$(($num - $var)) if [[ "$dif" -ge '1' ]];then echo "$dif" >> tmpfile fi let i++ done arr=( $(cat tmpfile) ) min=${arr[0]} max=${arr[0]} for i in "${arr[#]}" do if [[ "$i" -lt "$min" ]];then min="$i" elif [[ "${#arr[#]}" -eq '1' ]] && [[ "${arr[0]}" -eq '1' ]];then min="$i" for j in "${files[#]}" do var=$(echo "${j}" | awk -F'_' '{print $1}') if [[ $(($num - $var)) -eq "$min" ]]; then file_name="${var}_Result.out" echo "$file_name" fi done fi done echo "$min" #rm tmpfile Any help is most welcome.
and need to capture the file which is one value smaller than the num variable .... if i've num=102 in the script, then i want the file: 101_Result.out Just glob the file. echo "$dir"/*/"$((num - 1))_Result.out" sometimes it could happen that the .out files are not in sequential,i.e if num=102, then it may happen that i've only 100_Result.out file in one the sub-directories Try not to store state in bash. Instead, write one long big pipeline. Like ex. so: # Get all files find "$dir"/*/ -type f -name "*_Result.out" | # Extract the number in first column separated by space. sed 's~.*/\([0-9]*\)_Result.out$~\1 &~' | # Filter only smaller awk -v num="$num" '$1 < num' | # Get first file smaller file sort -n | tail -n1 | # Remove the number cut -d' ' -f2-
The question does not specify what to do if there are multiple files with the same number (i.e., a tie; e.g., /as/test_dir/4/101_Result.out and /as/test_dir/7/101_Result.out). This answer assumes that you want all of them. One Step This is very similar to Darkman’s answer, but It handles pathnames somewhat better. IMO, it’s clearer. It uses meaningful variable names (one-letter names are too short), looser spacing (169 characters is too many for a one-liner), and a simpler algorithm (no subtraction). dir="/as/test_dir" num=102 find "$dir" -type f -name '*_Result.out' | awk -F'[/_]' -v limit="$num" ' { this_num = $(NF-1) if (this_num < limit) { numbers[$0] = this_num if (this_num > max) max = this_num } } END { for (i in numbers) if (numbers[i] == max) print i } ' You clearly already understand the find command — find all files in and under $dir whose names match *_Result.out. Pipe into awk. Each filename (pathname) becomes an input record to awk. -F'[/_]' means use slash (/) and underscore (_) as field separators. That means that a filename (input record) of /as/test_dir/4/100_Result.out gets broken into these fields: $1 = (blank) $2 = as $3 = test $4 = dir $5 = 4 $6 = 100 $7 = Result.out ($1 would be set to the text before the first / (or _) if there were any.) As illustrated above, the number part of the file name is the second-to-last field in the record; i.e., $(NF-1). This depends on the fact that the file name always contains exactly one underscore, and it comes right after the number. (See Part 2 of this answer for a more flexible approach.) If the number is less than the limit (e.g., 102), save the pathname in an array, associated with the number. If the number is less than the limit but more than the maximum we have seen so far, update the maximum. (We don’t need to initialize max explicitly; awk automatically initializes all variables to zero1.) Finally, print all the pathnames that are associated with the max value. The above will list (all) the desired filenames on the standard output. As you know, you can put them into an array by putting arr=( before the find … | awk … pipeline, and ) after it. ________ 1 Actually, variables are initialized to null. This is treated as zero when it is used in a numeric context. Two Steps The above is OK for producing a human-readable, displayable list of filenames. However, filenames can contain weird characters like space, tab, newline, *, ?, etc.; processing the output from find can be problematic. A somewhat safer approach is to determine the max value, and then, as a second step, find the file(s) that match that value. You can then process those files with -exec. max=$(find "$dir" -type f -name '*_Result.out' | awk -F'/' -v limit="$limit" ' { this_num = $NF sub(/_Result.out/, "", this_num) if (this_num < limit) { numbers[$0] = this_num if (this_num > max) max = this_num } } END { print max } ' ) if [ "$max" = "" ] then echo "No file(s) found." else find "$dir" -type f -name "${max}_Result.out" fi -F'/' means use only slash (/) as field separator. That means that a filename (input record) of /as/test_dir/4/100_Result.out gets broken into these fields: $1 = (blank) $2 = as $3 = test_dir $4 = 4 $5 = 100_Result.out Here, the last (rightmost) component of the pathname (i.e., the file name) is the last field in the record; i.e., $NF. This, of course, is equivalent to the -exec basename "{}" you’re already using. Temporarily assign the file name to the this_num variable. Then strip off the _Result.out part (by substituting null for it), leaving just the number. Strictly speaking, the sub call should be sub(/_Result\.out/, "", this_num) to treat the . as a literal dot rather than an any-character (wildcard). But we know that the fourth-to-last character in the file name is an actual dot, because it matched the -name. At the end, just print the maximum number, capturing the value in a shell variable, … …, and then find the files that match that number (name).
Convert carriage return (\r) to actual overwrite
Questions Is there a way to convert the carriage returns to actual overwrite in a string so that 000000000000\r1010 is transformed to 101000000000? Context 1. Initial objective: Having a number x (between 0 and 255) in base 10, I want to convert this number in base 2, add trailing zeros to get a 12-digits long binary representation, generate 12 different numbers (each of them made of the last n digits in base 2, with n between 1 and 12) and print the base 10 representation of these 12 numbers. 2. Example: With x = 10 Base 2 is 1010 With trailing zeros 101000000000 Extract the 12 "leading" numbers: 1, 10, 101, 1010, 10100, 101000, ... Convert to base 10: 1, 2, 5, 10, 20, 40, ... 3. What I have done (it does not work): x=10 x_base2="$(echo "obase=2;ibase=10;${x}" | bc)" x_base2_padded="$(printf '%012d\r%s' 0 "${x_base2}")" for i in {1..12} do t=$(echo ${x_base2_padded:0:${i}}) echo "obase=10;ibase=2;${t}" | bc done 4. Why it does not work Because the variable x_base2_padded contains the whole sequence 000000000000\r1010. This can be confirmed using hexdump for instance. In the for loop, when I extract the first 12 characters, I only get zeros. 5. Alternatives I know I can find alternative by literally adding zeros to the variable as follow: x_base2=1010 x_base2_padded="$(printf '%s%0.*d' "${x_base2}" $((12-${#x_base2})) 0)" Or by padding with zeros using printf and rev x_base2=1010 x_base2_padded="$(printf '%012s' "$(printf "${x_base2}" | rev)" | rev)" Although these alternatives solve my problem now and let me continue my work, it does not really answer my question. Related issue The same problem may be observed in different contexts. For instance if one tries to concatenate multiple strings containing carriage returns. The result may be hard to predict. str=$'bar\rfoo' echo "${str}" echo "${str}${str}" echo "${str}${str}${str}" echo "${str}${str}${str}${str}" echo "${str}${str}${str}${str}${str}" The first echo will output foo. Although you might expect the other echo to output foofoofoo..., they all output foobar.
The following function overwrite transforms its argument such that after each carriage return \r the beginning of the string is actually overwritten: overwrite() { local segment result= while IFS= read -rd $'\r' segment; do result="$segment${result:${#segment}}" done < <(printf '%s\r' "$#") printf %s "$result" } Example $ overwrite $'abcdef\r0123\rxy' xy23ef Note that the printed string is actually xy23ef, unlike echo $'abcdef\r0123\rxy' which only seems to print the same string, but still prints \r which is then interpreted by your terminal such that the result looks the same. You can confirm this with hexdump: $ echo $'abcdef\r0123\rxy' | hexdump -c 0000000 a b c d e f \r 0 1 2 3 \r x y \n 000000f $ overwrite $'abcdef\r0123\rxy' | hexdump -c 0000000 x y 2 3 e f 0000006 The function overwrite also supports overwriting by arguments instead of \r-delimited segments: $ overwrite abcdef 0123 xy xy23ef To convert variables in-place, use a subshell: myvar=$(overwrite "$myvar")
With awk, you'd set the field delimiter to \r and iterate through fields printing only the visible portions of them. awk -F'\r' '{ offset = 1 for (i=NF; i>0; i--) { if (offset <= length($i)) { printf "%s", substr($i, offset) offset = length($i) + 1 } } print "" }' This is indeed too long to put into a command substitution. So you better wrap this in a function, and pipe the lines to be resolved to that.
To answer the specific question, how to convert 000000000000\r1010 to 101000000000, refer to Socowi's answer. However, I wouldn't introduce the carriage return in the first place and solve the problem like this: #!/usr/bin/env bash x=$1 # Start with 12 zeroes var='000000000000' # Convert input to binary binary=$(bc <<< "obase = 2; $x") # Rightpad with zeroes: ${#binary} is the number of characters in $binary, # and ${var:x} removes the first x characters from $var var=$binary${var:${#binary}} # Print 12 substrings, convert to decimal: ${var:0:i} extracts the first # i characters from $var, and $((x#$var)) interprets $var in base x for ((i = 1; i <= ${#var}; ++i)); do echo "$((2#${var:0:i}))" done
Extract orders and match to trades from two files
I have two attached files (orders1.txt and trades1.txt) I need to write a Bash script (possibly awk?) to extract orders and match them to trades. The output should produce a report that prints comma separated values containing “ClientID, OrderID, Price, Volume”. In addition to this for each client, I need to print the total volume and turnover (turnover is the subtotal of price * volume on each trade). Can someone please help me with a bash script that will do the above using the attached files? Any help would be greatly appreciated orders1.txt Entry Time, Client ID, Security ID, Order ID 25455410,DOLR,XGXUa,DOLR1435804437 25455410,XFKD,BUP3d,XFKD4746464646 25455413,QOXA,AIDl,QOXA7176202067 25455415,QOXA,IRUXb,QOXA6580494597 25455417,YXKH,OBWQs,YXKH4575139017 25455420,JBDX,BKNs,JBDX6760353333 25455428,DOLR,AOAb,DOLR9093170513 25455429,JBDX,QMP1Sh,JBDX2756804453 25455431,QOXA,QIP1Sh,QOXA6563975285 25455434,QOXA,XMUp,QOXA5569701531 25455437,XFKD,QLOJc,XFKD8793976660 25455438,YXKH,MRPp,YXKH2329856527 25455442,JBDX,YBPu,JBDX0100506066 25455450,QOXA,BUPYd,QOXA5832015401 25455451,QOXA,SIOQz,QOXA3909507967 25455451,DOLR,KID1Sh,DOLR2262067037 25455454,DOLR,JJHi,DOLR9923665017 25455461,YXKH,KBAPBa,YXKH2637373848 25455466,DOLR,EPYp,DOLR8639062962 25455468,DOLR,UQXKz,DOLR4349482234 25455474,JBDX,EFNs,JBDX7268036859 25455481,QOXA,XCB1Sh,QOXA4105943392 25455486,YXKH,XBAFp,YXKH0242733672 25455493,JBDX,BIF1Sh,JBDX2840241688 25455500,DOLR,QSOYp,DOLR6265839896 25455503,YXKH,IIYz,YXKH8505951163 25455504,YXKH,ZOIXp,YXKH2185348861 25455513,YXKH,MBOOp,YXKH4095442568 25455515,JBDX,P35p,JBDX9945514579 25455524,QOXA,YXOKz,QOXA1900595629 25455528,JBDX,XEQl,JBDX0126452783 25455528,XFKD,FJJMp,XFKD4392227425 25455535,QOXA,EZIp,QOXA4277118682 25455543,QOXA,YBPFa,QOXA6510879584 25455551,JBDX,EAMp,JBDX8924251479 25455552,QOXA,JXIQp,QOXA4360008399 25455554,DOLR,LISXPh,DOLR1853653280 25455557,XFKD,LOX14p,XFKD1759342196 25455558,JBDX,YXYb,JBDX8177118129 25455567,YXKH,MZQKl,YXKH6485420018 25455569,JBDX,ZPIMz,JBDX2010952336 25455573,JBDX,COPe,JBDX1612537068 25455582,JBDX,HFKAp,JBDX2409813753 25455589,QOXA,XFKm,QOXA9692126523 25455593,XFKD,OFYp,XFKD8556940415 25455601,XFKD,FKQLb,XFKD4861992028 25455606,JBDX,RIASp,JBDX0262502677 25455608,DOLR,HRKKz,DOLR1739013513 25455615,DOLR,ZZXp,DOLR6727725911 25455623,JBDX,CKQPp,JBDX2587184235 25455630,YXKH,ZLQQp,YXKH6492126889 25455632,QOXA,ORPz,QOXA3594333316 25455640,XFKD,HPIXSh,XFKD6780729432 25455648,QOXA,ABOJe,QOXA6661411952 25455654,XFKD,YLIp,XFKD6374702721 25455654,DOLR,BCFp,DOLR8012564477 25455658,JBDX,ZMDKz,JBDX6885176695 25455665,JBDX,CBOe,JBDX8942732453 25455670,JBDX,FRHMl,JBDX5424320405 25455679,DOLR,YFJm,DOLR8212353717 25455680,XFKD,XAFp,XFKD4132890550 25455681,YXKH,PBIBOp,YXKH6106504736 25455684,DOLR,IFDu,DOLR8034515043 25455687,JBDX,JACe,JBDX8243949318 25455688,JBDX,ZFZKz,JBDX0866225752 25455693,QOXA,XOBm,QOXA5011416607 25455694,QOXA,IDQe,QOXA7608439570 25455698,JBDX,YBIDb,JBDX8727773702 25455705,YXKH,MXOp,YXKH7747780955 25455710,YXKH,PBZRYs,YXKH7353828884 25455719,QOXA,QFDb,QOXA2477859437 25455720,XFKD,PZARp,XFKD4995735686 25455722,JBDX,ZLKKb,JBDX3564523161 25455730,XFKD,QFH1Sh,XFKD6181225566 25455733,JBDX,KWVJYc,JBDX7013108210 25455733,YXKH,ZQI1Sh,YXKH7095815077 25455739,YXKH,XIJp,YXKH0497248757 25455739,YXKH,ZXJp,YXKH5848658513 25455747,JBDX,XASd,JBDX4986246117 25455751,XFKD,XQIKz,XFKD5919379575 25455760,JBDX,IBXPb,JBDX8168710376 25455763,XFKD,EVAOi,XFKD8175209012 25455765,XFKD,JXKp,XFKD2750952933 25455773,XFKD,PTBAXs,XFKD8139382011 25455778,QOXA,XJp,QOXA8227838196 25455783,QOXA,CYBIp,QOXA2072297264 25455792,JBDX,PZI1Sh,JBDX7022115629 25455792,XFKD,XIKQl,XFKD6434550362 25455792,DOLR,YKPm,DOLR6394606248 25455796,QOXA,JXOXPh,QOXA9672544909 25455797,YXKH,YIWm,YXKH5946342983 25455803,YXKH,JZEm,YXKH5317189370 25455810,QOXA,OBMFz,QOXA0985316706 25455810,QOXA,DAJPp,QOXA6105975858 25455810,JBDX,FBBJl,JBDX1316207043 25455819,XFKD,YXKm,XFKD6946276671 25455821,YXKH,UIAUs,YXKH6010226371 25455828,DOLR,PTJXs,DOLR1387517499 25455836,DOLR,DCEi,DOLR3854078054 25455845,YXKH,NYQe,YXKH3727923537 25455853,XFKD,TAEc,XFKD5377097556 25455858,XFKD,LMBOXo,XFKD4452678489 25455858,XFKD,AIQXp,XFKD5727938304 trades1.txt # The first 8 characters is execution time in microseconds since midnight # The next 14 characters is the order ID # The next 8 characters is the zero padded price # The next 8 characters is the zero padded volume 25455416QOXA6580494597 0000013800001856 25455428JBDX6760353333 0000007000002458 25455434DOLR9093170513 0000000400003832 25455435QOXA6563975285 0000034700009428 25455449QOXA5569701531 0000007500009023 25455447YXKH2329856527 0000038300009947 25455451QOXA5832015401 0000039900006432 25455454QOXA3909507967 0000026900001847 25455456DOLR2262067037 0000034700002732 25455471YXKH2637373848 0000010900006105 25455480DOLR8639062962 0000027500001975 25455488JBDX7268036859 0000005200004986 25455505JBDX2840241688 0000037900002029 25455521YXKH4095442568 0000046400002150 25455515JBDX9945514579 0000040800005904 25455535QOXA1900595629 0000015200006866 25455533JBDX0126452783 0000001700006615 25455542XFKD4392227425 0000035500009948 25455570XFKD1759342196 0000025700007816 25455574JBDX8177118129 0000022400000427 25455567YXKH6485420018 0000039000008327 25455573JBDX1612537068 0000013700001422 25455584JBDX2409813753 0000016600003588 25455603XFKD4861992028 0000017600004552 25455611JBDX0262502677 0000007900003235 25455625JBDX2587184235 0000024300006723 25455658XFKD6374702721 0000046400009451 25455673JBDX6885176695 0000010900009258 25455671JBDX5424320405 0000005400003618 25455679DOLR8212353717 0000041100003633 25455697QOXA5011416607 0000018800007376 25455696QOXA7608439570 0000013000007463 25455716YXKH7747780955 0000037000006357 25455719QOXA2477859437 0000039300009840 25455723XFKD4995735686 0000045500009858 25455727JBDX3564523161 0000021300000639 25455742YXKH7095815077 0000023000003945 25455739YXKH5848658513 0000042700002084 25455766XFKD5919379575 0000022200003603 25455777XFKD8175209012 0000033300006350 25455788XFKD8139382011 0000034500007461 25455793QOXA8227838196 0000011600007081 25455784QOXA2072297264 0000017000004429 25455800XFKD6434550362 0000030000002409 25455801QOXA9672544909 0000039600001033 25455815QOXA6105975858 0000034800008373 25455814JBDX1316207043 0000026500005237 25455831YXKH6010226371 0000011400004945 25455838DOLR1387517499 0000046200006129 25455847YXKH3727923537 0000037400008061 25455873XFKD5727938304 0000048700007298 I have the following script: ''' #!/bin/bash declare -A volumes declare -A turnovers declare -A orders # Read the first file, remembering for each order the client id while read -r line do # Jump over comments if [[ ${line:0:1} == "#" ]] ; then continue; fi; details=($(echo $line | tr ',' " ")) order_id=${details[3]} client_id=${details[1]} orders[$order_id]=$client_id done < $1 echo "ClientID,OrderID,Price,Volume" while read -r line do # Jump over comments if [[ ${line:0:1} == "#" ]] ; then continue; fi; order_id=$(echo ${line:8:20} | tr -d '[:space:]') client_id=${orders[$order_id]} price=${line:28:8} volume=${line: -8} echo "$client_id,$order_id,$price,$volume" price=$(echo $price | awk '{printf "%d", $0}') volume=$(echo $volume | awk '{printf "%d", $0}') order_turnover=$(($price*$volume)) old_turnover=${turnovers[$client_id]} [[ -z "$old_turnover" ]] && old_turnover=0 total_turnover=$(($old_turnover+$order_turnover)) turnovers[$client_id]=$total_turnover old_volumes=${volumes[$client_id]} [[ -z "$old_volumes" ]] && old_volumes=0 total_volume=$((old_volumes+volume)) volumes[$client_id]=$total_volume done < $2 echo "ClientID,Volume,Turnover" for client_id in ${!volumes[#]} do volume=${volumes[$client_id]} turnover=${turnovers[$client_id]} echo "$client_id,$volume,$turnover" done Can anyone think of anything more elegant? Thanks in advance C
Assumption 1: the two files are ordered, so line x represents an action that is older than x+1. If not, then further work is needed. The assumption makes our work easier. Let's first change the delimiter of traders into a comma: sed -i 's/ /,/g' traders.txt This will be done in place for sake of simplicity. So, you now have traders which is comma separated, as is orders. This is the Assumption 2. Keep working on traders: split all columns and add titles1. More on the reasons why in a moment. gawk -i inplace -v INPLACE_SUFFIX=.bak 'BEGINFILE{FS=",";OFS=",";print "execution time,order ID,price,volume";}{print substr($1,1,8),substr($1,9),substr($2,1,9),substr($2,9)}' traders.txt Ugly but works. Now let's process your data using the following awk script: BEGIN { FS="," OFS="," } { if (1 == NR) { getline line < TRADERS # consume title line print "Client ID,Order ID,Price,Volume,Turnover"; # consume title line. Remove print to forget it getline line < TRADERS # reads first data line split(line, transaction, ",") next } if (transaction[2] == $4) { print $2, $4, transaction[3], transaction[4], transaction[3]*transaction[4] getline line < TRADERS # reads new data line split(line, transaction, ",") } } called by: gawk -f script -v TRADERS=traders.txt orders.txt And there you have it. Some caveats: check the numbers, as implicit gawk number conversion might not be correct with zero-padded numbers. There is a fix for that in case; getline might explode if we run out of lines from traders. I haven't put any check, that's up to you no control over timestamps. Match is based on Order ID. Output file: Client ID,Order ID,Price,Volume,Turnover QOXA,QOXA6580494597,000001380,00001856,2561280 JBDX,JBDX6760353333,000000700,00002458,1720600 DOLR,DOLR9093170513,000000040,00003832,153280 QOXA,QOXA6563975285,000003470,00009428,32715160 QOXA,QOXA5569701531,000000750,00009023,6767250 YXKH,YXKH2329856527,000003830,00009947,38097010 QOXA,QOXA5832015401,000003990,00006432,25663680 QOXA,QOXA3909507967,000002690,00001847,4968430 DOLR,DOLR2262067037,000003470,00002732,9480040 YXKH,YXKH2637373848,000001090,00006105,6654450 DOLR,DOLR8639062962,000002750,00001975,5431250 JBDX,JBDX7268036859,000000520,00004986,2592720 JBDX,JBDX2840241688,000003790,00002029,7689910 YXKH,YXKH4095442568,000004640,00002150,9976000 JBDX,JBDX9945514579,000004080,00005904,24088320 QOXA,QOXA1900595629,000001520,00006866,10436320 JBDX,JBDX0126452783,000000170,00006615,1124550 XFKD,XFKD4392227425,000003550,00009948,35315400 XFKD,XFKD1759342196,000002570,00007816,20087120 JBDX,JBDX8177118129,000002240,00000427,956480 YXKH,YXKH6485420018,000003900,00008327,32475300 JBDX,JBDX1612537068,000001370,00001422,1948140 JBDX,JBDX2409813753,000001660,00003588,5956080 XFKD,XFKD4861992028,000001760,00004552,8011520 JBDX,JBDX0262502677,000000790,00003235,2555650 JBDX,JBDX2587184235,000002430,00006723,16336890 XFKD,XFKD6374702721,000004640,00009451,43852640 JBDX,JBDX6885176695,000001090,00009258,10091220 JBDX,JBDX5424320405,000000540,00003618,1953720 DOLR,DOLR8212353717,000004110,00003633,14931630 QOXA,QOXA5011416607,000001880,00007376,13866880 QOXA,QOXA7608439570,000001300,00007463,9701900 YXKH,YXKH7747780955,000003700,00006357,23520900 QOXA,QOXA2477859437,000003930,00009840,38671200 XFKD,XFKD4995735686,000004550,00009858,44853900 JBDX,JBDX3564523161,000002130,00000639,1361070 YXKH,YXKH7095815077,000002300,00003945,9073500 YXKH,YXKH5848658513,000004270,00002084,8898680 XFKD,XFKD5919379575,000002220,00003603,7998660 XFKD,XFKD8175209012,000003330,00006350,21145500 XFKD,XFKD8139382011,000003450,00007461,25740450 QOXA,QOXA8227838196,000001160,00007081,8213960 QOXA,QOXA2072297264,000001700,00004429,7529300 XFKD,XFKD6434550362,000003000,00002409,7227000 QOXA,QOXA9672544909,000003960,00001033,4090680 QOXA,QOXA6105975858,000003480,00008373,29138040 JBDX,JBDX1316207043,000002650,00005237,13878050 YXKH,YXKH6010226371,000001140,00004945,5637300 DOLR,DOLR1387517499,000004620,00006129,28315980 YXKH,YXKH3727923537,000003740,00008061,30148140 XFKD,XFKD5727938304,000004870,00007298,35541260 1: requires gawk 4.1.0 or higher
Shell Extract Text Before Digits in a String
I've found several examples of extractions before a single character and examples of extracting numbers, but I haven't found anything about extracting characters before numbers. My question: Some of the strings I have look like this: NUC320 Syllabus Template - 8wk SLA School Template - UL CJ101 Syllabus Template - 8wk TECH201 Syllabus Template - 8wk Test Clone ID17 In cases where the string doesn't contain the data I want, I need it to be skipped. The desired output would be: NUC-320 CJ-101 TECH-201 SLA School Template - UL & Test Clone ID17 would be skipped. I imagine the process being something to the effect of: Extract text before " " Condition - Check for digits in the string Extract text before digits and assign it to a variable x Extract digits and assign to a variable y Concatenate $x"-"$y and assign to another variable z More information: The strings are extracted from a line in a couple thousand text docs using a loop. They will be used to append to a hyperlink and rename a file during the loop. Edit: #!/bin/sh # my files are named 1.txt through 9999.txt i both # increments the loop and sets the filename to be searched i=1 while [ $i -lt 10000 ] do x=$(head -n 31 $i.txt | tail -1 | cut -c 7-) if [ ! -z "$x" -a "$x" != " " ]; then # I'd like to insert the hyperlink with the output on the # same line (1.txt;cj101 Syllabus Template - 8wk;www.link.com/cj101) echo "$i.txt;$x" >> syllabus.txt # else # rm $i.txt fi i=`expr $i + 1` sleep .1 done
sed for printing lines starting with capital letters followed by digits. It also adds a - between them: sed -n 's/^\([A-Z]\+\)\([0-9]\+\) .*/\1-\2/p' input Gives: NUC-320 CJ-101 TECH-201
A POSIX-compliant awk solution: awk '{ if (match($1, /[0-9]+$/)) print substr($1, 1, RSTART-1) "-" substr($1, RSTART) }' \ file | while IFS= read -r token; do # Process token here (append to hyperlink, ...) echo "[$token]" done awk is used to extract the reformatted tokens of interest, which are then processed in a shell while loop. match($1, /[0-9]+$/) matches the 1st whitespace-separated field ($1) against extended regex [0-9]+$, i.e., matches only if the fields ends in one or more digits. substr($1, 1, RSTART-1) "-" substr($1, RSTART) joins the part before the first digit with the run of digits using -, via the special RSTART variable, which indicates the 1-based character position where the most recent match() invocation matched.
awk '$1 ~/[0-9]/{sub(/...$/,"-&",$1);print $1}' file NUC-320 CJ-101 TECH-201