Specific data formatting using awk or sed

Specific data formatting using awk or sed - linux

I am currently working with large datasets containing file information which is formatted into blocks of data. I am trying to take a piece of data from the file path line and append it as a new column on certain lines. The dataset contains file information formatted like so:
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
45:97:2a:60:e3:69 3208 10
7a:8b:8e:20:7b:38 1982 10
b9:45:3d:f4:97:88 1849 10
Whole File Hash: 865999b40fd9
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
e8:b0:cb:6f:76:ff 1344 10
19:c5:b2:aa:b3:60 613 10
11:7c:7e:76:4b:d5 1272 10
36:e0:59:49:b6:4a 581 10
9c:31:bc:8a:39:94 3296 10
01:f0:56:3a:e1:a9 1140 10
Whole File Hash: 4b28b44ae03d
What I am wanting to do is take the file type (.jar and .c in this example) and append it to their respective Chunk Hash lines so the final formatting would look like:
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
45:97:2a:60:e3:69 3208 10 .jar
7a:8b:8e:20:7b:38 1982 10 .jar
b9:45:3d:f4:97:88 1849 10 .jar
Whole File Hash: 865999b40fd9
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
e8:b0:cb:6f:76:ff 1344 10 .c
19:c5:b2:aa:b3:60 613 10 .c
11:7c:7e:76:4b:d5 1272 10 .c
36:e0:59:49:b6:4a 581 10 .c
9c:31:bc:8a:39:94 3296 10 .c
01:f0:56:3a:e1:a9 1140 10 .c
Whole File Hash: 4b28b44ae03d
I already have the awk code to pull the file type and the chunk hash lines:
awk 'match($0,/\..+/) {print substr($0,RSTART,RLENGTH)}'
awk '/Chunk Hash/{flag=1;next}/Whole File Hash:/{flag=0}flag'
I am just not sure on how to connect these pieces using awk (or sed) to append the file type as a new column onto each line in their respective data block. Another thing to note is I am trying to do this in a bash script if that makes a difference.

Here is a (GNU) sed solution:
/File path:/ { # If line matches "File path:"
h # Copy pattern space to hold space
s/.*(\.[^.]*)$/\1/ # Remove everything but extension from pattern space
x # Swap pattern space and hold space
} # Hold space now contains extension
/Chunk Hash/ { # If line matches "Chunk Hash"
n # Get next line into pattern space
:loop # Anchor for loop
/Whole File Hash/b # If line matches "Whole File Hash", jump out of loop
G # Append extension from hold space to pattern space
s/\n/\t\t\t\t/ # Substitute newline with a bunch of tabs
n # Get next line
b loop # Jump back to ":loop" label
}
This can be stored in a separate file (say, so.sed), and has to be called like
sed -r -f so.sed infile
resulting in
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
45:97:2a:60:e3:69 3208 10 .jar
7a:8b:8e:20:7b:38 1982 10 .jar
b9:45:3d:f4:97:88 1849 10 .jar
Whole File Hash: 865999b40fd9
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
e8:b0:cb:6f:76:ff 1344 10 .c
19:c5:b2:aa:b3:60 613 10 .c
11:7c:7e:76:4b:d5 1272 10 .c
36:e0:59:49:b6:4a 581 10 .c
9c:31:bc:8a:39:94 3296 10 .c
01:f0:56:3a:e1:a9 1140 10 .c
Whole File Hash: 4b28b44ae03d
Non-GNU seds have to jump through the usual hoops to insert tabs and can't use the -r option (but probably -E, which should be equivalent here; -r was just used for convenience to having to escape ()).

Solution in TXR language:
#(repeat)
# (cases)
File path: #*path.#suff
Inode Num: #inode
#header
# (collect)
#hashline
# (last)
Whole File Hash: #wfh
# (end)
# (output)
File path: #path.#suff
Inode Num: #inode
#header
# (repeat)
#{hashline 88}.#suff
# (end)
Whole File Hash: #wfh
# (end)
# (or)
#other
# (do (put-line other))
# (end)
#(end)
Run:
$ txr suffixes.txr data
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
45:97:2a:60:e3:69 3208 10 .jar
7a:8b:8e:20:7b:38 1982 10 .jar
b9:45:3d:f4:97:88 1849 10 .jar
Whole File Hash: 865999b40fd9
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
e8:b0:cb:6f:76:ff 1344 10 .c
19:c5:b2:aa:b3:60 613 10 .c
11:7c:7e:76:4b:d5 1272 10 .c
36:e0:59:49:b6:4a 581 10 .c
9c:31:bc:8a:39:94 3296 10 .c
01:f0:56:3a:e1:a9 1140 10 .c
Whole File Hash: 4b28b44ae03d

In awk:
$ cat script.awk
/File path/ {
match($0,/\..+/)
ext=substr($0,RSTART,RLENGTH)
}
/Chunk Hash/ {
flag=1 # flag on
print # print here to...
next # avoid printing ext
}
/Whole File Hash:/ {
flag=0 # flag off
}
flag==1 {
print $0, ext # add space here to your liking, left it short...
next # ... to show output on screen without sidescrolling
} 1 # print non-flagged records
Run:
$ awk -f script.awk data.txt
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
45:97:2a:60:e3:69 3208 10 .jar
7a:8b:8e:20:7b:38 1982 10 .jar
b9:45:3d:f4:97:88 1849 10 .jar
Whole File Hash: 865999b40fd9
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash Chunk Size (bytes) Compression Ratio (tenth)
e8:b0:cb:6f:76:ff 1344 10 .c
19:c5:b2:aa:b3:60 613 10 .c
11:7c:7e:76:4b:d5 1272 10 .c
36:e0:59:49:b6:4a 581 10 .c
9c:31:bc:8a:39:94 3296 10 .c
01:f0:56:3a:e1:a9 1140 10 .c
Whole File Hash: 4b28b44ae03d

awk --re-interval '
/^File/{ #If the beginning of line matches "File"
s=gensub("[^.]+(.*)","\\1","1",$0); #Gain the keywords like ".c,.jar" and assign them to s
}
/(..:){3,}/{ #If line matches "**:" three times or more
gsub("[0-9]+$","&\t\t\t\t\t" s,$0) #At the end add s
}
1' file #Print line

Related

Line counting - How to exclude a directory and images?

In order to count the lines of my repository, I typed the code below, and found out that images and pdfs are also included in the word count.
git ls-files | xargs wc -l
When someone asks you for the scale of the repository, would you include the images/pdfs?
If not, could someone help me answer the questions below?
How to exclude the files under "/pdfs" directory
How to exclude .jpg and .png?

You can make use of cloc. It counts blank lines, comment lines, and physical lines of source code in many programming languages. Cloc can take file, directory, and/or archive names as inputs. For instance, if you want to count the number of lines of code in your repository and exclude some directories while counting, you can specify those directories separated by comma like this:
cloc --exclude-dir=imagedir,pdfdir your_repository
cloc will show you the report like this:
387 text files.
387 unique files.
22 files ignored.
github.com/AlDanial/cloc v 1.88 T=0.97 s (376.5 files/s, 152866.0 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Go 235 17216 11769 95308
InstallShield 2 410 0 11178
XML 41 1418 159 2738
Python 5 516 523 1792
Bourne Shell 21 266 283 1512
JSON 19 24 0 1005
Markdown 23 452 0 797
AsciiDoc 4 119 0 312
Ruby 4 44 31 238
YAML 4 4 2 113
WiX source 1 19 24 112
make 3 16 25 68
DOS Batch 2 13 2 38
WiX include 1 0 0 28
Dockerfile 1 13 9 17
-------------------------------------------------------------------------------
SUM: 366 20530 12827 115256
-------------------------------------------------------------------------------
You can also use CLOC with Git like this:
cloc $(git ls-files)
which is equivalent to
git ls-files | xargs cloc

cloc sounds like it does the job. You should remove space+tab from IFS if you use command sub though: IFS=$'\n' cloc $(git ls-files)
If you just want to know a word count or line count, you could bodge it together like this. It gives you the language too. Clone the repo, test for text file / file type, count lines, delete files.
#!/bin/sh -e
# Get dir name from URL + remove trailing slashes - works for _most_ urls
url=${1:? No URL given}
url=${url%/}; url=${url%/}
repo=${1##*/}
repo=${repo%.git}
dir=./$repo
# Clone repo in tmp
cd "${TMPDIR:-/tmp}"
[ -e "$dir" ] && { echo Exists: "$dir" >&2; exit 1; }
trap 'rm -rf "$dir"' EXIT INT
git clone "$url"
# Get column 1 width, for alignment
max_path_length=$(printf '%s\n' "$dir/"* | wc -L)
# Extract and print the data
printf '\n%s\n\n' "$repo text files details:"
for file in "$dir"/*; do
mime=$(file --brief --mime-type "$file")
type=${mime%%/*}
if [ "$type" = text ]; then
lines=$(grep -c . "$file") || true
lang=${mime##*/}
printf "%-${max_path_length}s %s\n" "${file#$dir}" "[$lang, $lines lines]"
total_lines=$((total_lines + lines))
fi
done
printf '\n%s\n\n' "${dir#./} total lines: $total_lines"
Example output:
$ git-wc 'git://git.savannah.gnu.org/sed.git'
Cloning into 'sed'...
remote: Counting objects: 6276, done.
remote: Compressing objects: 100% (1134/1134), done.
remote: Total 6276 (delta 4994), reused 6276 (delta 4994)
Receiving objects: 100% (6276/6276), 2.14 MiB | 495.00 KiB/s, done.
Resolving deltas: 100% (4994/4994), done.
sed text files details:
/AUTHORS [plain, 6 lines]
/BUGS [plain, 101 lines]
/COPYING [plain, 553 lines]
/ChangeLog-2014 [plain, 2586 lines]
/Makefile.am [x-makefile, 123 lines]
/NEWS [plain, 498 lines]
/README [plain, 12 lines]
/README-hacking [plain, 58 lines]
/THANKS.in [plain, 63 lines]
/basicdefs.h [x-c, 83 lines]
/bootstrap [x-shellscript, 930 lines]
/bootstrap.conf [plain, 121 lines]
/cfg.mk [plain, 343 lines]
/configure.ac [x-m4, 294 lines]
/init.cfg [plain, 163 lines]
/thanks-gen [x-perl, 12 lines]
sed total lines: 5946
If the repo is local, you can just adjust the input methods. I'm sure the idea is clear. I know cloning the whole repo may be the dumbest way to do something like this, but sometimes you just want to know a thing. Plus you can use bash/sh - eg. [[ "$file" == "$dir/<exclude-dir>/* ]].

build-id data offset in the ELF file

I need to modify the build-id of the ELF notes section. I found out that it is possible here. Also found out that I can do it by modifying this code. What I can't figure out is data location. Here is what I'm talking about.
$ eu-readelf -S myelffile
Section Headers:
[Nr] Name Type Addr Off Size ES Flags Lk Inf Al
...
[ 2] .note.ABI-tag NOTE 000000000000028c 0000028c 00000020 0 A 0 0 4
[ 3] .note.gnu.build-id NOTE 00000000000002ac 000002ac 00000024 0 A 0 0 4
...
$ eu-readelf -n myelffile
Note section [ 2] '.note.ABI-tag' of 32 bytes at offset 0x28c:
Owner Data size Type
GNU 16 GNU_ABI_TAG
OS: Linux, ABI: 3.14.0
Note section [ 3] '.note.gnu.build-id' of 36 bytes at offset 0x2ac:
Owner Data size Type
GNU 20 GNU_BUILD_ID
Build ID: d75a086c288c582036b0562908304bc3a8033235
.note.gnu.build-id section is 36 bytes. The build id is 20 bytes. What are the other 16 bytes?
I played with the code a bit and read 36 bytes of myelffile at offset 0x2ac. Got the following 040000001400000003000000474e5500d75a086c288c582036b0562908304bc3a8033235.
Then I decided to use Elf64_Shdr definition, so I read data at address 0x2ac + sizeof(Elf64_Shdr.sh_name) + sizeof(Elf64_Shdr.sh_type) + sizeof(Elf64_Shdr.sh_flags) and I got my build id, d75a086c288c582036b0562908304bc3a8033235. It does makes sense why I got it, sizeof(Elf64_Shdr.sh_name) + sizeof(Elf64_Shdr.sh_type) + sizeof(Elf64_Shdr.sh_flags) = 16 bytes, but according to Elf64_Shdr definition I should be pointing to Elf64_Addr sh_addr, i.e. section virtual address.
So what is not clear to me is what are the other 16 bytes of the section? What do they represent? I can't reconcile the Elf64_Shdr definition and the results I'm getting from my experiments.

.note.gnu.build-id section is 36 bytes. The build id is 20 bytes. What are the other 16 bytes?
Each .note.* section starts with Elf64_Nhdr (12 bytes), followed by (4-byte aligned) note name of variable size (GNU\0 here), followed by (4-byte aligned) actual note data. Documentation.
Looking at /bin/date on my system:
eu-readelf -Wn /bin/date
Note section [ 2] '.note.ABI-tag' of 32 bytes at offset 0x2c4:
Owner Data size Type
GNU 16 GNU_ABI_TAG
OS: Linux, ABI: 3.2.0
Note section [ 3] '.note.gnu.build-id' of 36 bytes at offset 0x2e4:
Owner Data size Type
GNU 20 GNU_BUILD_ID
Build ID: 979ae4616ae71af565b123da2f994f4261748cc9
What are the bytes at offset 0x2e4?
dd bs=1 skip=$((0x2e4)) count=36 < /bin/date | xxd
00000000: 0400 0000 1400 0000 0300 0000 474e 5500 ............GNU.
00000010: 979a e461 6ae7 1af5 65b1 23da 2f99 4f42 ...aj...e.#./.OB
00000020: 6174 8cc9 at..
So we have: .n_namesz == 4, .n_descsz == 20, .n_type == 3 == NT_GNU_BUILD_ID, followed by 4-byte GNU\0 note name, followed by 20 bytes of actual build-id bytes 0x97, 0x9a, etc.

Problem, related to Zip4 extension, directing Standard Input Content into a zip archive, using zip, and using zipnotes to change the zipped file name

I have an annoying problem with zip and zipnote programs (both in 3.0 version) in my Debian stable platform.
I wish to create a zip archive storing (not compressing) data from standard input, without extra attributes/fields, and giving a name to the resulting file inside the zip file.
My first try was
printf "foodata" | zip -X0 bar.zip -
printf "# -\n#=foofile\n" | zipnote -w bar.zip
where zip create a bar.zip archive, with a stored file "-" containing "foodata", and zipnote rename the file from "-" to "foofile".
First problem (solved): zip, as we can see from zipdetails
001E Filename '-'
001F Extra ID #0001 0001 'ZIP64'
0021 Length 0010
0023 Uncompressed Size 0000000000000007
002B Compressed Size 0000000000000007
receiving data from standard input, doesn't know the size of the resulting file so create a PKZIP 4.5 compatible zip archive (that can exceed 4 GB) using Zip64 extension and adding a Zip64 extra attribute to the file.
And the -X option remove extrafile attributes but doesn't remove the Zip64 extra field.
This problem is easily solvable adding the -fz- option, as stated in zip man page
// .................................VVVV
printf "foodata" | zip -X0 -fz- bar.zip -
Now bar.zip is a PKZIP 2 compatible file and there isn't the Zip64 extra field.
Second problem (not solved): zipnote change the name of the contained file and add the Zip64 field to the file.
I don't know why.
According the zip man page
zip removes the Zip64 extensions if not needed when archive entries are copied (see the -U (--copy) option).
So I understand that
zip bar.zip --out bar-corrected.zip
should create a new bar-corrected.zip archive where the file foofile isZip64free (thefoofileis very short so theZip64` extension isn't needed, I presume).
Unfortunately, this doesn't works: I get the warning
copying: foofile
zip warning: Local Version Needed To Extract does not match CD: foofile
and the resulting file maintain the Zip64 extension.
And seems that doesn't works explicating the filename or adding the -fz- option: I've tried a lot o combinations but (maybe is my fault) without success.
Questions:
(1) can I avoid (and how) that zipnote, changing the name of a file, add the Zip64 fields to it?
(2) otherwise, how can I use zip (with --copy? with -fz-?) to create a new zip archive Zip64 extension free?

[Edit: Updated to use Store rather than Deflate]
Not sure how to achieve what you want with zip and zipnote, but here is an alternative.
echo abc | perl -MIO::Compress::Zip=zip -e ' zip "-" => "out.zip", Method => 0, Name => "member.txt" '
$ unzip -lv out.zip
Archive: out.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
4 Stored 4 0% 2019-10-10 21:54 4788814e member.txt
-------- ------- --- -------
4 4 0% 1 file
No Zip64 or extra attributes are present in the zip file.
$ zipdetails out.zip
0000 LOCAL HEADER #1 04034B50
0004 Extract Zip Spec 14 '2.0'
0005 Extract OS 00 'MS-DOS'
0006 General Purpose Flag 0008
[Bit 3] 1 'Streamed'
0008 Compression Method 0000 'Stored'
000A Last Mod Time 4F4AAECA 'Thu Oct 10 21:54:20 2019'
000E CRC 00000000
0012 Compressed Length 00000000
0016 Uncompressed Length 00000000
001A Filename Length 000A
001C Extra Length 0000
001E Filename 'member.txt'
0028 PAYLOAD abc.
002C STREAMING DATA HEADER 08074B50
0030 CRC 4788814E
0034 Compressed Length 00000004
0038 Uncompressed Length 00000004
003C CENTRAL HEADER #1 02014B50
0040 Created Zip Spec 14 '2.0'
0041 Created OS 03 'Unix'
0042 Extract Zip Spec 14 '2.0'
0043 Extract OS 00 'MS-DOS'
0044 General Purpose Flag 0008
[Bit 3] 1 'Streamed'
0046 Compression Method 0000 'Stored'
0048 Last Mod Time 4F4AAECA 'Thu Oct 10 21:54:20 2019'
004C CRC 4788814E
0050 Compressed Length 00000004
0054 Uncompressed Length 00000004
0058 Filename Length 000A
005A Extra Length 0000
005C Comment Length 0000
005E Disk Start 0000
0060 Int File Attributes 0000
[Bit 0] 0 'Binary Data'
0062 Ext File Attributes 81A40000
0066 Local Header Offset 00000000
006A Filename 'member.txt'
0074 END CENTRAL HEADER 06054B50
0078 Number of this disk 0000
007A Central Dir Disk no 0000
007C Entries in this disk 0001
007E Total Entries 0001
0080 Size of Central Dir 00000038
0084 Offset to Central Dir 0000003C
0088 Comment Length 0000
Done

Created a simple script that wraps the previous answer. See streamzip
Usage is
printf "foodata" | streamzip -method=store -member-name=foofile -zipfile=/tmp/bar.zip
This is what unzip thinks is in the zip file
unzip -lv /tmp/bar.zip
Archive: /tmp/bar.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
7 Stored 7 0% 2019-10-15 20:25 1b7dd7cd foofile
-------- ------- --- -------
7 7 0% 1 file

Extracting from bin file

So I tried this:
root#kali:~/Desktop/fmk# binwalk upgrade-2.4.0.bin
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
512 0x200 LZMA compressed data, properties: 0x6D, dictionary size: 8388608 bytes, uncompressed size: 2805816 bytes
927576 0xE2758 Squashfs filesystem, little endian, version 4.0, compression:xz, size: 12316692 bytes, 2963 inodes, blocksize: 262144 bytes, created: 2015-08-04 02:40:49
And then I used the following dd:
sudo dd if=upgrade-2.4.0.bin of=pineapple.squashfs bs=1 count=12316692
And I can't unsquashfs pineapple.squashfs.
Can't find a SQUASHFS superblock on pineapple.squashfs

You have to set the offset where the squashfs is
Usage: dd [OPERAND]...
or: dd OPTION
Copy a file, converting and formatting according to the operands.
bs=BYTES read and write up to BYTES bytes at a time
cbs=BYTES convert BYTES bytes at a time
conv=CONVS convert the file as per the comma separated symbol list
count=N copy only N input blocks
ibs=BYTES read up to BYTES bytes at a time (default: 512)
if=FILE read from FILE instead of stdin
iflag=FLAGS read as per the comma separated symbol list
obs=BYTES write BYTES bytes at a time (default: 512)
of=FILE write to FILE instead of stdout
oflag=FLAGS write as per the comma separated symbol list
seek=N skip N obs-sized blocks at start of output
skip=N skip N ibs-sized blocks at start of input
status=LEVEL The LEVEL of information to print to stderr;
'none' suppresses everything but error messages,
'noxfer' suppresses the final transfer statistics,
'progress' shows periodic transfer statistics
...
So, to extract the filesystem
dd if=upgrade-2.4.0.bin of=pineapple.squashfs bs=1 skip=927576

I did it with:
binwalk -Me upgrade-2.4.0.bin

Reverse engineer firmware image and rebuild Linux kernel for TI-AR7

I am trying to build my own Linux derivative to run on an TI-AR7 board. I took the board from an old Telekom Speedport W 501V router. To understand how firmware is flashed onto the device I have downloaded the most recent official firmware. Using the Linux file command I determined the image is a tar archive, which can be extracted easily.
ubuntu#ip-172-31-23-210:~/reverse$ ls
fw_speedport_w501v_v_28.04.38.image
ubuntu#ip-172-31-23-210:~/reverse$ file fw*
fw_speedport_w501v_v_28.04.38.image: POSIX tar archive (GNU)
ubuntu#ip-172-31-23-210:~/reverse$ tar -xvf fw*
./var/
./var/tmp/
./var/tmp/kernel.image
./var/tmp/filesystem.image
./var/flash_update.ko
./var/flash_update.o
./var/info.txt
./var/install
./var/chksum
./var/regelex
./var/signature
ubuntu#ip-172-31-23-210:~/reverse$
According to a wiki (Firmware-Image) that I have found, ./var/tmp/kernel.image contains the actual firmware. During the update process this image is written to the mtd1 device. As stated in the wiki (LZMA-Kernel) the lzma compressed kernel starts with the magic number 0xfeed1281. A hexdump of kernel.image contains that number at its beginning.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ hexdump -n 4 kernel.image
0000000 1281 feed
0000004
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$
The following script given on the last wiki entry should decompress the kernel.
#! /usr/bin/perl
use Compress::unLZMA;
use Archive::Zip;
open INPUT, "<$ARGV[0]" or die "can't open $ARGV[0]: $!";
read INPUT, $buf, 4;
$magic = unpack("V", $buf);
if ($magic != 0xfeed1281) {
die "bad magic";
}
read INPUT, $buf, 4;
$len = unpack("V", $buf);
read INPUT, $buf, 4*2; # address, unknown
read INPUT, $buf, 4;
$clen = unpack("V", $buf);
read INPUT, $buf, 4;
$dlen = unpack("V", $buf);
read INPUT, $buf, 4;
$cksum = unpack("V", $buf);
printf "Archive checksum: 0x%08x\n", $cksum;
read INPUT, $buf, 1+4; # properties, dictionary size
read INPUT, $dummy, 3; # alignment
$buf .= pack('VV', $dlen, 0); # 8 bytes of real size
#$buf .= pack('VV', -1, -1); # 8 bytes of real size
read INPUT, $buf2, $clen;
$crc = Archive::Zip::computeCRC32($buf2);
printf "Input CRC32: 0x%08x\n", $crc;
if ($cksum != $crc) {
die "wrong checksum";
}
$buf .= $buf2;
$data = Compress::unLZMA::uncompress($buf);
unless (defined $data) {
die "uncompress: $#";
}
open OUTPUT, ">$ARGV[1]" or die "can't write $ARGV[1]";
print OUTPUT $data;
#truncate OUTPUT, $dlen;
To use the script you may need to install Compress::unLZMA and Archive::Zip perl modules.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ tar -xvf Compress*
Compress-unLZMA-0.04/
Compress-unLZMA-0.04/Makefile.PL
Compress-unLZMA-0.04/ppport.h
Compress-unLZMA-0.04/Changes
Compress-unLZMA-0.04/lzma_sdk/
[...]
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ cd Compress*
ubuntu#ip-172-31-23-210:~/reverse/var/tmp/Compress-unLZMA-0.04$ perl Makefile.PL
Checking if your kit is complete...
Looks good
Writing Makefile for Compress::unLZMA
Writing MYMETA.yml and MYMETA.json
ubuntu#ip-172-31-23-210:~/reverse/var/tmp/Compress-unLZMA-0.04$ make
cp lib/Compress/unLZMA.pm blib/lib/Compress/unLZMA.pm
/usr/bin/perl /usr/share/perl/5.18/ExtUtils/xsubpp -typemap /usr/share/perl/5.18/ExtUtils/typemap unLZMA.xs > unLZMA.xsc && mv unLZMA.xsc unLZMA.c
cc -c -I. -Ilzma_sdk/Source -D_REENTRANT -D_GNU_SOURCE
[...]
ubuntu#ip-172-31-23-210:~/reverse/var/tmp/Compress-unLZMA-0.04$ sudo make install
Files found in blib/arch: installing files in blib/lib into architecture dependent library tree
Installing /usr/local/lib/perl/5.18.2/auto/Compress/unLZMA/unLZMA.bs
Installing /usr/local/lib/perl/5.18.2/auto/Compress/unLZMA/unLZMA.so
Installing /usr/local/lib/perl/5.18.2/Compress/unLZMA.pm
Installing /usr/local/man/man3/Compress::unLZMA.3pm
Appending installation info to /usr/local/lib/perl/5.18.2/perllocal.pod
ubuntu#ip-172-31-23-210:~/reverse/var/tmp/Compress-unLZMA-0.04$ # same for Archive::Zip module
After installing these dependencies the script decompressed the kernel successfully.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ ./decompress.pl kernel.image kernel.decompressed
Archive checksum: 0x29176e12
Input CRC32: 0x29176e12
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$
But what kind of file is kernel.decompressed and how do I generate a similar file from my Linux kernel source? I continued analyzing it using file and binwalk.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ file kernel.decompressed
kernel.decompressed: data
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ binwalk kernel.decompressed
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
1509632 0x170900 Linux kernel version "2.6.13.1-ohio (686) (gcc version 3.4.6) #9 Wed Apr 4 13:48:08 CEST 2007"
1516240 0x1722D0 CRC32 polynomial table, little endian
1517535 0x1727DF Copyright string: "Copyright 1995-1998 Mark Adler "
1549488 0x17A4B0 Unix path: /usr/gnemul/irix/
1550920 0x17AA48 Unix path: /usr/lib/libc.so.1
1618031 0x18B06F Neighborly text, "neighbor %.2x%.2x.%.2x:%.2x:%.2x:%.2x:%.2x:%.2x lost on port %d(%s)(%s)"
1966080 0x1E0000 gzip compressed data, maximum compression, from Unix, last modified: 2007-04-04 11:45:13
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$
So the Linux kernel starts at 1509632 and ends at 1516240. What kind of data is stored in front the Linux kernel (0 to 1509632)? I extracted the kernel and that piece of unknown data using dd.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ dd if=kernel.decompressed of=unknown.data bs=1 count=1509632
1509632+0 records in
1509632+0 records out
1509632 bytes (1.5 MB) copied, 1.62137 s, 931 kB/s
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ dd if=kernel.decompressed of=kernel bs=1 skip=1509632 count=6608
6608+0 records in
6608+0 records out
6608 bytes (6.6 kB) copied, 0.0072771 s, 908 kB/s
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$
I need to ask again: What kind of file is kernel and how do I generate a similar file from my Linux kernel source? I used xxd and strings to look at the file more closely.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ xxd -l 100 kernel
0000000: 4c69 6e75 7820 7665 7273 696f 6e20 322e Linux version 2.
0000010: 362e 3133 2e31 2d6f 6869 6f20 2836 3836 6.13.1-ohio (686
0000020: 2920 2867 6363 2076 6572 7369 6f6e 2033 ) (gcc version 3
0000030: 2e34 2e36 2920 2339 2057 6564 2041 7072 .4.6) #9 Wed Apr
0000040: 2034 2031 333a 3438 3a30 3820 4345 5354 4 13:48:08 CEST
0000050: 2032 3030 370a 0000 0000 0000 0000 0000 2007...........
0000060: 0000 0000 ....
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ strings kernel
Linux version 2.6.13.1-ohio (686) (gcc version 3.4.6) #9 Wed Apr 4 13:48:08 CEST 2007
do_be
do_bp
do_tr
do_ri
do_cpu
nmi_exception_handler
do_ade
emulate_load_store_insn
do_page_fault
context_switch
__put_task_struct
do_exit
local_bh_enable
run_workqueue
2.6.13.1-ohio gcc-3.4
enable_irq
__free_pages_ok
free_hot_cold_page
prep_new_page
kmem_cache_destroy
kmem_cache_create
pageout
vunmap_pte_range
vmap_pte_range
__vunmap
__brelse
sync_dirty_buffer
bio_endio
queue_kicked_iocb
proc_get_inode
remove_proc_entry
sysfs_get
sysfs_fill_super
kref_get
kref_put
0123456789abcdefghijklmnopqrstuvwxyz
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
vsnprintf
{zt^f
pw0Gm
0cIZ-
68BG+
QC]S%
v,;Zk
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$
This Github repository contains the extracted files to use for further analysis.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string