Recovering data from a corrupted, possibly partial zip

Recovering data from a corrupted, possibly partial zip - zip

I'm working with some old legacy code and getting some build errors. I have a zip file called vocab100k.zip, and the code says that it should unzip to include 2 files: vocab.100k.utf8 and vectors.100k.utf8.
When I try to run System.IO.Compression.ZipFile.OpenRead(zipFileFullPath), I get System.IO.InvalidDataException: 'End of Central Directory record could not be found.' When I try to manually unzip through the File Explorer using WinRAR, I get "Unexpected end of archive".
Double clicking to preview the contents shows me that one of my two files is present inside.
I used WinRAR's repair function but attempted extraction on the repaired zip will load to about 90% before it throws the folowing errors.
I suspect that this may have been one of a multi-part zip at some point, and the later zips have been lost. Is there any way to extract even a partial of the vectors.100k.utf8 that I see there? Are there maybe other ways the zip could have been corrupted?

Recovering Data from a Truncated Zip File
Assuming the file is simply truncated in the middle of vectors.100k.utf8 and the corruption isn't more serious, you should be able to recover part of the data. The output you've shown does suggest that this is a truncation issue. Won't know for sure without the zipdetails output I requested.
If this is just a truncation issue, you may be able to uncompress what is present with the perl script, recoverzip, below. This should work on Windows, MacOS or Linux -- the only prerequisite is you need perl installed.
use strict ;
use warnings ;
use IO::Uncompress::Unzip qw( unzip $UnzipError );
die "Usage: recoverzip zipfile member outfile\n"
if #ARGV != 3;
my $filename = shift;
my $name = shift;
my $outfile = shift;
unzip $filename => $outfile,
Name => $name,
or die "Cannot uncompress '$filename': $UnzipError\n" ;
The script takes three parameters
the name of the zip file to process
the name of the zip member to read
the output filename to store the recovered data
This script isn't guaranteed to get any data from a truncated zip file, but it can in some cases. It just depends where the truncation is at.
Create a truncated zip file
Here is a worked example to show how it works. Note that I'm using Linux tools to generate the truncated zip file. The recovery part is not dependent on Linux -- all just need is to have perl installed on your system.
First pick an input file to add to a zip file
$ cat lorem.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor
in reprehenderit in voluptate velit esse cillum dolore eu fugiat
nulla pariatur. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.
Add lorem.txt to a zip file called try.zip
$ zip try.zip lorem.txt
$ unzip -l try.zip
Archive: try.zip
Length Date Time Name
--------- ---------- ----- ----
446 2022-09-09 09:17 lorem.txt
--------- -------
446 1 file
Now we need to truncate try.zip in the middle of the lorem.txt member. To do that we need to understand where the compressed data lives at in the zip file. Can use zipdetails to get that information.
$ perl zipdetails try.zip
0000 LOCAL HEADER #1 04034B50
0004 Extract Zip Spec 14 '2.0'
0005 Extract OS 00 'MS-DOS'
0006 General Purpose Flag 0000
[Bits 1-2] 0 'Normal Compression'
0008 Compression Method 0008 'Deflated'
000A Last Mod Time 55294A2E 'Fri Sep 9 10:17:28 2022'
000E CRC F90EE7FF
0012 Compressed Length 0000010E
0016 Uncompressed Length 000001BE
001A Filename Length 0009
001C Extra Length 001C
001E Filename 'lorem.txt'
0027 Extra ID #0001 5455 'UT: Extended Timestamp'
0029 Length 0009
002B Flags '03 mod access'
002C Mod Time 631AF698 'Fri Sep 9 09:17:28 2022'
0030 Access Time 631AF698 'Fri Sep 9 09:17:28 2022'
0034 Extra ID #0002 7875 'ux: Unix Extra Type 3'
0036 Length 000B
0038 Version 01
0039 UID Size 04
003A UID 000003E8
003E GID Size 04
003F GID 000003E8
0043 PAYLOAD
0151 CENTRAL HEADER #1 02014B50
0155 Created Zip Spec 1E '3.0'
0156 Created OS 03 'Unix'
0157 Extract Zip Spec 14 '2.0'
0158 Extract OS 00 'MS-DOS'
0159 General Purpose Flag 0000
[Bits 1-2] 0 'Normal Compression'
015B Compression Method 0008 'Deflated'
015D Last Mod Time 55294A2E 'Fri Sep 9 10:17:28 2022'
0161 CRC F90EE7FF
0165 Compressed Length 0000010E
0169 Uncompressed Length 000001BE
016D Filename Length 0009
016F Extra Length 0018
0171 Comment Length 0000
0173 Disk Start 0000
0175 Int File Attributes 0001
[Bit 0] 1 Text Data
0177 Ext File Attributes 81ED0000
017B Local Header Offset 00000000
017F Filename 'lorem.txt'
0188 Extra ID #0001 5455 'UT: Extended Timestamp'
018A Length 0005
018C Flags '03 mod access'
018D Mod Time 631AF698 'Fri Sep 9 09:17:28 2022'
0191 Extra ID #0002 7875 'ux: Unix Extra Type 3'
0193 Length 000B
0195 Version 01
0196 UID Size 04
0197 UID 000003E8
019B GID Size 04
019C GID 000003E8
01A0 END CENTRAL HEADER 06054B50
01A4 Number of this disk 0000
01A6 Central Dir Disk no 0000
01A8 Entries in this disk 0001
01AA Total Entries 0001
01AC Size of Central Dir 0000004F
01B0 Offset to Central Dir 00000151
01B4 Comment Length 0000
Done
There is quite a lot of output from zipdetails, but for our purposes we need to look at the PAYLOAD line -- that shows the offset where the compressed data for lorem.txt starts. In this case it is hex 043. The next field is the CENTRAL HEADER at offset hex 0151. So that means the compressed payload starts at offset 0x43 and ends at 0x150.
Now truncate the zip file in the middle of the lorem.txt compressed data at offset 0x100 and write the truncated zip file to trunc.zip
$ head -c $((0x100)) try.zip >trunc.zip
We now have a sample truncated zip file to test. First check what unzip thinks of the truncated file - it shows a very similar error to yours
$ unzip -t trunc.zip
Archive: trunc.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of trunc.zip or
trunc.zip.zip, and cannot find trunc.zip.ZIP, period.
Recover data from the truncated zip file
Now run the recoverzip script to see if we can get any data from the zip file..
$ perl recoverzip trunc.zip lorem.txt recovered.txt
Cannot uncompress 'trunc.zip': unexpected end of file
The unexpected end of file error is to be expected in this use-case.
Finally, let's see what data was recovered
$ cat recovered.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor
in reprehenderit in voluptate velit e
Success! In this instance we have recovered some of the data from lorem.txt.

If you have access to Linux, you can try using zip tool to create fixed version of the archive:
zip -FF vocab100k.zip --out vocab100k_fixed.zip
But this works only if the file you want to extract is not missing any parts.

Related

Problem, related to Zip4 extension, directing Standard Input Content into a zip archive, using zip, and using zipnotes to change the zipped file name

I have an annoying problem with zip and zipnote programs (both in 3.0 version) in my Debian stable platform.
I wish to create a zip archive storing (not compressing) data from standard input, without extra attributes/fields, and giving a name to the resulting file inside the zip file.
My first try was
printf "foodata" | zip -X0 bar.zip -
printf "# -\n#=foofile\n" | zipnote -w bar.zip
where zip create a bar.zip archive, with a stored file "-" containing "foodata", and zipnote rename the file from "-" to "foofile".
First problem (solved): zip, as we can see from zipdetails
001E Filename '-'
001F Extra ID #0001 0001 'ZIP64'
0021 Length 0010
0023 Uncompressed Size 0000000000000007
002B Compressed Size 0000000000000007
receiving data from standard input, doesn't know the size of the resulting file so create a PKZIP 4.5 compatible zip archive (that can exceed 4 GB) using Zip64 extension and adding a Zip64 extra attribute to the file.
And the -X option remove extrafile attributes but doesn't remove the Zip64 extra field.
This problem is easily solvable adding the -fz- option, as stated in zip man page
// .................................VVVV
printf "foodata" | zip -X0 -fz- bar.zip -
Now bar.zip is a PKZIP 2 compatible file and there isn't the Zip64 extra field.
Second problem (not solved): zipnote change the name of the contained file and add the Zip64 field to the file.
I don't know why.
According the zip man page
zip removes the Zip64 extensions if not needed when archive entries are copied (see the -U (--copy) option).
So I understand that
zip bar.zip --out bar-corrected.zip
should create a new bar-corrected.zip archive where the file foofile isZip64free (thefoofileis very short so theZip64` extension isn't needed, I presume).
Unfortunately, this doesn't works: I get the warning
copying: foofile
zip warning: Local Version Needed To Extract does not match CD: foofile
and the resulting file maintain the Zip64 extension.
And seems that doesn't works explicating the filename or adding the -fz- option: I've tried a lot o combinations but (maybe is my fault) without success.
Questions:
(1) can I avoid (and how) that zipnote, changing the name of a file, add the Zip64 fields to it?
(2) otherwise, how can I use zip (with --copy? with -fz-?) to create a new zip archive Zip64 extension free?

[Edit: Updated to use Store rather than Deflate]
Not sure how to achieve what you want with zip and zipnote, but here is an alternative.
echo abc | perl -MIO::Compress::Zip=zip -e ' zip "-" => "out.zip", Method => 0, Name => "member.txt" '
$ unzip -lv out.zip
Archive: out.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
4 Stored 4 0% 2019-10-10 21:54 4788814e member.txt
-------- ------- --- -------
4 4 0% 1 file
No Zip64 or extra attributes are present in the zip file.
$ zipdetails out.zip
0000 LOCAL HEADER #1 04034B50
0004 Extract Zip Spec 14 '2.0'
0005 Extract OS 00 'MS-DOS'
0006 General Purpose Flag 0008
[Bit 3] 1 'Streamed'
0008 Compression Method 0000 'Stored'
000A Last Mod Time 4F4AAECA 'Thu Oct 10 21:54:20 2019'
000E CRC 00000000
0012 Compressed Length 00000000
0016 Uncompressed Length 00000000
001A Filename Length 000A
001C Extra Length 0000
001E Filename 'member.txt'
0028 PAYLOAD abc.
002C STREAMING DATA HEADER 08074B50
0030 CRC 4788814E
0034 Compressed Length 00000004
0038 Uncompressed Length 00000004
003C CENTRAL HEADER #1 02014B50
0040 Created Zip Spec 14 '2.0'
0041 Created OS 03 'Unix'
0042 Extract Zip Spec 14 '2.0'
0043 Extract OS 00 'MS-DOS'
0044 General Purpose Flag 0008
[Bit 3] 1 'Streamed'
0046 Compression Method 0000 'Stored'
0048 Last Mod Time 4F4AAECA 'Thu Oct 10 21:54:20 2019'
004C CRC 4788814E
0050 Compressed Length 00000004
0054 Uncompressed Length 00000004
0058 Filename Length 000A
005A Extra Length 0000
005C Comment Length 0000
005E Disk Start 0000
0060 Int File Attributes 0000
[Bit 0] 0 'Binary Data'
0062 Ext File Attributes 81A40000
0066 Local Header Offset 00000000
006A Filename 'member.txt'
0074 END CENTRAL HEADER 06054B50
0078 Number of this disk 0000
007A Central Dir Disk no 0000
007C Entries in this disk 0001
007E Total Entries 0001
0080 Size of Central Dir 00000038
0084 Offset to Central Dir 0000003C
0088 Comment Length 0000
Done

Created a simple script that wraps the previous answer. See streamzip
Usage is
printf "foodata" | streamzip -method=store -member-name=foofile -zipfile=/tmp/bar.zip
This is what unzip thinks is in the zip file
unzip -lv /tmp/bar.zip
Archive: /tmp/bar.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
7 Stored 7 0% 2019-10-15 20:25 1b7dd7cd foofile
-------- ------- --- -------
7 7 0% 1 file

BASH + how to verify the words in array are contain in the variable

Hi friend and colleges
I wrote the following script in order to verify the words in array are contain in the $list variable
#!/bin/bash
list="sdb sdc sdd sde sdf sdg sdh sdi sdk sdj sdo"
array=( sdb sdd sde sdf sdg )
function contain_word
{
contain=false
[[ -z "${list// }" ]] && return
for arr in ${array[*]}
do
echo "$list" | grep -q $arr
[[ $? -eq 0 ]] && (( count ++ ))
done
[[ ${#array[#]} -eq $count ]] && export contain=true
}
contain_word
echo $contain
this script do the job but its long code for this purpose and ugly
I will happy to get good idea how to do it better ( in bash / awk / perl one liner etc )
Example1
For
list="sdb sdc sdd sde sdf sdg sdh sdi sdk sdj sdo"
array=( sdb sdd sde sdf sdg )
it will print true
Example2
For
list="sdb sdc sdd sde sdf sdg sdh sdi sdk sdj sdo"
array=( sdw sdd sde sdf sdg )
it will print false

$ cat tst.sh
contain_word() {
list="sdb sdc sdd sde sdf sdg sdh sdi sdk sdj sdo"
printf '%s\n' "${array[#]}" |
awk -v list="$list" '
BEGIN {
split(list,tmpArr)
for (idx in tmpArr) {
wordSet[tmpArr[idx]]
}
}
!($0 in wordSet) {
exit 1
}
'
}
array=( sdb sdd sde sdf sdg )
contain_word
printf '%s -> %s\n' "${array[*]}" "$?"
array=( sdw sdd sde sdf sdg )
contain_word
printf '%s -> %s\n' "${array[*]}" "$?"
$ ./tst.sh
sdb sdd sde sdf sdg -> 0
sdw sdd sde sdf sdg -> 1
The above uses full string comparison so no partial matches nor false regexp matches are possible. It also won't fail due to globbing and will work using any awk in any shell (that supports arrays using the syntax you provided) on any UNIX box. You can of course tweak the awk code or the calling shell code to print true or false rather than just the awk exit status as appropriate.

Here's a few lines of python that would do it.
[user#local ~/tmp/b] python
>>> list="sdb sdc sdd sde sdf sdg sdh sdi sdk sdj sdo"
>>> array="sdb sdd sde sdf sdg"
>>> set(array.split(" ")).issubset(set(list.split(" ")))
True
>>> list="sdb sdc sdd sde sdf sdg sdh sdi sdk sdj sdo"
>>> array="sdw sdd sde sdf sdg"
>>> set(array.split(" ")).issubset(set(list.split(" ")))
False

You can convert the scalar variable into array variable and compare both arrays (Not a single linear and probably there is a lot of way).
my $list = "sdb sdc sdd sde sdf sdg sdh sdi sdk sdj sdo";
my #listary = split / /, $list;
my #myarray = qw(sdb sdd sde sdf sdg);
And please compare both arrays

My solution is awk based and it turns the list into a regular expression in order to remove every word in the list from the array. If the result is empty, true is printed; otherwise false is printed.
awk -v list="$list" '
{
gsub(" +","|",list)
gsub(" *("list") *","")
print ($0) ? "false" : "true"
}
' <<<"${array[*]}"

Edit2: Using Perl is nice, simple and very efficient.
perl -e'#h{split/ /,shift}=();exists$h{$_}||exit 1 for#ARGV' "$list" "${array[#]}" && echo "true" || echo "false"
Original:
It would be much simpler if array is in file but anyway
list="sdb sdc sdd sde sdf sdg sdh sdi sdk sdj sdo"
array=( sdb sdd sde sdf sdg )
[[ $(echo $list | sed 's/ /\n/g' | sort -u | grep -Ff <(echo ${array[#]} | sed 's/ /\n/g') | wc -l) -eq ${#array[#]} ]] && echo "true" || echo "false"
or shorter
[[ $(sed 's/ /\n/g' <<<$list | sort -u | grep -Ff <(sed 's/ /\n/g' <<<${array[#]}) | wc -l) -eq ${#array[#]} ]] && echo "true" || echo "false"
Why would anybody cycle around array? There is a difference between O(N*M) and O(N+M).
Edit: It seems that understanding how computers work and what is O notation is less common than I expected, there is a small demonstration.
#!/bin/bash
list="aaa aab aac aad aae aaf aag aah aai aaj aak aal aam aan aao aap aaq aar aas aat aau aav aaw aax aay aaz aba abb abc abd abe abf abg abh abi abj abk abl abm abn abo abp abq abr abs abt abu abv abw abx aby abz aca acb acc acd ace acf acg ach aci acj ack acl acm acn aco acp acq acr acs act acu acv acw acx acy acz ada adb adc add ade adf adg adh adi adj adk adl adm adn ado adp adq adr ads adt adu adv adw adx ady adz aea aeb aec aed aee aef aeg aeh aei aej aek ael aem aen aeo aep aeq aer aes aet aeu aev aew aex aey aez afa afb afc afd afe aff afg afh afi afj afk afl afm afn afo afp afq afr afs aft afu afv afw afx afy afz aga agb agc agd age agf agg agh agi agj agk agl agm agn ago agp agq agr ags agt agu agv agw agx agy agz aha ahb ahc ahd ahe ahf ahg ahh ahi ahj ahk ahl ahm ahn aho ahp ahq ahr ahs aht ahu ahv ahw ahx ahy ahz aia aib aic aid aie aif aig aih aii aij aik ail aim ain aio aip aiq air ais ait aiu aiv aiw aix aiy aiz aja ajb ajc ajd aje ajf ajg ajh aji ajj ajk ajl ajm ajn ajo ajp ajq ajr ajs ajt aju ajv ajw ajx ajy ajz aka akb akc akd ake akf akg akh aki akj akk akl akm akn ako akp akq akr aks akt aku akv akw akx aky akz ala alb alc ald ale alf alg alh ali alj alk all alm aln alo alp alq alr als alt alu alv alw alx aly alz ama amb amc amd ame amf amg amh ami amj amk aml amm amn amo amp amq amr ams amt amu amv amw amx amy amz ana anb anc and ane anf ang anh ani anj ank anl anm ann ano anp anq anr ans ant anu anv anw anx any anz aoa aob aoc aod aoe aof aog aoh aoi aoj aok aol aom aon aoo aop aoq aor aos aot aou aov aow aox aoy aoz apa apb apc apd ape apf apg aph api apj apk apl apm apn apo app apq apr aps apt apu apv apw apx apy apz aqa aqb aqc aqd aqe aqf aqg aqh aqi aqj aqk aql aqm aqn aqo aqp aqq aqr aqs aqt aqu aqv aqw aqx aqy aqz ara arb arc ard are arf arg arh ari arj ark arl arm arn aro arp arq arr ars art aru arv arw arx ary arz asa asb asc asd ase asf asg ash asi asj ask asl asm asn aso asp asq asr ass ast asu asv asw asx asy asz ata atb atc atd ate atf atg ath ati atj atk atl atm atn ato atp atq atr ats att atu atv atw atx aty atz aua aub auc aud aue auf aug auh aui auj auk aul aum aun auo aup auq aur aus aut auu auv auw aux auy auz ava avb avc avd ave avf avg avh avi avj avk avl avm avn avo avp avq avr avs avt avu avv avw avx avy avz awa awb awc awd awe awf awg awh awi awj awk awl awm awn awo awp awq awr aws awt awu awv aww awx awy awz axa axb axc axd axe axf axg axh axi axj axk axl axm axn axo axp axq axr axs axt axu axv axw axx axy axz aya ayb ayc ayd aye ayf ayg ayh ayi ayj ayk ayl aym ayn ayo ayp ayq ayr ays ayt ayu ayv ayw ayx ayy ayz aza azb azc azd aze azf azg azh azi azj azk azl azm azn azo azp azq azr azs azt azu azv azw azx azy azz baa bab bac bad bae baf bag bah bai baj bak bal bam ban bao bap baq bar bas bat bau bav baw bax bay baz bba bbb bbc bbd bbe bbf bbg bbh bbi bbj bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt bbu bbv bbw bbx bby bbz bca bcb bcc bcd bce bcf bcg bch bci bcj bck bcl bcm bcn bco bcp bcq bcr bcs bct bcu bcv bcw bcx bcy bcz bda bdb bdc bdd bde bdf bdg bdh bdi bdj bdk bdl bdm bdn bdo bdp bdq bdr bds bdt bdu bdv bdw bdx bdy bdz bea beb bec bed bee bef beg beh bei bej bek bel bem ben beo bep beq ber bes bet beu bev bew bex bey bez bfa bfb bfc bfd bfe bff bfg bfh bfi bfj bfk bfl bfm bfn bfo bfp bfq bfr bfs bft bfu bfv bfw bfx bfy bfz bga bgb bgc bgd bge bgf bgg bgh bgi bgj bgk bgl bgm bgn bgo bgp bgq bgr bgs bgt bgu bgv bgw bgx bgy bgz bha bhb bhc bhd bhe bhf bhg bhh bhi bhj bhk bhl bhm bhn bho bhp bhq bhr bhs bht bhu bhv bhw bhx bhy bhz bia bib bic bid bie bif big bih bii bij bik bil bim bin bio bip biq bir bis bit biu biv biw bix biy biz bja bjb bjc bjd bje bjf bjg bjh bji bjj bjk bjl bjm bjn bjo bjp bjq bjr bjs bjt bju bjv bjw bjx bjy bjz bka bkb bkc bkd bke bkf bkg bkh bki bkj bkk bkl bkm bkn bko bkp bkq bkr bks bkt bku bkv bkw bkx bky bkz bla blb blc bld ble blf blg blh bli blj blk bll blm bln blo blp blq blr bls blt blu blv blw blx bly blz bma bmb bmc bmd bme bmf bmg bmh bmi bmj bmk bml"
array=(
aaa aab aac aad aae aaf aag aah aai aaj aak aal aam aan aao aap aaq aar aas
aat aau aav aaw aax aay aaz aba abb abc abd abe abf abg abh abi abj abk abl
abm abn abo abp abq abr abs abt abu abv abw abx aby abz aca acb acc acd ace
acf acg ach aci acj ack acl acm acn aco acp acq acr acs act acu acv acw acx
acy acz ada adb adc add ade adf adg adh adi adj adk adl adm adn ado adp adq
adr ads adt adu adv adw adx ady adz aea aeb aec aed aee aef aeg aeh aei aej
aek ael aem aen aeo aep aeq aer aes aet aeu aev aew aex aey aez afa afb afc
afd afe aff afg afh afi afj afk afl afm afn afo afp afq afr afs aft afu afv
afw afx afy afz aga agb agc agd age agf agg agh agi agj agk agl agm agn ago
agp agq agr ags agt agu agv agw agx agy agz aha ahb ahc ahd ahe ahf ahg ahh
ahi ahj ahk ahl ahm ahn aho ahp ahq ahr ahs aht ahu ahv ahw ahx ahy ahz aia
aib aic aid aie aif aig aih aii aij aik ail aim ain aio aip aiq air ais ait
aiu aiv aiw aix aiy aiz aja ajb ajc ajd aje ajf ajg ajh aji ajj ajk ajl ajm
ajn ajo ajp ajq ajr ajs ajt aju ajv ajw ajx ajy ajz aka akb akc akd ake akf
akg akh aki akj akk akl akm akn ako akp akq akr aks akt aku akv akw akx aky
akz ala alb alc ald ale alf alg alh ali alj alk all alm aln alo alp alq alr
als alt alu alv alw alx aly alz ama amb amc amd ame amf amg amh ami amj amk
aml amm amn amo amp amq amr ams amt amu amv amw amx amy amz ana anb anc and
ane anf ang anh ani anj ank anl anm ann ano anp anq anr ans ant anu anv anw
anx any anz aoa aob aoc aod aoe aof aog aoh aoi aoj aok aol aom aon aoo aop
aoq aor aos aot aou aov aow aox aoy aoz apa apb apc apd ape apf apg aph api
apj apk apl apm apn apo app apq apr aps apt apu apv apw apx apy apz aqa aqb
aqc aqd aqe aqf aqg aqh aqi aqj aqk aql aqm aqn aqo aqp aqq aqr aqs aqt aqu
aqv aqw aqx aqy aqz ara arb arc ard are arf arg arh ari arj ark arl arm arn
aro arp arq arr ars art aru arv arw arx ary arz asa asb asc asd ase asf asg
ash asi asj ask asl asm asn aso asp asq asr ass ast asu asv asw asx asy asz
ata atb atc atd ate atf atg ath ati atj atk atl atm atn ato atp atq atr ats
att atu atv atw atx aty atz aua aub auc aud aue auf aug auh aui auj auk aul
aum aun auo aup auq aur aus aut auu auv auw aux auy auz ava avb avc avd ave
avf avg avh avi avj avk avl avm avn avo avp avq avr avs avt avu avv avw avx
avy avz awa awb awc awd awe awf awg awh awi awj awk awl awm awn awo awp awq
awr aws awt awu awv aww awx awy awz axa axb axc axd axe axf axg axh axi axj
axk axl axm axn axo axp axq axr axs axt axu axv axw axx axy axz aya ayb ayc
ayd aye ayf ayg ayh ayi ayj ayk ayl aym ayn ayo ayp ayq ayr ays ayt ayu ayv
ayw ayx ayy ayz aza azb azc azd aze azf azg azh azi azj azk azl azm azn azo
azp azq azr azs azt azu azv azw azx azy azz baa bab bac bad bae baf bag bah
bai baj bak bal bam ban bao bap baq bar bas bat bau bav baw bax bay baz bba
bbb bbc bbd bbe bbf bbg bbh bbi bbj bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt
bbu bbv bbw bbx bby bbz bca bcb bcc bcd bce bcf bcg bch bci bcj bck bcl bcm
bcn bco bcp bcq bcr bcs bct bcu bcv bcw bcx bcy bcz bda bdb bdc bdd bde bdf
bdg bdh bdi bdj bdk bdl bdm bdn bdo bdp bdq bdr bds bdt bdu bdv bdw bdx bdy
bdz bea beb bec bed bee bef beg beh bei bej bek bel bem ben beo bep beq ber
bes bet beu bev bew bex bey bez bfa bfb bfc bfd bfe bff bfg bfh bfi bfj bfk
bfl bfm bfn bfo bfp bfq bfr bfs bft bfu bfv bfw bfx bfy bfz bga bgb bgc bgd
bge bgf bgg bgh bgi bgj bgk bgl bgm bgn bgo bgp bgq bgr bgs bgt bgu bgv bgw
bgx bgy bgz bha bhb bhc bhd bhe bhf bhg bhh bhi bhj bhk bhl bhm bhn bho bhp
bhq bhr bhs bht bhu bhv bhw bhx bhy bhz bia bib bic bid bie bif big bih bii
bij bik bil bim bin bio bip biq bir bis bit biu biv biw bix biy biz bja bjb
bjc bjd bje bjf bjg bjh bji bjj bjk bjl bjm bjn bjo bjp bjq bjr bjs bjt bju
bjv bjw bjx bjy bjz bka bkb bkc bkd bke bkf bkg bkh bki bkj bkk bkl bkm bkn
bko bkp bkq bkr bks bkt bku bkv bkw bkx bky bkz bla blb blc bld ble blf blg
blh bli blj blk bll blm bln blo blp blq blr bls blt blu blv blw blx bly blz
bma bmb bmc bmd bme bmf bmg bmh bmi bmj bmk bml
);
function contain_word
{
[[ -z "${list// }" ]] && return 1
for arr in ${array[*]}
do
echo "$list" | grep -q $arr
[[ $? -eq 0 ]] && (( count ++ ))
done
[[ ${#array[#]} -eq $count ]]
}
function contain_word2
{
[[ $(sed 's/ /\n/g' <<<$list | sort -u | grep -Ff <(sed 's/ /\n/g' <<<${array[#]}) | wc -l) -eq ${#array[#]} ]]
}
contain_word$1 && echo "true" || echo "false"
And simple demonstration what O(M*N) vs O(M+N) means for M=N=1000 which is not too much for modern HW, isn't it?
$ time ./test.sh
true
real 0m0.989s
user 0m1.040s
sys 0m0.319s
$ time ./test.sh 2
true
real 0m0.011s
user 0m0.012s
sys 0m0.000s
Even for M=N=100
list="aaa aab aac aad aae aaf aag aah aai aaj aak aal aam aan aao aap aaq aar aas aat aau aav aaw aax aay aaz aba abb abc abd abe abf abg abh abi abj abk abl abm abn abo abp abq abr abs abt abu abv abw abx aby abz aca acb acc acd ace acf acg ach aci acj ack acl acm acn aco acp acq acr acs act acu acv acw acx acy acz ada adb adc add ade adf adg adh adi adj adk adl adm adn ado adp adq adr ads adt adu adv"
array=(
aaa aab aac aad aae aaf aag aah aai aaj aak aal aam aan aao aap aaq aar aas
aat aau aav aaw aax aay aaz aba abb abc abd abe abf abg abh abi abj abk abl
abm abn abo abp abq abr abs abt abu abv abw abx aby abz aca acb acc acd ace
acf acg ach aci acj ack acl acm acn aco acp acq acr acs act acu acv acw acx
acy acz ada adb adc add ade adf adg adh adi adj adk adl adm adn ado adp adq
adr ads adt adu adv
)
$ time ./test.sh
true
real 0m0.117s
user 0m0.105s
sys 0m0.042s
$ time ./test.sh 2
true
real 0m0.008s
user 0m0.008s
sys 0m0.001s
That's it for how inefficient it is.
BTW using Perl would be more elegant than AWK
function contain_word3
{
perl -e'#h{split/ /,shift}=();exists$h{$_}||exit 1 for#ARGV' "$list" "${array[#]}"
}
and fast (8ms).

A perl one-liner
perl -ape 'BEGIN{$H{$_}=1while$_=shift}$_=!grep!$H{$_},#F' $list <<<"${array[*]}"
prints
1
and
perl -ape 'BEGIN{$H{$_}=1while$_=shift}$_=!grep!$H{$_},#F' $list <<<"${array[*]} not"
prints nothing because not is not in $list
How it works, perl -h for command line switches,
BEGIN{$H{$_}=1while$_=shift} : fill a hash with keys #ARGV and values 1 and empties #ARGV list
$_=!grep!$H{$_},#F : grep return the array of elements not found in the hash, because of scalar context $=!, returns the number of element, and ! returns 1 if =0, nothing if >0.
Otherwise it can also be done in bash >=4.0 with associative arrays:
declare -A hashlist=([sdg]="1" [sdf]="1" [sde]="1" [sdd]="1" [sdc]="1" [sdb]="1" [sdo]="1" [sdk]="1" [sdj]="1" [sdi]="1" [sdh]="1" )
array=( sdb sdd sde sdf sdg )
r=0; for a in "${array[#]}"; do ((r|=\!hashlist[$a])); done ;((r=\!r))
to follow the same logic than perl but can be simplified inverting logic
r=1; for a in "${array[#]}"; do ((r&=\!hashlist[$a])); done
also can be optimized to break the loop when first item is not found
r=1; for a in "${array[#]}"; do ((r&=hashlist[$a])) || break; done
then r=1 if all entries were found, 0 otherwise.
note that ! must be escaped \! only in command line because of -H switch but in a script the \ must be removed.

How to convert decimal short output?

With od -N 64 -i mpich
on Ubuntu 14.04 I have
0000000 1135000353 1135000810 1135005924 1135016843
0000020 1135027542 1135036186 1135041461 1135041331
0000040 1135043045 1135052773 1135063618 1135067789
0000060 1135064934 1135052521 1135033974 1135019865
0000100
How to convert these decimal shorts into ascii?

To show "these" decimals:
perl -ane 'shift #F; print map {pack "l",$_ } #F' <<EOS | od -c
0000000 1135000353 1135000810 1135005924 1135016843
0000020 1135027542 1135036186 1135041461 1135041331
0000040 1135043045 1135052773 1135063618 1135067789
0000060 1135064934 1135052521 1135033974 1135019865
0000100
EOS

Reverse engineer firmware image and rebuild Linux kernel for TI-AR7

I am trying to build my own Linux derivative to run on an TI-AR7 board. I took the board from an old Telekom Speedport W 501V router. To understand how firmware is flashed onto the device I have downloaded the most recent official firmware. Using the Linux file command I determined the image is a tar archive, which can be extracted easily.
ubuntu#ip-172-31-23-210:~/reverse$ ls
fw_speedport_w501v_v_28.04.38.image
ubuntu#ip-172-31-23-210:~/reverse$ file fw*
fw_speedport_w501v_v_28.04.38.image: POSIX tar archive (GNU)
ubuntu#ip-172-31-23-210:~/reverse$ tar -xvf fw*
./var/
./var/tmp/
./var/tmp/kernel.image
./var/tmp/filesystem.image
./var/flash_update.ko
./var/flash_update.o
./var/info.txt
./var/install
./var/chksum
./var/regelex
./var/signature
ubuntu#ip-172-31-23-210:~/reverse$
According to a wiki (Firmware-Image) that I have found, ./var/tmp/kernel.image contains the actual firmware. During the update process this image is written to the mtd1 device. As stated in the wiki (LZMA-Kernel) the lzma compressed kernel starts with the magic number 0xfeed1281. A hexdump of kernel.image contains that number at its beginning.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ hexdump -n 4 kernel.image
0000000 1281 feed
0000004
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$
The following script given on the last wiki entry should decompress the kernel.
#! /usr/bin/perl
use Compress::unLZMA;
use Archive::Zip;
open INPUT, "<$ARGV[0]" or die "can't open $ARGV[0]: $!";
read INPUT, $buf, 4;
$magic = unpack("V", $buf);
if ($magic != 0xfeed1281) {
die "bad magic";
}
read INPUT, $buf, 4;
$len = unpack("V", $buf);
read INPUT, $buf, 4*2; # address, unknown
read INPUT, $buf, 4;
$clen = unpack("V", $buf);
read INPUT, $buf, 4;
$dlen = unpack("V", $buf);
read INPUT, $buf, 4;
$cksum = unpack("V", $buf);
printf "Archive checksum: 0x%08x\n", $cksum;
read INPUT, $buf, 1+4; # properties, dictionary size
read INPUT, $dummy, 3; # alignment
$buf .= pack('VV', $dlen, 0); # 8 bytes of real size
#$buf .= pack('VV', -1, -1); # 8 bytes of real size
read INPUT, $buf2, $clen;
$crc = Archive::Zip::computeCRC32($buf2);
printf "Input CRC32: 0x%08x\n", $crc;
if ($cksum != $crc) {
die "wrong checksum";
}
$buf .= $buf2;
$data = Compress::unLZMA::uncompress($buf);
unless (defined $data) {
die "uncompress: $#";
}
open OUTPUT, ">$ARGV[1]" or die "can't write $ARGV[1]";
print OUTPUT $data;
#truncate OUTPUT, $dlen;
To use the script you may need to install Compress::unLZMA and Archive::Zip perl modules.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ tar -xvf Compress*
Compress-unLZMA-0.04/
Compress-unLZMA-0.04/Makefile.PL
Compress-unLZMA-0.04/ppport.h
Compress-unLZMA-0.04/Changes
Compress-unLZMA-0.04/lzma_sdk/
[...]
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ cd Compress*
ubuntu#ip-172-31-23-210:~/reverse/var/tmp/Compress-unLZMA-0.04$ perl Makefile.PL
Checking if your kit is complete...
Looks good
Writing Makefile for Compress::unLZMA
Writing MYMETA.yml and MYMETA.json
ubuntu#ip-172-31-23-210:~/reverse/var/tmp/Compress-unLZMA-0.04$ make
cp lib/Compress/unLZMA.pm blib/lib/Compress/unLZMA.pm
/usr/bin/perl /usr/share/perl/5.18/ExtUtils/xsubpp -typemap /usr/share/perl/5.18/ExtUtils/typemap unLZMA.xs > unLZMA.xsc && mv unLZMA.xsc unLZMA.c
cc -c -I. -Ilzma_sdk/Source -D_REENTRANT -D_GNU_SOURCE
[...]
ubuntu#ip-172-31-23-210:~/reverse/var/tmp/Compress-unLZMA-0.04$ sudo make install
Files found in blib/arch: installing files in blib/lib into architecture dependent library tree
Installing /usr/local/lib/perl/5.18.2/auto/Compress/unLZMA/unLZMA.bs
Installing /usr/local/lib/perl/5.18.2/auto/Compress/unLZMA/unLZMA.so
Installing /usr/local/lib/perl/5.18.2/Compress/unLZMA.pm
Installing /usr/local/man/man3/Compress::unLZMA.3pm
Appending installation info to /usr/local/lib/perl/5.18.2/perllocal.pod
ubuntu#ip-172-31-23-210:~/reverse/var/tmp/Compress-unLZMA-0.04$ # same for Archive::Zip module
After installing these dependencies the script decompressed the kernel successfully.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ ./decompress.pl kernel.image kernel.decompressed
Archive checksum: 0x29176e12
Input CRC32: 0x29176e12
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$
But what kind of file is kernel.decompressed and how do I generate a similar file from my Linux kernel source? I continued analyzing it using file and binwalk.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ file kernel.decompressed
kernel.decompressed: data
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ binwalk kernel.decompressed
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
1509632 0x170900 Linux kernel version "2.6.13.1-ohio (686) (gcc version 3.4.6) #9 Wed Apr 4 13:48:08 CEST 2007"
1516240 0x1722D0 CRC32 polynomial table, little endian
1517535 0x1727DF Copyright string: "Copyright 1995-1998 Mark Adler "
1549488 0x17A4B0 Unix path: /usr/gnemul/irix/
1550920 0x17AA48 Unix path: /usr/lib/libc.so.1
1618031 0x18B06F Neighborly text, "neighbor %.2x%.2x.%.2x:%.2x:%.2x:%.2x:%.2x:%.2x lost on port %d(%s)(%s)"
1966080 0x1E0000 gzip compressed data, maximum compression, from Unix, last modified: 2007-04-04 11:45:13
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$
So the Linux kernel starts at 1509632 and ends at 1516240. What kind of data is stored in front the Linux kernel (0 to 1509632)? I extracted the kernel and that piece of unknown data using dd.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ dd if=kernel.decompressed of=unknown.data bs=1 count=1509632
1509632+0 records in
1509632+0 records out
1509632 bytes (1.5 MB) copied, 1.62137 s, 931 kB/s
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ dd if=kernel.decompressed of=kernel bs=1 skip=1509632 count=6608
6608+0 records in
6608+0 records out
6608 bytes (6.6 kB) copied, 0.0072771 s, 908 kB/s
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$
I need to ask again: What kind of file is kernel and how do I generate a similar file from my Linux kernel source? I used xxd and strings to look at the file more closely.
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ xxd -l 100 kernel
0000000: 4c69 6e75 7820 7665 7273 696f 6e20 322e Linux version 2.
0000010: 362e 3133 2e31 2d6f 6869 6f20 2836 3836 6.13.1-ohio (686
0000020: 2920 2867 6363 2076 6572 7369 6f6e 2033 ) (gcc version 3
0000030: 2e34 2e36 2920 2339 2057 6564 2041 7072 .4.6) #9 Wed Apr
0000040: 2034 2031 333a 3438 3a30 3820 4345 5354 4 13:48:08 CEST
0000050: 2032 3030 370a 0000 0000 0000 0000 0000 2007...........
0000060: 0000 0000 ....
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$ strings kernel
Linux version 2.6.13.1-ohio (686) (gcc version 3.4.6) #9 Wed Apr 4 13:48:08 CEST 2007
do_be
do_bp
do_tr
do_ri
do_cpu
nmi_exception_handler
do_ade
emulate_load_store_insn
do_page_fault
context_switch
__put_task_struct
do_exit
local_bh_enable
run_workqueue
2.6.13.1-ohio gcc-3.4
enable_irq
__free_pages_ok
free_hot_cold_page
prep_new_page
kmem_cache_destroy
kmem_cache_create
pageout
vunmap_pte_range
vmap_pte_range
__vunmap
__brelse
sync_dirty_buffer
bio_endio
queue_kicked_iocb
proc_get_inode
remove_proc_entry
sysfs_get
sysfs_fill_super
kref_get
kref_put
0123456789abcdefghijklmnopqrstuvwxyz
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
vsnprintf
{zt^f
pw0Gm
0cIZ-
68BG+
QC]S%
v,;Zk
ubuntu#ip-172-31-23-210:~/reverse/var/tmp$
This Github repository contains the extracted files to use for further analysis.

Invalid character (0xe2) in mnemonic

I have trouble compiling my assembly code.
gcc returns: func_select.s:5: Error: invalid character (0xe2) in mnemonic
func_select.s:7: Error: invalid character (0xe2) in mnemonic
here is the code (lines 5-7):
secondStringLength: ‫‪.string " second pstring length: %d‬‬\n"
OldChar: .string "‫‪old char: %c,‬‬"
NewChar: ‫‪.string " new char: %c,‬‬"
How can I fix this?

Remove the formatting characters embedded in the text.
$ charinfo 'secondStringLength:‫‪.string " second pstring length: %d‬‬\n"'
U+0073 LATIN SMALL LETTER S [Ll]
U+0065 LATIN SMALL LETTER E [Ll]
...
U+0068 LATIN SMALL LETTER H [Ll]
U+003A COLON [Po]
U+202B RIGHT-TO-LEFT EMBEDDING [Cf]
U+202A LEFT-TO-RIGHT EMBEDDING [Cf]
U+002E FULL STOP [Po]
U+0073 LATIN SMALL LETTER S [Ll]
...
U+0025 PERCENT SIGN [Po]
U+0064 LATIN SMALL LETTER D [Ll]
U+202C POP DIRECTIONAL FORMATTING [Cf]
U+202C POP DIRECTIONAL FORMATTING [Cf]
U+005C REVERSE SOLIDUS [Po]
U+006E LATIN SMALL LETTER N [Ll]
U+0022 QUOTATION MARK [Po]

Igancio Vazquez-Abrams is right. To provide more detail, according to xxd this is your first line:
$ cat b | xxd
00000000: 7365 636f 6e64 5374 7269 6e67 4c65 6e67 secondStringLeng
00000010: 7468 3a20 2020 2020 e280 abe2 80aa 2e73 th: .......s
00000020: 7472 696e 6720 2220 7365 636f 6e64 2070 tring " second p
00000030: 7374 7269 6e67 206c 656e 6774 683a 2025 string length: %
00000040: 64e2 80ac e280 ac5c 6e22 0a0a d......\n"..
Note: e2 80 ab and then e2 80 aa. These are the U+202B and U+202A mentioned earlier. Remove them (as well as the next 2 U+202C).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string