View images from blob - sqlite [duplicate] - linux

I have a sqlite3 database. One column has the TEXT type, and contains blobs which I would like to save as file. Those are gzipped files.
The output of the command sqlite3 db.sqlite3 ".dump" is:
INSERT INTO "data" VALUES(1,'objects','object0.gz',X'1F8B080000000000000 [.. a few thousands of hexadecimal characters ..] F3F5EF')
How may I extract the binary data from the sqlite file to a file using the command line ?

sqlite3 cannot output binary data directly, so you have to convert the data to a hexdump, use cut to extract the hex digits from the blob literal, and use xxd (part of the vim package) to convert the hexdump back into binary:
sqlite3 my.db "SELECT quote(MyBlob) FROM MyTable WHERE id = 1;" \
| cut -d\' -f2 \
| xxd -r -p \
> object0.gz
With SQLite 3.8.6 or later, the command-line shell includes the fileio extension, which implements the writefile function:
sqlite3 my.db "SELECT writefile('object0.gz', MyBlob) FROM MyTable WHERE id = 1"

You can use writefile if using the sqlite3 command line tool:
Example usage:
select writefile('blob.bin', blob_column) from table where key='12345';

In my case, I use "hex" instead of "quote" to retrieve image from database, and no need "cut" in the command pipe. For example:
sqlite3 fr.db "select hex(bmp) from reg where id=1" | xxd -r -p > 2.png

I had to make some minor changes on CL's answer, in order to make it work for me:
The structure for the command that he is using does not have the database name in it, the syntax that I am using is something like:
sqlite3 mydatabase.sqlite3 "Select quote(BlobField) From TableWithBlod Where StringKey = '1';" | ...
The way he is using the cut command does not work in my machine. The correct way for me is:
cut -d "'" -f2
So the final command would be something like:
sqlite3 mydatabase.sqlite3 "Select quote(BlobField) From TableWithBlod Where StringKey = '1';" | cut -d "'" -f2 | xxd -r -p > myfile.extension
And in my case:
sqlite3 osm-carto_z14_m8_m.mbtiles "select quote(images.tile_data) from images where images.tile_id = '1';" | cut -d "'" -f2 | xxd -r -p > image.png

Related

How to make version-sort command work in a sh file?

I'm trying to use "sort -V" command (aka version-sort) in a sh file.
Specifically, I have the following line of code in a sh file:
SOME_PATH="$(ls dir_1/dir_2/v*/filename.txt | sort -V | tail -n1)"
What I'm trying to accomplish through the above command is that given a list of file paths with different version numbers, I want to get the file path with the greatest version number.
For example, let's assume that I have the following list of file paths:
dir_1/dir_2/v1/filename.txt,
dir_1/dir_2/v2/filename.txt,
dir_1/dir_2/v11/filename.txt
Then, I want the command to return dir_1/dir_2/v11/filename.txt instead of dir_1/dir_2/v2/filename.txt since the former has the greatest version value, "11".
From my understanding the above linux command precisely accomplishes this.
I confirmed it working on the Linux bash terminal.
However, when I run a sh file with the above command in it, I'm getting a
"ERROR: Unknown command line flag 'V'" error message.
Is there a way to make version-sort work in a sh file?
If not, is there a way to implement it not using -V flag?
Thank you.
Using shell's printf and awk:
SOME_PATH=$(printf %s\\0 dir_1/dir_2/v*/filename.txt |
awk 'BEGIN{FS="/";RS="\0";v=0}{match($3,/v([[:digit:]]+)/,m);if(m[1]>v){v=m[1];l=$0}}END{print l}')
Using awk only:
SOME_PATH=$(awk 'BEGIN{delete ARGV[0];v=0;for(i in ARGV){split(ARGV[i],s,"/");match(s[3],/v([[:digit:]]+)/,m);if(m[1]>v){v=m[1];l=ARGV[i]}}}END{print l}' dir_1/dir_2/v*/filename.txt)
Formatted awk script:
#!/usr/bin/env -S awk -f
BEGIN {
delete ARGV[0]
v=0
for (i in ARGV) {
split(ARGV[i], s, "/")
match(s[3], /v([[:digit:]]+)/, m)
if (m[1]>v) {
v=m[1]
l=ARGV[i]
}
}
}
END {
print l
}
Using a null delimited list stream, and not parsing the output of ls 1:
SOME_PATH=$(
printf '%s\0' dir_1/dir_2/v*/filename.txt |
sort -z -t'/' -k3V |
tail -zn1 |
tr -d '\0'
)
How it works:
printf '%s\0' dir_1/dir_2/v*/filename.txt: Expands the paths into a null delimited stream output.
sort -z -t'/' -k3V: Sorts the null delimited input stream on -k3V version number from the 3rd column, -t'/' using / as a delimiter.
tail -zn1: Outputs the least null delimited entry from the input stream.
tr -d '\0': Trim-out any remaining null to prevent the shell from complaining with error: warning: command substitution: ignored null byte in input.
StackExchange: Why not parse ls (and what to do instead)?

How to extract value from json contained in a variable using jq in bash

I am writing a bash script which has a json value stored in a variable now i want to extract the values in that json using Jq. The code used is.
json_val={"code":"lyz1To6ZTWClDHSiaeXyxg","redirect_to":"http://example.com/client-redirect-uri?code=lyz1To6ZTWClDHSiaeXyxg"}
code_val= echo"$json_val" | jq '.code'
This throws an error of no such file or direcotry.
If i change this to
json_val={"code":"lyz1To6ZTWClDHSiaeXyxg","redirect_to":"http://example.com/client-redirect-uri?code=lyz1To6ZTWClDHSiaeXyxg"}
code_val=echo" $json_val " | jq '.code'
This does not throws any error but the value in code_val is null.
If try to do it manually echo {"code":"lyz1To6ZTWClDHSiaeXyxg","redirect_to":"http://example.com/client-redirect-uri?code=lyz1To6ZTWClDHSiaeXyxg"} | jq '.code' it throws parse numeric letter error.
how can i do it in first case.
You may use this:
json_val='{"code":"lyz1To6ZTWClDHSiaeXyxg","redirect_to":"http://example.com/client-redirect-uri?code=lyz1To6ZTWClDHSiaeXyxg"}'
code_val=$(jq -r '.code' <<< "$json_val")
echo "$code_val"
lyz1To6ZTWClDHSiaeXyxg
Note following changes:
Wrap complete json string in single quotes
use of $(...) for command substitution
Use of <<< (here-string) to avoid a sub-shell creation
PS: If you're getting json text from a curl command and want to store multiple fields in shell variables then use:
read -r code_val redirect_to < <(curl ... | jq -r '.code + "\t" + .redirect_to')
Where ... is your curl command.
If try to do it manually:
$ echo {"code":"lyz1To6ZTWClDHSiaeXyxg","redirect_to":"http://example.com/client-redirect-uri?code=lyz1To6ZTWClDHSiaeXyxg"} | jq '.code'
...it throws parse numeric letter error.
seems like you did not escape the string of the echo command. in your case, escaping with a singe-quote (apostrophe ') will do - same as you did with the jq json-path argument ('.code')
$ echo '{"code":"lyz1To6ZTWClDHSiaeXyxg","redirect_to":"http://example.com/client-redirect-uri?code=lyz1To6ZTWClDHSiaeXyxg"}' | jq '.code'
"lyz1To6ZTWClDHSiaeXyxg"

Generate text from bash script with literal double quotes

I am trying to automate string replace thing for my project ...
I need output like this in file
insert into libraries values("Schema_name", "table_name", "table_name", "/data/Projects/Ope/ACT/Domain/Code/Files");
and what I am getting in the file is
`insert into libraries values(Schema_name, table_name, table_name, /data/Projects/Ope/ACT/Domain/Code/Files)`;
replace_script.sh
#!/bin/bash
while read line
do
param1=`echo $line | awk -F ' ' '{print $1}'`
param2=`echo $line | awk -F ' ' '{print $2}'`
echo "insert into libraries values(\"$param1\",\"$param2\",\"$param2\",\"/data/Projects/Ope/ACT/Domain/Code/Files\");" >> input_queries.hql
done <<EOF
schema_name table_name
schema_name table_name
EOF
The exact code given in your question emits as output:
insert into libraries values("schema_name","table_name","table_name","/data/Projects/Ope/ACT/Domain/Code/Files");
insert into libraries values("schema_name","table_name","table_name","/data/Projects/Ope/ACT/Domain/Code/Files");
This is, as I understand it, exactly what you claim to want.
However, SQL doesn't use double quotes for data -- it uses single quotes for that.
escape_sql() {
local val
val=${1//\\/\\\\}
val=${val//\'/\\\'}
val=${val//\"/\\\"}
printf '%s' "$val"
}
while read -r param1 param2; do
printf $'insert into libraries values(\'%s\', \'%s\', \'%s\', \'/data/Projects/Ope/ACT/Domain/Code/Files\');\\n' \
"$(escape_sql "$param1")" \
"$(escape_sql "$param2")" \
"$(escape_sql "$param2")"
done <<EOF
schema_name table_name
schema_name table_name
EOF
The above makes a rudimentary attempt to prevent malicious values from escaping their quotes -- though you should really use a language with native SQL bindings for your database for the purpose!
That said -- this is not safe escaping against malicious data (for instance, data containing literal quotes). For that, use a language built-to-purpose.

sort fasta by sequence size

I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic):
>id1
sequence 1 # could be on several line
>id2
sequence 2
...
I have run a tools that give me in tsv format:
the Identifiant, the length, and the position in bytes of the identifiant.
for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding sequence then append it to a new file.
# this fonction will get the sequence using seek
def get_seq(file, bites):
with open(file) as f_:
f_.seek(bites, 0) # go to the line of interest
line = f_.readline().strip() # this line is the begin of the
#sequence
to_return = "" # init the string which will contains the sequence
while not line.startswith('>') or not line: # while we do not
# encounter another identifiant
to_return += line
line = f_.readline().strip()
return to_return
# simply append to a file the id and the sequence
def write_seq(out_file, id_, sequence):
with open(out_file, 'a') as out_file:
out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))
# main loop will parse the index file and call the function defined below
with open(args.fai) as ref:
indice = 0
for line in ref:
spt = line.split()
id_ = spt[0]
seq = get_seq(args.i, int(spt[2]))
write_seq(out_file=args.out, id_=id_, sequence=seq)
my problems is the following is really slow does it is normal (it takes several days)? Do I have another way to do it? I am a not a pure informaticien so I may miss some point but I was believing to index files and use seek was the fatest way to achive this am I wrong?
Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools:
subprocess.call(["samtools", "faidx", args.i])
with open(args.fai) as ref:
for line in ref:
spt = line.split()
id_ = spt[0]
subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)
What about bash and some basic unix commands (csplit is the clue)? I wrote this simple script, but you can customize/improve it. It's not highly optimized and doesn't use index file, but nevertheless may run faster.
csplit -z -f tmp_fasta_file_ $1 '/>/' '{*}'
for file in tmp_fasta_file_*
do
TMP_FASTA_WC=$(wc -l < $file | tr -d ' ')
FASTA_WC+=$(echo "$file $TMP_FASTA_WC\n")
done
for filename in $(echo -e $FASTA_WC | sort -k2 -r -n | awk -F" " '{print $1}')
do
cat "$filename" >> $2
done
rm tmp_fasta_file*
First positional argument is a filepath to your fasta file, second one is a filepath for output, i.e. ./script.sh input.fasta output.fasta
Using a modified version of fastq-sort (currently available at https://github.com/blaiseli/fastq-tools), we can convert the file to fastq format using bioawk, sort with the -L option I added, and convert back to fasta:
cat test.fasta \
| tee >(wc -l > nb_lines_fasta.txt) \
| bioawk -c fastx '{l = length($seq); printf "#"$name"\n"$seq"\n+\n%.*s\n", l, "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' \
| tee >(wc -l > nb_lines_fastq.txt) \
| fastq-sort -L \
| tee >(wc -l > nb_lines_fastq_sorted.txt) \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
| tee >(wc -l > nb_lines_fasta_sorted.txt) \
> test_sorted.fasta
The fasta -> fastq conversion step is quite ugly. We need to generate dummy fastq qualities with the same length as the sequence. I found no better way to do it with (bio)awk than this hack based on the "dynamic width" thing mentioned at the end of https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html#Format-Modifiers.
The IIIII... string should be longer than the longest of the input sequences, otherwise, invalid fastq will be obtained, and when converting back to fasta, bioawk seems to silently skip such invalid reads.
In the above example, I added steps to count the lines. If the line numbers are not coherent, it may be because the IIIII... string was too short.
The resulting fasta file will have the shorter sequences first.
To get the longest sequences at the top of the file, add the -r option to fastq-sort.
Note that fastq-sort writes intermediate files in /tmp. If for some reason it is interrupted before erasing them, you may want to clean your /tmp manually and not wait for the next reboot.
Edit
I actually found a better way to generate dummy qualities of the same length as the sequence: simply using the sequence itself:
cat test.fasta \
| bioawk -c fastx '{print "#"$name"\n"$seq"\n+\n"$seq}' \
| fastq-sort -L \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
> test_sorted.fasta
This solution is cleaner (and slightly faster), but I keep my original version above because the "dynamic width" feature of printf and the usage of tee to check intermediate data length may be interesting to know about.
You can also do it very conveniently with awk, check the code below:
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta |\
awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
sort -k1,1n | cut -f 2- | tr "\t" "\n"
This and other methods have been posted in Biostars (e.g. using BBMap's sortbyname.sh script), and I strongly recommend this community for questions such like this one.

Find HEX value in file and grep the following value

I have a 2GB file in raw format. I want to search for all appearance of a specific HEX value "355A3C2F74696D653E" AND collect the following 28 characters.
Example: 355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135
In this case I want the output: "323031312D30342D32365431343A34373A30322D31343A34373A3135" or better: 2011-04-26T14:47:02-14:47:15
I have tried with
xxd -u InputFile | grep '355A3C2F74696D653E' | cut -c 1-28 > OutputFile.txt
and
xxd -u -ps -c 4000000 InputFile | grep '355A3C2F74696D653E' | cut -b 1-28 > OutputFile.txt
But I can't get it working.
Can anybody give me a hint?
As you are using xxd it seems to me that you want to search the file as if it were binary data. I'd recommend using a more powerful programming language for this; the Unix shell tools assume there are line endings and that the text is mostly 7-bit ASCII. Consider using Python:
#!/usr/bin/python
import mmap
fd = open("file_to_search", "rb")
needle = "\x35\x5A\x3C\x2F\x74\x69\x6D\x65\x3E"
haystack = mmap.mmap(fd.fileno(), length = 0, access = mmap.ACCESS_READ)
i = haystack.find(needle)
while i >= 0:
i += len(needle)
print (haystack[i : i + 28])
i = haystack.find(needle, i)
If your grep supports -P parameter then you could simply use the below command.
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{28}'
323031312D30342D32365431343A
For 56 chars,
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{56}'
323031312D30342D32365431343A34373A30322D31343A34373A3135
Why convert to hex first? See if this awk script works for you. It looks for the string you want to match on, then prints the next 28 characters. Special characters are escaped with a backslash in the pattern.
Adapted from this post: Grep characters before and after match?
I added some blank lines for readability.
VirtualBox:~$ cat data.dat
Thisis a test of somerandom characters before thestringI want5Z</time>2011-04-26T14:47:02-14:47:15plus somemoredata
VirtualBox:~$ cat test.sh
awk '/5Z\<\/time\>/ {
match($0, /5Z\<\/time\>/); print substr($0, RSTART + 9, 28);
}' data.dat
VirtualBox:~$ ./test.sh
2011-04-26T14:47:02-14:47:15
VirtualBox:~$
EDIT: I just realized something. The regular expression will need to be tweaked to be non-greedy, etc and between that and awk need to be tweaked to handle multiple occurrences as you need them. Perhaps some of the folks more up on awk can chime in with improvements as I am real rusty. An approach to consider anyway.

Resources