Pipe envsubst output to hive

Pipe envsubst output to hive - linux

Working with Hive 0.13.0, I would like to evaluate variables against a template and then immediately execute the resulting Hive code (avoiding a temporary intermediate file is preferable).
Here is a (non-working) example of what I'd like to do:
template.hql
SELECT COUNT(*) FROM ${TABLE};
In the shell:
export TABLE=DEFAULT.FOOTABLE
envsubst < template.hql | hive
Is there a particular reason this does not work, and is there a proper way to achieve it?

The substitution works as expected:
$ cat template.hql
SELECT COUNT(*) FROM ${TABLE};
$ export TABLE=DEFAULT.FOOTABLE
$ envsubst < template.hql
SELECT COUNT(*) FROM DEFAULT.FOOTABLE;
So I suspect hive does not read queries from the standard in. I see from an online manual that it supports the -f parameter, so you can create the file manually:
TMPFILE=$(mktemp)
envsubst < template.hql > $TMPFILE
hive -f $TMPFILE
rm $TMPFILE
If you're on a newish version of bash, you can avoid an intermediate file:
hive -f <( envsubst < template.hql )
I'm not sure, but also check if hive -f - might read from stdin.

Related

Add multiple and various content from bash parameter to a variable set into the sql script called

I have a SQL script and a sh executable to run a script doing some operations on my database.
My principal problem is I'm searching how I could do the following thing:
Send an array of parameter from my bash parameters when launching the script, my actual command is:
./myscript.sh databaseName user thirdParameterToPassAsAString
'fourthParameterElement1','fourthParameterElement2','fourthParameterElement3'
the content of my script:
#!/bin/bash
set -e
psql $1 $2 <<EOF
set search_path = search_path;
set firstParameterusedinSQLScript = $3;
set Param_List = $4;
\i my_script.sql
EOF
and the sql part where I have the problem:
where ae.example in (:Param_List)
I have of course some issues with this where clause.
So the question is how could i do this?

Have you considered changeing the sql itself (not changeing the original sql file that contains it) before executing it (replaceing the parameter via sed).
If that is an option for you, you could define a helpber function like
function prepare_script() {
cat <<EOF
set search_path = search_path;
EOF
sed -e"s|:Param_List|$3|g" -e"s|firstParameterusedinSQLScript|$2|g" Requetes_retour_arriere_fiab_x_siret.sql
}
You could then call it like:
prepare_script "$1" "$2" "$3" | psql $1 $2
Note, that you do not change the file on disk itself, you just read it using set and have it output the altered sql on stdout and pipe it to psql.

Copy a txt file twice to a different file using bash

I am trying to cat a file.txt and loop it twice through the whole content and copy it to a new file file_new.txt. The bash command I am using is as follows:
for i in {1..3}; do cat file.txt > file_new.txt; done
The above command is just giving me the same file contents as file.txt. Hence file_new.txt is also of the same size (1 GB).
Basically, if file.txt is a 1GB file, then I want file_new.txt to be a 2GB file, double the contents of file.txt. Please, can someone help here? Thank you.

Simply apply the redirection to the for loop as a whole:
for i in {1..3}; do cat file.txt; done > file_new.txt
The advantage of this over using >> (aside from not having to open and close the file multiple times) is that you needn't ensure that a preexisting output file is truncated first.
Note that the generalization of this approach is to use a group command ({ ...; ...; }) to apply redirections to multiple commands; e.g.:
$ { echo hi; echo there; } > out.txt; cat out.txt
hi
there
Given that whole files are being output, the cost of invoking cat for each repetition will probably not matter that much, but here's a robust way to invoke cat only once:[1]
# Create an array of repetitions of filename 'file' as needed.
files=(); for ((i=0; i<3; ++i)); do files[i]='file'; done
# Pass all repetitions *at once* as arguments to `cat`.
cat "${files[#]}" > file_new.txt
[1] Note that, hypothetically, you could run into your platform's command-line length limit, as reported by getconf ARG_MAX - given that on Linux that limit is 2,097,152 bytes (2MB) that's not likely, though.

You could use the append operator, >>, instead of >. Then adjust your loop count as needed to get the output size desired.

You should adjust your code so it is as follows:
for i in {1..3}; do cat file.txt >> file_new.txt; done
The >> operator appends data to a file rather than writing over it (>)

if file.txt is a 1GB file,
cat file.txt > file_new.txt
cat file.txt >> file_new.txt
The > operator will create file_new.txt(1GB),
The >> operator will append file_new.txt(2GB).
for i in {1..3}; do cat file.txt >> file_new.txt; done
This command will make file_new.txt(3GB),because for i in {1..3} will run three times.

As others have mentioned, you can use >> to append. But, you could also just invoke cat once and have it read the file 3 times. For instance:
n=3; cat $( yes file.txt | sed ${n}q ) > file_new.txt
Note that this solution exhibits a common anti-pattern and fails to properly quote the arguments, which will cause issues if the filename contains whitespace. See mklement's solution for a more robust solution.

Recursive search grep

I'm trying to search through HDFS for parquet files and list them out. I'm using this, which works great. It looks through all of the subdirectories in /sources.works_dbo and gives me all the parquet files:
hdfs dfs -ls -R /sources/works_dbo | grep ".*\.parquet$"
However; I just want to return the first file it encounters per subdirectory, so that each subdirectory only appears on a single line in my output. Say I had this:
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet
When I run my command I expect the output to look like this:
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet

... | awk '!seen[gensub(/[^/]+$/,"",1)]++' file
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet
The above uses GNU awk for gensub(), with other awks you'd use a variable and sub():
awk '{path=$0; sub(/[^/]+$/,"",path)} !seen[path]++'
It will work for any mixture of any length of paths.

You can use sort -u (unique) with / as the delimiter and using the first three fields as key. The -s option ("stable") makes sure that the file retained is the first one encountered for each subdirectory.
For this input
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet
the result is
$ sort -s -t '/' -k 1,3 -u infile
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet

If the subdirectories are of variable length, this awk solution may come in handy:
hdfs dfs -ls -R /sources/works_dbo | awk '
BEGIN{FS="/"; OFS="/";}
{file=$NF; // file name is always the last field
$NF=""; folder=$0; // chomp off the last field to cache folder
if (!(folder in seen_dirs)) // cache the first file per folder
seen_dirs[folder]=file;
}
END{
for (f in seen_dirs) // after we've processed all rows, print our cache
print f,seen_dirs[f];
}'

Using Perl:
hdfs dfs -ls -R /sources/works_dbo | grep '.*\.parquet$' | \
perl -MFile::Basename -nle 'print unless $h{ dirname($_) }++'
In the perl command above:
-M loads File::Basename module;
-n causes Perl to apply the expression passed via -e for each input line;
-l preserves the line terminator;
$_ is the default variable keeping the currently read line;
dirname($_) returns the directory part for the path specified by $_;
$h is a hash where keys are directory names, and values are integers 0, 1, 2 etc;
the line is printed to the standard output, unless the directory name is seen in the previous iterations, i.e. the hash value $h{ dirname($_) } is non-zero.
By the way, instead of piping the result of hdfs dfs -ls -R via grep, you can use the find command:
hdfs dfs -find /sources/works_dbo -name '*.parquet'

How to get list of commands used in a shell script?

I have a shell script of more than 1000 lines, i would like to check if all the commands used in the script are installed in my Linux operating system.
Is there any tool to get the list of Linux commands used in the shell script?
Or how can i write a small script which can do this for me?
The script runs successfully on the Ubuntu machine, it is invoked as a part of C++ application. we need to run the same on a device where a Linux with limited capability runs. I have identified manually, few commands which the script runs and not present on Device OS. before we try installing these commands i would like to check all other commands and install all at once.
Thanks in advance

I already tried this in the past and got to the conclusion that is very difficult to provide a solution which would work for all scripts. The reason is that each script with complex commands has a different approach in using the shells features.
In case of a simple linear script, it might be as easy as using debug mode.
For example: bash -x script.sh 2>&1 | grep ^+ | awk '{print $2}' | sort -u
In case the script has some decisions, then you might use the same approach an consider that for the "else" cases the commands would still be the same just with different arguments or would be something trivial (echo + exit).
In case of a complex script, I attempted to write a script that would just look for commands in the same place I would do it myself. The challenge is to create expressions that would help identify all used possibilities, I would say this is doable for about 80-90% of the script and the output should only be used as reference since it will contain invalid data (~20%).
Here is an example script that would parse itself using a very simple approach (separate commands on different lines, 1st word will be the command):
# 1. Eliminate all quoted text
# 2. Eliminate all comments
# 3. Replace all delimiters between commands with new lines ( ; | && || )
# 4. extract the command from 1st column and print it once
cat $0 \
| sed -e 's/\"/./g' -e "s/'[^']*'//g" -e 's/"[^"]*"//g' \
| sed -e "s/^[[:space:]]*#.*$//" -e "s/\([^\\]\)#[^\"']*$/\1/" \
| sed -e "s/&&/;/g" -e "s/||/;/g" | tr ";|" "\n\n" \
| awk '{print $1}' | sort -u
the output is:
.
/
/g.
awk
cat
sed
sort
tr
There are many more cases to consider (command substitutions, aliases etc.), 1, 2 and 3 are just beginning, but they would still cover 80% of most complex scripts.
The regular expressions used would need to be adjusted or extended to increase precision and special cases.
In conclusion if you really need something like this, then you can write a script as above, but don't trust the output until you verify it yourself.

Add export PATH='' to the second line of your script.
Execute your_script.sh 2>&1 > /dev/null | grep 'No such file or directory' | awk '{print $4;}' | grep -v '/' | sort | uniq | sed 's/.$//'.

If you have a fedora/redhat based system, bash has been patched with the --rpm-requires flag
--rpm-requires: Produce the list of files that are required for the shell script to run. This implies -n and is subject to the same limitations as compile time error checking checking; Command substitutions, Conditional expressions and eval builtin are not parsed so some dependencies may be missed.
So when you run the following:
$ bash --rpm-requires script.sh
executable(command1)
function(function1)
function(function2)
executable(command2)
function(function3)
There are some limitations here:
command and process substitutions and conditional expressions are not picked up. So the following are ignored:
$(command)
<(command)
>(command)
command1 && command2 || command3
commands as strings are not picked up. So the following line will be ignored
"/path/to/my/command"
commands that contain shell variables are not listed. This generally makes sense since
some might be the result of some script logic, but even the following is ignored
$HOME/bin/command
This point can however be bypassed by using envsubst and running it as
$ bash --rpm-requires <(<script envsubst)
However, if you use shellcheck, you most likely quoted this and it will still be ignored due to point 2
So if you want to use check if your scripts are all there, you can do something like:
while IFS='' read -r app; do
[ "${app%%(*}" == "executable" ] || continue
app="${app#*(}"; app="${app%)}";
if [ "$(type -t "${app}")" != "builtin" ] && \
! [ -x "$(command -v "${app}")" ]
then
echo "${app}: missing application"
fi
done < <(bash --rpm-requires <(<"$0" envsubst) )
If your script contains files that are sourced that might contain various functions and other important definitions, you might want to do something like
bash --rpm-requires <(cat source1 source2 ... <(<script.sh envsubst))

Based #czvtools’ answer, I added some extra checks to filter out bad values:
#!/usr/bin/fish
if test "$argv[1]" = ""
echo "Give path to command to be tested"
exit 1
end
set commands (cat $argv \
| sed -e 's/\"/./g' -e "s/'[^']*'//g" -e 's/"[^"]*"//g' \
| sed -e "s/^[[:space:]]*#.*\$//" -e "s/\([^\\]\)#[^\"']*\$/\1/" \
| sed -e "s/&&/;/g" -e "s/||/;/g" | tr ";|" "\n\n" \
| awk '{print $1}' | sort -u)
for command in $commands
if command -q -- $command
set -a resolved (realpath (which $command))
end
end
set resolved (string join0 $resolved | sort -z -u | string split0)
for command in $resolved
echo $command
end

Using awk output to run mysqldump

I am trying to learn more about shell scripts and have pieced this together if someone would please educate/critique the use of awk in particular. Because it's based on each value of the executed query, is this a viable option? Is there a better method, ie. for or while loop?
timestamp = $(date "+%Y=%m-%d %H:%M:%S")
user_dir = "server_dir_name"
backup_path = "/home1/$user_dir/public_html"
local_bckup_store = "/home1/$user_dir/backups"
db_prefix = "database_prefix"
db_user = "database_username"
db_pwd = "database_user_password" //used here for simplicity, not security
tar -zcvf "$local_bckup_store/$user_dir-public_html-$timestamp.tar.gz" $backup_path
mysql -NBr -u$db_user -p$db_pwd -Dinformation_schema -e "SELECT schema_name FROM schemata WHERE schema_name LIKE '/$db_prefix_%/"';"
| awk '{ system("mysqldump -u$db_user -p$db_pwd $1 > $local_bckup_store/$1.$timestamp.sql | gzip" $1) }'
I will address the security issue of storing the *$db_pwd*, even though this will be stored and run locally on the server, but would appreciate some input on best practices.
Thanks.

You're really just using awk to obtain the first parameter returned from your mysql statement.
I would recommend saving that param in a variable and then doing what you want with it.
UPDATED Example (I think this is what you want):
e.g.
while read schema_name ; do
backup_file=${local_bckup_store}/${schema_name}.${timestamp}.sql
mysqldump -u${db_user} -p${db_pwd} ${schema_name} > ${backup_file}
gzip ${backup_file}
done < `mysql -NBr -u${db_user} -p${db_pwd} -Dinformation_schema -e "SELECT schema_name FROM schemata WHERE schema_name LIKE 'db_prefix_%';" | awk '{ print $1 }'`
Note the backticks, that runs the command and returns the STDOUT text, which will feed into the STDIN of the 'while read' line. Also, I've put the shell variable names in parenthesis, this is also a good practice to help keep them from being incorrectly interpolated.
This way, you avoid awk having to exec another shell just to run the mysqldump command and your code is [slightly] easier to maintain. As an added bonus, you can add error checking to be sure the mysql command succeeded before calling mysqldump.
Additionally, I would recommend using 'cut' instead of 'awk' for this, it's a bit more efficient for what you're trying to do.
FYI, you could also send the STDOUT of your mysqldump directly into the STDIN of gzip like such:
mysqldump -u${db_user} -p${db_pwd} ${schema_name} | gzip > ${backup_file}.gz

This awk command is problematic:
awk '{ system("mysqldump -u$db_user -p$db_pwd $1 > $local_bckup_store/$1.$timestamp.sql | gzip" $1) }'
Since awk cannot use shell variables $db_user etc like that in single quoted. To pass shell variables to awk use this syntax:
awk -v $db_user="$db_user" -vdb_pwd="$db_pwd" ts="$timestamp" '{
system("mysqldump -udb_user -pdb_pwd " $1 " > $local_bckup_store/" $1 "." ts ".sql | gzip" $1) }'
PS: You mysqldump | awk is stil untested here.
Another option is to do it directly in shell as:
schema=$(mysql -NBr -u$db_user -p$db_pwd -Dinformation_schema -e "SELECT schema_name FROM schemata WHERE schema_name LIKE 'db_prefix_%';")
mysqldump -u$db_user -p$db_pwd "$schema" > "$local_bckup_store/$schema.$timestamp.sql"

It's a good idea, but you really arent doing enough text parsing you warrant awk, you can do it all in bash like this.
query = "SELECT schema_name FROM schemata WHERE schema_name LIKE 'db_prefix_%';"
mysql -NBr -u$db_user -p$db_pwd -Dinformation_schema -e $query | while read line
do
arr=($line)
mysqldump -u$db_user -p$db_pwd ${arr[0]} > $local_bckup_store/${arr[0]}.$timestamp.sql && gzip $local_bckup_store/${arr[0]}.$timestamp.sql
done
bash will create an array by splitting the string on whitespace for you (this is the point of arr=($line)).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pipe envsubst output to hive - linux

Related

Add multiple and various content from bash parameter to a variable set into the sql script called

Copy a txt file twice to a different file using bash

Recursive search grep

How to get list of commands used in a shell script?

Using awk output to run mysqldump

Categories

Resources