join 2 csv files into third file with "|" as seperator - linux

I want to join two files
a.csv
customer|BillTo
100|3437146
103|3436977
b.csv
Customer|Parent
100|ANHEUSER-BUSCH INBEV
1025|INTRASTATE DISTRIBUTORS INC.
The joined file should be like this
Parent|BillTo
ANHEUSER-BUSCH INBEV|3437146
I tried to use awk but seems can't get the result. Any help would be appreciated.

$ join -j1 --header -o 2.2,1.2 -t'|' \
<(head -n 1 a.csv; tail -n +2 a.csv | sort) \
<(head -n 1 b.csv; tail -n +2 b.csv | sort)
Parent|BillTo
ANHEUSER-BUSCH INBEV|3437146
This assumes GNU join, which since you have this tagged linux seems a safe bet, and a shell like bash, zsh, ksh93, etc. that supports <(command) style redirection (/bin/sh usually doesn't).

Related

Bash script to select filename with latest date and without substring

If a directory /dat contains files with filenames such as
base-2020-01-01.dat
base-2020-01-01-incremental-2020-01-02.dat
base-2020-01-01-incremental-2020-01-03.dat
base-2020-01-03.dat
base-2020-01-03-incremental-2020-01-04.dat
base-2020-01-03-incremental-2020-01-05.dat
how can we write a bash script that selects
the base-*.dat filename without the -incremental-* substring, and whose date in the filename is the most recent one.
the date string in this filename
In this example, we want to select base-2020-01-03.dat and 2020-01-03.
If a different file naming convention makes it easier to parse, that will be even better!
With modern GNU tools: Using find, grep -P and bash (with error handling)
read file < <(
find /data -maxdepth 1 -name 'base*.dat' ! -name '*incremental*' \
-printf '%f\n' | sort -nr | head -n1
)
set -e
echo "${file:?$(tput setaf 1)no match$(tput sgr0;exit 1)}"
date=$(grep -oP "\d{4}-\d{2}-\d{2}" <<< "$file")
echo "$date"
Output
base-2020-01-03.dat
2020-01-03
Using pure bash extended globs and parameter expansion, with no external programs:
#!/usr/bin/env bash
shopt -s extglob
declare -a files=(base-!(*incremental*).dat)
echo "Non-incremental files: ${files[*]}"
justdate="${files[-1]#base-}"
justdate="${justdate%.dat}"
echo "Most recent file: ${files[-1]} from $justdate"
Usage:
$ ls
base-2020-06-29.dat base-2020-06-30-incremental-2020-07-01.dat base-2020-07-01.dat demo.sh
$ bash demo.sh
Non-incremental files: base-2020-06-29.dat base-2020-07-01.dat
Most recent file: base-2020-07-01.dat from 2020-07-01
Here is another possible solution:
date=$(find /dat -name 'base-[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].dat' -printf '%f\n' | sort -r | grep -o '[0-9][^.]*' -m1)
[ -n "$date" ] && echo "$date base-$date.dat"
find outputs the list of all the files named base-YYYY-MM-DD.dat
sort -r sorts the list with the most recent date first
grep -o extracts the date part and with -m1 stops at the first line
Using GNU tools to allow NULL delimiter for the sake of paranoia (so a filename with literal newlines can't introduce garbage into your logic -- or at least, won't do anything untoward unless it sorts last):
printf '%s\0' /data/base-*.dat |
sort --zero-terminated |
grep --null -v incremental |
tail --zero-terminated -n 1
If you aren't on a GNU platform (you don't have all the --zero-terminated or --null options), change the \0 to \n in the printf format string, and remove the aforementioned options.

cat command merging two *.txt with missing columns mac osx

I like to use the cat command to join several *.txt files under mac osx.
my first file1.txt looks like:
a;b;c;d
1;2;3;4
second file2.txt:
a;b
5;6
7;8
what I want:
a;b;c;d
1;2;3;4
5;6;;
7;8;;
my question: can I skip the header from the second file in the output file? And how is cat dealing with the missing columns? writing NaNs?
maybe this command could do it?
head -1 file1.txt > all.txt;
tail -n +2 -q file*.txt >> all.txt
I don't think the cat command alone will deal with removing the headers or mark any missing columns, since all it does is concatenate files. But if you know the highest possible number of columns, you can do something like this:
cat file1.txt <( tail -n+2 file2.txt ) | gawk -F';' -v OFS=';' '{NF=4}1'
Where NF=4 is the highest number of columns (in your example, 4).
The command above is concatenating file1.txt with a header-less version of file2.txt, using the output of a subcommand as input (operator <( ) ). You can use the <( ) as many times you want for each file you're wanting to concatenate. The final command, gawk, was adapted from this answer) and it's padding out the column delimiters for you.
(note: use brew install gawk if gawk isn't found; Mac OS X's awk won't work)
If not having the first header doesn't bother you and you don't want to use cat, you could do:
gawk -F';' -v OFS=';' '{NF=4}1' file*.txt | egrep -v '^a;b'

Problems with tail -f and awk? [duplicate]

Is that possible to use grep on a continuous stream?
What I mean is sort of a tail -f <file> command, but with grep on the output in order to keep only the lines that interest me.
I've tried tail -f <file> | grep pattern but it seems that grep can only be executed once tail finishes, that is to say never.
Turn on grep's line buffering mode when using BSD grep (FreeBSD, Mac OS X etc.)
tail -f file | grep --line-buffered my_pattern
It looks like a while ago --line-buffered didn't matter for GNU grep (used on pretty much any Linux) as it flushed by default (YMMV for other Unix-likes such as SmartOS, AIX or QNX). However, as of November 2020, --line-buffered is needed (at least with GNU grep 3.5 in openSUSE, but it seems generally needed based on comments below).
I use the tail -f <file> | grep <pattern> all the time.
It will wait till grep flushes, not till it finishes (I'm using Ubuntu).
I think that your problem is that grep uses some output buffering. Try
tail -f file | stdbuf -o0 grep my_pattern
it will set output buffering mode of grep to unbuffered.
If you want to find matches in the entire file (not just the tail), and you want it to sit and wait for any new matches, this works nicely:
tail -c +0 -f <file> | grep --line-buffered <pattern>
The -c +0 flag says that the output should start 0 bytes (-c) from the beginning (+) of the file.
In most cases, you can tail -f /var/log/some.log |grep foo and it will work just fine.
If you need to use multiple greps on a running log file and you find that you get no output, you may need to stick the --line-buffered switch into your middle grep(s), like so:
tail -f /var/log/some.log | grep --line-buffered foo | grep bar
you may consider this answer as enhancement .. usually I am using
tail -F <fileName> | grep --line-buffered <pattern> -A 3 -B 5
-F is better in case of file rotate (-f will not work properly if file rotated)
-A and -B is useful to get lines just before and after the pattern occurrence .. these blocks will appeared between dashed line separators
But For me I prefer doing the following
tail -F <file> | less
this is very useful if you want to search inside streamed logs. I mean go back and forward and look deeply
Didn't see anyone offer my usual go-to for this:
less +F <file>
ctrl + c
/<search term>
<enter>
shift + f
I prefer this, because you can use ctrl + c to stop and navigate through the file whenever, and then just hit shift + f to return to the live, streaming search.
sed would be a better choice (stream editor)
tail -n0 -f <file> | sed -n '/search string/p'
and then if you wanted the tail command to exit once you found a particular string:
tail --pid=$(($BASHPID+1)) -n0 -f <file> | sed -n '/search string/{p; q}'
Obviously a bashism: $BASHPID will be the process id of the tail command. The sed command is next after tail in the pipe, so the sed process id will be $BASHPID+1.
Yes, this will actually work just fine. Grep and most Unix commands operate on streams one line at a time. Each line that comes out of tail will be analyzed and passed on if it matches.
This one command workes for me (Suse):
mail-srv:/var/log # tail -f /var/log/mail.info |grep --line-buffered LOGIN >> logins_to_mail
collecting logins to mail service
Coming some late on this question, considering this kind of work as an important part of monitoring job, here is my (not so short) answer...
Following logs using bash
1. Command tail
This command is a little more porewfull than read on already published answer
Difference between follow option tail -f and tail -F, from manpage:
-f, --follow[={name|descriptor}]
output appended data as the file grows;
...
-F same as --follow=name --retry
...
--retry
keep trying to open a file if it is inaccessible
This mean: by using -F instead of -f, tail will re-open file(s) when removed (on log rotation, for sample).
This is usefull for watching logfile over many days.
Ability of following more than one file simultaneously
I've already used:
tail -F /var/www/clients/client*/web*/log/{error,access}.log /var/log/{mail,auth}.log \
/var/log/apache2/{,ssl_,other_vhosts_}access.log \
/var/log/pure-ftpd/transfer.log
For following events through hundreds of files... (consider rest of this answer to understand how to make it readable... ;)
Using switches -n (Don't use -c for line buffering!).By default tail will show 10 last lines. This can be tunned:
tail -n 0 -F file
Will follow file, but only new lines will be printed
tail -n +0 -F file
Will print whole file before following his progression.
2. Buffer issues when piping:
If you plan to filter ouptuts, consider buffering! See -u option for sed, --line-buffered for grep, or stdbuf command:
tail -F /some/files | sed -une '/Regular Expression/p'
Is (a lot more efficient than using grep) a lot more reactive than if you does'nt use -u switch in sed command.
tail -F /some/files |
sed -une '/Regular Expression/p' |
stdbuf -i0 -o0 tee /some/resultfile
3. Recent journaling system
On recent system, instead of tail -f /var/log/syslog you have to run journalctl -xf, in near same way...
journalctl -axf | sed -une '/Regular Expression/p'
But read man page, this tool was built for log analyses!
4. Integrating this in a bash script
Colored output of two files (or more)
Here is a sample of script watching for many files, coloring ouptut differently for 1st file than others:
#!/bin/bash
tail -F "$#" |
sed -une "
/^==> /{h;};
//!{
G;
s/^\\(.*\\)\\n==>.*${1//\//\\\/}.*<==/\\o33[47m\\1\\o33[0m/;
s/^\\(.*\\)\\n==> .* <==/\\o33[47;31m\\1\\o33[0m/;
p;}"
They work fine on my host, running:
sudo ./myColoredTail /var/log/{kern.,sys}log
Interactive script
You may be watching logs for reacting on events?
Here is a little script playing some sound when some USB device appear or disappear, but same script could send mail, or any other interaction, like powering on coffe machine...
#!/bin/bash
exec {tailF}< <(tail -F /var/log/kern.log)
tailPid=$!
while :;do
read -rsn 1 -t .3 keyboard
[ "${keyboard,}" = "q" ] && break
if read -ru $tailF -t 0 _ ;then
read -ru $tailF line
case $line in
*New\ USB\ device\ found* ) play /some/sound.ogg ;;
*USB\ disconnect* ) play /some/othersound.ogg ;;
esac
printf "\r%s\e[K" "$line"
fi
done
echo
exec {tailF}<&-
kill $tailPid
You could quit by pressing Q key.
you certainly won't succeed with
tail -f /var/log/foo.log |grep --line-buffered string2search
when you use "colortail" as an alias for tail, eg. in bash
alias tail='colortail -n 30'
you can check by
type alias
if this outputs something like
tail isan alias of colortail -n 30.
then you have your culprit :)
Solution:
remove the alias with
unalias tail
ensure that you're using the 'real' tail binary by this command
type tail
which should output something like:
tail is /usr/bin/tail
and then you can run your command
tail -f foo.log |grep --line-buffered something
Good luck.
Use awk(another great bash utility) instead of grep where you dont have the line buffered option! It will continuously stream your data from tail.
this is how you use grep
tail -f <file> | grep pattern
This is how you would use awk
tail -f <file> | awk '/pattern/{print $0}'

how to compare output of two ls in linux

So here is the task which I can't solve. I have a directory with .h files and a directory with .i files, which have the same names as the .h files. I want just by typing a command to have all .h files which are not found as .i files. It's not a hard problem, I can do it in some programming language, but I'm just curious how it will look like in cmd :). To be more specific here is the algo:
get file names without extensions from ls *.h
get file names without extensions from ls *.i
compare them
print all names from 1 that are not met in 2
Good luck!
diff \
<(ls dir.with.h | sed 's/\.h$//') \
<(ls dir.with.i | sed 's/\.i$//') \
| grep '$<' \
| cut -c3-
diff <(ls dir.with.h | sed 's/\.h$//') <(ls dir.with.i | sed 's/\.i$//') executes ls on the two directories, cuts off the extensions, and compares the two lists. Then grep '$<' finds the files that are only in the first listing, and cut -c3- cuts off the "< " characters that diff inserted.
ls ./dir_h/*.h | sed -r -n 's:.*dir_h/([^.]*).h$:dir_i/\1.i:p' | xargs ls 2>&1 | \
grep "No such file or directory" | awk '{print $4}' | sed -n -r 's:dir_i/([^:]*).*:dir_h/\1:p'
ls -1 dir1/*.hh dir2/*.ii | awk -F"/" '{print $NF}' |awk -F"." '{a[$1]++;b[$0]}END{for(i in a)if(a[i]==1 && b[i".hh"]) print i}'
explanation:
ls -1 dir1/*.hh dir2/*.ii
above will list all the files *.hh and *.ii files in both the directories.
awk -F"/" '{print $NF}'
above will just print the file name excluding the complete path of the file.
awk -F"." '{a[$1]++;b[$0]}END{for(i in a)if(a[i]==1 && b[i".hh"]) print i}'
above will create two associative arrays one with file name and one with excluding the extension.
if both hh and ii files exist the value in the assosciative array will 2 if there is only one file then the value will be 1.so we need array item whose value is 1 and it should be a header file (.hh).
this can be checked using the asso..array b which is done in the END block.
Assuming bash is your shell:
for file in $( ls dir_with_h/*.h ); do
name=${file%\.h}; # trim trailing ".h" file extension
name=${name#dir_with_h/}; # trim leading folder name
if [ ! -e dir_with_i/${name}.i ]; then
echo ${name};
fi
done
Undoubtedly this can be ported to virtually all other shells. I find this less cryptic than some other approaches (although this is surely my problem) but it is a little wordy. As such. a shell script might help recall it.

diff files comparing only first n characters of each line

I have got 2 files. Let us call them md5s1.txt and md5s2.txt. Both contain the output of a
find -type f -print0 | xargs -0 md5sum | sort > md5s.txt
command in different directories. Many files were renamed, but the content stayed the same. Hence, they should have the same md5sum. I want to generate a diff like
diff md5s1.txt md5s2.txt
but it should compare only the first 32 characters of each line, i.e. only the md5sum, not the filename. Lines with equal md5sum should be considered equal. The output should be in normal diff format.
Easy starter:
diff <(cut -d' ' -f1 md5s1.txt) <(cut -d' ' -f1 md5s2.txt)
Also, consider just
diff -EwburqN folder1/ folder2/
Compare only the md5 column using diff on <(cut -c -32 md5sums.sort.XXX), and tell diff to print just the line numbers of added or removed lines, using --old/new-line-format='%dn'$'\n'. Pipe this into ed md5sums.sort.XXX so it will print only those lines from the md5sums.sort.XXX file.
diff \
--new-line-format='%dn'$'\n' \
--old-line-format='' \
--unchanged-line-format='' \
<(cut -c -32 md5sums.sort.old) \
<(cut -c -32 md5sums.sort.new) \
| ed md5sums.sort.new \
> files-added
diff \
--new-line-format='' \
--old-line-format='%dn'$'\n' \
--unchanged-line-format='' \
<(cut -c -32 md5sums.sort.old) \
<(cut -c -32 md5sums.sort.new) \
| ed md5sums.sort.old \
> files-removed
The problem with ed is that it will load the entire file into memory, which can be a problem if you have a lot of checksums. Instead of piping the output of diff into ed, pipe it into the following command, which will use much less memory.
diff … | (
lnum=0;
while read lprint; do
while [ $lnum -lt $lprint ]; do read line <&3; ((lnum++)); done;
echo $line;
done
) 3<md5sums.sort.XXX
If you are looking for duplicate files fdupes can do this for you:
$ fdupes --recurse
On ubuntu you can install it by doing
$ apt-get install fdupes

Resources