bash/awk/unix detect changes in lines of csv files - linux

I have a timestamp in this format:
(normal_file.csv)
timestamp
19/02/2002
19/02/2002
19/02/2002
19/02/2002
19/02/2002
19/02/2002
The dates are usually uniform, however, there are files with irregular dates pattern such as this example:
(abnormal_file.csv)
timestamp
19/02/2002
19/02/2003
19/02/2005
19/02/2006
In my directory, there are hundreds of files that consist of normal.csv and abnormal.csv.
I want to write a bash or awk script that detect the dates pattern in all files of a directory. Files with abnormal.csv should be moved automatically to a new, separate directory (let's say dir_different/).
Currently, I have tried the following:
#!/bin/bash
mkdir dir_different
for FILE in *.csv;
do
# pipe 1: detect the changes in the line
# pipe 2: print the timestamp column (first column, columns are comma-separated)
awk '$1 != prev {print ; prev = $1}' < $FILE | awk -F , '{print $1}'
done
If the timestamp in a given file is normal, then only one single timestamp will be printed; but for abnormal files, multiple dates will be printed.
I am not sure how to separate the abnormal files from the normal files, and I have tried the following:
do
output=$(awk 'FNR==3{print $0}' $FILE)
echo ${output}
if [[ ${output} =~ ([[:space:]]) ]]
then
mv $FILE dir_different/
fi
done
Or is there an easier method to detect changes in lines and separate files that have different lines? Thank you for any suggestions :)

Assuming that none of your "normal" CSV files have trailing newlines this should do the separation just fine:
#!/bin/bash
mkdir -p dir_different
for FILE in *.csv;
do
if awk '{a[$1]++}END{if(length(a)<=2){exit 1}}' "$FILE" ; then
echo mv "$FILE" dir_different
fi
done
After a dry-run just get rid of the echo :)
Edit:
{a[$1]++} This bit creates an array a that gets the first field of each line as an index, and that gets incremented every time the same value is seen.
END{if(length(a)<=2){exit 1}} This checks how many elements are in the array. If there there are less than 3 (which should be the case if there's always the same date and we only get 1 header, 1 date) exit the processing with 1.
"$FILE" is part of the bash script, not awk, and I quoted your variable out of habit, should you ever have files w/ spaces in their names you'll see why :)

So, a "normal" file contains only two different lines:
timestamp
dd/mm/yyyy
Testing if a file is normal is thus as simple as:
[ $(sort -u file.csv | wc -l) -eq 2 ]
This leads to the following possible solution:
#!/usr/bin/env bash
mkdir -p dir_different
for FILE in *.csv;
do
if [ $(sort -u "$FILE" | wc -l) -ne 2 ] ; then
echo mv "$FILE" dir_different
fi
done

Related

Space characters in arguments are not handled right in Bash script

I have the following Bash script,
#!/bin/bash
if [ $# -eq 0 ]
then
echo "Error: No arguments supplied. Please provide two files."
exit 100
elif [ $# -lt 2 ]
then
echo "Error: You provided only one file. Please provide exactly two files."
exit 101
elif [ $# -gt 2 ]
then
echo "Error: You provided more than two files. Please provide exactly two files."
exit 102
else
file1="$1"
file2="$2"
fi
if [ $(wc -l "$file1" | awk -F' ' '{print $1}') -ne $(wc -l "$file2" | awk -F' ' '{print $1}') ]
then
echo "Error: Files $file1 and $file2 should have had the same number of entries."
exit 200
else
entriesNum=$(wc -l "$file1" | awk -F' ' '{print $1}')
fi
for entry in $(seq $entriesNum)
do
path1=$(head -n$entry "$file1" | tail -n1)
path2=$(head -n$entry "$file2" | tail -n1)
diff "$path1" "$path2"
if [ $? -ne 0 ]
then
echo "Error: $path1 and $path2 do not much."
exit 300
fi
done
echo "All files in both file lists match 100%."
done
which I execute giving two file paths as arguments:
./compare2files.sh /path/to/my\ first\ file\ list.txt /path/to/my\ second\ file\ list.txt
As you can see, the names of the above two files contain spaces, and every file itself contain a list of other file paths, which I want to compare line by line, e.g the first line of the one file with the first of the other, the second with the second, and so on.
The paths listed in the above two files contain spaces too, but I have escaped them using backslaces. For example, file /Volumes/WD/backup photos temp/myPhoto.jpg is turned to /Volumes/WD/backup\ photos\ temp/myPhoto.jpg.
The problem is that script fails at diff command:
diff: /Volumes/WD/backup\ photos\ temp/myPhoto.jpg: No such file or directory
diff: /Volumes/WD/backup\ photos\ 2022/IMG_20220326_1000.jpg: No such file or directory
Error: /Volumes/WD/backup\ photos\ temp/myPhoto.jpg and /Volumes/WD/backup\ photos\ 2022/IMG_20220326_1000.jpg do not much.
When I modify the diff code like diff $path1 $path2 (without double quotes), I get another kind of error:
diff: extra operand \`temp\'
diff: Try `diff --help' for more information
Error: /Volumes/WD/backup\ photos\ temp/myPhoto.jpg and /Volumes/WD/backup\ photos\ 2022/IMG_20220326_1000.jpg do not much.
Apparently the files exist and the paths are valid, but the spaces in paths' names are not handled right. What is wrong with my code and how can be fixed (apart from renaming directories and files)?
The title is incorrect: Spaces in your script's arguments are handled correctly. The backslash/space sequences in your input files (as returned in the stdout from head -n 1), by contrast, are not processed as you expect.
Your input files should not contain literal backslashes. Backslashes are only meaningful when they're part of your script itself, parsed as syntax; not when they're part of your data. (That is to say, the string hello\ world in your script or in the command-line arguments given to the shell that calls your script becomes a single string hello world in memory; the backslash guides its parsing, but is not part of the value itself).
Command substitution results do not go through this parsing phase, so the output from head is stored in path1 and path2 exactly as it is (other than removal of the final trailing newline), without backslashes being removed.
If you must process input that contains quote and escape characters, xargs or the Python shlex module can be used to split that input into an array, as demonstrated in Reading quoted/escaped arguments correctly from a string.

Separating lines of a huge file into two files depending on the date

I'm gathering tones of data in a stream on an Ubuntu machine, the data is stored in days packages (where each day_file contains somewhere between 1 and 5 gb). I'm not an experienced linux/bash/awk user, but the data looks something like this (all lines start with a date):
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way
Now to the problem, the stream is cut around midnight local time (for a few reasons it can't be cut at exact 00.00.00 gtm time). This means that rows from two dates are stored in the same file and I want to separate them into the correct date files. I wrote the following script trying to separate the rows, it works but it takes several hours to run and I think that there must be a faster way of doing this operation?
#!/bin/bash
dateDiff (){
line_str="$1"
dte1="2020-09-01"
dte2=${line_str:0:10}
if [[ "$dte1" == "$dte2" ]]; then
echo $line_str >> correct_date.txt;
else
echo $line_str >> wrong_date.txt;
fi
}
IFS=$'\n'
for line in $(cat massive_file.txt)
do
dateDiff "$line"
done
unset IFS
Using this awk script I'm able to process 10GB file in approx 1 minute on my machine.
awk '{ if ($0 ~ /^2020-08-31/) { print $0 > "correct.txt" } else { print $0 > "wrong.txt" } }' input_file_name.txt
Line is checked against regular expression containing your date, then whole line is printed to file based on regexp match.
Using awk with T as your field separator, the first field, $1, will be the date. Then you can output each record to a file named for the date.
$ cat file
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way
$ awk -FT '{ print > ($1 ".txt") }' file
$ ls 20*.txt
2020-08-31.txt 2020-09-01.txt
$ cat 2020-09-01.txt
2020-09-01T00:00:00Z !In a unreadable way
$ cat 2020-08-31.txt
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
Some notes:
Using a bash loop to read logs would be very slow.
Using awk, sed, grep or similar is very good, but still you will have to read and write whole files line by line, and this has a perfomance ceiling.
For your specific case, you could only identify the split points, which can be 3, not only 2, (previous, current and next day logs can co-exist in a file) with something like grep -nm1 "^$day" and then split the log file with a combination of head and tail, like this. Then append or prepend them to the existing ones. This would be a very fast solution because you would write the files massively, not line by line.
Here is a simple solution with grep, as you need to test only the 10 first characters of the log lines, and for this job grep is faster than awk.
Assuming that you store logs in a destination directory, every incoming file should pass from something like this script. Order of processing is important, you have to follow date order of the files, e.g. you see that I append to an existing file. This is just a demo solution for guidance.
#!/bin/bash
[[ -f "$1" ]] && f="$1" || { echo "Nothing to do"; exit 1; }
dest_dir=archive/
suffix="_file.log"
curr=${f:0:10}
prev=$( date -d "$curr -1 day" "+%Y-%m-%d" )
next=$( date -d "$curr +1 day" "+%Y-%m-%d" )
for d in $prev $curr $next; do
grep "^$d" "$f" >> "${dest_dir}${d}${suffix}"
done

Writing Output to Middle of File

I am currently working on a script that loops through an input file, the input file has a format similar to an /etc/hosts file in Linux. For example, the input file would look something like this:
192.168.1.21 host1
192.168.1.17 host5
192.168.1.168 host9
192.168.1.3 host3
192.168.1.37 host4
The data from the input file would need to be added to another file (an already existing hosts file) but it would need to be sorted in alphabetical order.
The following code snippet shows how I coded my script.
#Assign parameters to variables
inputFile=$1
hostsFile=$2
while read line
do
#Extract data from input file
ipAddress=`echo $line | awk '{print $1}'`
hostName=`echo $line | awk '{print $2}'`
#Loop to add host to hostname file, maintaining alphabetical order
while read line
do
addHost=`echo $line | awk '{print $2}'`
if [[ $profileEntry < $deviceName ]]; then
#Add device in sorted alphabetical order
echo $ipAddress "\t" $hostName >> $hostsFile
break;
fi
done < $hostsFile
done < $inputFile
I am doing the string comparison but when I write to the file using the >> or > operators the output will be append to the end of the file. Is there any Linux ksh construct or other method that can be used in shell scripts that will allow me to insert lines of text in alphabetical order, as opposed to appending text to the end of a file.
What you're asking for is impossible: The OS itself has no way to do insertions (as opposed to replacements) of content in the middle of a file. Other than when O_APPEND is in effect, write() calls overwrite and replace any content immediately after the current position of the file pointer.
Write a new file from scratch, and rename it over the original you're trying to replace; this has the benefit of being an atomic operation:
#!/usr/bin/env ksh
inputFile=$1
hostsFile=$2
tempFile=$(mktemp "${hostsFile}.XXXXXX")
if sort -u -- "$inputFile" "$hostsFile" >"$tempFile"; then
mv -- "$tempFile" "$hostsFile" && exit
fi
rm -f -- "$tempFile"

Linux : Move files that have more than 100 commas in one line

I have 100 files in a specific directory that contains several records with fields delimited with commas.
I need to use a Linux command that check the lines in each file
and if the line contains more than 100 comma move it to another directory.
Is it possible ?
Updated Answer
Although my original answer below is functional, Glenn's (#glennjackman) suggestion in the comments is far more concise, idiomatic, eloquent and preferable - as follows:
#!/bin/bash
mkdir subdir
for f in file*; do
awk -F, 'NF>100{exit 1}' "$f" || mv "$f" subdir
done
It basically relies on awk's exit status generally being 0, and then only setting it to 1 when encountering files that need moving.
Original Answer
This will tell you if a file has more than 100 commas on any line:
awk -F, 'NF>100{print 1;exit} END{print 0}' someFile
It will print 1 and exit without parsing the remainder of the file if the file has any line with more than 100, and print 0 at the end if it doesn't.
If you want to move them as well, use
#!/bin/bash
mkdir subdir
for f in file*; do
if [[ $(awk -F, 'NF>100{print 1;exit}END{print 0}' "$f") != "0" ]]; then
echo mv "$f" subdir
fi
done
Try this and see if it selects the correct files, and, if you like it, remove the word echo and run it again so it actually moves them. Back up first!

Rename part of file name based on exact match in contents of another file

I would like to rename a bunch of files by changing only one part of the file name and doing that based on an exact match in a list in another file. For example, if I have these file names:
sample_ACGTA.txt
sample_ACGTA.fq.abc
sample_ACGT.txt
sample_TTTTTC.tsv
sample_ACCCGGG.fq
sample_ACCCGGG.txt
otherfile.txt
and I want to find and replace based on these exact matches, which are found in another file called replacements.txt:
ACGT name1
TTTTTC longername12
ACCCGGG nam7
ACGTA another4
So that the desired resulting file names would be
sample_another4.txt
sample_another4.fq.abc
sample_name1.txt
sample_longername12.tsv
sample_nam7.fq
sample_nam7.txt
otherfile.txt
I do not want to change the contents. So far I have tried sed and mv based on my search results on this website. With sed I found out how to replace the contents of the file using my list:
while read from to; do
sed -i "s/$from/$to/" infile ;
done < replacements.txt,
and with mv I have found a way to rename files if there is one simple replacement:
for files in sample_*; do
mv "$files" "${files/ACGTA/another4}"
done
But how can I put them together to do what I would like?
Thank you for your help!
You can perfectly combine your for and while loops to only use mv:
while read from to ; do
for i in test* ; do
if [ "$i" != "${i/$from/$to}" ] ; then
mv $i ${i/$from/$to}
fi
done
done < replacements.txt
An alternative solution with sed could consist in using the e command that executes the result of a substitution (Use with caution! Try without the ending e first to print what commands would be executed).
Hence:
sed 's/\(\w\+\)\s\+\(\w\+\)/mv sample_\1\.txt sample_\2\.txt/e' replacements.txt
would parse your replacements.txt file and rename all your .txt files as desired.
We just have to add a loop to deal with the other extentions:
for j in .txt .bak .tsv .fq .fq.abc ; do
sed "s/\(\w\+\)\s\+\(\w\+\)/mv 'sample_\1$j' 'sample_\2$j'/e" replacements.txt
done
(Note that you should get error messages when it tries to rename non-existing files, for example when it tries to execute mv sample_ACGT.fq sample_name1.fq but file sample_ACGT.fq does not exist)
You could use awk to generate commands:
% awk '{print "for files in sample_*; do mv $files ${files/" $1 "/" $2 "}; done" }' replacements.txt
for files in sample_*; do mv $files ${files/ACGT/name1}; done
for files in sample_*; do mv $files ${files/TTTTTC/longername12}; done
for files in sample_*; do mv $files ${files/ACCCGGG/nam7}; done
for files in sample_*; do mv $files ${files/ACGTA/another4}; done
Then either copy/paste or pipe the output directly to your shell:
% awk '{print "for files in sample_*; do mv $files ${files/" $1 "/" $2 "}; done" }' replacements.txt | bash
If you want the longer match string to be used first, sort the replacements first:
% sort -r replacements.txt | awk '{print "for files in sample_*; do mv $files ${files/" $1 "/" $2 "}; done" }' | bash

Resources