SHELL : Sort Directory name as descending order - linux

I have 3 folders in my server,
Assuming folder names are
workbook_20220217
workbook_20220407
workbook_20220105
Each folder consist of its respective files
I would only want to print the latest file based on date, there are 2 methods i have tried so far
The first method i tried
Variable Declared
TABLEAU_REPORTING_FOLDER=/farid/reporting/workbook
#First Method
ls $TABLEAU_REPORTING_FOLDER *_* | sort -t_ -n -k2 | sed ':0 N;s/\n/, /;t0'
#The first method will return all its contents in the folder as well
#The second Method i have tried
$(ls -td ${TABLEAU_REPORTING_FOLDER}/workbook/* | head -1)
# This will return folder based on ascending order
Target output should be a workbook_20220407
What is the best approach should look into? There are no other logics i could think rather than using the date as the biggest value to determine if its the latest date
*PS i could not read folder as date modified because once folder have been transferred to my server, all 3 folders will be of the same date
UPDATE
I found a way to get the latest folder based on filename based on this reference : https://www.unix.com/shell-programming-and-scripting/174140-how-sort-files-based-file-name-having-numbers.html
ls | sort -t'-' -nk2.3 | tail -1
This will return the latest folder based on folder title , will this be safe to use ?
Also what does -nk.2.3 does and mean ?

You can list your files in a directory in reverse order with option -r (independent if you have selected either sort order) See man page of ls(1) command for details.
The options -n and -k2.3 of sort(1) command mean, respectively (see also sort(1) man page for details):
sort numerically. This meaning that the keys are considered as numbers and sorted accordingly.
select fields 2 and 3 (the dot must be a comma, by the way) as keys for sorting purposes.
Read the man pages of both commands, they are your friends.

Related

Compare two files and extract values that you don't include in common to a third file

Basically I need to compare in the first room the file A that has 100 records (all of them numerical) and a file B that also has records (numerical). The idea is, to compare both files and generate a third party that gives me as an output, the numbers that are not Sundays both in file A and in B. That is if I compare A with B and the numbers that B has are not inside from A I generated a C file with those numbers that are not in common.
Example File A:
334030004141665
334030227891112
334030870429938
334030870429939
334030241924239
334030870429932
334030870429933
334030870429930
334030870429931
334030870429936
334030013091432
334030030028092
334030218459802
334030003074203
334030010435534
334030870429937
334030870429934
334030870429935
334030062679707
334030062679706
Example File B
334030013091432
334030030028092
334030218459802
334030003074203
334030010435534
334030010781511
334030010783039
334030204710123
334030203456292
334030203292057
334030010807268
334030010455298
334030240658153
334030218450890
334030023035316
334030010807456
334030010457538
334030071689268
334030204710136
Excpected File C
334030013091432
334030030028092
334030218459802
334030003074203
334030010435534
I have already tried with comm, diff, grep but nothing makes me work. The ideal would not be to sort the files, since these that I want to compare only have 100 records, but for the next one there will be more than one million records.
Thank you for your contributions.
I'm going to look through my fingers with I have already tried with comm, diff, grep this time, but next time post some actual trials.
To extract the common information in both files the obvious would be to use grep for it:
$ grep -f A B
Output:
334030013091432
334030030028092
334030218459802
334030003074203
334030010435534
but grep in that form would accept partial matches as well, so being lazy I wouldn't see the man grep (well, I did, it's grep -w -f A B) but use awk instead:
$ awk 'NR==FNR{a[$0];next}($0 in a)' A B
Explained:
$ awk '
NR==FNR { # process the first file in the list
a[$0] # hash record to a hash
next # move to next record in the first file
} # after this point process all the files after the first
($0 in a) # if record found in a hash, output it
' A B # put the smaller file first as it is stored in memory
Once you get to the million lines part, please time (time grep ... and time awk ...) the difference of both solutions and post in the comments.

Linux join utility complains about input file not being sorted

I have two files:
file1 has the format:
field1;field2;field3;field4
(file1 is initially unsorted)
file2 has the format:
field1
(file2 is sorted)
I run the 2 following commands:
sort -t\; -k1 file1 -o file1 # to sort file 1
join -t\; -1 1 -2 1 -o 1.1 1.2 1.3 1.4 file1 file2
I get the following message:
join: file1:27497: is not sorted: line_which_was_identified_as_out_of_order
Why is this happening ?
(I also tried to sort file1 taking into consideration the entire line not only the first filed of the line but with no success)
sort -t\; -c file1 doesn't output anything. Around line 27497, the situation is indeed strange which means that sort doesn't do its job correctly:
XYZ113017;...
line 27497--> XYZ11301;...
XYZ11301;...
To complement Wumpus Q. Wumbley's helpful answer with a broader perspective (since I found this post researching a slightly different problem).
When using join, the input files must be sorted by the join field ONLY, otherwise you may see the warning reported by the OP.
There are two common scenarios in which more than the field of interest is mistakenly included when sorting the input files:
If you do specify a field, it's easy to forget that you must also specify a stop field - even if you target only 1 field - because sort uses the remainder of the line if only a start field is specified; e.g.:
sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
sort -t, -k1,1 ... # Field 1 only
If your sort field is the FIRST field in the input, it's tempting to not specify any field selector at all.
However, if field values can be prefix substrings of each other, sorting whole lines will NOT (necessarily) result in the same sort order as just sorting by the 1st field:
sort ... # NOT always the same as 'sort -k1,1'! see below for example
Pitfall example:
#!/usr/bin/env bash
# Input data: fields separated by '^'.
# Note that, when properly sorting by field 1, the order should
# be "nameA" before "nameAA" (followed by "nameZ").
# Note how "nameA" is a substring of "nameAA".
read -r -d '' input <<EOF
nameA^other1
nameAA^other2
nameZ^other3
EOF
# NOTE: "WRONG" below refers to deviation from the expected outcome
# of sorting by field 1 only, based on mistaken assumptions.
# The commands do work correctly in a technical sense.
echo '--- just sort'
sort <<<"$input" | head -1 # WRONG: 'nameAA' comes first
echo '--- sort FROM field 1'
sort -t^ -k1 <<<"$input" | head -1 # WRONG: 'nameAA' comes first
echo '--- sort with field 1 ONLY'
sort -t^ -k1,1 <<<"$input" | head -1 # ok, 'nameA' comes first
Explanation:
When NOT limiting sorting to the first field, it is the relative sort order of chars. ^ and A (column index 6) that matters in this example. In other words: the field separator is compared to data, which is the source of the problem: ^ has a HIGHER ASCII value than A, and therefore sorts after 'A', resulting in the line starting with nameAA^ sorting BEFORE the one with nameA^.
Note: It is possible for problems to surface on one platform, but be masked on another, based on locale and character-set settings and/or the sort implementation used; e.g., with a locale of en_US.UTF-8 in effect, with , as the separator and - permissible inside fields:
sort as used on OSX 10.10.2 (which is an old GNU sort version, 5.93) sorts , before - (in line with ASCII values)
sort as used on Ubuntu 14.04 (GNU sort 8.21) does the opposite: sorts - before ,[1]
[1] I don't know why - if somebody knows, please tell me. Test with sort <<<$'-\n,'
sort -k1 uses all fields starting from field 1 as the key. You need to specify a stop field.
sort -t\; -k1,1
... or the gnu sort is just as buggy as every other GNU command
try and sort Gi1/0/11 vs Gi1/0/1 and you'll never be able to get an actual regular textual sort suitable for join input because someone added some extra intelligence in sort which will happily use numeric or human numeric sorting automagically in such cases without even bothering to add a flag to force the regular behavior
what is suitable for humans is seldom suitable for scripting

Sort hyphenated names alphabetically, then numerically

I have an unsorted server list like the following;
bgsqlnp-z101
bgsqlnp-z102
bgsqlnp-z103
bgsqlnp-z2
bgsqlnp-z3
bgsqlnp-z5
dfsqlnp-z108
dfsqlnp-z4
bgsqlnp-z1
dfsqlprd-z8
fuqddev-z88
fuqhdev-z8
ghsbqudev-z18
heiappprod-z1
htsybprd-z24
Using sort to read-in the file, I'm trying to get the following;
bgsqlnp-z1
bgsqlnp-z2
bgsqlnp-z3
bgsqlnp-z5
bgsqlnp-z101
bgsqlnp-z102
bgsqlnp-z103
dfsqlnp-z4
dfsqlnp-z108
dfsqlprd-z8
fuqddev-z88
fuqhdev-z8
ghsbqudev-z18
heiappprod-z1
htsybprd-z24
I'm just not able to find the right keydef for my -k option.
Here's the closest I've been able to get;
sort -k2n -t"z"
bgsqlnp-z1
bgsqlnp-z101
bgsqlnp-z102
bgsqlnp-z103
bgsqlnp-z2
bgsqlnp-z3
bgsqlnp-z5
dfsqlnp-z108
dfsqlnp-z4
dfsqlprd-z8
fuqddev-z88
fuqhdev-z8
ghsbqudev-z18
heiappprod-z1
htsybprd-z24
The numbers are in the right order, but the server names aren't sorted.
Attempts using a multi-field keydef (-k1,2n) seem to have zero effect (i get no sorting at all).
Here's some extra info about the server names;
1) All of them have a "-z[1-200]" suffix on the names, some numbers repeat.
2) Server names are differing lengths (4 to 16 characters)
So using 'cut' is out of the question
You can use sed to get around having a multi-character separator. You can switch between numeric and dictionary order after each sort key definition. Note that you have to have multiple -k options for multiple keys, check the man page for details on this.
Something like this:
sed 's/-z/ /' file | sort -k2,2n -k1,1d | sed 's/ /-z/'

Bash to get timestamp from file list and compare it to filename

Implementing a GIT repository for a project we are including the DB structure by generating a dump on the post-commit hook on deployment.
What I would like to have is a simple versioning system for the file based on the timestamp of the last change to the tables structure.
After finding this post with the suggestion to check for the dates of the the *.frm files in the MySQL data dir I thought the solution would be to implement it based on that last date as part of the generated file. This is:
Find out the latest date-time of the files of the DB (i.e. /var/lib/mysql/databaseX/) via an ls command (of type ls -la *.frm)
compare that value (last changed file) with the one of a certain file (ie /project/dump_2012102620001.sql) where the numbers correspond to the last generated dump.
If files timestamp is after that date generate the mysqldump command, otherwise ignore so the dump does not get generated and
committed as a change to GIT
Unfortunately my Linux console/bash concepts are too far from being capable and have not found any similar script to use.
You can use [[ file1 -ot file2 ]] to test whether file1 is older than file2.
last=$(ls -tr /path/to/db/files/*.frm | tail -n1)
if [[ dump -ot $last ]] ; then
create_new_dump
fi
You can save yourself a lot of grief by just dumping the table structure every time with the appropriate mysqldump command as this is relatively lightweight since it won't include table contents. Strip out the variable timestamp information at the top and compare with the previous file. Store if different.

how to create a txt file with columns being the descending sub-directories in Linux?

My data follow the structure:
../data/study_ID/FF_Number/Exam_Number/date,
Where the data dir contains 176 participants` sub-directories. The ID number represents the participants ID, and each of the following sub-directories represents some experimental number.
I want to create a txt file with one line per participants and the following columns: study ID, FF_number, Exam_Number and date.
However it gets a bit more complicated as I want to divide the participants into chunks of ~ 15-20 ppt per chunk for the following analysis.
Any suggestions?
Cheers.
Hmm, nobody?
You should redirect output of "find" command, consider switches -type d, and -maxdepth, and probably parse it with sed, replacing "/" with "spaces". Maybe piping through "cut" and "column -t" commands, and "sort" and "uniq" will be useful. Do names, except FF and ID, contain spaces, or special characters e.g. related to names of participants?
It should be possible to get a TXT with "one liner" and a few pipes.
You should try, and post first results of your work on this :)
EDIT: Alright, I created for me a structure with several thousands of directories and subdirectories numbered by participant, by exam number etc., which look like this ( maybe it's not identical with what you have, but don't worry ). Studies are numbered from 5 to 150, FF from 45 to 75, and dates from 2012_01_00 to 2012_01_30 - which makes really huge quantity of directories in total.
/Users/pwadas/bzz/data
/Users/pwadas/bzz/data/study_005
/Users/pwadas/bzz/data/study_005/05_Num
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_00
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_01
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_02
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_03
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_04
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_05
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_06
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_07
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_08
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_09
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_10
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_11
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_12
Now, I want ( quote ) "txt file with one line per participants and the following columns: study ID, FF_number, Exam_Number and date."
So I use the following one-liner:
find /Users/pwadas/bzz/data -type d | head -n 5000 |cut -d'/' -f5-7 | uniq |while read line; do echo -n "$line: " && ls -d /Users/pwadas/bzz/$line/*Exam/* | perl -0pe 's/.*2012/2012/g;s/\n/ /g' && echo ; done > out.txt
and here is the output ( a few first lines from out.txt ). Lines are very long, I cutted it on output for first 80-90 characters:
dtpwmbp:data pwadas$ cat out.txt |cut -c1-90
data:
data/study_005:
data/study_005/05_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/06_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/07_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/08_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
dtpwmbp:data pwadas$
I hope this will help you a little, and you'll be able to modify it according to your needs and patterns, and that seems to be all I can do :) You should analyze the one liner, especially "cut" command, and perl-regex part, which removes newlines and full directory name from "ls" output. This is probably fair from optimal, but beautifying is not the point here, I guess :)
So, good luck :)
PS. "head" command limits output for N first lines, you'll probably want to skip out
| head .. |
part.

Resources