How to search a specific column in a file for a string in shell?

How to search a specific column in a file for a string in shell? - linux

I am learning shell scripting. I wrote a script that indexes all the files and folders in the current working directory in the following tab separated format,
Path Filename File/Directory Last Modified date Size Textfile(Y/N)
for example, two lines in my file 'index.txt' are,
home/Desktop/C C d Thu Jan 16 01:23:57 PST 2014 4 KB N
home/Desktop/C/100novels.txt 100novels.txt f Thu Mar 14 06:04:06 PST 2013 0 KB Y
Now, i want to write another script to search for files from the previous file that takes in command line parameters, the file name. And also optional parameters as -dir (if to search only for directories), -date DATE (for last modified date), -t (for text files only) etc. How to go about this ?
I know the grep command that searches in a file for a string, but suppose i entered a file name, it would search the entire file for the file name instead of searching only the filename column. How do i do this? Is there any other command to achieve this ?

"i want to search in a speific column for a particular string... how to do that ?"
For example, to search for directories:
awk -F '\t' '$3 == "d"' filename
You can also search by regular expression or substring matching. Please post further questions as you learn.
Single quotes prevent variable substitution (and other forms of expansion) -- see the bash manual. So you have to pass the value by other means:
awk -F'\t' -v s="$search_file" '$2 == s' ./index.txt > final.txt
or, use double quotes, more fragile and harder to read IMO
awk -F'\t' "\$2 == \"$search_file\"" ./index.txt > final.txt
Also, you don't mix single and double quotes. Pick one or the other, depending on the functionality you need.

You can doit using the command awk. For example to print the first column of a file using awk:
awk -F ":" '{print $1}' /etc/passwd
(note that the -F flag is to choose a field separator for the columns, in your case you don't need it). Then you can use grep to filter in that column. Good luck!

Related

linux shell script to fetch a <latest file> of <particular name> in a folder

I'm enhancing my linux script, I have a folder with lots of files of different dates, I want to fetch a latest file starting with a particular name.
For ex.
I have below list of files in a folder I need latest file of name Subnetwork_RAN in a folder:
Subnetwork_PCC_11Dec2022UTC0500
Subnetwork_RAN_12Dec2022UTC0500
Subnetwork_RAN_13Dec2022UTC0500
Subnetwork_PCC_13Dec2022UTC0500
Output will be file name Subnetwork_RAN_13Dec2022UTC0500
I tried to build a linux shell script to get latest file of particular name.

This problem has a rather simple awk solution:
ls -tl | awk ' $9 ~ /Subnetwork_RAN/ {print $9; exit;}'
ls -tl outputs a long listing of the current directory, sorted by time (newest first).
This output is piped to awk which (line-by-line) looks for a filename containing the required string. The first time it finds one, it prints the filename and exits.
Note, this assumes (as in your example) that the filename contains no white space. If it does, you need to modify the print statement to print the substring of the line $0 beginning with your string, to the end of the line.
If your string might be repeated in more recent filenames but not at the start, the regex condition can be modified to select only filenames where your string is at the start $9~/^Subnetwork_RAN/

Supposing you have a file called test.txt with the filenames you showed. Then in bash you can do this:
awk 'BEGIN {FS="_"} $0 ~/Subnetwork_RAN/ {printf "%s ",$0; system("date +%s -d " $3)}' asd | sort -rn -k 2 | head -1 | cut -d " " -f 1
Output:
Subnetwork_RAN_13Dec2022UTC0500
Some explanation:
$0 ~ /Subnetwork_RAN/ matches all the lines containing the sub-string "Subnetwork_RAN"
The bash command date can recognize the date format like this 13Dec2022UTC0500 and transform it in a timestamp (date +%s)
sort sorts numerically in reverse order based on the second field (timestamp output of awk system call)
head gives the first line, i.e., the most recent
cut takes only the first field given a field separator equal to " ". The first field is the full filename (printf call in awk)

Linux help: how do you move files to folders based on a .txt file?

I have a .txt file containing a column of IDs and their ages (as integers). I've already created separate folders in my directory for each age category (ranging from 20-86). For every ID in my .txt file I would like to move their image (which is currently stored in the folder "data") to the appropriate folder, based on their age category listed in column two of my .txt file.
Any help on how to do this in Linux would be really appreciated!
Updated example with files ending in different suffixes.
Current working directory:
data/ 20/ 21/ 22/ 23/ 24/ ...
text file:
ID001 21
ID002 23
ID003 20
ID004 22
ID005 21
ls data/
ID001-XXX-2125.jpg
ID002-YYY-2370.jpg
ID003-XXX-2125.jpg
ID004-YYY-2370.jpg
ID005-XXX-2125.jpg
Desired output:
20/
ID003-XXX-2125.jpg
21/
ID001-XXX-2125.jpg
ID005-XXX-2370.jpg
22/
ID004-YYY-2370.jpg
23/
ID002-YYY-2370.jpg

As you suggest, awk can do this kind of task (though see Ed Morton's remark below). You may try the following, which is tested with GNU awk. From your working directory you can do:
awk '{system("mv data/"$1"*.jpg " $2)}' inputfile
Explanation: Here the system() function is used. The system() function allows you to execute a command supplied as an expression. In this case:
We use the mv (move) command and use the first field $1 in the input file to address the JPG file in the data directory.
Then we use the second field $2 of the input file for the destination directory.
The system() function is modeled after the standard C library function. Further reading in The GNU Awk User’s Guide

while read fil id
do
mv -f "data/"*"$fil"*".jpg" "$id/"
done < file
Read the two fields from the file (called file in this case) in a loop and use the variables to construct and execute the mv command.

Consider having your .txt file is in your current working directory. Can you try this written and tested script?
#!/bin/sh
DIR_CWD="/path/to/current_working_directory"
cd "$DIR_CWD/data"
for x in *; do
ID_number=`echo $x | awk -F"-" '{print $1}'`
DIR_age=`cat "$DIR_CWD/file.txt" | grep $ID_number | awk '{print $2}'`
mv -- "$DIR_CWD/data/$x" "$DIR_CWD/$DIR_age"
done
Note that DIR_CWD must be stated as the path of your current working directory.

Grep for specific numbers within a text file and output per number text file

I have a text file chunk_names.txt that looks like this:
chr1_12334_64321
chr1_134435_77474
chr10_463252_74754
chr10_54265_423435
chr13_5464565_547644567
This is an example but all chromosomes are represented (1...22, X and Y). All entries follow the same formatchr{1..22, X or Y}_*string of numbers*__*string of numbers*.
I would like to split these into per chromosome files e.g. all of the chunks starting chr10 to be put into a file called chr10.txt:
In Linux I have tried :
for i in {1..22}
do
grep chr$i chunk_names.txt > chr$i.txt
done
However, the chr1.txt output file now contains all the chromosome chunks with 1 in them (1,10,11,12, etc).
How would I modify this script to separate out the chromosomes?
I also haven't tackled how to include chromosome X or Y within the same script and am currently running that separately
Things I have tried :
grep -o gives me just "chr$i" as an output
grep 'chr$i' gives me blank files
grep "chr$i" has the initial problem
Many thanks for your time.

Your 'for' loop will mean parsing your file N times (where N is the number of chromosomes/contigs in your list). Here's an agnostic approach using awk that will parse the file just once:
awk -F '_' '{ print > $1 ".txt" }' chunk_names.txt

If you include the _ following the number you can distinguish between chr1_ and e.g. chr10_. To include X and Y, simply include these in the loop
for i in {1..22} X Y
do
grep "chr${i}_" chunk_names.txt > chr$i.txt
done
To search at the beginning of the line only you can add a leading ^ to the pattern
grep "^chr${i}_" chunk_names.txt > chr$i.txt
Explanation about your attempts:
grep chr$i searches for the pattern anywhere in the line. The shell replaces $i with the value of the variable i, so you get chr1, chr2 etc.
If you enclose the pattern in double quotes as grep "chr$i" the shell will not do any file name globbing or splitting of the string, but still expand variables. In your case it is the same as without quotes.
If you use single quotes, the shell takes the literal string as is, so you always search for a line that contains chr$i (instead of chr1 etc.) which does not occur in your file.
Explanation about quotes:
The quotes in my proposed solution are not necessary in your case, but it is a good habit to quote everything. If your pattern would contain spaces or characters that are special to the shell, the quoting will make a difference.
Example:
If your file would contain a chr1* instead of the chr1_, the pattern chr${i}* would be replaced by the list of matching files.
When you already created your output files chr1.txt etc., try these commands
$ i=1; echo chr$i*
chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt
$ i=1; echo "chr$i*"
chr1*
In the first case, the grepcommand
grep chr${i}* chunk_names.txt
would be expanded as
grep chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt chunk_names.txt
which would search for the pattern chr10.txt in files chr11.txt ... chr1.txt and chunk_names.txt.

Cut number from string

I want to cut several numbers from a .txt file to add them later up. Here is an abstract from the .txt file:
anonuser pts/25 127.0.0.1 Mon Nov 16 17:24 - crash (10+23:07)
I want to get the "10" before the "+" and I only want the number, nothing else. This number should be written to another .txt file. I used this code, but it only works if the number has one digit:
awk ' /^'anonuser' / {split($NF,k,"[(+0:)][0-9][0-9]");print k[1]} ' log2.txt > log3.txt

With GNU grep:
grep -Po '\(\K[^+]*' file > new_file
Output to new_file:
10
See: PCRE Regex Spotlight: \K

What if you use the match() function in awk?
$ awk '/^anonuser/ && match($NF,/^\(([0-9]*)/,a) {print a[1]}' file
10
How does this work?
/^anonuser/ && match() {print a[1]} if the line starts with anonuser and the pattern is found, print it.
match($NF,/^\(([0-9]*)/,a) in the last field ((10+23:07)), look for the string ( + digits and capture these in the array a[].
Note also that this approach allows you to store the values you capture, so that you can then sum them as you indicate in the question.

The following uses the same approach as the OP, and has a couple of advantages, e.g. it does not require anything special, and it is quite robust (with respect to assumptions about the input) and maintainable:
awk '/^anonuser/ {split($NF,k,/+/); gsub(/[^0-9]/,"",k[1]); print k[1]}'

for anything more complex use awk but for simple task sed is easy enough
sed -r '/^anonuser/{s/.*\(([0-9]+)\+.*/\1/}'
find the number between a ( and + sign.

I am not sure about the format in the file.
Can you use simple cut commands?
cut -d"(" -f2 log2.txt| cut -d"+" -f1 > log3.txt

Change date format in first column using awk/sed

I have a shell script which is automatically ran each morning which appends that days results to a text file. The file should have todays date on the first column followed by results separated by commas. I use the command date +%x to get the day in the required format (dd/mm/yy). However on one computer date +%x returns mm/dd/yyyy ( any idea why this is the case?). I then sort the data in the file in date order.
Here is a snippet of such a text file
29/11/12,9654.80,194.32,2.01,7.19,-7.89,7.65,7.57,3.98,9625.27,160.10,1.66,4.90,-4.79,6.83,4.84,3.54
03/12/12,5184.22,104.63,2.02,6.88,-6.49,7.87,6.67,4.10,5169.52,93.81,1.81,5.29,-5.45,7.87,5.37,4.10
04/12/12,5183.65,103.18,1.99,6.49,-6.80,8.40,6.66,4.38,5166.04,95.44,1.85,6.04,-6.49,8.40,6.28,4.38
11/07/2012,5183.65,102.15,1.97,6.78,-6.36,8.92,6.56,4.67,5169.48,96.67,1.87,5.56,-6.10,8.92,5.85,4.67
07/11/2012,5179.39,115.57,2.23,7.64,-6.61,8.83,7.09,4.62,5150.17,103.52,2.01,7.01,-6.08,8.16,6.51,4.26
11/26/2012,5182.66,103.30,1.99,7.07,-5.76,7.38,6.37,3.83,5162.81,95.47,1.85,6.34,-5.40,6.65,5.84,3.44
11/30/2012,5180.82,95.19,1.84,6.51,-5.40,7.91,5.92,4.12,5163.98,91.82,1.78,5.58,-5.07,7.05,5.31,3.65
Is it possible to change the date format for the latter four lines to the correct date format using awk or sed? I only wish to change the date format for those in the form mm/dd/yyyy to dd/mm/yy.

It looks like you're using two different flavors (versions) of date. To check which versions you've got, I think GNU date accepts the --version flag whereas other versions, like BSD/OSX will not accept this flag.
Since you may be using completely different systems, it's probably safest to avoid date completely and use perl to print the current date:
perl -MPOSIX -e 'print POSIX::strftime("%d/%m/%y", localtime) . "\n"'
If you are sure you have GNU awk on both machines, you could use it like this:
awk 'BEGIN { print strftime("%d/%m/%y") }'
To fix the file you've got, here's my take using GNU awk:
awk '{ print gensub(/^(..\/)(..\/)..(..,)/, "\\2\\1\\3", "g"); next }1' file
Or using sed:
sed 's/^\(..\/\)\(..\/\)..\(..,\)/\2\1\3/' file
Results:
29/11/12,9654.80,194.32,2.01,7.19,-7.89,7.65,7.57,3.98,9625.27,160.10,1.66,4.90,-4.79,6.83,4.84,3.54
03/12/12,5184.22,104.63,2.02,6.88,-6.49,7.87,6.67,4.10,5169.52,93.81,1.81,5.29,-5.45,7.87,5.37,4.10
04/12/12,5183.65,103.18,1.99,6.49,-6.80,8.40,6.66,4.38,5166.04,95.44,1.85,6.04,-6.49,8.40,6.28,4.38
07/11/12,5183.65,102.15,1.97,6.78,-6.36,8.92,6.56,4.67,5169.48,96.67,1.87,5.56,-6.10,8.92,5.85,4.67
11/07/12,5179.39,115.57,2.23,7.64,-6.61,8.83,7.09,4.62,5150.17,103.52,2.01,7.01,-6.08,8.16,6.51,4.26
26/11/12,5182.66,103.30,1.99,7.07,-5.76,7.38,6.37,3.83,5162.81,95.47,1.85,6.34,-5.40,6.65,5.84,3.44
30/11/12,5180.82,95.19,1.84,6.51,-5.40,7.91,5.92,4.12,5163.98,91.82,1.78,5.58,-5.07,7.05,5.31,3.65

This should work: sed -re 's/^([0-9][0-9])\/([0-9][0-9])\/[0-9][0-9]([0-9][0-9])(.*)$/\2\/\1\/\3\4/'
It can be made smaller but I made it so it would be more obvious what it does (4 groups, just switching month/day and removing first two chars of the year).
Tip: If you don't want to cat the file you could to the changes in place with sed -i. But be careful if you put a faulty expression in you might end up breaking your source file.
NOTE: This assumes that IF the year is specified with 4 digits, the month/day is reversed.

This below command will do it.
Note:No matter how many number of lines are present in the file.this will just change the last 4 lines.
tail -r your_file| awk -F, 'NR<5{split($1,a,"/");$1=a[2]"/"a[1]"/"a[3];print}1'|tail -r
Well i could figure out some way without using pipes and using a single awk statement and this solution does need a tail command:
awk -F, 'BEGIN{cmd="wc -l your_file";while (cmd|getline tmp);split(tmp,x)}x[1]-NR<=4{split($1,a,"/");$1=a[2]"/"a[1]"/"a[3];print}1' your_file

Another solution:
awk -F/ 'NR<4;NR>3{a=$1;$1=$2;$2=a; print $1"/"$2"/" substr($3,3,2) substr($3,5)}' file

Using awk:
$ awk -F/ 'NR>3{x=$1;$1=$2;$2=x}1' OFS="/" file
By using the / as the delimiter, all you need to do is swap the 1st and 2nd fields which is done here using a temporary variable.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string