manipulate directories using cut or awk? - linux

I am processing some directories in Linux, and I am trying to manipulate the file names, here's my case
grep 'string' | awk '{print $2$3}'
and I get the following
dir1/another-directory/even-another-directory/file1.jpeg
dir1/another-directory/even-another-directory/fiiile2.jpeg
dir1/another-directory/even-another-directory/filee4.jpeg
dir1/another-directory/even-another-directory/fileee1.jpeg
I am trying to take the last part of these files (anything after the slash), so that I get a list like this, in a CSV file maybe?
file1.jpeg
fiiile2.jpeg
filee4.jpeg
fileee1.jpeg
would awk or cut be able to do that? I know this is a very basic question, but I couldn't find something related online so far.
Thanks,

do everything with awk:
awk '/string/{x=$2$3;sub(/.*\//,"",x);print x}'

Drop the grep and do the match in awk and use xargs to call basename on each file for stripping the leading directories:
awk '/string/{print $2$3}' | xargs -n1 basename

Related

awk searching for characters recursively

I would like to search for a string within multiple files recursively.
I have used grep before, and it works fine.
grep -r SearchString .
But I hear awk is much faster. So I am using the below command but it just prints out everything on the file?
awk 'If ($0 ~/SearchString/) {print $0} ' /path/*
Your assumptions are a bit off. Awk is a tool that is designed for working on a predefined set of files. So it does not have any knowledge of recursive fileprocessing.
On a single file, one would o:
awk '/regex/' file1 file2
and this would print all lines of file1 and file2 which match regex. This, however, might not be what you want, as it does not give you any information which line belongs to which file, in contrast, to grep -E 'regex' file1 file2. To achieve this, you already need to start adding stuff, making your awk a bit less comfortable:
awk '/regex/{print FILENAME,":",$0}' file1 file2
If you want to go recursive, you will need to use find for this:
find . -type f -print0 | xargs -0 awk '/regex/{print FILENAME":", $0}'

How to print only the filename part of files that contain a certain string?

I need to print out the filename (e.g. A001.txt) that contains the string "XYZ".
I tried this:
grep -H -R "XYZ" ~/directory/*.txt | cut -d':' -f1
It would output the entire path (e.g. ~/directory/A001.txt). How can I make it so that it would only output the filename (e.g. A001.txt)?
Why oh why did the GNU guys give grep an option to recursively find files when there's a perfectly good tool designed for the job and with an extremely obvious name. Sigh...
find . -type f -exec awk '/XYZ/{print gensub(/.*\//,"",1,FILENAME); nextfile}' {} +
The above uses GNU awk which I assume you have since you were planning to use GNU grep.
grep -lr term dir/to/search/ | awk -F'/' '{print $NF}' should do the trick.
-l just lists filenames, including their directories.
-r is recursive to go through the directory tree and all files in the dir specified.
This all gets piped to awk, which is told to use / as a delimiter (not allowed in file names, so not as brittle as it could be) and to print the last field (NF is the field count, so $NF is the last field)
grep -Rl "content" | xargs -d '\n' basename -a
This should do the trick and print only the filename without the path.
basename prints filename NAME with any leading directory components
removed.
Reference: https://linux.die.net/man/1/basename

Use grep and cut to filter text file to only display usernames that start with ‘m’ ‘w’ or ‘s’ and their home directories

root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
sys:x:3:1:sys:/dev:/usr/sbin/nologin
games:x:5:2:games:/usr/games:/usr/sbin/nologin
mail:x:8:5:mail:/var/mail:/usr/sbin/nologin
www-data:x:33:3:www-data:/var/www:/usr/sbin/nologin
backup:x:34:2:backup:/var/backups:/usr/sbin/nologin
nobody:x:65534:1337:nobody:/nonexistent:/usr/sbin/nologin
syslog:x:101:1000::/home/syslog:/bin/false
whoopsie:x:109:99::/nonexistent:/bin/false
user:x:1000:1000:edco8700,,,,:/home/user:/bin/bash
sshd:x:116:1337::/var/run/sshd:/usr/sbin/nologin
ntp:x:117:99::/home/ntp:/bin/false
mysql:x:118:999:MySQL Server,,,:/nonexistent:/bin/false
vboxadd:x:999:1::/var/run/vboxadd:/bin/false
this is an /etc/passwd file I need to do this command on. So far I have:
cut -d: -f1,6 testPasswd.txt | grep ???
that will display all the usernames and the folder associated, but I'm stuck on how to find only the ones that start with m,w,s and print the whole line.
I've tried grep -o '^[mws]*' and different variations of it, but none have worked.
Any suggestions?
Try variations of
cut -d: -f1,6 testPasswd.txt | grep '^m\|^w\|^s'
Or to put it more concisely,
cut -d: -f1,6 testPasswd.txt | grep '^[mws]'
That's neater especially if you have a lot of patterns to match.
But of course the awk solution is much better if doing it without constraints.
Easier to do with awk:
awk 'BEGIN{FS=OFS=":"} $1 ~ /^[mws]/{print $1, $6}' testPasswd.txt
sys:/dev
mail:/var/mail
www-data:/var/www
syslog:/home/syslog
whoopsie:/nonexistent
sshd:/var/run/sshd
mysql:/nonexistent

How can I get the second column of a very large csv file using linux command?

I was given this question during an interview. I said I could do it with java or python like xreadlines() function to traverse the whole file and fetch the column, but the interviewer wanted me to just use linux cmd. How can I achieve that?
You can use the command awk. Below is an example of printing out the second column of a file:
awk -F, '{print $2}' file.txt
And to store it, you redirect it into a file:
awk -F, '{print $2}' file.txt > output.txt
You can use cut:
cut -d, -f2 /path/to/csv/file
I'd add to Andreas answer, but can't comment yet.
With csv, you have to give awk a field seperator argument, or it will define fields bound by whitespace instead of commas. (Obviously, csv that uses a different field seperator will need a different character to be declared.)
awk -F, '{print $2}' file.txt

How to extract distinct part of a string from a file in linux

I'm using the following command to extract distinct urls that contain .com extension and may contain .us or whatever country extension.
grep '\.com' source.txt -m 700 | uniq | sed -e 's/www.//'
> dest.txt
The problem is that, it extracts urls in the same doamin, the thing tht I don't want. Ex:
abc.yahoo.com
efg.yahoo.com
I only need the yahoo.com. How can I using grep or any other command extract distinct domain names only ?
Maybe something like this?
egrep -io '[a-z0-9\-]+\.[a-z]{2,3}(\.[a-z]{2})?' source.txt
Have you tried using awk in instead of sed and specify "." as the delimiter and only print out the two last fields.
awk -F "." '{ print $(NF-1)"."$NF }'
Perhaps something like this should help:
egrep -o '[^.]*.com' file

Resources