extract date from a file name in unix using shell scripting - linux

I am working on shell script. I want to extract date from a file name.
The file name is: abcd_2014-05-20.tar.gz
I want to extract date from it: 2014-05-20

echo abcd_2014-05-20.tar.gz |grep -Eo '[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}'
Output:
2014-05-20
grep got input as echo stdin or you can also use cat command if you have these strings in a file.
-E Interpret PATTERN as an extended regular expression.
-o Show only the part of a matching line that matches PATTERN.
[[:digit:]] It will fetch digit only from input.
{N} It will check N number of digits in given string, i.e.: 4 for years 2 for months and days
Most importantly it will fetch without using any separators like "_" and "." and this is why It's most flexible solution.

Using awk with custom field separator, it is quite simple:
echo 'abcd_2014-05-20.tar.gz' | awk -F '[_.]' '{print $2}'
2014-05-20

Use grep:
$ ls -1 abcd_2014-05-20.tar.gz | grep -oP '[\d]+-[\d]+-[\d]+'
2014-05-20
-o causes grep to print only the matching part
-P interprets the pattern as perl regex
[\d]+-[\d]+-[\d]+: stands for one or more digits followed by a dash (3 times) that matches your date.

Here few more examples,
Using cut command (cut gives more readability like awk command)
echo "abcd_2014-05-20.tar.gz" | cut -d "_" -f2 | cut -d "." -f1
Output is:
2014-05-20
using grep commnad
echo "abcd_2014-05-20.tar.gz" | grep -Eo "[0-9]{4}\-[0-9]{2}\-[0-9]{2}"
Output is:
2014-05-20
An another advantage of using grep command format is that, it will also help to fetch multiple dates like this:
echo "ab2014-15-12_cd_2014-05-20.tar.gz" | grep -Eo "[0-9]{4}\-[0-9]{2}\-[0-9]{2}"
Output is:
2014-15-12
2014-05-20

I will use some kind of regular expression with the "grep" command, depending on how your file name is created.
If your date is always after "_" char I will use something like this.
ls -l | grep ‘_[REGEXP]’
Where REGEXP is your regular expression according to your date format.
Take a look here http://www.linuxnix.com/2011/07/regular-expressions-linux-i.html

Multiple ways you could do it:
echo abcd_2014-05-20.tar.gz | sed -n 's/.*_\(.*\).tar.gz/\1/p'
sed will extract the date and will print it.
Another way:
filename=abcd_2014-05-20.tar.gz
temp=${filename#*_}
date=${temp%.tar.gz}
Here temp will hold string in file name post "_" i.e. 2014-05-20.tar.gz
Then you can extract date by removing .tar.gz from the end.

Related

How to strip stdout before logging into file? [duplicate]

Without using sed or awk, only cut, how do I get the last field when the number of fields are unknown or change with every line?
You could try something like this:
echo 'maps.google.com' | rev | cut -d'.' -f 1 | rev
Explanation
rev reverses "maps.google.com" to be moc.elgoog.spam
cut uses dot (ie '.') as the delimiter, and chooses the first field, which is moc
lastly, we reverse it again to get com
Use a parameter expansion. This is much more efficient than any kind of external command, cut (or grep) included.
data=foo,bar,baz,qux
last=${data##*,}
See BashFAQ #100 for an introduction to native string manipulation in bash.
It is not possible using just cut. Here is a way using grep:
grep -o '[^,]*$'
Replace the comma for other delimiters.
Explanation:
-o (--only-matching) only outputs the part of the input that matches the pattern (the default is to print the entire line if it contains a match).
[^,] is a character class that matches any character other than a comma.
* matches the preceding pattern zero or more time, so [^,]* matches zero or more non‑comma characters.
$ matches the end of the string.
Putting this together, the pattern matches zero or more non-comma characters at the end of the string.
When there are multiple possible matches, grep prefers the one that starts earliest. So the entire last field will be matched.
Full example:
If we have a file called data.csv containing
one,two,three
foo,bar
then grep -o '[^,]*$' < data.csv will output
three
bar
Without awk ?...
But it's so simple with awk:
echo 'maps.google.com' | awk -F. '{print $NF}'
AWK is a way more powerful tool to have in your pocket.
-F if for field separator
NF is the number of fields (also stands for the index of the last)
There are multiple ways. You may use this too.
echo "Your string here"| tr ' ' '\n' | tail -n1
> here
Obviously, the blank space input for tr command should be replaced with the delimiter you need.
This is the only solution possible for using nothing but cut:
echo "s.t.r.i.n.g." | cut -d'.' -f2-
[repeat_following_part_forever_or_until_out_of_memory:] | cut -d'.' -f2-
Using this solution, the number of fields can indeed be unknown and vary from time to time. However as line length must not exceed LINE_MAX characters or fields, including the new-line character, then an arbitrary number of fields can never be part as a real condition of this solution.
Yes, a very silly solution but the only one that meets the criterias I think.
If your input string doesn't contain forward slashes then you can use basename and a subshell:
$ basename "$(echo 'maps.google.com' | tr '.' '/')"
This doesn't use sed or awk but it also doesn't use cut either, so I'm not quite sure if it qualifies as an answer to the question as its worded.
This doesn't work well if processing input strings that can contain forward slashes. A workaround for that situation would be to replace forward slash with some other character that you know isn't part of a valid input string. For example, the pipe (|) character is also not allowed in filenames, so this would work:
$ basename "$(echo 'maps.google.com/some/url/things' | tr '/' '|' | tr '.' '/')" | tr '|' '/'
the following implements A friend's suggestion
#!/bin/bash
rcut(){
nu="$( echo $1 | cut -d"$DELIM" -f 2- )"
if [ "$nu" != "$1" ]
then
rcut "$nu"
else
echo "$nu"
fi
}
$ export DELIM=.
$ rcut a.b.c.d
d
An alternative using perl would be:
perl -pe 's/(.*) (.*)$/$2/' file
where you may change \t for whichever the delimiter of file is
It is better to use awk while working with tabular data. You don't have to master on command. If it can be achieved by awk, why not use that? I suggest you do not waste your precious time, and use a handful of commands to get the job done.
Example:
# $NF refers to the last column in awk
ll | awk '{print $NF}'
If you have a file named filelist.txt that is a list paths such as the following:
c:/dir1/dir2/file1.h
c:/dir1/dir2/dir3/file2.h
then you can do this:
rev filelist.txt | cut -d"/" -f1 | rev
Adding an approach to this old question just for the fun of it:
$ cat input.file # file containing input that needs to be processed
a;b;c;d;e
1;2;3;4;5
no delimiter here
124;adsf;15454
foo;bar;is;null;info
$ cat tmp.sh # showing off the script to do the job
#!/bin/bash
delim=';'
while read -r line; do
while [[ "$line" =~ "$delim" ]]; do
line=$(cut -d"$delim" -f 2- <<<"$line")
done
echo "$line"
done < input.file
$ ./tmp.sh # output of above script/processed input file
e
5
no delimiter here
15454
info
Besides bash, only cut is used.
Well, and echo, I guess.
choose -1
choose supports negative indexing (the syntax is similar to Python's slices).
I realized if we just ensure a trailing delimiter exists, it works. So in my case I have comma and whitespace delimiters. I add a space at the end;
$ ans="a, b"
$ ans+=" "; echo ${ans} | tr ',' ' ' | tr -s ' ' | cut -d' ' -f2
b

How to cut the date out of a string in Shell?

I got a list of Strings like this in a .txt file
asdafdgdhjhgk.de/dsafdfdfgfdggfgg - Abgelaufen seit 26.11.2076 14:08 (seit 12345 Tagen)
Now I want to cut the date out of the strings like: 26.11.2076
All this have to happen in a Shell-Script so I through cut or sed would be a good idea but i didn't found an answer in the internet.
You can use GNU grep with -E with extended regEx support using the -E, --extended-regexp flag.
$ grep -Eo "[[:digit:]]{2}.[[:digit:]]{2}.[[:digit:]]{4}" <<< "asdafdgdhjhgk.de/dsafdfdfgfdggfgg - Abgelaufen seit 26.11.2076 14:08 (seit 12345 Tagen)"
26.11.2076
(or) if you want to run it on a file with multiple such strings, do
$ grep -Eo "[[:digit:]]{2}.[[:digit:]]{2}.[[:digit:]]{4}" input-file
If the structure of the logs/lines are similar from start till the date then following could be used:
awk '{print $5}' input
Or
grep -oP '([3][0-1]|[1-2][0-9]|[0][1-9])\.([0][0-9]|[1][0-2])\.[0-9]{4}' input
Note: this may break for month of feb.
When it comes to text parsing, I almost always prefer Perl.
Multiple comma-separated matches per line:
perl -ne '#_=/((?:\d\d\.){2}\d{4})/g and print join(",", #_), "\n"' file
Multiple matches per line joined into a single column:
perl -ne 'while (/((?:\d\d\.){2}\d{4})/g) {print "$&\n";}' file
The first matches:
perl -ne '/((?:\d\d\.){2}\d{4})/ and print "$1\n"' file
If the dates are followed by time, add (?: \d\d:\d\d) to the regular expressions, e.g.
/((?:\d\d\.){2}\d{4})(?: \d\d:\d\d)/
This will make the matches stricter. Note, (?:) is a non-capturing group.
I also like grep's -P option that enables Perl-compatible regular expressions:
grep -o -P '(?:\d\d\.){2}\d{4}' file
But some implementations may not support it:
This is highly experimental and grep -P may warn of unimplemented features.
(the man page for grep).

How to find the last field using 'cut'

Without using sed or awk, only cut, how do I get the last field when the number of fields are unknown or change with every line?
You could try something like this:
echo 'maps.google.com' | rev | cut -d'.' -f 1 | rev
Explanation
rev reverses "maps.google.com" to be moc.elgoog.spam
cut uses dot (ie '.') as the delimiter, and chooses the first field, which is moc
lastly, we reverse it again to get com
Use a parameter expansion. This is much more efficient than any kind of external command, cut (or grep) included.
data=foo,bar,baz,qux
last=${data##*,}
See BashFAQ #100 for an introduction to native string manipulation in bash.
It is not possible using just cut. Here is a way using grep:
grep -o '[^,]*$'
Replace the comma for other delimiters.
Explanation:
-o (--only-matching) only outputs the part of the input that matches the pattern (the default is to print the entire line if it contains a match).
[^,] is a character class that matches any character other than a comma.
* matches the preceding pattern zero or more time, so [^,]* matches zero or more non‑comma characters.
$ matches the end of the string.
Putting this together, the pattern matches zero or more non-comma characters at the end of the string.
When there are multiple possible matches, grep prefers the one that starts earliest. So the entire last field will be matched.
Full example:
If we have a file called data.csv containing
one,two,three
foo,bar
then grep -o '[^,]*$' < data.csv will output
three
bar
Without awk ?...
But it's so simple with awk:
echo 'maps.google.com' | awk -F. '{print $NF}'
AWK is a way more powerful tool to have in your pocket.
-F if for field separator
NF is the number of fields (also stands for the index of the last)
There are multiple ways. You may use this too.
echo "Your string here"| tr ' ' '\n' | tail -n1
> here
Obviously, the blank space input for tr command should be replaced with the delimiter you need.
This is the only solution possible for using nothing but cut:
echo "s.t.r.i.n.g." | cut -d'.' -f2-
[repeat_following_part_forever_or_until_out_of_memory:] | cut -d'.' -f2-
Using this solution, the number of fields can indeed be unknown and vary from time to time. However as line length must not exceed LINE_MAX characters or fields, including the new-line character, then an arbitrary number of fields can never be part as a real condition of this solution.
Yes, a very silly solution but the only one that meets the criterias I think.
If your input string doesn't contain forward slashes then you can use basename and a subshell:
$ basename "$(echo 'maps.google.com' | tr '.' '/')"
This doesn't use sed or awk but it also doesn't use cut either, so I'm not quite sure if it qualifies as an answer to the question as its worded.
This doesn't work well if processing input strings that can contain forward slashes. A workaround for that situation would be to replace forward slash with some other character that you know isn't part of a valid input string. For example, the pipe (|) character is also not allowed in filenames, so this would work:
$ basename "$(echo 'maps.google.com/some/url/things' | tr '/' '|' | tr '.' '/')" | tr '|' '/'
the following implements A friend's suggestion
#!/bin/bash
rcut(){
nu="$( echo $1 | cut -d"$DELIM" -f 2- )"
if [ "$nu" != "$1" ]
then
rcut "$nu"
else
echo "$nu"
fi
}
$ export DELIM=.
$ rcut a.b.c.d
d
An alternative using perl would be:
perl -pe 's/(.*) (.*)$/$2/' file
where you may change \t for whichever the delimiter of file is
It is better to use awk while working with tabular data. You don't have to master on command. If it can be achieved by awk, why not use that? I suggest you do not waste your precious time, and use a handful of commands to get the job done.
Example:
# $NF refers to the last column in awk
ll | awk '{print $NF}'
If you have a file named filelist.txt that is a list paths such as the following:
c:/dir1/dir2/file1.h
c:/dir1/dir2/dir3/file2.h
then you can do this:
rev filelist.txt | cut -d"/" -f1 | rev
Adding an approach to this old question just for the fun of it:
$ cat input.file # file containing input that needs to be processed
a;b;c;d;e
1;2;3;4;5
no delimiter here
124;adsf;15454
foo;bar;is;null;info
$ cat tmp.sh # showing off the script to do the job
#!/bin/bash
delim=';'
while read -r line; do
while [[ "$line" =~ "$delim" ]]; do
line=$(cut -d"$delim" -f 2- <<<"$line")
done
echo "$line"
done < input.file
$ ./tmp.sh # output of above script/processed input file
e
5
no delimiter here
15454
info
Besides bash, only cut is used.
Well, and echo, I guess.
choose -1
choose supports negative indexing (the syntax is similar to Python's slices).
I realized if we just ensure a trailing delimiter exists, it works. So in my case I have comma and whitespace delimiters. I add a space at the end;
$ ans="a, b"
$ ans+=" "; echo ${ans} | tr ',' ' ' | tr -s ' ' | cut -d' ' -f2
b

How to extract distinct part of a string from a file in linux

I'm using the following command to extract distinct urls that contain .com extension and may contain .us or whatever country extension.
grep '\.com' source.txt -m 700 | uniq | sed -e 's/www.//'
> dest.txt
The problem is that, it extracts urls in the same doamin, the thing tht I don't want. Ex:
abc.yahoo.com
efg.yahoo.com
I only need the yahoo.com. How can I using grep or any other command extract distinct domain names only ?
Maybe something like this?
egrep -io '[a-z0-9\-]+\.[a-z]{2,3}(\.[a-z]{2})?' source.txt
Have you tried using awk in instead of sed and specify "." as the delimiter and only print out the two last fields.
awk -F "." '{ print $(NF-1)"."$NF }'
Perhaps something like this should help:
egrep -o '[^.]*.com' file

How to trim specific text with grep

I am in need of trimming some text with grep, I have tried various other methods and havn't had much luck, so for example:
C:\Users\Admin\Documents\report2011.docx: My Report 2011
C:\Users\Admin\Documents\newposter.docx: Dinner Party Poster 08
How would it be possible to trim the text file, so to trim the ":" and all characters after it.
E.g. so the output would be like:
C:\Users\Admin\Documents\report2011.docx
C:\Users\Admin\Documents\newposter.docx
use awk?
awk -F: '{print $1':'$2}' inputFile > outFile
you can use grep
(note that -o returns only the matching text)
grep -oe "^C:[^:]" inputFile > outFile
That is pretty simple to do with grep -o:
$ grep -o '^C:[^:]*' input
C:\Users\Admin\Documents\report2011.docx
C:\Users\Admin\Documents\newposter.docx
If you can have other drives just replace C by .:
$ grep -o '^.:[^:]*' input
If a line can start with something different than a drive name, you can consider both the occurrence a drive name in the beginning of the line and the case where there is no such drive name:
$ grep -o '^\(.:\|\)[^:]*' input
cat inputFile | cut -f1,2 -d":"
The -d specifies your delimiter, in this case ":". The -f1,2 means you want the first and second fields.
The first part doesn't necessarily have to be cat inputFile, it's just whatever it takes to get the text that you referred to. The key part being cut -f1,2 -d":"
Your text looks like output of grep. If what you're asking is how to print filenames matching a pattern, use GNU grep option --files-with-matches
You can use this as well for your example
grep -E -o "^C\S+"| tr -d ":"
egrep -o "^C\S+"| tr -d ":"
\S here is non-space character match

Resources