How to cut the date out of a string in Shell? - string

I got a list of Strings like this in a .txt file
asdafdgdhjhgk.de/dsafdfdfgfdggfgg - Abgelaufen seit 26.11.2076 14:08 (seit 12345 Tagen)
Now I want to cut the date out of the strings like: 26.11.2076
All this have to happen in a Shell-Script so I through cut or sed would be a good idea but i didn't found an answer in the internet.

You can use GNU grep with -E with extended regEx support using the -E, --extended-regexp flag.
$ grep -Eo "[[:digit:]]{2}.[[:digit:]]{2}.[[:digit:]]{4}" <<< "asdafdgdhjhgk.de/dsafdfdfgfdggfgg - Abgelaufen seit 26.11.2076 14:08 (seit 12345 Tagen)"
26.11.2076
(or) if you want to run it on a file with multiple such strings, do
$ grep -Eo "[[:digit:]]{2}.[[:digit:]]{2}.[[:digit:]]{4}" input-file

If the structure of the logs/lines are similar from start till the date then following could be used:
awk '{print $5}' input
Or
grep -oP '([3][0-1]|[1-2][0-9]|[0][1-9])\.([0][0-9]|[1][0-2])\.[0-9]{4}' input
Note: this may break for month of feb.

When it comes to text parsing, I almost always prefer Perl.
Multiple comma-separated matches per line:
perl -ne '#_=/((?:\d\d\.){2}\d{4})/g and print join(",", #_), "\n"' file
Multiple matches per line joined into a single column:
perl -ne 'while (/((?:\d\d\.){2}\d{4})/g) {print "$&\n";}' file
The first matches:
perl -ne '/((?:\d\d\.){2}\d{4})/ and print "$1\n"' file
If the dates are followed by time, add (?: \d\d:\d\d) to the regular expressions, e.g.
/((?:\d\d\.){2}\d{4})(?: \d\d:\d\d)/
This will make the matches stricter. Note, (?:) is a non-capturing group.
I also like grep's -P option that enables Perl-compatible regular expressions:
grep -o -P '(?:\d\d\.){2}\d{4}' file
But some implementations may not support it:
This is highly experimental and grep -P may warn of unimplemented features.
(the man page for grep).

Related

How to grep full words based on partial input?

I have a file text.txt which contains the below words.
1. moon,one
2. sun,two
3. well,three
4. doll,four
if i grep this file using sun
grep -i sun text.txt
I will get the output
sun,two
But, my requirement is I need to grep with the word which is starting with sun not exactly sun.
grep -i sunlight text.txt
Here I need the same output for grep -i sun text.txt.
You don't need awk or gawk, nor sed. Just do
grep -o 'sun.*'
Other more complex / elegant solutions may be available depending on the system you are using.
What you are looking for are regular expressions.
In your case, it would be
grep -i 'sun.*' text.txt
Try using -o, as showed in the documentation.
The -o make grep return only the matched part. You can also use regular expressions.
grep -io sun text.txt
Is this what you're looking for?
awk -F ',' '/^[SsuUnN]/ {print $0}' test.txt
or if you want to search the pattern "sun" in general from the input_file, then use this:
awk -F ',' 'BEGIN{IGNORECASE=1} /sun/ {print $0}' test.txt

How to remove lines contained in file 1 from file 2 if in file 2 they are prefixed?

I have the following situation:
source.txt
ID1:email1#domain1.com
ID2:email2#domain2.com
ID3:email3#domain3.com
...
IDs are numeric strings, e.g. 1234, 23412, 897... (one or more digits).
exclude.txt
emailX#domainX.com
emailY#domainY.com
emailZ#domainZ.com
...
i.e. only emails, no IDs.
I want to remove all lines from source.txt which contain emails listed in exclude.txt, preserving the ID:email pairs for the lines which are not removed.
How can I do that with linux command line tools (or simple bash script if needed)?
You can do it easily with awk:
awk -F":" 'NR==FNR{a[$1];next}(!($2 in a))' exclude.txt source.txt
Alternative with grep:
grep -v -F -f exclude.txt source.txt
Use grep with care, since grep does a regex matching. You might need to add also -w option to grep (word matching)

Use regex in grep while while using two files

I know that you can use regex in grep and use patterns from a file to search another file. But, can you combine these two options?
For example, from the file where the patterns come from (with the -f option for use patterns from a file), I only want to use the first column to search the second file.
I tried this:
grep -E '^(*)\b' -f file_1 file_2 > file_3
To grep the first column from file_1 with the * wildcard, but it is not working. Any ideas?
Grep doesn't use wildcards for patterns, it uses regular expressions, so (*) makes little sense.
If you want to extract the first column from a file, use cut -f1 or awk '{print $1}' (or sed or perl or whatever to extract it), the redirect to grep using the special - (i.e. standard input) as the source file:
cut -f1 file1 | grep -f- file_2 > file_3

extract date from a file name in unix using shell scripting

I am working on shell script. I want to extract date from a file name.
The file name is: abcd_2014-05-20.tar.gz
I want to extract date from it: 2014-05-20
echo abcd_2014-05-20.tar.gz |grep -Eo '[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}'
Output:
2014-05-20
grep got input as echo stdin or you can also use cat command if you have these strings in a file.
-E Interpret PATTERN as an extended regular expression.
-o Show only the part of a matching line that matches PATTERN.
[[:digit:]] It will fetch digit only from input.
{N} It will check N number of digits in given string, i.e.: 4 for years 2 for months and days
Most importantly it will fetch without using any separators like "_" and "." and this is why It's most flexible solution.
Using awk with custom field separator, it is quite simple:
echo 'abcd_2014-05-20.tar.gz' | awk -F '[_.]' '{print $2}'
2014-05-20
Use grep:
$ ls -1 abcd_2014-05-20.tar.gz | grep -oP '[\d]+-[\d]+-[\d]+'
2014-05-20
-o causes grep to print only the matching part
-P interprets the pattern as perl regex
[\d]+-[\d]+-[\d]+: stands for one or more digits followed by a dash (3 times) that matches your date.
Here few more examples,
Using cut command (cut gives more readability like awk command)
echo "abcd_2014-05-20.tar.gz" | cut -d "_" -f2 | cut -d "." -f1
Output is:
2014-05-20
using grep commnad
echo "abcd_2014-05-20.tar.gz" | grep -Eo "[0-9]{4}\-[0-9]{2}\-[0-9]{2}"
Output is:
2014-05-20
An another advantage of using grep command format is that, it will also help to fetch multiple dates like this:
echo "ab2014-15-12_cd_2014-05-20.tar.gz" | grep -Eo "[0-9]{4}\-[0-9]{2}\-[0-9]{2}"
Output is:
2014-15-12
2014-05-20
I will use some kind of regular expression with the "grep" command, depending on how your file name is created.
If your date is always after "_" char I will use something like this.
ls -l | grep ‘_[REGEXP]’
Where REGEXP is your regular expression according to your date format.
Take a look here http://www.linuxnix.com/2011/07/regular-expressions-linux-i.html
Multiple ways you could do it:
echo abcd_2014-05-20.tar.gz | sed -n 's/.*_\(.*\).tar.gz/\1/p'
sed will extract the date and will print it.
Another way:
filename=abcd_2014-05-20.tar.gz
temp=${filename#*_}
date=${temp%.tar.gz}
Here temp will hold string in file name post "_" i.e. 2014-05-20.tar.gz
Then you can extract date by removing .tar.gz from the end.

Grep Usage help

I want to use grep to find all of the headers in a corpus, I want to find every thing up to the : and ignore every thing after that. Does anyone know how to do that? (Could I get a complete line of code)
Use sed or awk.
A sed example:
sed -e '/^[^:]*$/d' -e 's/\(.*\):.*/\1/' filename
If all you want to do is display the first portion of the matched line then you can say
grep your_pattern | cut -d: -f 1
but if you want to not match against data after the colon, you need a different tool. There are many tools available sed, awk, perl, python, etc. For instance, the Perl code would look something like this
perl -nle '($s) = split /:/; print $s if $s =~ /your_pattern/'
or the longer script version:
#!/usr/bin/perl
use strict;
use warnings;
while (my $line = <>) {
my $substring = split /:/, $line;
if ($substring =~ /your_pattern/) {
print "$substring\n";
}
}
(I'm not sure I fully understand your question)
you must use 'grep' AND 'cut', one solution (albeit far from perfect) would be:
$ cat file | grep ':' | cut -f 1 -d ':'
sed -n '/^$/q;/:/{s/:.*/:/;p;}'
This will stop after all the headers are processed.
Edit: a bit improved version:
sed -n '/^$/q;/^[^ :\t]{1,}:/{s/:.*/:/;p;}'

Resources