Filtering large data file by date using command line

Filtering large data file by date using command line - linux

I have a csv file that contains a bunch of data with one of the columns being date. I am trying to extract all lines that have dates in a specific year and save it into a new file.
The format of file is like this with the date and time in the second column:
000000000,10/04/2021 02:10:15 AM,.....
So far I tried:
grep -E ^2020 data.csv >> temp.csv
But it just produced an empty temp list. Any ideas on how I can do this?

One potential solution is with awk:
awk -F"," '$2 ~ /\/2020 /' data.csv > temp.csv
Another potential option is with grep:
grep "\/2020 " data.csv > temp.csv
However, the grep solution may detect "/2020 " elsewhere in the file, rather than in column 2.

Although awk solution is best here, e.g.
awk -F, 'index($2, "/2021 ")' file
grep can also be used here:
grep '^[^,]*,[^,]*/2021 ' file
See the online demo
Notes:
awk -F, 'index($2, "/2021 ")' splits the lines (records) into fields with a comma (see -F,), and if there is a /2021 + space in the second field ($2) the line is printed
the ^[^,]*,[^,]*/2021 pattern in the grep command matches
^ - start of string
[^,]* - zero or more non-comma chars
,[^,]* - a , and zero or more non-comma chars
/2021 - a literal substring.

Related

Change some field separators in awk

I have a input file
1.txt
joshwin_xc8#yahoo.com:1802752:2222:
ihearttofurkey#yahoo.com:1802756:111113
www.rothmany#mail.com:xxmyaduh:13#;:3A
and I want an output file:
out.txt
joshwin_xc8#yahoo.com||o||1802752||o||2222:
ihearttofurkey#yahoo.com||o||1802756||o||111113
www.rothmany#mail.com||o||xxmyaduh||o||13#;:3A
I want to replace the first two ':' in 1.txt with '||o||', but with the script I am using
awk -F: '{print $1,$2,$3}' OFS="||o||" 3.txt
But it is not giving the expected output.
Any help would be highly appreciated.

Perl solution:
perl -pe 's/:/||o||/ for $_, $_' 1.txt
-p reads the input line by line and prints each line after processing it
s/// is similar to substitution you might know from sed
for in postposition runs the previous command for every element in the following list
$_ keeps the line being processed
For higher numbers, you can use for ($_) x N where N is the number. For example, to substitute the first 7 occurrences:
perl -pe 's/:/||o||/ for ($_) x 7' 1.txt

Following sed may also help you in same.
sed 's/:/||o||/;s/:/||o||/' Input_file
Explanation: Simply substituting 1st occurrence of colon with ||o|| and then 2nd occurrence of colon now becomes 1st occurrence of colon now and substituting that colon with ||o|| as per OP's requirement.

Perl solution also, but I think the idea can apply to other languages: using the limit parameter of split:
perl -nE 'print join q(||o||), split q(:), $_, 3' file
(q quotes because I'm on Windows)

Suppose if we need to replace first 2 occurrence of : use below code
Like this you can change as per your requirement suppose if you need to change for first 7 occurences change {1..2} to {1..7}.
Out put will be saved in orginal file. it wont display the output
for i in {1..2}
> do
> sed -i "s/:/||o||/1" p.txt
> done

Linux Bash: extracting text from file int variable

I haven't found anything that clearly answers my question. Although very close, I think...
I have a file with a line:
# Skipsdata for serienummer 1158
I want to extract the 4 digit number at the end and put it into a variable, this number changes from file to file so I can't just search for "1158". But the "# Skipsdata for serienummer" always remains the same.
I believe that either grep, sed or awk may be the answer but I'm not 100 % clear on their usage.

Using Awk as
numberRequired=$(awk '/# Skipsdata for serienummer/{print $NF}' file)
printf "%s\n" "$numberRequired"
1158

You can use grep with the -o switch, which prints only the matched part instead of the whole line.
Print all numbers at the end of lines from file yourFile
grep -Po '\d+$' yourFile
Print all four digit numbers at the end of lines like described in your question:
grep -Po '^# Skipsdata for serienummer \K\d{4}$' yourFile
-P enables perl style regexes which support \d and especially \K.
\d matches any digit (0-9).
\d{4} matches exactly four digits.
\K lets grep forget the previously matched part, such that only the part afterwards is printed.

There are multiple ways to find your number. Assuming the input data is in a file called inputfile:
mynumber=$(sed -n 's/# Skipsdata for serienummer //p' <inputfile) will print only the number and ignore all the other lines;
mynumber=$(grep '^# Skipsdata for serienummer' inputfile | cut -d ' ' -f 5) will filter the relevant lines first, then only output the 5th field (the number)

Generate record of files which have been removed by grep as a secondary function of primary command

I asked a question here to remove unwanted lines which contained strings which matched a particular pattern:
Remove lines containg string followed by x number of numbers
anubhava provided a good line of code which met my needs perfectly. This code removes any line which contains the string vol followed by a space and three or more consecutive numbers:
grep -Ev '\bvol([[:blank:]]+[[:digit:]]+){2}' file > newfile
The command will be used on a fairly large csv file and be initiated by crontab. For this reason, I would like to keep a record of the lines this command is removing, just so I can go back to check the correct data that is being removed- I guess it will be some sort of log containing the name sof the lines that did not make the final cut. How can I add this functionality?

Drop grep and use awk instead:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print >> "deleted"; next} 1' file
The above uses GNU awk for word delimiters (\<) and will append every deleted line to a file named "deleted". Consider adding a timestamp too:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print systime(), $0 >> "deleted"; next} 1' file

using awk or sed to print all columns from the n-th to the last [duplicate]

This question already has answers here:
awk to print all columns from the nth to the last with spaces
(4 answers)
Using awk to print all columns from the nth to the last
(27 answers)
Closed 6 years ago.
This is NOT a duplicate of another question.
All previous questions/solutions posted on stackoverflow have got the same issue: additional spaces get replaced into a single space.
Example (1.txt)
filename Nospaces
filename One space
filename Two spaces
filename Three spaces
Result:
awk '{$1="";$0=$0;$1=$1}1' 1.txt
One space
Two spaces
Three spaces
awk '{$1=""; print substr($0,2)}' 1.txt
One space
Two spaces
Three spaces

Specify IFS with -F option to avoid omitting multiple space by awk
awk -F "[ ]" '{$1="";$0=$0;$1=$1}1' 1.txt
awk -F "[ ]" '{$1=""; print substr($0,2)}' 1.txt

If you define a field as any number of non-space characters followed by any number of space characters, then you can remove the first N like this:
$ sed -E 's/([^[:space:]]+[[:space:]]*){1}//' file
Nospaces
One space
Two spaces
Three spaces
Change {1} to {N}, where N is the number of fields to remove. If you only want to remove 1 field from the start, then you can remove the {1} entirely (as well as the parentheses which are used to create a group):
sed -E 's/[^[:space:]]+[[:space:]]*//' file
Some versions of sed (e.g. GNU sed) allow you to use the shorthand:
sed -E 's/(\S+\s*){1}//' file
If there may be some white space at the start of the line, you can add a \s* (or [[:space:]]*) to the start of the pattern, outside of the group:
sed -E 's/\s*(\S+\s*){1}//' file
The problem with using awk is that whenever you touch any of the fields on given record, the entire record is reformatted, causing each field to be separated by OFS (the Output Field Separator), which is a single space by default. You could use awk with sub if you wanted but since this is a simple substitution, sed is the right tool for the job.

To preserve whitespace in awk, you'll have to use regular expression substitutions or use substrings. As soon as you start modifying individual fields, awk has to recalculate $0 using the defined (or implicit) OFS.
Referencing Tom's sed answer:
awk '{sub(/^([^[:blank:]]+[[:blank:]]+){1}/, "", $0); print}' 1.txt

Use cut:
cut -d' ' -f2- a.txt
prints all columns from the second to the last and preserves whitespace.

Working code in awk, no leading space, supporting multiple space in the columns and printing from the n-th column:
awk '{ print substr($0, index($0,$column_id)) }' 1.txt

Find the most common line in a file in bash

I have a file of strings:
string-string-123
string-string-123
string-string-123
string-string-12345
string-string-12345
string-string-12345-123
How do I retrieve the most common line in bash (string-string-123)?

You can use sort with uniq
sort file | uniq -c | sort -n -r

You could use awk to do this:
awk '{++a[$0]}END{for(i in a)if(a[i]>max){max=a[i];k=i}print k}' file
The array a keeps a count of each line. Once the file has been read, we loop through it and find the line with the maximum count.
Alternatively, you can skip the loop in the END block by assigning the line during the processing of the file:
awk 'max < ++c[$0] {max = c[$0]; line = $0} END {print line}' file
Thanks to glenn jackman for this useful suggestion.
It has rightly been pointed out that the two approaches above will only print out one of the most frequently occurring lines in the case of a tie. The following version will print out all of the most frequently occurring lines:
awk 'max<++c[$0] {max=c[$0]} END {for(i in c)if(c[i]==max)print i}' file

Tom Fenech's elegant awk answer works great [in the amended version that prints all most frequently occurring lines in the event of a tie].
However, it may not be suitable for large files, because all distinct input lines are stored in an associative array in memory, which could be a problem if there are many non-duplicate lines; that said, it's much faster than the approaches discussed below.
Grzegorz Żur's answer combines multiple utilities elegantly to implicitly produce the desired result, but:
all distinct lines are printed (highest-frequency count first)
output lines are prefixed by their occurrence count (which may actually be desirable).
While you can pipe Grzegorz Żur's answer to head to limit the number of lines shown, you can't assume a fixed number of lines in general.
Building on Grzegorz's answer, here's a generic solution that shows all most-frequently-occurring lines - however many there are - and only them:
sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1'
If you don't want the output lines prefixed with the occurrence count:
sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1' |
sed 's/^ *[0-9]\{1,\} //'
Explanation of Grzegorz Żur's answer:
uniq -c outputs the set of unique input lines prefixed with their respective occurrence count (-c), followed by a single space.
sort -n -r then sorts the resulting lines numerically (-n), in descending order (-r), so that the most frequently occurring line(s) are at the top.
Note that sort, if -k is not specified, will generally try to sort by the entire input line, but -n causes only the longest prefix that is recognized as an integer to be used for sorting, which is exactly what's needed here.
Explanation of my awk command:
NR==1 {prev=$1} stores the 1st whitespace-separated field ($1) in variable prev for the first input line (NR==1)
$1!=prev {exit} terminates processing, if the 1st whitespace-separated field is not the same as the previous line's - this means that a non-topmost line has been reached, and no more lines need printing.
1 is shorthand for { print } meaning that the input line at hand should be printed as is.
Explanation of my sed command:
^ *[0-9]\{1,\} matches the numeric prefix (denoting the occurrence count) of each output line, as (originally) produced by uniq -c
applying s/...// means that the prefix is replaced with an empty string, i.e., effectively removed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Filtering large data file by date using command line - linux

One potential solution is with awk: awk -F"," '$2 ~ /\/2020 /' data.csv > temp.csv Another potential option is with grep: grep "\/2020 " data.csv > temp.csv However, the grep solution may detect "/2020 " elsewhere in the file, rather than in column 2.

Related

Change some field separators in awk

Linux Bash: extracting text from file int variable

Generate record of files which have been removed by grep as a secondary function of primary command

using awk or sed to print all columns from the n-th to the last [duplicate]

Find the most common line in a file in bash

Categories

Resources