Subsetting a CSV by unique column values - linux

I am fairly new to linux and feel this should be a fairly simple task, but I cannot quite figure it out. I have a large data file with millions of rows, and I want to break the file into smaller files based on date. I have a time column that contains YYMMDDHH data, and I want to create sub files based on the DD. For each new DD, I want a new file created with all entries for that day. The file is a csv and is already sorted by time.
From what I have read it looks like I should be able to use cat, awk and possibly grep to perform what I want.
To elaborate further, there are 14 columns per row. One column has data that contains YYMMDDHH (ie 14071000, 14071000...14071022,14071022....14071100...14071200...)
I can manually subset with
cat trial | awk 'NR>=1 && NR<=100 {print}' >output.txt
This gives me the rows between 1 and 100. I was wondering if there is a command that allows me to extract based off the YYMMDDHH column, so that all data points on 140710 could be put in a single file. Hope that helps explain my problem a little better.

You should be able to use s.th. like this:
awk '{ line_date = $1 / 100; print > "out_" line_date ".txt"; }'
BTW you might want to avoid 'useless use of cat' by not piping but using awk directly on your file.

YYMMDDHH 14071000
imagine YYMMDDHH is at the 1st coloumn.
awk '{fn = substr($1, 1, 6) ; print $0 >> fn }' 1.txt

awk '{print $0 >> "File" substr($1, 0, 6) ".txt"}' file
Assuming date is in the first column. Logic is to append each line to corresponding file (name of the file is the date in YYMMDD format). So that all data corresponding to each date will be in corresponding "FileYYMMDD.txt". If date is in some other column, you can just change $1 to the column number.
Sample Output:
sdlcb#Goofy-Gen:~/AMD/SO$ cat file
14071000 asasaa
14071022 iosido
14071000 lsdksld
14071022 sodisdois
14071100 iwiwe
14071022 iosido
14071100 iwiwe
14071200 yqiwyq
sdlcb#Goofy-Gen:~/AMD/SO$ awk '{print $0 >> "File" substr($1, 0, 6) ".txt"}' file
sdlcb#Goofy-Gen:~/AMD/SO$ ls
file File140710.txt File140711.txt File140712.txt
sdlcb#Goofy-Gen:~/AMD/SO$ cat File140710.txt
14071000 asasaa
14071022 iosido
14071000 lsdksld
14071022 sodisdois
14071022 iosido
sdlcb#Goofy-Gen:~/AMD/SO$ cat File140711.txt
14071100 iwiwe
14071100 iwiwe
sdlcb#Goofy-Gen:~/AMD/SO$ cat File140712.txt
14071200 yqiwyq

Related

How to print the value in third column of a line which comes after a line which, contains a specific string using AWK to a different file?

I have an output which contains something like this in the middle.
Stopping criterion = max iterations
Energy initial, next-to-last, final =
-83909.5503696 -86748.8150981 -86748.8512012
What I am trying to do is to print out the last value(3rd column) in line after the line which contains the string "Energy" to a different file. and I have to print out these values from 100 different files. currently I have been trying with this line which only looks at a single file.
awk -F: '/Energy/ { getline; print $0 }' inputfile > outputfile
but this gives output like:
-83909.5503696 -86748.8150981 -86748.8512012
Update - With the help of a suggestion below I was able to output the value to a file. but as it reads through different files it overwrites the final output file and prints out value of the final file that it read. What I tried was this,
#SBATCH --array=1-100
num=$SLURM_ARRAY_TASK_ID..
fold=$(printf '%03d' $num)
cd $main_path/surf_$fold
awk 'f{print $3; f=0} /Energy/{f=1}' inputfile > outputfile
This would not be an appropriate job for getline, see http://awk.freeshell.org/AllAboutGetline, and idk why you're setting FS to : with -F: when your fields are space-separated as awk assumes by default.
Here's how to do what I think you're trying to do with 1 call to awk:
awk 'f{print $3; f=0} /Energy/{f=1}' "$main_path/surf_"*"/inputfile > outputfile

Comparing two files and updating second file using bash and awk and sorting the second file

I have two files with two colums in each file that I want to compare the 1st column of both files. If the value of the 1st column in the first file does not exist in the second file, I then want to append to the second file the value in the 1st column of the first file, eg
firstFile.log
1457935407,998181
1457964225,998191
1457969802,997896
secondFile.log
1457966024,1
1457967635,1
1457969802,5
1457975246,2
After, secondFile.log should look like:
1457935407,null
1457964225,null
1457966024,1
1457967635,1
1457969802,5
1457975246,2
Note: Second file should be sorted by the first column after being updated.
Using awk and sort:
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1]; next} {delete a[$1]; print} END{
for (i in a) print i, "null"}' firstFile.log secondFile.log |
sort -t, -k1 > $$.temp && mv $$.temp secondFile.log
1457935407,null
1457964225,null
1457966024,1
1457967635,1
1457969802,5
1457975246,2
using non awk tools...
$ sort -t, -uk1,1 file2 <(sed 's/,.*/,null/' file1)
1457935407,null
1457964225,null
1457966024,1
1457967635,1
1457969802,5
1457975246,2

Bash CSV sorting and unique-ing

a Linux question: I have the CSV file data.csv with the following fields and values
KEY,LEVEL,DATA
2.456,2,aaa
2.456,1,zzz
0.867,2,bbb
9.775,4,ddd
0.867,1,ccc
2.456,0,ttt
...
The field KEY is a float value, while LEVEL is an integer. I know that the first field can have repeated values, as well as the second one, but if you take them together you have a unique couple.
What I would like to do is to sort the file according to the column KEY and then for each unique value under KEY keep only the row having the higher value under LEVEL.
Sorting is not a problem:
$> sort -t, -k1,2 data.csv # fields: KEY,LEVEL,DATA
0.867,1,ccc
0.867,2,bbb
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd
...
but then how can I filter the rows so that I get what I want, which is:
0.867,2,bbb
2.456,2,aaa
9.775,4,ddd
...
Is there a way to do it using command line tools like sort, uniq, awk and so on? Thanks in advance
try this line:
your sort...|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
output:
kent$ echo "0.867,1,bbb
0.867,2,ccc
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd"|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
0.867,2,ccc
2.456,2,aaa
9.775,4,ddd
The idea is, because your file is already sorted, just go through the file/input from top, if the first column (KEY) changed, print the last line, which is the highest value of LEVEL of last KEY
try with your real data, it should work.
also the whole logic (with your sort) could be done by awk in single process.
Use:
$> sort -r data.csv | uniq -w 5 | sort
given your floats are formatted "0.000"-"9.999"
Perl solution:
perl -aF, -ne '$h{$F[0]} = [#F[1,2]] if $F[1] > $h{$F[0]}[0]
}{
print join ",", $_, #{$h{$_}} for sort {$a<=>$b} keys %h' data.csv
Note that the result is different from the one you requested, the first line contains bbb, not ccc.

Comparing two files using awk and printing contains which are matching from other files

I have two files:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
file2.txt
919167000000
919594000000
Output
919167000000,hutch,mumbai
919594000000,idea,mumbai
How can I achieve this using AWK? I've got a huge file of phone numbers which needs to be compared like this. I believe Awk can handle it; if not please let me know how can I do this.
Extra definitions
Is the common part always a 6-digit number? Yes always 6.
Are the two files already sorted? file1 is not sorted. file2 can be sorted.
Are the trailing digits in file 2 always zeros? No, these are phone numbers this can vary, purpose of this is to get series information of the phone number.
Is there any danger of file 1 containing three records for a given number while file 2 contains 2 records, or is it one-to-one? It's one-to-one.
Can there be records in file 1 with no match in file 2, or vice versa?_ Yes.
If so, do you want to see the unmatched records? Yes I want both records.
Extended data
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
918888,airtel,karnataka
file2.txt
919167838888
919594998484
919212334323
Output Expected:
919167838888,hutch,mumbai
919594998484,idea,mumbai
919212334323,nomatch,nomatch
As I noted in a comment, there's a lot of unstated information needed to give a definitive answer. However, we can make some plausible guesses:
The common number is the first 6 digits of file 2 (we don't care about the trailing digits, but will simply copy them to the output).
The files are sorted in order.
If there are unmatched records in either file, those records will be ignored.
The tools of choice are probably sed and join:
sed 's/^\([0-9]\{6\}\)/\1,\1/' file2.txt |
join -t, -o 1.2,2.2,2.3 - file1.txt
This edits file2.txt to create a comma-separated first field with the 6-digit phone number followed by all the rest of the line. The input is fed to the join command, which joins on the first column, and outputs the 'rest of the line' (column 2) from file2.txt and columns 2 and 3 from file1.txt.
If the phone numbers are variable length, then the matching operation is horribly complex. For that, I'd drop into Perl (or Python) to do the work. If the data is unsorted, it can be sorted before being fed into the commands. If you want unmatched records, you can specify how to handle those in the options to join.
The extra information needed is now available. The key information is the 6-digits is fixed — phew! Since you're on Linux, I'm assuming bash is available with 'process substitution':
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -a 2 -e 'no-match' - <(sort file1.txt)
If process substitution is not available, simply sort file1.txt in situ:
sort -o file1.txt file1.txt
Then use file1.txt in place of <(sort file1.txt).
I think the comment might be asking for inputs such as:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
902130,airtel,karnataka
file2.txt
919167000000
919594000000
919342313242
Output
no-match,airtel,karnataka
919167000000,hutch,mumbai
919342313242,no-match,no-match
919594000000,idea,mumbai
If that's not what the comment is about, please clarify by editing the question to add the extra data and output in a more readable format than comments allow.
Working with the extended data, this mildly modified command:
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -e 'no-match' - <(sort file1.txt)
produces the output:
919167838888,hutch,mumbai
919212334323,no-match,no-match
919594998484,idea,mumbai
which looks rather like a sorted version of the desired output. The -a n options control whether the unmatched records from file 1 or file 2 (or both) are printed; the -e option controls the value printed for the unmatched fields. All of this is readily available from the man pages for join, of course.
Here's one way using GNU awk. Run like:
awk -f script.awk file2.txt file1.txt
Contents of script.awk:
BEGIN {
FS=OFS=","
}
FNR==NR {
sub(/[ \t]+$/, "")
line = substr($0, 0, 6)
array[line]=$0
next
}
{
printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"
dup[$1]++
}
END {
for (i in array) {
if (!(i in dup)) {
printf "FILE2 no match --> %s\n", array[i]
}
}
}
Alternatively, here's the one-liner:
awk 'BEGIN { FS=OFS="," } FNR==NR { sub(/[ \t]+$/, ""); line = substr($0, 0, 6); array[line]=$0; next } { printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"; dup[$1]++} END { for (i in array) if (!(i in dup)) printf "FILE2 no match --> %s\n", array[i] }' file2.txt file1.txt
awk -F, 'FNR==NR{a[$1]=$2","$3;next}{for(i in a){if($1~/i/) print $1","a[i]}}' your_file

extracting data from two list using a shell script

I am trying to create a shell script that pulls a line from a file and checks another file for an instance of the same. If it finds an entry then it adds it to another file and loops through the first list until the it has gone through the whole file. The data in the first file looks like this -
email#address.com;
email2#address.com;
and so on
The other file in which I am looking for a match and placing the match in the blank file looks like this -
12334 email#address.com;
32213 email2#address.com;
I want it to retain the numbers as well as the matching data. I have an idea of how this should work but need to know how to implement it.
My Idea
#!/bin/bash
read -p "enter first file name:" file1
read -p "enter second file name:" file2
FILE_DATA=( $( /bin/cat $file1))
FILE_DATA1=( $( /bin/cat $file2))
for I in $((${#FILE_DATA[#]}))
do
echo $FILE_DATA[$i] | grep $FILE_DATA1[$i] >> output.txt
done
I want the output to look like this but only for addresses that match -
12334 email#address.com;
32213 email2#address.com;
Thank You
quite like manipulating text using SQL:
$ cat file1
b#address.com
a#address.com
c#address.com
d#address.com
$ cat file2
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
$ join -1 1 -2 2 <(sort file1) <(sort -k2 file2) | awk '{print $2,$1}'
11457 b#address.com
22519 d#address.com
make keys sorted(we use emails as keys here)
join on keys(file1.column1, file2.column2)
format output(use awk to reverse columns)
As you've learned about diff and comm, now it's time to learn about another tool in the unix toolbox, join.
Join does just what the name indicates, it joins together 2 files. The way you join is based on keys embedded in the file.
The number 1 restraint on using join is that the data must be sorted in both files on the same column.
file1
a abc
b bcd
c cde
file2
a rec1
b rec2
c rec3
join file1 file2
a abc rec1
b bcd rec2
c cde rec3
you can consult the join man page for how to reduce and reorder the columns of output. for example
1>join -o 1.1 2.2 file1 file2
a rec1
b rec2
c rec3
You can use your code for file name input to turn this into a generalizable script.
Your solution using a pipeline inside a for loop will work for small sets of data, but as the size of data grows, the cost of starting a new process for each word you are searching for will drag down the run time.
I hope this helps.
Read line by the file1.txt file and assign the line to var ADDR. grep file2.txt with the content of var ADDR and append the output to file_result.txt.
(while read ADDR; do grep "${ADDR}" file2.txt >> file_result.txt ) < file1.txt
This awk one-liner can help you do that -
awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
NR and FNR are awk's built-in variables that stores the line numbers. NR does not get reset to 0 when working with two files. FNR does. So while that condition is true we add everything to an array a. Once the first file is completed, we check for the second column of second file. If a match is present in the array we put the entire line in a file f3.txt. If not then we ignore it.
Using data from Kev's solution:
[jaypal:~/Temp] cat f1.txt
b#address.com
a#address.com
c#address.com
d#address.com
[jaypal:~/Temp] cat f2.txt
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
[jaypal:~/Temp] awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
[jaypal:~/Temp] cat f3.txt
11457 b#address.com
22519 d#address.com

Resources