BASH Split CSV Into Multiple Files Based on Column Value [duplicate]

BASH Split CSV Into Multiple Files Based on Column Value [duplicate] - linux

This question already has answers here:
How to split a CSV file into multiple files based on column value
(2 answers)
Closed 5 months ago.
I have a file named fulldata.tmp which contains pipe delimited data (I can change it to comma if needed but generally like using pipe). With a BASH Shell script I would like to split lines out to new files based on the value in column 1 and retain the header. I'm pulling this data via SQL so I can pre-sort if needed but I don't have direct access to the terminal running this script so development and debugging is difficult. I've searched dozens of examples mostly recommending awk but I'm not connecting the dots. This is my core need and below are a couple quality of life options I'd like if it's easy along with example data.
Nice if possible: I would like to specify which columns print to the new files (my example desired output shows I want columns 1-4 out of the initial 5 columns).
Nice if possible: I would like the new files named with a prefix then the data that is being split on followed by extension: final_$col1.csv
GROUPID|LABEL|DATE|ACTIVE|COMMENT
ABC|001|2022-09-15|True|None
DEF|001|2022-09-16|False|None
GHI|002|2022-10-17|True|Future
final_ABC.csv
ABC|001|2022-09-15|True
final_DEF.csv
DEF|001|2022-09-16|False
final_GHI.csv
GHI|002|2022-10-17|True

Maybe awk
awk -F'|' -v OFS='|' 'NR>1{print $1, $2, $3, $4 > "final_"$1".csv"}' fulldata.tmp
Check the created csv files and it's content.
tail -n+1 final*.csv
Output
==> final_ABC.csv <==
ABC|001|2022-09-15|True
==> final_DEF.csv <==
DEF|001|2022-09-16|False
==> final_GHI.csv <==
GHI|002|2022-10-17|True
Here is how I would do the header.
IFS= read -r head < fulldata.tmp
Then use the variable to awk.
awk -F'|' -v header="${head%|*}" 'NR>1{printf "%s\n%s|%s|%s|%s\n", header, $1, $2, $3, $4 > "final_"$1".csv"}' fulldata.tmp
Run tail again to check.
tail -n+1 final*.csv
Output
==> final_ABC.csv <==
GROUPID|LABEL|DATE|ACTIVE
ABC|001|2022-09-15|True
==> final_DEF.csv <==
GROUPID|LABEL|DATE|ACTIVE
DEF|001|2022-09-16|False
==> final_GHI.csv <==
GROUPID|LABEL|DATE|ACTIVE
GHI|002|2022-10-17|True
You did find a solution with pure awk.

This works and preserves the header which I believe was a requirement.
cut -d '|' -f 1 fulldata.tmp | grep -v GROUPID | sort -u | while read -r id; do grep -E "^${id}|^GROUPID" fulldata.tmp > final_${id}.csv; done
I think a pure awk solution is better though.

Related

Comparing two csv files with different lengths but output only the line where it matches the same value in two different column

I've been trying to compare two csv file using simple shell script but I think the code that I was using is not doing it's job. what I want to do is, compare the two files using Column 6 from first.csv and Column 2 in second.csv and when it matches, it will output the line from first.csv. see below as an example
first.csv
1,0,3210820,0,536,7855712
1,0,3523340820,0,36,53712
1,0,321023423i420,0,336,0255712
1,0,321082234324,0,66324,027312
second.csv
14,7855712,Whie,Black
124,7855712,Green,Black
174,1197,Black,Orange
1284,98132197,Yellow,purple
35384,9811123197,purple,purple
13354,0981123131197,green,green
183434,0811912313127,white,green
Output should be from the first file:
1,0,3210820,0,536,7855712
I've been using the code below.
cat first.csv | while read line
do
cat second.csv | grep $line > output_file
done
please help. Thank you

Your question is not entirely clear, but here is what I think you want:
cat first.csv | while read LINE; do
VAL=`echo "$LINE" | cut -d, -f6`
grep -q "$VAL" second.csv && echo $LINE
done
The first line in the loop extracts the 6th field from the line and stores it in VAL. The next line checks (quietly), if VAL occurs in second.csv and if so, outputs the line.
Note that grep will check for any occurence in second.csv, not only in field 2. To check only against field 2, change it to:
cut -d, -f2 second.csv | grep -q "$VAL" && echo $LINE
Unrelated to your question I would like to comment, that those things can much more efficiency be solved in a language like python.

Well... If you have bash with process substitution you can treat all the 2nd fields in second.csv (with a $ appended to anchor the search at the end of the line) as input from a file. Then using grep -f match data from the 2nd column of second.csv with the end of the line in first.csv doing what you intend.
You can use the <(process) form to redirect the 2nd field as a file using:
grep -f <(awk -F, '{print $2"$"}' second.csv) first.csv
Example Output
With the data you show in first.csv and second.csv you get:
1,0,3210820,0,536,7855712
Adding the "$" anchor as part of the 2nd field from second.csv should satisfy the match only in the 6th field (end of line) in first.csv.
The benefit here being there is but a single call to grep and awk, not an additional subshell spawned per-iteration. Doesn't matter with small files like your sample input, but with millions of lines, we are talking hours (or days) of processing time difference.

awk printing nothing when used in loop [duplicate]

This question already has answers here:
How can I use a file in a command and redirect output to the same file without truncating it?
(14 answers)
Closed 2 years ago.
I have a bunch of files using the format file.1.a.1.txt that look like this:
A 1
B 2
C 3
D 4
and was using the following command to add a new column containing the name of each file:
awk '{print FILENAME (NF?"\t":"") $0}' file.1.a.1.txt > file.1.a.1.txt
which ended up making them look how I want:
file.1.a.1.txt A 1
file.1.a.1.txt B 2
file.1.a.1.txt C 3
file.1.a.1.txt D 4
However, I need to do this for multiple files as a job on an HPC using sbatch submission. But when I run the following job script:
#!/bin/bash
#<other SBATCH info>
#SBATCH --array=1-10
N=$SLURM_ARRAY_TASK_ID
for j in {a,b,c};
do
for i in {1,2,3}
do awk '{print FILENAME (NF?"\t":"") $0}' file.${N}."$j"."$i".txt > file.${N}."$j"."$i".txt
done
done
awk is generating empty files. I have tried using cat to call the file and then piping it to awk but that also hasn't worked.

You don't need a loop and you cannot redirect STDOUT to the same file you're reading from STDIN, you will get blank files if you do that.
Try this:
#!/bin/bash
N=$SLURM_ARRAY_TASK_ID
awk '
NF{
print FILENAME "\t" $0 > FILENAME".tmp"
}
ENDFILE{ # requires gawk
close(FILENAME".tmp")
}' file."$N".{a,b,c}.{1,2,3}.txt
for file in file*.tmp; do
mv "$file" "${file%.tmp}"
done
Note that if you don't have GNU awk to use ENDFILE{} you can remove that stanza and get away with either:
Putting the close() statement just after the print statement (comes with lots of overhead)
Don't call close() at all and as long as you don't have a lot of files, you should be fine.

awk output to variable [duplicate]

This question already has answers here:
How do I set a variable to the output of a command in Bash?
(15 answers)
Closed 6 years ago.
[Dd])
echo"What is the record ID?"
read rID
numA= awk -f "%" '{print $1'}< practice.txt
I cannot figure out how to set numA = to the output of the awk in order to compare rID and numA. numA is equal to the first field of a txt file which is separated by %. Any suggestions?

You can capture the output of any command in a variable via command substitution:
numA=$(awk -F '%' '{print $1}' < practice.txt)
Unless your file contains only one line, however, the awk command you presented (as corrected above) is unlikely to be what you want to use. If the practice.txt file contains, say, answers to multiple questions, one per line, then you probably want to structure the script altogether differently.

You don't need to use awk, just use parameter expansion:
numA=${rID%%\%*}

this is the correct syntax.
numA=$(awk -F'%' '{print $1}' practice.txt)
however, it will be easier to do comparisons in awk by passing the bash variable in.
awk -F'%' -v r="$rID" '$1==r{... do something ...}' practice.txt
since you didn't specify any details it's difficult to suggest more...
to remove rID matching line from the file do this
awk -F'%' -v r="$rID" '$1!=r' practice.txt > output
will print the lines where the condition is met ($1 not equal to rID), equivalent to deleting the ones which are equal. You can mimic in place replacement by
awk ... practice.txt > temp && mv temp practice.txt
where you fill in ... from the line above.

Try using
$ numA=`awk -F'%' '{ if($1 != $0) { print $1; exit; }}' practice.txt`
From the question, "numA is equal to the first field of a txt file which is separated by %"
-F'%', meaning % is the only separator we care about
if($1 != $0), meaning ignore lines that don't have the separator
print $1; exit;, meaning exit after printing the first field that we encounter separated by %. Remove the exit if you don't want to stop after the first field.

renaming files using loop in unix

I have a situation here.
I have lot of files like below in linux
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaac
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaag
I want to remove the $line and make a counter from 0001 to 6000 for my 6000 such files in its place.
Also i want to remove the trailer 3 characters after this is done for each file.
After fix file should be like
SIPTV_FIPTV_ID0000001_T20141003195717_C0000001000_FWD148_IPV_001.DAT
SIPTV_FIPTV_ID0000002_T20141003195717_C0000001000_FWD148_IPV_001.DAT
Please help.

With some assumption, I think this should do it:
1. list of the files is in a file named input.txt, one file per line
2. the code is running in the directory the files are in
3. bash is available
awk '{i++;printf "mv \x27"$0"\x27 ";printf "\x27"substr($0,1,16);printf "%05d", i;print substr($0,22,47)"\x27"}' input.txt | bash

from the command prompt give the following command
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}'
%
and check the output, if it looks OK
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}' | sh
%
A commentary: echo *.DAT??? is meant to give as input to awk a list of all the filenames that you want to modify, you may want something more articulated if the example names you gave aren't representative of the whole spectrum... regarding the awk script itself, I used sprintf to generate a string with the correct number of zeroes for the replacement of $line, the idiom `"\\$..." with two backslashes to quote the dollar sign is required by gawk and does no harm in mawk, and as a last remark I have to say that in similar cases I prefer to make at least a dry run before passing the commands to the shell...

how to cut CSV file

I have the following CSV file
more file.csv
Number,machine_type,OS,Version,Mem,CPU,HW,Volatge
1,HG652,linux,23.12,256,III,LOP90,220
2,HG652,linux,23.12,256,III,LOP90,220
3,HG652,SCO,MK906G,526,1G,LW1005,220
4,HG652,solaris,1172,1024,2Core,netra,220
5,HG652,solaris,1172,1024,2Core,netra,220
Please advice how to cut CSV file ( by cut or sed or awk command )
in order to get a partial CSV file
Command need to get value that represent the fields that we want to cut from the CSV
According to example 1 ( value should be 6 )
Example 1
on this example we cut the 6 fields from left to right , ( in this case CSV will look like this )
Number,machine_type,OS,Version,Mem,CPU
1,HG652,linux,23.12,256,III
2,HG652,linux,23.12,256,III
3,HG652,SCO,MK906G,526,1G
4,HG652,solaris,1172,1024,2Core
5,HG652,solaris,1172,1024,2Core

cut is your friend:
$ cut -d',' -f-6 file
Number,machine_type,OS,Version,Mem,CPU
1,HG652,linux,23.12,256,III
2,HG652,linux,23.12,256,III
3,HG652,SCO,MK906G,526,1G
4,HG652,solaris,1172,1024,2Core
5,HG652,solaris,1172,1024,2Core
Explanation
-d',' set comma as field separator
-f-6 print up to the field number 6 based on that delimiter. It is equivalent to -f1-6, as 1 is default.
Also awk can make it, if necessary:
$ awk -v FS="," 'NF{for (i=1;i<=6;i++) printf "%s%s", $i, (i==6?RS:FS)}' file
Number,machine_type,OS,Version,Mem,CPU
1,HG652,linux,23.12,256,III
2,HG652,linux,23.12,256,III
3,HG652,SCO,MK906G,526,1G
4,HG652,solaris,1172,1024,2Core
5,HG652,solaris,1172,1024,2Core

the cut commandline is rather simple and well suited in your case:
cut -d, -f1-6 yourfile
So everybody agrees to say that the cut way is the best way to go in this case. But we can also talk about the awk solution, and there I may point out that in fedorqui's answer, a clever trick is used to silence empty lines (NF as a selection pattern), but it has the disadvantage of e.g. removing blank lines from the original file. I propose below another solution (en passant, using the -F option instead of the variable passing mechanism on FS that preserves any empty line and also respects lines with less than 6 fields, e.g. prints these lines without adding extra commas there:
awk -F, '{min=(NF>6?6:NF); for (i=1;i<=min-1;i++) printf "%s,", $i; printf "%s\n", $6}' yourfile
This works nicely because printf-ing $6 is never an error, even in case the line has less than 6 fields. This is true with my gawk 4.0.1, at least...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

BASH Split CSV Into Multiple Files Based on Column Value [duplicate] - linux

This works and preserves the header which I believe was a requirement. cut -d '|' -f 1 fulldata.tmp | grep -v GROUPID | sort -u | while read -r id; do grep -E "^${id}|^GROUPID" fulldata.tmp > final_${id}.csv; done I think a pure awk solution is better though.

Related

Comparing two csv files with different lengths but output only the line where it matches the same value in two different column

awk printing nothing when used in loop [duplicate]

awk output to variable [duplicate]

renaming files using loop in unix

how to cut CSV file

Categories

Resources