How do I create a bash script to duplicate a file, swap first two columns and append to bottom of original file? - linux

I would like to write a bash script to duplicate the product-feed swapping the first two columns(sku, productId) and append it to feed. This is what I have so far but it does not seem to be working.
Duplicate the feed
Swap the first two columns
Append it to the original feed
1-Duplicate the feed--> cd /var/ftp/JNM-01-020420/inbound/product-feed/en_US && cp ./*.csv /var/ftp/JNM-01-020420/inbound/product-feed/en_US/tmp
2-Swap Columns--> awk '{t=$1; $1=$2; $2=t; print;}' ./tmp
3-Append to original feed --> ./tmp >> ./*.csv
Example of product feed for reference

Tested on 150 MB file:
awk 'BEGIN {FS=OFS=","} {print $2, $1}' inputfile.csv > col2-1.csv
cat col2-1.csv >> inputfile.csv

Related

BASH Split CSV Into Multiple Files Based on Column Value [duplicate]

This question already has answers here:
How to split a CSV file into multiple files based on column value
(2 answers)
Closed 5 months ago.
I have a file named fulldata.tmp which contains pipe delimited data (I can change it to comma if needed but generally like using pipe). With a BASH Shell script I would like to split lines out to new files based on the value in column 1 and retain the header. I'm pulling this data via SQL so I can pre-sort if needed but I don't have direct access to the terminal running this script so development and debugging is difficult. I've searched dozens of examples mostly recommending awk but I'm not connecting the dots. This is my core need and below are a couple quality of life options I'd like if it's easy along with example data.
Nice if possible: I would like to specify which columns print to the new files (my example desired output shows I want columns 1-4 out of the initial 5 columns).
Nice if possible: I would like the new files named with a prefix then the data that is being split on followed by extension: final_$col1.csv
GROUPID|LABEL|DATE|ACTIVE|COMMENT
ABC|001|2022-09-15|True|None
DEF|001|2022-09-16|False|None
GHI|002|2022-10-17|True|Future
final_ABC.csv
ABC|001|2022-09-15|True
final_DEF.csv
DEF|001|2022-09-16|False
final_GHI.csv
GHI|002|2022-10-17|True
Maybe awk
awk -F'|' -v OFS='|' 'NR>1{print $1, $2, $3, $4 > "final_"$1".csv"}' fulldata.tmp
Check the created csv files and it's content.
tail -n+1 final*.csv
Output
==> final_ABC.csv <==
ABC|001|2022-09-15|True
==> final_DEF.csv <==
DEF|001|2022-09-16|False
==> final_GHI.csv <==
GHI|002|2022-10-17|True
Here is how I would do the header.
IFS= read -r head < fulldata.tmp
Then use the variable to awk.
awk -F'|' -v header="${head%|*}" 'NR>1{printf "%s\n%s|%s|%s|%s\n", header, $1, $2, $3, $4 > "final_"$1".csv"}' fulldata.tmp
Run tail again to check.
tail -n+1 final*.csv
Output
==> final_ABC.csv <==
GROUPID|LABEL|DATE|ACTIVE
ABC|001|2022-09-15|True
==> final_DEF.csv <==
GROUPID|LABEL|DATE|ACTIVE
DEF|001|2022-09-16|False
==> final_GHI.csv <==
GROUPID|LABEL|DATE|ACTIVE
GHI|002|2022-10-17|True
You did find a solution with pure awk.
This works and preserves the header which I believe was a requirement.
cut -d '|' -f 1 fulldata.tmp | grep -v GROUPID | sort -u | while read -r id; do grep -E "^${id}|^GROUPID" fulldata.tmp > final_${id}.csv; done
I think a pure awk solution is better though.

Rename file as third word on it (bash)

I have several autogenerated files (see the picture below for example) and I want to rename them according to 3rd word in the first line (in this case, that would be 42.txt).
First line:
ligand CC##HOc3ccccc3 42 P10000001
Is there a way to do it?
Say you have file.txt containing:
ligand CC##HOc3ccccc3 42 P10000001
and you want to rename file.txt to 42.txt based on the 3rd field in the file.
*Using awk
The easiest way is simply to use mv with awk in a command substitution, e.g.:
mv file.txt $(awk 'NR==1 {print $3; exit}' file.txt).txt
Where the command-substitution $(...) is just the awk expression awk 'NR==1 {print $3; exit}' that simply outputs the 3rd-field (e.g. 42). Specifying NR==1 ensures only the first line is considered and exit at the end of that rule ensures no more lines are processed wasting time if file.txt is a 100000 line file.
Confirmation
file.txt is now renamed 42.txt, e.g.
$ cat 42.txt
ligand CC##HOc3ccccc3 42 P10000001
Using read
You can also use read to simply read the first line and take the 3rd word as the name there and then mv the file, e.g.
$ read -r a a name a <file.txt; mv file.txt "$name".txt
The temporary variable a above is just used to read and discard the other words in the first line of the file.

Generate record of files which have been removed by grep as a secondary function of primary command

I asked a question here to remove unwanted lines which contained strings which matched a particular pattern:
Remove lines containg string followed by x number of numbers
anubhava provided a good line of code which met my needs perfectly. This code removes any line which contains the string vol followed by a space and three or more consecutive numbers:
grep -Ev '\bvol([[:blank:]]+[[:digit:]]+){2}' file > newfile
The command will be used on a fairly large csv file and be initiated by crontab. For this reason, I would like to keep a record of the lines this command is removing, just so I can go back to check the correct data that is being removed- I guess it will be some sort of log containing the name sof the lines that did not make the final cut. How can I add this functionality?
Drop grep and use awk instead:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print >> "deleted"; next} 1' file
The above uses GNU awk for word delimiters (\<) and will append every deleted line to a file named "deleted". Consider adding a timestamp too:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print systime(), $0 >> "deleted"; next} 1' file

Modification of file names

I have a list of more than 1000 files on the following format.
0521865417_roman_pottery_in_the_archaeological_record_2007.pdf
0521865476_power_politics_and_religion_in_timurid_iran_2007.pdf
0521865514_toward_a_theory_of_human_rights_religion_law_courts_2006.pdf
0521865522_i_was_wrong_the_meanings_of_apologies_2008.pdf
I am on Linux and want to change them as follows
2007_roman_pottery_in_the_archaeological_record.pdf
2007_power_politics_and_religion_in_timurid_iran.pdf
2006_toward_a_theory_of_human_rights_religion_law_courts.pdf
2008_i_was_wrong_the_meanings_of_apologies.pdf
Using rename and awk I managed to get
2007_roman_pottery_in_the_archaeological_record_2007.pdf
2007_power_politics_and_religion_in_timurid_iran_2007.pdf
2006_toward_a_theory_of_human_rights_religion_law_courts_2006.pdf
2008_i_was_wrong_the_meanings_of_apologies_2008.pdf
The remaining task is now to remove the last field that holds the year.
A solution that uses sed to generate the new names and the rename commands then pipes them to bash:
ls -1 | sed -r 's/[0-9]*_([A-Za-z_]*)_[a-z]{3}_([0-9]{4})\.pdf$/mv & \2_\1.pdf/g' | bash
A work around from where you left of...
echo 2007_roman_pottery_in_the_archaeological_record_2007.pdf | awk -F '_' '{$NF=""; OFS="_"; print substr($0, 0, length($0)-1)".pdf";}'

Search for lines in a file that contain de lines of a second file

So I have a first file with a ID in each line, for example:
458-12-345
466-44-3-223
578-4-58-1
599-478
854-52658
955-12-32
Then I have a second file. It has a ID in each file followed by information, for example:
111-2457-1 0.2545 0.5484 0.6914 0.4222
112-4844-487 0.7475 0.4749 0.1114 0.8413
115-44-48-5 0.4464 0.8894 0.1140 0.1044
....
The first file only has 1000 lines, with the IDs of the info I need, while the second file has more than 200,000 lines.
I used the following bash command in a fedora with good results:
cat file1.txt | while read line; do cat file2.txt | egrep "^$line\ "; done > file3.txt
However I'm now trying to replicate the results in Ubuntu, and the output is a blank file. Is there a reason for this not to work in Ubuntu?
Thanks!
You can grep for several strings at once:
grep -f id_file data_file
Assuming that id_file contains all the IDs and data_file contains the IDs and data.
Typical job for awk:
awk 'FNR==NR{i[$1]=1;next} i[$1]{print}' file1 file2
This will print the lines from the second file that have an index in the first one. For even more speed, use mawk.
this line works fine for me in Ubuntu:
cat 1.txt | while read line; do cat 2.txt | grep "$line"; done
However, this may be slow as the second file (200000 lines) will be grepped 1000 times (number of lines in the first file)

Resources