Deleting rows from csv based on input file - linux

I have a daily process which runs on Linux that returns a set of failed updated users, and need to delete these bad rows from the large user csv for importation into a database.
My output file contains the USER_ID for each failed user.
I'm trying to create an updated file with these removed.
I have reviewed the multitude of examples available, but none seem to work correctly. I've included a sample of the error file and the user file.
The first row is a header, and should be ignored
My error file:
"USER_ID"
"CA781558"
"LN764767"
My user file:
"USER_ID","FIRSTNAME","LASTNAME","LAST_ACTIVITY","GROUD_UID"
"CA781558","Dani","Roper","2015-07-17 19:47:21","CF93DF0A-BD23AF87D20A"
"BT055163","Alexis","Richardo","2016-04-19 21:23:08","CB71F91E-7E638292ABD5"
"LN764767","Peter","Rajosz","2016-03-18 11:59:29","973C4AD2-63BA12BB91CD"
"TN479717","Jerry","Alindos","2015-06-12 07:37:56","1DA745BA-71CB88AA91EA"
"FR915163","Alexis","Richardo","2016-04-19 21:23:08","DBA8B91E-7A6B8292ABD5"
"GB135767","Peter","Rajosz","2016-03-18 11:59:29","AE3C4AD2-63BA181B91CD"
"SG839717","Jerry","Alindos","2015-06-12 07:37:56","1BA746BA-71CB88AA91EA"
Expected Output:
"USER_ID","FIRSTNAME","LASTNAME","LAST_ACTIVITY","GROUD_UID"
"BT055163","Alexis","Richardo","2016-04-19 21:23:08","CB71F91E-7E638292ABD5"
"TN479717","Jerry","Alindos","2015-06-12 07:37:56","1DA745BA-71CB88AA91EA"
"FR915163","Alexis","Richardo","2016-04-19 21:23:08","DBA8B91E-7A6B8292ABD5"
"GB135767","Peter","Rajosz","2016-03-18 11:59:29","AE3C4AD2-63BA181B91CD"
"SG839717","Jerry","Alindos","2015-06-12 07:37:56","1BA746BA-71CB88AA91EA"
Can you help? Thank you in advance

You can use awk like this:
awk -F, 'FNR==NR{del[$1]; next} FNR==1 || !($1 in del)' err.txt file.txt
"USER_ID","FIRSTNAME","LASTNAME","LAST_ACTIVITY","GROUD_UID"
"BT055163","Alexis","Richardo","2016-04-19 21:23:08","CB71F91E-7E638292ABD5"
"TN479717","Jerry","Alindos","2015-06-12 07:37:56","1DA745BA-71CB88AA91EA"
"FR915163","Alexis","Richardo","2016-04-19 21:23:08","DBA8B91E-7A6B8292ABD5"
"GB135767","Peter","Rajosz","2016-03-18 11:59:29","AE3C4AD2-63BA181B91CD"
"SG839717","Jerry","Alindos","2015-06-12 07:37:56","1BA746BA-71CB88AA91EA"

Related

Is it possible to partially unzip a .vcf file?

I have a ~300 GB zipped vcf file (.vcf.gz) which contains the genomes of about 700 dogs. I am only interested in a few of these dogs and I do not have enough space to unzip the whole file at this time, although I am in the process of getting a computer to do this. Is it possible to unzip only parts of the file to begin testing my scripts?
I am trying to a specific SNP at a position on a subset of the samples. I have tried using bcftools to no avail: (If anyone can identify what went wrong with that I would also really appreciate it. I created an empty file for the output (722g.990.SNP.INDEL.chrAll.vcf.bgz) but it returns the following error)
bcftools view -f PASS --threads 8 -r chr9:55252802-55252810 -o 722g.990.SNP.INDEL.chrAll.vcf.gz -O z 722g.990.SNP.INDEL.chrAll.vcf.bgz
The output type "722g.990.SNP.INDEL.chrAll.vcf.bgz" not recognised
I am planning on trying awk, but need to unzip the file first. Is it possible to partially unzip it so I can try this?
Double check your command line for bcftools view.
The error message 'The output type "something" is not recognized' is printed by bcftools when you specify an invalid value for the -O (upper-case O) command line option like this -O something. Based on the error message you are getting it seems that you might have put the file name there.
Check that you don't have your input and output file names the wrong way around in your command. Note that the -o (lower-case o) command line option specifies the output file name, and the file name at the end of the command line is the input file name.
Also, you write that you created an empty file for the output. You don't need to do that, bcftools will create the output file.
I don't have that much experience with bcftools but generically If you want to to use awk to manipulate a gzipped file you can pipe to it so as to only unzip the file as needed, you can also pipe the result directly through gzip so it too is compressed e.g.
gzip -cd largeFile.vcf.gz | awk '{ <some awk> }' | gzip -c > newfile.txt.gz
Also zcat is an alias for gzip -cd, -c is input/output to standard out, -d is decompress.
As a side note if you are trying to perform operations on just a part of a large file you may also find the excellent tool less useful it can be used to view your large file loading only the needed parts, the -S option is particularly useful for wide formats with many columns as it stops line wrapping, as is -N for showing line numbers.
less -S largefile.vcf.gz
quit the view with q and g takes you to the top of the file.

capture line and post it

there is a log file that I need to capture specific lines in, and send a specific word out of it to a url
This line does the job of tracing that log file and finding that word
tail -f /var/log/mail.log | awk '/status=bounced/ { sub(/^to=</,"",$7); sub(/>,$/,"",$7); print $7}'
Now, I need the result of $7 to be sent to some url, I'm assuming by using curl.
Assuming that this log file will only get bigger and that this script will need to run endlessly in the background..
What's the best way of putting a bash script that will answer those needs?
Thanks!

Need command or script to fetch only specific lines from a paragraph in a file in Unix/Linux

Need command or script to fetch only specific paragraph from a file in Unix/Linux
Files formatting is like
=================================
THREAD NUMBER
MESSAGE NUMBER
Severity
File_LOCATION
FUNCTION_NAME
LINE_NUMBER
TIME STAMP
BLANK LINE
SINGLE LINE ERROR TEXT
=================================
Here i Want to extract only severity and message text part for user input severity.
Try this:
awk "/Severity: $1/,/File_LOCATION/" < file_name > out.txt
Maybe you can use cat file | tail --lines=+3 | head -1 to get the third line out of the given file. But I don't know if this is what you want or need.

awk unix insert into file location directory

In linux, I am trying to select a variable from a specific column and row of CSV file and then use this variable as the end of a file location hierarchy. When I type the following into a bash terminal window, it seems to work by outputting the variable in correct row and column on screen.
awk -F "," 'FNR == 2 {print $8}' /sdata/images/projects/ASD_SSD/1/ruths_data/ruth/imaging\ study/imaging\ study\ working/delete2.csv
However, I am trying to go do the following substitution within a script, this fails to work...
r=2
c=8
s=awk -F "," 'FNR == $r {print $c}' /sdata/images/projects/ASD_SSD/1/ruths_data/ruth/imaging\ study/imaging\ study\ working/delete2.csv
I then try to use the s output as the end of a hierarchy file location. For example, /home/ork/js/s*
I keep getting the following error, so this looks like it's not creating the s variable and then not inserting it into the actual file location.
omitting directory `/home/ork/js/'
I have spent a few hours trying to figure out what is preventing this from working and am a new user (so I am sure it is something simple, sorry).
I hope I was clear enough, please let me know if this requires further clarification.
This is a common question here. The single quotes are protecting the variables from the shell, so they never get expanded. Also command substitution is needed when assigning to variable s. One way to do it would be:
s=$(awk -F, 'FNR==r{print c}' r="$r" c="$c" file)

display specific sections of log files on linux shell

I'm searching for a way to get specific informations out of a log file.
This is my log file :
------
[SQL STATEMENT
MAYBE
SEVERAL
LINES
LONG
]
ERR: [01.02.2012 14:17:44] [[SOME][MORE][INFO] additional debug informations]
[corresponding source file]
------
[SQL STATEMENT
MAYBE
SEVERAL
LINES
LONG
]
ERR: [01.02.2012 14:21:42] [[SOME][MORE][INFO] additional debug informations]
[corresponding source file]
------
[SQL STATEMENT
MAYBE
SEVERAL
LINES
LONG
]
DEBUG: [23.08.2011 22:30:01] []
[corresponding source file]
------
This log file contais debug and error information of sql statements.
What I need is to get all blocks of sql error messages out of this log file.
These blocks are seperated by lines with '------'.
Like the first entry of the file the error messages are represented by an 'ERR:' in the
message block.
How can I get these messages out of the file.
I didn't want to write special scripts for that kind of task.
So it would be nice if this can be done by using command line tools.
Thanks for any help.
awk can do it for you:
awk 'BEGIN { RS="------" ; ORS=RS}
$0 ~ "ERR: " { print }' INPUTFILE
Will print the ERR: blocks. If you want the others just replace ~ to !~.
See it in action here.
You can use grep:
grep ERR: filename

Resources