How to remove the nth occurrence of a substring from each line on four 100GB files

How to remove the nth occurrence of a substring from each line on four 100GB files - node.js

I have 4 100GB csv files where two fields need to be concatenated. Luckily the two fields are next to each other.
My thought is to remove the 41st occurence of "," from each line and then my two fields will be properly united and ready to be uploaded to an analytical tool that I use.
The development machine is a Windows 10 machine with 4 x 3.6GHz and 64G RAM and I push the file to a server on Centos 7 system with 40 x 2.4GHz and 512G RAM. I have sudo access on the server and I can technically change the file there if someone has a solution that is dependent on Linux tools. The idea is to accomplish the task in the fastest/easiest way possible. I have to repeat this task monthly and would be ecstatic to automate it.
My original way of accomplishing this was to load the csv to MySQL, concat the fields and remove the old fields. Export the table as a csv again and push to the server. This takes two days and is laborious.
Right now I'm torn between learning to use sed or using a something I'm more familiar with like node.js to stream the files line by line into a new file and then push those to the server.
If you recommend using sed, I've read here and here but don't know how to remove the nth occurrence from each line.
Edit: Cyrus asked for a sample input/output.
Input file formatted thusly:
"field1","field2",".........","field41","field42","......
Output file formatted like so:
"field1","field2",".........","field41field42","......

If you want to remove 41st occurrence of , then you can try :
sed -i 's/","//41' file

Related

Partially expand VCF bgz file in Linux

I have downloaded gnomAD files from - https://gnomad.broadinstitute.org/downloads
This is the bgz file
https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.2.vcf.bgz
When I expand using:
zcat gnomad.genomes.r2.1.1.sites.2.vcf.bgz > gnomad.genomes.r2.1.1.sites.2.vcf
The output VCF file becomes more than 330GB. I do not have that kind of space available on my laptop.
Is there a way where I can just expand - say 1 GB of the bgz file OR just 100000 rows from the bgz file?

From what I've been able to determine, a bgz file is compatible with gzip, and a VCF file is a plain text file. Since it's a gzip file, and not a .tar.gz, the solution doesn't require listing any archive contents, and simplifies things a bit.
This can probably be accomplished in several ways, and I doubt this is the best way, but I've been able to successfully decompress the first 100,000 rows into a file using the following code in python3 (it should also work under earlier versions back to 2.7):
#!/usr/bin/env python3
import gzip
ifile = gzip.GzipFile("gnomad.genomes.r2.1.1.sites.2.vcf.bgz")
ofile = open("truncated.vcf", "wb")
LINES_TO_EXTRACT = 100000
for line in range(LINES_TO_EXTRACT):
ofile.write(ifile.readline())
ifile.close()
ofile.close()
I tried this on your example file, and the truncated file is about 1.4 GiB. It took about 1 minute, 40 seconds on a raspberry pi-like computer, so while it's slow, it's not unbearably so.
While this solution is somewhat slow, it's good for your application for the following reasons:
It minimizes disk and memory usage, which could otherwise be problematic with a large file like this.
It cuts the file to exactly the given number of lines, which avoids truncating your output file mid-line.
The three input parameters can be easily parsed from the command line in case you want to make a small CLI utility for parsing other files in this manner.

How to stream log files content that is constantly changing file names in perl?

I a series of applications on Linux systems that I need to basically constantly 'stream' out or even just 'tail' out but the challenge is the filenames are constantly rolling and changing.
The are all date encoded (dates being in different formats) and each then have different incremented formats.
Most of them start with one and increase, but one doesn't have an extension and then adds an extension past the first file and the other increments a number but once hitting 99 rolls to increment a alpha and returns the numeric to 01 and then up again as it rolls so quickly.
I just have the OS level shell scripting, OS command line utilities, and perl available to me to handle this situation for another application to pickup and read these logs.
The new files are always created right when it starts writing to the new file and groups of different logs (some I am reading some I am not) are being written to the same directory so I cannot just pickup anything hitting the directory.
If I simply 'tail -n 1000000 -f |' them today this works fine for the reader application I am using until the file changes and I cannot setup file lists ranges within the reader application, but can pre-process them so they basically appear as a continuous stream to the reader vs. the reader directly invoking commands to read them. A simple Perl log reader like this also work fine for a static filename but not for dynamic ones. It is critical I don't re-process any logs lines and just capture new lines being written to the logs.
I admit I am not any form a Perl guru, and the best answers / clue I've been able to find so far is the use of Perl's Glob function to possibly do this but the examples I've found basically reprocess all of the files on each run then seem to stop.
Example File Names I am dealing with across multiple apps I am trying to handle..
appA_YYMMDD.log
appA_YYMMDD_0001.log
appA_YYMMDD_0002.log
WS01APPB_YYMMDD.log
WS02APPB_YYMMDD.log
WS03AppB_YYMMDD.log
APPCMMDD_A01.log
APPCMMDD_B01.log
YYYYMMDD_001_APPD.log
As denoted above the files do not have the same inode and simply monitoring the directory for change is not possible as a lot of things are written there. On the dev system it has more than 50 logs being written to the directory and thousands of files and I am only trying to retrieve 5. I am seeing if multitail can be made available to try that suggestion but it is not currently available and installing any additional RPMs in the environment is generally a multi-month battle.
ls -i
24792 APPA_180901.log
24805 APPA__180902.log
17011 APPA__180903.log
17072 APPA__180904.log
24644 APPA__180905.log
17081 APPA__180906.log
17115 APPA__180907.log
So really the root of what I am trying to do is simply a continuous stream regardless if the file name changes and not have to run the extract command repeatedly nor have big breaks in the data feed while some script figures out that the file being logged to has changed. I don't need to parse the contents (my other app does that).. Is there an easy way of handling this changing file name?

How about monitoring the log directory for changes with Linux inotify, e.g. Linux::inotify2? Then you could detect when new log files are created, stop reading from the old log file and start reading from the new log file.

Try tailswitch. I created this script to tail log files that are rotated daily and have YYYY-MM-DD on their names. To use this script, you just say:
% tailswitch '*.log'
The quoting prevents the shell from interpreting the glob pattern. The script will perform glob pattern from time to time to switch to a newer file based on its name.

Is there a way using REXX to edit a ps dataset and insert a string after a particular line?

I am writing a REXX program which will update a PS dataset. I can edit a particular line using my REXX code. But I would want a code to insert a particular string after a particular line.
For Example: My PS dataset has 100 lines. I want to insert a text "ABCDE" after 44th line (in 45th line) which will increase the total lines of the file to 101 lines. The remaining lines should remain unchanged. Is this possible using REXX?

Independent of REXX you need to effectively read the old dataset and write it out to a new file and add your new record (string) to the output file and then write the rest. There is no way to “insert” a record in a Physical Sequential (PS) dataset. At the end you would delete the old and rename the newly created file to the old name.
Another option would be to use a generation dataset group (GDG) and read the current (0) and create the new (+1) as the output. This way you still are referring to the same dataset name for others to reference.

What #Hogstrom suggests is a good solution to the problem you describe. In the interest of completeness, here is a solution that may be necessary under extreme circumstances.
Create an edit macro...
/*REXX*/
ADDRESS ISREDIT 'MACRO NOPROCESS'
aLine = 'ABCDE'
ADDRESS ISREDIT 'LINE_AFTER 44 = DATALINE (ALINE)'
...and run ISPF edit in batch, executing this macro.
The JCL to run ISPF in batch is shop-specific, but many shops have created a cataloged procedure to do so.
If you are willing to copy your dataset to the z/Unix file system, you could also use sed or awk to make your changes.
I'm not recommending any of this, I'm just pointing out it can be done if #Hogstrom's solution won't work for you for some reason.

splitting text files based column wise

So I have an invoice that I need to make a report out of. It is on average to be about 250 pages long. So I'm trying to create a script that would extract the specific value of the invoice and make a report. Here's my problem:
the invoice is in pdf format with it spanning two column. In Linux command, I want to use 'pdftotext' Linux command to convert into multiple text files (with each txt file representing each pdf page). How do I do that
I recognize that 'pdftotext' command splits it left part of the page and right part of the page by having 21 spaces in between. How do I the right side of the data(identified after reading at least 21 spaces in a row) to the end of the file
Since the file is large and I only last few page of the files, how do I delete all those text files in a script (not manually) until I read a keyword (let's just say the keyword = Start Invoice)?
I know this is a lot of questions, but I'm confused in what Linux command can do. Can you guys guide me to the right direction? Thanks
PS: I'm using CentOS 5.2

What about:
pdftotext YOUR.pdf | sed 's/^\([^ ]\+\) \{21\}.*/\1/' > OUTPUT
pdftotext YOUR.pdf | sed 's/.* \{21\}\(.*\)/\1/' >> OUTPUT
But you should check out pdftotext's -raw and -layout options too. And there are more ways to do it...

Shortening large CSV on debian

I have a very large CSV file and I need to write an app that will parse it but using the >6GB file to test against is painful, is there a simple way to extract the first hundred or two lines without having to load the entire file into memory?
The file resides on a Debian server.

Did you try the head command?
head -200 inputfile > outputfile

head -10 file.csv > truncated.csv
will take the first 10 lines of file.csv and store it in a file named truncated.csv

"The file resides on a Debian server."- this is interesting. This basically means that even if you use 'head', where does head retrieve the data from? The local memory (after the file has been copied) which defeats the purpose.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string