splitting text files based column wise - linux

So I have an invoice that I need to make a report out of. It is on average to be about 250 pages long. So I'm trying to create a script that would extract the specific value of the invoice and make a report. Here's my problem:
the invoice is in pdf format with it spanning two column. In Linux command, I want to use 'pdftotext' Linux command to convert into multiple text files (with each txt file representing each pdf page). How do I do that
I recognize that 'pdftotext' command splits it left part of the page and right part of the page by having 21 spaces in between. How do I the right side of the data(identified after reading at least 21 spaces in a row) to the end of the file
Since the file is large and I only last few page of the files, how do I delete all those text files in a script (not manually) until I read a keyword (let's just say the keyword = Start Invoice)?
I know this is a lot of questions, but I'm confused in what Linux command can do. Can you guys guide me to the right direction? Thanks
PS: I'm using CentOS 5.2

What about:
pdftotext YOUR.pdf | sed 's/^\([^ ]\+\) \{21\}.*/\1/' > OUTPUT
pdftotext YOUR.pdf | sed 's/.* \{21\}\(.*\)/\1/' >> OUTPUT
But you should check out pdftotext's -raw and -layout options too. And there are more ways to do it...

Related

Command line tool to merge textfiles , keeping the common content and adding the rest of the content to it

I have got a folder with notes on different topics, say ~/notes/topica.txt ~/notes/topicb.txt etc... . I have been using them in different computers for some time now so they have diverted their content
So I have now something like
#computerA:
cat ~/notes/topica.txt
Lines added in Computer A
----------
Some common text
#computerB:
cat ~/notes/topica.txt
Lines added in Computer B
----------
Some common text
I want to be able to merge them in a way that I get
cat ~/notes/topica.txt
Lines added in Computer A
Lines added in Computer B
----------
Some common text
I have tried some tools like diff but they are really not achieving the objective I want.
Also I need to do it for an entire folder with a bunch of files. (Im alright with doing a bash script)
Thanks for your help

How to remove the nth occurrence of a substring from each line on four 100GB files

I have 4 100GB csv files where two fields need to be concatenated. Luckily the two fields are next to each other.
My thought is to remove the 41st occurence of "," from each line and then my two fields will be properly united and ready to be uploaded to an analytical tool that I use.
The development machine is a Windows 10 machine with 4 x 3.6GHz and 64G RAM and I push the file to a server on Centos 7 system with 40 x 2.4GHz and 512G RAM. I have sudo access on the server and I can technically change the file there if someone has a solution that is dependent on Linux tools. The idea is to accomplish the task in the fastest/easiest way possible. I have to repeat this task monthly and would be ecstatic to automate it.
My original way of accomplishing this was to load the csv to MySQL, concat the fields and remove the old fields. Export the table as a csv again and push to the server. This takes two days and is laborious.
Right now I'm torn between learning to use sed or using a something I'm more familiar with like node.js to stream the files line by line into a new file and then push those to the server.
If you recommend using sed, I've read here and here but don't know how to remove the nth occurrence from each line.
Edit: Cyrus asked for a sample input/output.
Input file formatted thusly:
"field1","field2",".........","field41","field42","......
Output file formatted like so:
"field1","field2",".........","field41field42","......
If you want to remove 41st occurrence of , then you can try :
sed -i 's/","//41' file

How to copy web page content in a file via linux terminal?

I want to copy some texts from a web page to a file in Linux. I know "wget" can be used to download files but my favourite data is not stored in files and when I want to have them, I have to use copy and paste manually which is very difficult for thousands of web pages.
For example, I need to have the data in the link below:
http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2017&MONTH=09&FROM=0112&TO=0112&STNM=72672
and similar link with varying YEAR, MONTH, FROM, TO, STNM values.
Is there any command/script to copy and paste automatically ?
First, make a file with all of the year, month, from, to and stnm. A line for each one:
inputFile.txt:
2017,09,0112,0112,72672
2017,08,0112,0112,72672
In a shell script,loop through that file line by line and execute wget replacing your hardcoded values with variables filled from the read line:
#!/bin/bash
while IFS=, read -r year month from to stnm; do
wget "http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=$year&MONTH=$month&FROM=$from&TO=$to&STNM=$stnm"
done < inputFile.txt
That's the bare bones version, I'm certain it could use some tweaks to get up and running, but it should be close.
Execute the shell script:
bash whateveryounamedthisscript.sh
In this example two new files will be generated, one for September and another for August.

grab 2 numbers from file name then insert into command

I'm a bit new to programming in general and I'm not sure how to go about accomplish this task in my bash script.
A quick background: when importing my music library (formerly organized by iTunes) to Banshee, all of the files were duplicated to fit Banshee's number style (ex: 02. instead of 02 ) on top of that, iTunes apparently did not save the ID3 tags to the files, so many of them are blank. So now I've got a few thousand tags to fix and duplicate files to get rid of.
To automate the process, I started learning to write bash scripts. I came up with a script (which you can see here) that does four things: removes unnecessary iTunes files, takes input from user about ID3 Tag information and stores it in variables, clears any present tag info from all files, writes new tags with info taken from user, using a program called eyeD3.
Now, here's where I run into my problem. This script is basically blindly writing info to all mp3 files in the dir. This is fine for tags that all the files have in common - like artist, album, total tracks, year, etc. But I can't tag each individual track number with this method. So I'm still editing the track# tags one at a time, manually. And that's something I really don't want to do 2,000+ times.
The files names all look like this:
01. song1.mp3
02. song2.mp3
03. song3.mp3
The command to write a track number to a tag looks like this:
$ eyeD3 -n 1 "01. song1.mpg"
So... I'm not sure how to go about automating this. I need to grab the first two digits of each file name, store them somewhere, then recall each one into a separate eyeD3 command.
You can loop over the files using globbing, and use substring expansion to capture the first two characters of the filename:
for f in *mp3; do
eyeD3 -n ${f:0:2} "$f"
done

Download batch files from a website using linux

i want to downloads some files (nearly about 1000-2000 zip files) from a website.
i can sit around and add each file one after another. please give me a program or script or whatever method so i can automate the download.
The website i am talking about has download link as
sitename.com/sometetx/date/12345/folder/12345_zip.zip
date can be taken care of. the main concern is that number 12345 before and after the folder, they both change simultaneously. e.g.
sitename.com/sometetx/date/23456/folder/23456_zip.zip
sitename.com/sometetx/date/54321/folder/54321_zip.zip
i tried using curl
sitename.com/sometetx/date/[12345-54321]/folder/[12345-54321]_zip.zip
but it makes to much of combination of downloads i.e. keeps left 12345 as it is and scan through 12345 to 54321 the increment left 12345 +1 then repeats scan from [12345-54321].
also tried bash wget
here i have one variable at two places, when using loop the right 12345 with a " _" is ignored by the program.
PLease help me, i dont know much about linux or programing, thanks
In order to get your loop variable next to _ to not be ignored by the shell, put it in the quotes, like this:
$ for ((i=10000; i < 99999; i++)); do \
wget sitename.com/sometetx/date/$i/folder/"$i"_zip.zip; done

Resources