Combine lots of CSV files in CentOS? - linux

I have a CentOS machine and I want to combine .csv data.
I have thousands of small documents all with the same column information.
How would I go about combining all of them into files of up to 20Mb in size?
For example 1.csv would combine the first few files and once the 20Mb limit is reached the data will continue to go into 2.csv and so on.
Any help is greatly appreciated

If they don't have headers, something as simple as;
$ cat *.csv > combined.csv
would work (we run in the directory containing the files (assuming you want them in the order returned by ls *.csv)).

You can acheive what you want with simple tail command :
tail -q -n+2 *.csv
You only need to add the proper header column afterward.

You might want to look at the join utility: https://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html#join-invocation

Related

awk/sed/grep command to compare the contents of three files

Hi I am trying to automate some data entry, and I am using a tcp server/client to send filenames around for other server to go into a repository and pull these files. as part of testing this I am running the program with logging the filenames that are supposed to be sent, what was received, and if it got received I am sending a reply back with the filename.
so I have three text files with file names inside of them.
SupposedToSend.txt
Recieved.txt
GotReplyFor.txt
I know that awk could do what I am trying to do but I am not sure how to set it up, I need to compare the three files for elements that does not exists in any of the other files, so if one entry is missing from any file i need to know which one and from which file.
I can write a program for this which will take much longer to write and to run since these files are getting 5 elements/minute dumped into them
paste -d '\n' SupposedToSend.txt Recieved.txt GotReplyFor.txt | uniq -c | grep -v '^ 3'
It's tolerable if you have no errors, deeply suboptimal otherwise. Or if the data in the different files is out of sequence... (In which case you might need to sort them somehow.)
Or you could just run diff3 to compare 3 files...

Download multiple files, with different final names

OK, what I need is fairly simple.
I want to download LOTS of different files (from a specific server), via cURL and would want to save each one of them as a specific new filename, on disk.
Is there an existing way (parameter, or whatever) to achieve that? How would you go about it?
(If there was an option to input all URL-filename pairs in a text file, one per line, and get cURL to process it, would be ideal)
E.g.
http://www.somedomain.com/some-image-1.png --> new-image-1.png
http://www.somedomain.com/another-image.png --> new-image-2.png
...
OK, just figured a smart way to do it myself.
1) Create a text file with pairs of URL (what to download) and Filename (how to save it to disk), separated by comma (,), one per line. And save it as input.txt.
2) Use the following simple BASH script :
while read line; do
IFS=',' read -ra PART <<< "$line";
curl $PART[0] -o $PART[1];
done < input.txt
*Haven't thoroughly tested it yet, but I think it should work.

Concatenate PDFs while preserving rank in list

I am currently struggling to concatenate my various PDF files into one file in an automated way while at the same time preserving the order the files are provided in.
The main problem is, that I include a rank for each file (they are visualizations of list items), ranging currently from 1 to 100. If I run
pdftk *.pdf cat output all.pdf
the combined PDF pages will not be ordered from 1 to 100 accordingly. My PDFs are named in a similar way to the following example, please note that "rank_XXX" determines obviously their rank in the list. However, the fact that in the terminal 10 and 100 are listed prior to 2 messes up my sorting. I was thinking that ls -v could somehow be useful for pipe the filenames into pdftk or a similar tool, but I could not get it working.
rank_1_XYZ_123123A.pdf
rank_1_XYZ_123123B.pdf
rank_2_XYZ_123141A.pdf
rank_2_XYZ_123141B.pdf
rank_3_ABC_394124A.pdf
rank_3_ABC_394124B.pdf
...
rank_10_XYZ_129123A.pdf
rank_10_XYZ_129123B.pdf
...
rank_100_ZZZ_929123A.pdf
rank_100_ZZZ_929123B.pdf
I managed to get at least partially what I want by using
pdftk rank_[1-9]*.pdf cat output all.pdf
Nevertheless, this somehow does not work for numbers larger than 9.
Any help is greatly appreciated.
ls -v seems to do the job:
pdftk `ls -v` cat output all.pdf

splitting text files based column wise

So I have an invoice that I need to make a report out of. It is on average to be about 250 pages long. So I'm trying to create a script that would extract the specific value of the invoice and make a report. Here's my problem:
the invoice is in pdf format with it spanning two column. In Linux command, I want to use 'pdftotext' Linux command to convert into multiple text files (with each txt file representing each pdf page). How do I do that
I recognize that 'pdftotext' command splits it left part of the page and right part of the page by having 21 spaces in between. How do I the right side of the data(identified after reading at least 21 spaces in a row) to the end of the file
Since the file is large and I only last few page of the files, how do I delete all those text files in a script (not manually) until I read a keyword (let's just say the keyword = Start Invoice)?
I know this is a lot of questions, but I'm confused in what Linux command can do. Can you guys guide me to the right direction? Thanks
PS: I'm using CentOS 5.2
What about:
pdftotext YOUR.pdf | sed 's/^\([^ ]\+\) \{21\}.*/\1/' > OUTPUT
pdftotext YOUR.pdf | sed 's/.* \{21\}\(.*\)/\1/' >> OUTPUT
But you should check out pdftotext's -raw and -layout options too. And there are more ways to do it...

Shortening large CSV on debian

I have a very large CSV file and I need to write an app that will parse it but using the >6GB file to test against is painful, is there a simple way to extract the first hundred or two lines without having to load the entire file into memory?
The file resides on a Debian server.
Did you try the head command?
head -200 inputfile > outputfile
head -10 file.csv > truncated.csv
will take the first 10 lines of file.csv and store it in a file named truncated.csv
"The file resides on a Debian server."- this is interesting. This basically means that even if you use 'head', where does head retrieve the data from? The local memory (after the file has been copied) which defeats the purpose.

Resources