How to split large file to small files with prefix in Linux/Bash

How to split large file to small files with prefix in Linux/Bash - linux

I have a file in Linux called test. Now I want to split the test into say 10 small files.
The test file has more than 1000 table names. I want the small files to have equal no of lines, the last file might have the same no of table names or not.
What I want is can we add a prefix to the split files while invoking the split command in the Linux terminal.
Sample:
test_xaa test_xab test_xac and so on..............
Is this possible in Linux.

I was able to solve my question with the following statement
split -l $(($(wc -l < test.txt )/10 + 1)) test.txt test_x
With this I was able to get the desired result

I would've sworn split did this on it's own, but to my surprise, it does not.
To get your prefix, try something like this:
for x in /path/to/your/x*; do
mv $x your_prefix_$x
done

Related

Shell Script to loop over files, aplly command and save each output to new file

I have read most questions regarding this topic, but can't get an answer to my specific question:
I have a number of files in a directory, and I want to apply a command to each of these files and then create a new file with the outpot for every single file. I can only manage to write it into one file alltogether. As i expect to have ~ 500.000 files, i also would need the script to be as efficient as possible.
for f in *.bed; do sort -k1,1 -k2,2n; done
This command sorts each file accordingly and writes the ouput in the Shell - But i cannot manage to write to file in the for-loop without appending it with ">>" .
I'm thankful for any answer providing an approach or an already answered question on this topic!

You can use script like this:
for f in *.bed
do
sort -k1,1 -k2,2n $f >>new_filename
done
If you want to be sure new_filename is empty before run the loop you can clear the content in file with command (before for loop):
>new_filename

Insert text in .txt file using cmd

So, I want to insert test in .txt but when I try
type file1.txt >> file2.txt
and sort it using cygwin with sort file1 | uniq >> sorted it will place it at the end of the file. But i want to write it to the start of the file. I don't know if this is possible in cmd and if it's not I can also do it in a linux terminal.
Is there a special flag or operator I need to use?
Thanks in regards, Davin
edit: the file itself (the file i'm writing to) is about 5GB big so i would have to write 5GB to a file every time i wanted to change anything

It is not possible to write to the start of the file. You can only replace the file content with content provided or append to the end of a file. So if you need to add the sorted output in front of the sorted file, you have to do it like that:
mv sorted sorted.old
sort file1 | uniq > sorted
cat sorted.old >> sorted
rm sorted.old
This is not a limitation of the shell but of the file APIs of pretty much every existing operating system. The size of a file can only be changed at the end, so you can increase it, in that case the file will grow at the end (all content stays as it is but now there is empty space after the content) or you can truncate it (in that case content is cut off at the end). It is possible to copy data around within a file but there exists no system function to do that, you have to do it yourself and this is almost as inefficient as the solution shown above.

Using diff for two files and send by email [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have files like below. I use crontab every 5 min to check the files to see if the system's added one file, for example like this: AIR_2015xxxxT0yyyyyyyy.cfg. Then I need to use the diff command automatically between the last one and before the last one.
AIR_20151021T163514000.cfg
AIR_20151026T103845000.cfg
AIR_2015xxxxT0yyyyyyyy.cfg
I want to do this in a script like the one below:
#!/bin/bash
/var/opt/fds/
diff AIR_2015xxxxT0yyyyyyyy.cfg AIR_20151026T103845000.cfg > Test.txt
body(){
cat body.txt
}
(echo -e "$(body)") | -a Test.txt mailx -s 'Comparison' user#email.com

Given a list of files in the directory /var/opt/fds with names in the format:
AIR_YYYYmmddTHHMMSSfff.cfg
where the letter Y represents digits for the year, m for month, d for day, H for hour, M for minute, S for second, and f for fraction (milliseconds), then you need to establish the two most recent files in the directory to compare them.
One way to do this is:
cd /var/opt/fds || exit 1
old=
new=
for file in AIR_20[0-9][0-9]????T?????????.cfg
do
old=$new
new=$file
done
if [ -n "$old" ] && [ -n "$new" ]
then
diff "$old" "$new" > test.txt
mailx -a test.txt -s 'Comparison' user#example.com < body.txt
fi
Note that if the new file has a name containing letters x and y as shown in the question and comments, it will be listed after the names containing the time stamp as digits, so it will be picked up as the new file. It also assumes permission to write in the /var/opt/fds directory, and that the mail body file is present in that directory too. Those assumptions can be trivially fixed if necessary. The test.txt file should be deleted after it is sent, too, and you could check that it is non-empty before sending the email (just in case the two most recent files are in fact identical). You could embed a time-stamp in the generated file name containing the diffs instead of using test.txt:
output="diff.$(date +'%Y%m%dT%H%M%S000').txt"
and then use $output in place of test.txt.
The test ensures that there was both an old and a new name. The pattern match is sloppier than it could be, but using [0-9] or an appropriate subrange ([01], [0-3], [0-2], [0-5]) for the question marks makes the pattern unreadably long:
for file in AIR_20[0-9][0-9][01][0-9][0-3][0-9]T[0-2][0-9][0-5][0-9][0-5][0-9][0-9][0-9][0-9].cfg
It also probably provides very little extra in the way of protection. Of course, as shown, it imposes a Y2.1K crisis on the system, not that it is hard to fix that. You could also cut down the range of valid dates by basing it on today's date, but beware of the end of the year, etc. You might decide you only need entries from the last month or so.
Using globbing is generally better than trying to parse ls or find output. In this context, where the file names have a restricted set of characters in the name (no newlines, no blanks or tabs, no quotes, no dollar signs, etc), it is feasible to use either find or ls — but if you have to deal with arbitrary names created by random end users, those tools are not suitable. (The ls command does all sorts of weird stuff with weird names and basically is hard to use reliably in the face of user cussedness. The find command and its -print0 option can be used, especially if you have a sort that recognizes -z to work with null-terminated 'lines' and an xargs that supports -0 to handle such lines too — but you have to very careful.)
Note that this scheme does not keep a record of the last file analyzed (so if no new files appear for an hour, you might send a dozen copies of the same differences), nor does it directly report on the file names (but using diff -u or diff -c would include the file names being diffed in the output). Again, these issues can be worked around if that's appropriate (and it probably is). Keeping the record of which files have been compared is probably the hardest job; even that's not too bad:
echo "$old" "$new" >> reported.diffs
to record what's been processed; then
if grep -q "$old $new" reported.diffs
then : Already processed
else : Process $old and $new
fi

How does linux redirect IO work internally

When we use the redirect IO operator for a shell script does the operator keep all the data to be written in memory and write it all at once or does write it to file line by line.
Here is what i am working on.
I have about 200 small files ~1000 lines each in a specific format. I want to process (do a regex and change the format a little) each line in all the files and have the new transformed lines in a single combined file.
I have a transformscript.sh that takes a single file and applies the transformation. I run it in the following manner
sh transformscript.sh somefile.txt > newfile.txt
This works fine and fast for a single file.
How do i extend to do it for all the files. will it be efficient to change transformscript.sh to take a directory as argument instead of filename and add a for loop to transform all the lines of all the files together. Or should I run the above trnsformscript.sh for each file and create a new file for each one and combine then separately.
Thanks.

The redirect operator simply opens the file for writing and passes that file descriptor to the shell as its standard output. The shell then writes to the file directly.

You probably do NOT want to run the script separately for each file since you will incur the overhead of bash process creation for each pass. For example:
# don't do it this way
for somefile in $(ls somefiles*.txt); do
newfile=${somefile//some/new}
sh transformscript.sh $somefile > $newfile
done
The above starts one shell for every file found which is pretty inefficient. It would be better to rewrite transformscript.sh to handle multiple files if possible. Depending on how complicated your transform is and whether you need to keep the original filenames, you might be able to use a single sed process. For example, assume you have 200 files named test1.txt through test200.txt all with a "Hello world" line you want to change to "Hello joe". You could do something as simple a this:
sed -i.save 's/Hello world/Hello joe/' test*.txt
The -i tells sed to do an "in place" edit (edit the original file) and the optional ".save" argument to -i makes a backup copy of the original file with a .save extension before editing the original file. Note, this will leave the original contents in the .save files and the new content in the files with the original name which may not be what you want.

Compare 2 files with shell script

I was trying to find the way for knowing if two files are the same, and found this post...
Parsing result of Diff in Shell Script
I used the code in the first answer, but i think it's not working or at least i cant get it to work properly...
I even tried to make a copy of a file and compare both (copy and original), and i still get the answer as if they were different, when they shouldn't be.
Could someone give me a hand, or explain what's happening?
Thanks so much;
peixe

Are you trying to compare if two files have the same content, or are you trying to find if they are the same file (two hard links)?
If you are just comparing two files, then try:
diff "$source_file" "$dest_file" # without -q
or
cmp "$source_file" "$dest_file" # without -s
in order to see the supposed differences.
You can also try md5sum:
md5sum "$source_file" "$dest_file"
If both files return same checksum, then they are identical.

comm is a useful tool for comparing files.
The comm utility will read file1 and file2, which should be
ordered in the current collating sequence, and produce three
text columns as output: lines only in file1; lines only in
file2; and lines in both files.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to split large file to small files with prefix in Linux/Bash - linux

I was able to solve my question with the following statement split -l $(($(wc -l < test.txt )/10 + 1)) test.txt test_x With this I was able to get the desired result

I would've sworn split did this on it's own, but to my surprise, it does not. To get your prefix, try something like this: for x in /path/to/your/x*; do mv $x your_prefix_$x done

Related

Shell Script to loop over files, aplly command and save each output to new file

Insert text in .txt file using cmd

Using diff for two files and send by email [closed]

How does linux redirect IO work internally

Compare 2 files with shell script

Categories

Resources