Separating a joined file to original files in Linux

Separating a joined file to original files in Linux - linux

I know that to append or join multiple files in Linux, we can use the command: cat file1 >> file2.
But I couldn't find any command to separate file1 from file2 after joining them. In other words, I want both original file1 and file2 back again. I tried to use the split command but it just dismembers a file into multiple files with the same size.
Is there a way to do it?

There is no such command, since no information about what was file1 or file2 is retained. The new combined file is just a data stream.
In order to "split" them back up, you need rules about how to do so (such as, how many bytes file1 and file2 were).

When you perform the concatenation, the system doesn't keep track of how the resulting file was created. So it has no way of remembering where the original split was located in that file.
Can you explain what you are trying to do ?

No problem, as long as you still have file1:
$ echo foobar >file1
$ echo blah >file2
$ cat file1 >> file2
$ truncate -s $(( $(stat -c '%s' file2) - $(stat -c '%s' file1) )) file2
$ cat file2
blah
Also, instead of stat -c '%s' filename you can use wc -c filename | cut -f 1 -d ' ', which is longer but more portable.

Related

How to use grep in a shell script?

I am trying to make a shell script which prints out the last modification dates of the following files.
Somehow the script just prints out an empty line
"modified" is a file which contains the names and the modification dates of the files in the following format(delimiter='#'):
>modified
for i in file1 file2 file3
do
echo $i#`stat --printf='%y\n' $i`>>modified
done
Having created that file, I'm trying to search it like:
for i in file1 file2 file3
do
var=`grep -w "$i" modified | cut -d'#' -f2`
echo $var
done

As mentioned by Charles, there's no reason to create that modified file for that (unless you are planning to use that file for another purpose).
Also, you can give different arguments to your stat command, as in:
stat --printf='%y\n' file1 file2 file3
This gives exactly the same output as what you're aiming for.

Using Sed to extract the headers in multiple files

I used head -3 to extract headers from some files that I needed to show header data I did this:
head -3 file1 file2 file3
and head -3 * works also.
I thought sed 3 file1 file2 file3 would work but it only gives the first file's output and not the others. I then tried sed -n '1,2p' file1 file2 file3. Again only the first file produced any output. I also tried with a wildcard sed -n '1,2p' filename* same result only the first file's output.
Everything I read seems like it should work. sed *filesnames*.
Thanks in advance

Assuming GNU sed as question is tagged linux. From GNU sed manual
-s
--separate By default, sed will consider the files specified on the command line as a single continuous long stream. This GNU sed
extension allows the user to consider them as separate files: range
addresses (such as ‘/abc/,/def/’) are not allowed to span several
files, line numbers are relative to the start of each file, $ refers
to the last line of each file, and files invoked from the R commands
are rewound at the start of each file.
Example:
$ cat file1
foo
bar
$ cat file2
123
456
$ sed -n '1p' file1 file2
foo
$ sed -n '3p' file1 file2
123
$ sed -sn '1p' file1 file2
foo
123
When using -i, the -s option is implied
$ sed -i '1chello' file1 file2
$ cat file1
hello
bar
$ cat file2
hello
456

shell script to compare two files and write the difference to third file

I want to compare two files and redirect the difference between the two files to third one.
file1:
/opt/a/a.sql
/opt/b/b.sql
/opt/c/c.sql
In case any file has # before /opt/c/c.sql, it should skip #
file2:
/opt/c/c.sql
/opt/a/a.sql
I want to get the difference between the two files. In this case, /opt/b/b.sql should be stored in a different file. Can anyone help me to achieve the above scenarios?

file1
$ cat file1 #both file1 and file2 may contain spaces which are ignored
/opt/a/a.sql
/opt/b/b.sql
/opt/c/c.sql
/opt/h/m.sql
file2
$ cat file2
/opt/c/c.sql
/opt/a/a.sql
Do
awk 'NR==FNR{line[$1];next}
{if(!($1 in line)){if($0!=""){print}}}
' file2 file1 > file3
file3
$ cat file3
/opt/b/b.sql
/opt/h/m.sql
Notes:
The order of files passed to awk is important here, pass the file to check - file2 here - first followed by the master file -file1.
Check awk documentation to understand what is done here.

You can use some tools like cat, sed, sort and uniq.
The main observation is this: if the line is in both files then it is not unique in cat file1 file2.
Furthermore in cat file1 file2| sort, all doubles are in sequence. Using uniq -u we get unique lines and have this pipe:
cat file1 file2 | sort | uniq -u
Using sed to remove leading whitespace, empty and comment lines, we get this final pipe:
cat file1 file2 | sed -r 's/^[ \t]+//; /^#/ d; /^$/ d;' | sort | uniq -u > file3

how to subtract the two files in linux

I have two files like below:
file1
"Connect" CONNECT_ID="12"
"Connect" CONNECT_ID="11"
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
file2
"Quit" CONNECT_ID="12"
"Quit" CONNECT_ID="11"
The file contents are not exactly same but similar to above and the number of records are minimum 100,000.
Now i want to get the result as show below into file1 (means the final result should be there in file1)
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
I have used a while loop something like below:
awk {'print $2'} file2 | sed "s/CONNECTION_ID=//g" > sample.txt
while read actual; do
grep -w -v $actual file1 > file1_tmp
mv -f file1_tmp file1
done < sample.txt
Here I have adjusted my code according to example. So it may or may not work.
My problem is the loop is repeating for more than 1 hour to complete the process.
So can any one suggest me how to achieve the same with any other ways like using diff or comm or sed or awk or any other linux command which will run faster?
Here mainly I want to eliminate this big typical while loop.

Most UNIX tools are line based and as you don't have whole line matches that means grep, comm and diff are out the window. To extract field based information like you want awk is perfect:
$ awk 'NR==FNR{a[$2];next}!($2 in a)' file2 file1
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
To store the results back to file1 you'll need to redict the output to a temporary file and then move the file into file1 like so:
$ awk 'NR==FNR{a[$2];next}!($2 in a)' file2 file1 > tmp && mv tmp file1
Explanation:
The awk variable NR increments for every record read, that is each line in every file. The FNR variable increments for every record but gets reset for every file.
NR==FNR # This condition is only true when reading file1
a[$2] # Add the second field in file1 into array as a lookup table
next # Get the next line in file1 (skips any following blocks)
!($2 in a) # We are now looking at file2 if the second field not in the look up
# array execute the default block i.e print the line
To modify this command you just need to change the fields that matched. In your real case if you want to match field 1 from file1 with field 4 from file2 then you would do:
$ awk 'NR==FNR{a[$1];next}!($4 in a)' file2 file1

This might work for you (GNU sed):
sed -r 's|\S+\s+(\S+)|/\1/d|' file2 | sed -f - -i file1

The tool best suited to this job is join(1). It joins two files based on values in a given column of each file. Normally it just outputs the lines that match across the two files, but it also has a mode to output the lines from one of the files that do not match the other file.
join requires that the files be sorted on the field(s) you are joining on, so either pre-sort the files, or use process substitution (a bash feature - as in the example below) to do it on the one command line:
$ join -j 2 -v 1 -o "1.1 1.2" <(sort -k2,2 file1) <(sort -k2,2 file2)
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
-j 2 says to join the files on the second field for both files.
-v 1 says to only output fields from file 1 that do not match any in file 2
-o "1.1 1.2" says to order the output with the first field of file 1 (1.1) followed by the second field of file 1 (1.2). Without this, join will output the join column first followed by the remaining columns.

You may need to analyze file2 at fist, and append all ID which have appered to a cache(eg. memory)
Than scan file1 line by line to adjust whether the ID in the cache.
python code like this:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
p = re.compile(r'CONNECT_ID="(.*)"')
quit_ids = set([])
for line in open('file2'):
m = p.search(line)
if m:
quit_ids.add(m.group(1))
output = open('output_file', 'w')
for line in open('file1'):
m = p.search(line)
if m and m.group(1) not in quit_ids:
output.write(line)
output.close()

The main bottleneck is not really the while loop, but the fact that you rewrite the output file thousands of times.
In your particular case, you might be able to get away with just this:
cut -f2 file2 | grep -Fwvf - file1 >tmp
mv tmp file1
(I don't think the -w option to grep is useful here, but since you had it in your example, I retained it.)
This presupposes that file2 is tab-delimited; if not, the awk '{ print $2 }' file2 you had there is fine.

How to append one file to another in Linux from the shell?

I have two files: file1 and file2. How do I append the contents of file2 to file1 so that contents of file1 persist the process?

Use bash builtin redirection (tldp):
cat file2 >> file1

cat file2 >> file1
The >> operator appends the output to the named file or creates the named file if it does not exist.
cat file1 file2 > file3
This concatenates two or more files to one. You can have as many source files as you need. For example,
cat *.txt >> newfile.txt
Update 20130902
In the comments eumiro suggests "don't try cat file1 file2 > file1." The reason this might not result in the expected outcome is that the file receiving the redirect is prepared before the command to the left of the > is executed. In this case, first file1 is truncated to zero length and opened for output, then the cat command attempts to concatenate the now zero-length file plus the contents of file2 into file1. The result is that the original contents of file1 are lost and in its place is a copy of file2 which probably isn't what was expected.
Update 20160919
In the comments tpartee suggests linking to backing information/sources. For an authoritative reference, I direct the kind reader to the sh man page at linuxcommand.org which states:
Before a command is executed, its input and output may be redirected
using a special notation interpreted by the shell.
While that does tell the reader what they need to know it is easy to miss if you aren't looking for it and parsing the statement word by word. The most important word here being 'before'. The redirection is completed (or fails) before the command is executed.
In the example case of cat file1 file2 > file1 the shell performs the redirection first so that the I/O handles are in place in the environment in which the command will be executed before it is executed.
A friendlier version in which the redirection precedence is covered at length can be found at Ian Allen's web site in the form of Linux courseware. His I/O Redirection Notes page has much to say on the topic, including the observation that redirection works even without a command. Passing this to the shell:
$ >out
...creates an empty file named out. The shell first sets up the I/O redirection, then looks for a command, finds none, and completes the operation.

Note: if you need to use sudo, do this:
sudo bash -c 'cat file2 >> file1'
The usual method of simply prepending sudo to the command will fail, since the privilege escalation doesn't carry over into the output redirection.

Try this command:
cat file2 >> file1

Just for reference, using ddrescue provides an interruptible way of achieving the task if, for example, you have large files and the need to pause and then carry on at some later point:
ddrescue -o $(wc --bytes file1 | awk '{ print $1 }') file2 file1 logfile
The logfile is the important bit. You can interrupt the process with Ctrl-C and resume it by specifying the exact same command again and ddrescue will read logfile and resume from where it left off. The -o A flag tells ddrescue to start from byte A in the output file (file1). So wc --bytes file1 | awk '{ print $1 }' just extracts the size of file1 in bytes (you can just paste in the output from ls if you like).
As pointed out by ngks in the comments, the downside is that ddrescue will probably not be installed by default, so you will have to install it manually. The other complication is that there are two versions of ddrescue which might be in your repositories: see this askubuntu question for more info. The version you want is the GNU ddrescue, and on Debian-based systems is the package named gddrescue:
sudo apt install gddrescue
For other distros check your package management system for the GNU version of ddrescue.

Another solution:
tee < file1 -a file2
tee has the benefit that you can append to as many files as you like, for example:
tee < file1 -a file2 file3 file3
will append the contents of file1 to file2, file3 and file4.
From the man page:
-a, --append
append to the given FILEs, do not overwrite

Zsh specific: You can also do this without cat, though honestly cat is more readable:
>> file1 < file2
The >> appends STDIN to file1 and the < dumps file2 to STDIN.

cat can be the easy solution but that become very slow when we concat large files, find -print is to rescue you, though you have to use cat once.
amey#xps ~/work/python/tmp $ ls -lhtr
total 969M
-rw-r--r-- 1 amey amey 485M May 24 23:54 bigFile2.txt
-rw-r--r-- 1 amey amey 485M May 24 23:55 bigFile1.txt
amey#xps ~/work/python/tmp $ time cat bigFile1.txt bigFile2.txt >> out.txt
real 0m3.084s
user 0m0.012s
sys 0m2.308s
amey#xps ~/work/python/tmp $ time find . -maxdepth 1 -type f -name 'bigFile*' -print0 | xargs -0 cat -- > outFile1
real 0m2.516s
user 0m0.028s
sys 0m2.204s

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string