How can I correctly parse this file through tail, without formatting errors?
I am using tail within cygwin to parse the last ten lines of two files. One file parses through correctly, the other contains a space between every character.
$ tail file2.txt -n 4
22/06/2015 12:28 - Decompressing and saving profile extract...
22/06/2015 12:28 - Decompressing and saving profile extract...
22/06/2015 12:38 - Decompressing and saving profile extract...
22/06/2015 12:38 - Decompressing and saving profile extract...
$ tail file1.txt -n 4
P a c k a g e s t a r t .
E l a p s e d t i m e : 5 0 . 1 7 5 7 5 4 8 s e c s .
. . . P a c k a g e E x e c u t e d .
R e s u l t : S u c c e s s
When I read the raw contents of the file in python I get the folllowing, whjich I think is a load of unicode characters
In [1]: open('file1.text', 'r').read()
Out[1]: '\xff\xfeP\x00a\x00c\x00k\x00a\x00g\x00e\x00 \x00s\x00t\x00a\x00r\x00t\x00.\x00\r\x00\n\x00E\x00l\x00a\x00p\x00s\x00e\x00d\x00 \x00t\x00i\x00m\x00e\x00:\x00 \x005\x000\x00.\x001\x007\x005\x007\x005\x004\x008\x00 \x00s\x00e\x00c\x00s\x00.\x00\r\x00\n\x00.\x00.\x00.\x00P\x00a\x00c\x00k\x00a\x00g\x00e\x00 \x00E\x00x\x00e\x00c\x00u\x00t\x00e\x00d\x00.\x00\r\x00\n\x00\r\x00\n\x00R\x00e\x00s\x00u\x00l\x00t\x00:\x00 \x00S\x00u\x00c\x00c\x00e\x00s\x00s\x00\r\x00\n\x00\r\x00\n\x00'
In [2]: print open('temp.txt', 'r').read()
■P a c k a g e s t a r t .
E l a p s e d t i m e : 5 0 . 1 7 5 7 5 4 8 s e c s .
. . . P a c k a g e E x e c u t e d .
R e s u l t : S u c c e s s
When I copy the entire content of file1.txt into a new file test.txt - the issue does not reoccur.
$ tail test.txt
Package start.
Elapsed time: 50.1757548 secs.
...Package Executed.
Result: Success
The file seems to have the characters \x00 between every character and \xff at the start.
The file is in UTF-16 format, which uses 2 8-bit bytes to represent most characters (and 4 8-bit bytes for some characters). Each of the 128 ASCII characters is represented as 2 bytes, a zero byte and a byte containing the actual character value. The \xff\xfe sequence at the start is a Byte Order Mark (BOM); it indicates whether the remaining characters are represented with the high-order or low-order byte first.
UTF-16 is one of several ways to represent Unicode text. It's most commonly used in Microsoft Windows.
I'm not sure why the null characters appear as spaces. That may be due to the way your terminal emulator behaves.
Use the iconv command to convert the file from UTF-16 to some other format.
Related
I'm working on a machine translation project in which I have 4.5 million lines of text in two languages, English and German. I would like to shuffle these lines prior to dividing the data into shards on which I will train my model. I know the shuf command described here allows one to shuffle lines in one file, but how can I ensure that corresponding lines in the second file are also shuffled into the same order? Is there a command to shuffle lines in both files?
TL;DR
paste to create separate columns from two files into a single file
shuf on the single file
cut to split the columns
Paste
$ cat test.en
a b c
d e f
g h i
$ cat test.de
1 2 3
4 5 6
7 8 9
$ paste test.en test.de > test.en-de
$ cat test.en-de
a b c 1 2 3
d e f 4 5 6
g h i 7 8 9
Shuffle
$ shuf test.en-de > test.en-de.shuf
$ cat test.en-de.shuf
d e f 4 5 6
a b c 1 2 3
g h i 7 8 9
Cut
$ cut -f1 test.en-de.shuf> test.en-de.shuf.en
$ cut -f2 test.en-de.shuf> test.en-de.shuf.de
$ cat test.en-de.shuf.en
d e f
a b c
g h i
$ cat test.en-de.shuf.de
4 5 6
1 2 3
7 8 9
I have two blocks of data in a file, say foo.txt like the following:
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9
I'd like to extract rows 2:4 and 6:8 and put them as the following:
b 2 f 6
c 3 g 7
d 4 h 8
I could try using auxiliary files:
sed -n '2,4p' foo.txt > tmp1; sed -n '6,8p' foo.txt > tmp2; paste tmp1 tmp2 > output; rm tmp1 tmp2
But is there a better way to do it without auxiliary files? Thanks!
Using process substitution:
$ paste <(sed -n '2,4p' foo.txt) <(sed -n '6,8p' foo.txt) > output
$ cat output
b 2 f 6
c 3 g 7
d 4 h 8
$
In AWK:
$ awk 'NR==2,NR==4{a[++i]=$0} NR==6,NR==8{b[++j]=$0} END {for(i=1;i<=j;i++) print a[i],b[i]}' file
b 2 f 6
c 3 g 7
d 4 h 8
When between the given record numbers (NR), fill up arrays a and b. In the END, print them side by side.
How to compare 2 files? I need to compare a column of a linux file with the second column of another file and get the difference.
Let's say I have the following files.
file 1:
a 3
b 6
c 8
d 7
g 5
p 16
file 2:
a 1
b 6
c 8
d 7
g 5
I need to compare column two of file 1 with column two of file 2 and get the difference.
Desired output file 1 - file 2 :
a 2
b 0
c 0
d 0
g 0
p 16
This awk one-liner works for your example:
awk 'NR==FNR{a[$1]=$2;next}{print $1,a[$1]-$2;delete a[$1]}
END{for(x in a)print x, a[x]}' file1 file2
So far, I have the following code that allows me to loop through a file and identify the third line of every file. However, I want to delete the last column on every third line, but I'm having trouble coming up with the syntax:
#!/bin/bash
counter=1
lines=$(wc -l < 'test_file')
echo $lines
while read line; do
if (( $counter % 3 == 0 )); then
#Need a good one liner here to solve the problem!
else
echo "$line is not a multiple of three."
fi
((counter = counter + 1))
done < 'test_file'
I'm open to using awk, but I'm not sure how to do that inline on a line within a file. If it can be done with parameters on the entire file that is fine.
Also, if the divisible by 3 line contains only one field, that should be deleted as well. There will never be an already existing blank line.
Sample input:
my dog smells
my cat bites
my fish swims
a b c
d e f
g h i
Sample output:
my dog smells
my cat bites
my fish
a b c
d e f
g h
awk is easier to handle this kind of task:
awk 'NR%3==0{NF--}7' file
let's do a little test:
kent$ cat f
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
kent$ awk 'NR%3==0{NF--}7' f
a b c d
a b c d
a b c
a b c d
a b c d
a b c
a b c d
a b c d
a b c
a b c d
This might work for you (GNU sed):
sed '3~3s/\s\S\+$//' file
I have this script cut -d "|" -f 8,9,13,23 info.txt > result.txt
When I open result.txt it looks like this:
R S L T = R A C 2 6 8 | R A W T = r i c k a r d a d a m c e s a r t w o s i x n i n e
When it should look like this:
RSLT=RAC268|RAWT=rickard adam cesar two six nine
It works in the console, so my guess is that the problem is within the output operator.