Cygwin Output problems - cygwin

I have this script cut -d "|" -f 8,9,13,23 info.txt > result.txt
When I open result.txt it looks like this:
R S L T = R A C 2 6 8 | R A W T = r i c k a r d a d a m c e s a r t w o s i x n i n e
When it should look like this:
RSLT=RAC268|RAWT=rickard adam cesar two six nine
It works in the console, so my guess is that the problem is within the output operator.

Related

Shuffling pairs of lines in two text files

I'm working on a machine translation project in which I have 4.5 million lines of text in two languages, English and German. I would like to shuffle these lines prior to dividing the data into shards on which I will train my model. I know the shuf command described here allows one to shuffle lines in one file, but how can I ensure that corresponding lines in the second file are also shuffled into the same order? Is there a command to shuffle lines in both files?
TL;DR
paste to create separate columns from two files into a single file
shuf on the single file
cut to split the columns
Paste
$ cat test.en
a b c
d e f
g h i
$ cat test.de
1 2 3
4 5 6
7 8 9
$ paste test.en test.de > test.en-de
$ cat test.en-de
a b c 1 2 3
d e f 4 5 6
g h i 7 8 9
Shuffle
$ shuf test.en-de > test.en-de.shuf
$ cat test.en-de.shuf
d e f 4 5 6
a b c 1 2 3
g h i 7 8 9
Cut
$ cut -f1 test.en-de.shuf> test.en-de.shuf.en
$ cut -f2 test.en-de.shuf> test.en-de.shuf.de
$ cat test.en-de.shuf.en
d e f
a b c
g h i
$ cat test.en-de.shuf.de
4 5 6
1 2 3
7 8 9

extracting two ranges of lines of a file a and putting them as a data block with shell commands

I have two blocks of data in a file, say foo.txt like the following:
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9
I'd like to extract rows 2:4 and 6:8 and put them as the following:
b 2 f 6
c 3 g 7
d 4 h 8
I could try using auxiliary files:
sed -n '2,4p' foo.txt > tmp1; sed -n '6,8p' foo.txt > tmp2; paste tmp1 tmp2 > output; rm tmp1 tmp2
But is there a better way to do it without auxiliary files? Thanks!
Using process substitution:
$ paste <(sed -n '2,4p' foo.txt) <(sed -n '6,8p' foo.txt) > output
$ cat output
b 2 f 6
c 3 g 7
d 4 h 8
$
In AWK:
$ awk 'NR==2,NR==4{a[++i]=$0} NR==6,NR==8{b[++j]=$0} END {for(i=1;i<=j;i++) print a[i],b[i]}' file
b 2 f 6
c 3 g 7
d 4 h 8
When between the given record numbers (NR), fill up arrays a and b. In the END, print them side by side.

AWK script: Finding number of matches that each element in Col2 has in Col1

I want to compare two columns in a file as below using AWK, can someone gives a help please?
e.g.
Col1 Col2
---- ----
2 A
2 D
3 D
3 D
3 A
7 N
7 M
1 D
1 R
Now I want to use AWK to implement the following algorithm to find matches between those columns:
list1[] <=== Col1
list2[] <=== Col2
NewList[]
for i in col2:
d = 0
for j in range(1,len(col2)):
if i == list2[j]:
d++
NewList.append(list1[list2.index[i]])
Expected result:
A ==> 2 // means A matches two times to Col1
D ==> 4 // means D matches two times to Col1
....
So I want to write the above code in AWK script and I find it too complicated for me as I haven't used it yet.
Thank you very much for your help
Not all that complicated, keep the count in an array indexed by the character and print the array out at the end;
awk '{cnt[$2]++} END {for(c in cnt) print c, cnt[c]}' test.txt
# A 2
# D 4
# M 1
# N 1
# R 1
{cnt[$2]++} # For each row, get the second column and increase the
# value of the array at that position (ie cnt['A']++)
END {for(c in cnt) print c, cnt[c]}
# When all rows done (END), loop through the keys of the
# array and print key and array[key] (the value)
alternative solution
$ rev file | cut -c1 | sort | uniq -c
2 A
4 D
1 M
1 N
1 R
for the formatting pipe to ... | sed -r 's/(\w) (\w)/\2 ==> \1/'
A ==> 2
D ==> 4
M ==> 1
N ==> 1
R ==> 1
Or, do everything in awk

Need to delete last column in every third line of a file

So far, I have the following code that allows me to loop through a file and identify the third line of every file. However, I want to delete the last column on every third line, but I'm having trouble coming up with the syntax:
#!/bin/bash
counter=1
lines=$(wc -l < 'test_file')
echo $lines
while read line; do
if (( $counter % 3 == 0 )); then
#Need a good one liner here to solve the problem!
else
echo "$line is not a multiple of three."
fi
((counter = counter + 1))
done < 'test_file'
I'm open to using awk, but I'm not sure how to do that inline on a line within a file. If it can be done with parameters on the entire file that is fine.
Also, if the divisible by 3 line contains only one field, that should be deleted as well. There will never be an already existing blank line.
Sample input:
my dog smells
my cat bites
my fish swims
a b c
d e f
g h i
Sample output:
my dog smells
my cat bites
my fish
a b c
d e f
g h
awk is easier to handle this kind of task:
awk 'NR%3==0{NF--}7' file
let's do a little test:
kent$ cat f
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
kent$ awk 'NR%3==0{NF--}7' f
a b c d
a b c d
a b c
a b c d
a b c d
a b c
a b c d
a b c d
a b c
a b c d
This might work for you (GNU sed):
sed '3~3s/\s\S\+$//' file

Formatting Errors with tail

How can I correctly parse this file through tail, without formatting errors?
I am using tail within cygwin to parse the last ten lines of two files. One file parses through correctly, the other contains a space between every character.
$ tail file2.txt -n 4
22/06/2015 12:28 - Decompressing and saving profile extract...
22/06/2015 12:28 - Decompressing and saving profile extract...
22/06/2015 12:38 - Decompressing and saving profile extract...
22/06/2015 12:38 - Decompressing and saving profile extract...
$ tail file1.txt -n 4
P a c k a g e s t a r t .
E l a p s e d t i m e : 5 0 . 1 7 5 7 5 4 8 s e c s .
. . . P a c k a g e E x e c u t e d .
R e s u l t : S u c c e s s
When I read the raw contents of the file in python I get the folllowing, whjich I think is a load of unicode characters
In [1]: open('file1.text', 'r').read()
Out[1]: '\xff\xfeP\x00a\x00c\x00k\x00a\x00g\x00e\x00 \x00s\x00t\x00a\x00r\x00t\x00.\x00\r\x00\n\x00E\x00l\x00a\x00p\x00s\x00e\x00d\x00 \x00t\x00i\x00m\x00e\x00:\x00 \x005\x000\x00.\x001\x007\x005\x007\x005\x004\x008\x00 \x00s\x00e\x00c\x00s\x00.\x00\r\x00\n\x00.\x00.\x00.\x00P\x00a\x00c\x00k\x00a\x00g\x00e\x00 \x00E\x00x\x00e\x00c\x00u\x00t\x00e\x00d\x00.\x00\r\x00\n\x00\r\x00\n\x00R\x00e\x00s\x00u\x00l\x00t\x00:\x00 \x00S\x00u\x00c\x00c\x00e\x00s\x00s\x00\r\x00\n\x00\r\x00\n\x00'
In [2]: print open('temp.txt', 'r').read()
■P a c k a g e s t a r t .
E l a p s e d t i m e : 5 0 . 1 7 5 7 5 4 8 s e c s .
. . . P a c k a g e E x e c u t e d .
R e s u l t : S u c c e s s
When I copy the entire content of file1.txt into a new file test.txt - the issue does not reoccur.
$ tail test.txt
Package start.
Elapsed time: 50.1757548 secs.
...Package Executed.
Result: Success
The file seems to have the characters \x00 between every character and \xff at the start.
The file is in UTF-16 format, which uses 2 8-bit bytes to represent most characters (and 4 8-bit bytes for some characters). Each of the 128 ASCII characters is represented as 2 bytes, a zero byte and a byte containing the actual character value. The \xff\xfe sequence at the start is a Byte Order Mark (BOM); it indicates whether the remaining characters are represented with the high-order or low-order byte first.
UTF-16 is one of several ways to represent Unicode text. It's most commonly used in Microsoft Windows.
I'm not sure why the null characters appear as spaces. That may be due to the way your terminal emulator behaves.
Use the iconv command to convert the file from UTF-16 to some other format.

Resources