Put every X rows of input into a new column - linux

I have a file with 3972192 lines and two values tab separated for each line. I would like to separate every 47288 lines into a new column (this derives in 84 columns). I read these other question (Put every N rows of input into a new column) in which it does the same as I want but with awk I get:
awk: program limit exceeded: maximum number of fields size=32767
if I do it with pr, the limit of columns to separate is 36.
For doing this I first selected column 2 with awk:
awk '{print $2}' input_file>values_file
For getting the first column values I did:
awk '{print $1}' input_file>headers_file
head -n 47288 headers_file >headers_file2
Once I get the both files I will put them together with the paste function:
paste -d values_file headers_file2 >Desired_output
Example:
INPUT:
-Line1: ABCD 12
-Line2: ASDF 3435
...
-Line47288: QWER 345466
-Line47289: ABCD 456
...
-Line94576: QWER 25
...
-Line3972192 QWER 436
DESIRED output WANTED:
- Line1: ABCD 12 456 ....
...
- Line47288: QWER 345466 25 .... 436
Any advice? thanks in advance,

I suppose each block has the same pattern, I mean, the first column is in the same order [ABCD ASDF ... QWER] and again.
If so, you have to take the first column of the first BLOCK [47288 lines] and echo to the target file.
Then you have to get the second column of each BLOCK and paste it to the target file.
I tried with this data file :
ABCD 1001
EFGH 1002
IJKL 1003
MNOP 1004
QRST 1005
UVWX 1006
ABCD 2001
EFGH 2002
IJKL 2003
MNOP 2004
QRST 2005
UVWX 2006
ABCD 3001
EFGH 3002
IJKL 3003
MNOP 3004
QRST 3005
UVWX 3006
ABCD 4001
EFGH 4002
IJKL 4003
MNOP 4004
QRST 4005
UVWX 4006
ABCD 5001
EFGH 5002
IJKL 5003
MNOP 5004
QRST 5005
UVWX 5006
And with this script :
#!/bin/bash
#target number of lines, change to 47288
LINES=6
INPUT='data.txt'
TOTALLINES=`wc --lines $INPUT | cut --delimiter=" " --field=1`
TOTALBLOCKS=$((TOTALLINES / LINES))
#getting first block of target file, the first column of first LINES of data file
head -n $LINES $INPUT | cut --field=1 > target.txt
#get second column of each line, by blocks, and paste it into target file
BLOCK=1
while [ $BLOCK -le $TOTALBLOCKS ]
do
HEADVALUE=$((BLOCK * LINES))
head -n $HEADVALUE $INPUT | tail -n $LINES | cut --field=2 > tmpcol.txt
cp target.txt targettmp.txt
paste targettmp.txt tmpcol.txt > target.txt
BLOCK=$((BLOCK+1))
done
#removing temp files
rm -f targettmp.txt
rm -f tmpcol.txt
And I got this target file :
ABCD 1001 2001 3001 4001 5001
EFGH 1002 2002 3002 4002 5002
IJKL 1003 2003 3003 4003 5003
MNOP 1004 2004 3004 4004 5004
QRST 1005 2005 3005 4005 5005
UVWX 1006 2006 3006 4006 5006
I hope this helps you.

Related

Filter two big files with many columns

I have two big files (millions of lines) and I don't have access to a database. It is necessary for me use bash.
The first file is something like:
NUMBER CODE CAMP2
1222 aa132 3264
1223 ab124 4283
1224 af121 6224
1225 ag172 9235
1226 aw183 1229
.
.
.
And my second file is something like:
NUMBER NAME CAMP3
1222 Juan 1111
1223 Carlos 2222
1225 Jesus 4444
1226 Mosies 5555
.
.
.
And I need to cross the files by the NUMBER:
NUMBER CODE CAMP2 NAME CAMP3
1222 aa132 3264 Juan 1111
1223 ab124 4283 Carlos 2222
1225 ag172 9235 Jesus 4444
1226 aw183 1229 Mosies 5555
I try with a for read line by line, but it take a lot of time.
comm is not possible because are many columns.
The two files do not have the same number of lines. There are lines in the first file that are not in the second file, and vice versa.
My code so far is very simple:
while read line
do
numer=$(echo $line | awk -F" " '{print $1}')
search=$(grep $numer file2)
if [ ! -z $search ]; then
echo $line" "$search > file_output
fi
done < file1
The while works, but takes a long time.
Following works, but may not be efficient
$ head f*
==> f1.txt <==
NUMBER CODE CAMP2
1222 aa132 3264
1223 ab124 4283
1224 af121 6224
1225 ag172 9235
1226 aw183 1229
==> f2.txt <==
NUMBER NAME CAMP3
1222 Juan 1111
1223 Carlos 2222
1225 Jesus 4444
1226 Mosies 5555
$ awk '{ while(getline line < "f2.txt") {split(line, a, " "); if(a[1] == $1) {print $1, $2, $3, a[2], a[3] }} close("f2.txt"); }' f1.txt
NUMBER CODE CAMP2 NAME CAMP3
1222 aa132 3264 Juan 1111
1223 ab124 4283 Carlos 2222
1225 ag172 9235 Jesus 4444
1226 aw183 1229 Mosies 5555
Realized you added similar code as below in your posting and were looking for efficient way, given the format of the files join will work straight off here
# join file1 file2
Earlier Response: You can perhaps extract the first field out of the first file using cat/awk and then look for that in the 2nd file (grep). On finding a match add the two parts. Sometthing as below should help. Although I am doing field 2,3 with cut that part can be improved to read rest of the line
for name in `cat file1 | awk '{print $1}'`
do
result2=`grep $name file2`
if [ $? -eq 0 ];
then
part1=`grep $name file1`
part2=`echo $result2 | cut -d' ' -f2,3`
echo "$part1 $part2"
fi
done
This should do it:
#!/bin/bash
awk '{ if(NR==FNR) {
r[$1]=substr($0, match($0,/ [^ ]/)+1)
} else {
print($0,r[$1])
r[$1]=""
}
} END {
for (i in r)
if (r[i]!="")
print(i," . "," .",r[i])
}
' filen2 filen1
With filen1 being:
NUMBER CODE CAMP2
1222 aa132 3264
1223 ab124 4283
1224 af121 6224
1225 ag172 9235
1226 aw183 1229
And filen2 being:
NUMBER NAME CAMP3
1222 Juan 1111
1223 Carlos 2222
1225 Jesus 4444
1226 Moises 5555
1248 Antonio 8888
2185 Pablo 7754
You should get this output:
NUMBER CODE CAMP2 NAME CAMP3
1222 aa132 3264 Juan 1111
1223 ab124 4283 Carlos 2222
1224 af121 6224
1225 ag172 9235 Jesus 4444
1226 aw183 1229 Moises 5555
2185 . . Pablo 7754
1248 . . Antonio 8888
Change the dots into spaces if you do not want dots in the output.

Match lines with same repeated digit, next to each other or with a space in between

Using grep command to match only lines which have two or more consecutive occurrences of the same digit, even if separated by a space
This is how I am getting the output:
-bash-4.2$ cat test_file5
1234 4567 7890 0984
4565 5678 8900 0767
1234 5678 9021 7654
4556 7890 9005 4432
-bash-4.2$ grep "\([0-9]\)\\1" test_file5
4565 5678 8900 0767
4556 7890 9005 4432
Expected output:
1234 4567 7890 0984
4565 5678 8900 0767
4556 7890 9005 4432
You add an option to have zero or more non-digit items in between them:
grep -E '([0-9])[^0-9]*\1' test_file5
Or, if you want to be more rigid, limit it to between 0 and 1 spaces:
grep -E '([0-9])[ ]{0,1}\1' test_file5
As #Sndeep pointed out in his comment, the single question mark also stands for "the previous might be there or not", so you can also type
grep -E '([0-9]) ?\1' test_file5

filter out content between two files

I have two files
conditions.txt
abcd
efgh
logs.txt
efgh
ijkl
mnop
qrst
I am expecting output to be:
ijkl
mnop
qrst
Actual output:
efgh
ijkl
ijkl
mnop
mnop
qrst
qrst
Here's the code I had worked till now
func(){
while read condition
do
if [[ $line = $condition ]] ; then
:
else
echo "$line";
done < condition.txt
}
while read line
do
func $line
done < log.txt
Try using grep:
$ grep -v -f conditions.txt logs.txt
From the man page for GNU grep:
-v, --invert-match
Invert the sense of matching, to select non-matching lines.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. If this option is used multiple times or is combined with the -e (--regexp) option, search for all patterns
given. The empty file contains zero patterns, and therefore matches nothing.
If you don't feel like re-inventing wheels ...
grep -vf conditions.txt logs.txt
ijkl
mnop
qrst

Linux shell command to copy text data from a file to another

file_1 contents:
aaa 111 222 333
bbb 444 555 666
ccc 777 888 999
file_2 contents:
ddd
eee
fff
how do i copy only part of the text from file_1 to file_2
so that file_2 would become:
ddd 111 222 333
eee 444 555 666
fff 777 888 999
Try with awk:
awk 'NR==FNR{a[FNR]=$2FS$3FS$4;next} {print $0, a[FNR]}' file_1 file_2
Explanation:
NR is the current input line, FNR is the number of input line in current file, you can see that by
$ awk '{print NR,FNR}' file_1 file_2
1 1
2 2
3 3
4 1
5 2
6 3
So, the condition NR==FNR is only true when reading the first file, and that's when the columns $2, $3, and $4 get saved in a[FNR]. After reading file_1, the condition NR==FNR becomes false and the block {print $0, a[FNR]} is executed, where $0 is the whole line in file_2.

How can I select same sequence of lines which is present multiple times in a file?

Suppose I have a file as following:
101 abcd <time>
106 efgh <time>
107 ijkl <time>
110 pqrs <time>
105 trsf <time>
101 yrte <time>
109 tyti <time>
110 tyui <time>
I want to do some operations on the chunk of lines starting from 101 and ends at 110.
I'm able to solve it when there is only one occurance of 101 and 110 in a file.
sed -ne 's/101/,/110/p' file1 > file2
With this command I can take out the chunk of lines I want to work upon.
Please help me to find the logic where I will be able to save the first chunk in first file, second matched lines in second file and so on.
I'm writing the script in AIX.
You could do:
awk '/^101/ && !i { c++; i=1 } i { print > "file" c } /^110/ { i=0 }' input
This simply increments a counter (c) each time a line matches ^101, but only if not already in a block being printed. The second clause prints to an output file with the counter in the name if appropriate, and the third clase turns off the flag (i) that is used to determine if the current line is in a block to be printed.
Another option is to simply do:
awk '/^101/,/^110/{ print > "output" c } /^110/{c++}' c=1 input
you could try with awk, here is a short one-liner to do that job:
awk '/101/{++i;f=1} f{print $0>"file"i} /110/{f=0}' file
test with your example:
kent$ echo "101 abcd <time>
106 efgh <time>
107 ijkl <time>
110 pqrs <time>
105 trsf <time>
101 yrte <time>
109 tyti <time>
110 tyui <time>"|awk '/101/{++i;f=1} f{print $0>"file"i} /110/{f=0}'
kent$ head *
==> file1 <==
101 abcd <time>
106 efgh <time>
107 ijkl <time>
110 pqrs <time>
==> file2 <==
101 yrte <time>
109 tyti <time>
110 tyui <time>

Resources