Printing only 6 and 10 charcters words in a linux file - linux

This is my file:
$cat filename
10023a,vija45,8877au,qwer65,guru12 0099888das,baburam123,ganeshan1,feild55512
What I tried to do is using the sed below command to get the output to be only 6 charcters words in that file
sed -ne 's/[a-z][0-9]\{6}/&/p' filename
it displaying all words and lines
Could you please any one help me on this..
Expected output is
vija45 baburam123
8877au ganeshan1
qwer65 feild55512
guru12

Use that:
tr "," "\n" <file | grep '^.\{6\}$\|^.\{10\}$'
First tr replaces all , with newlines, that we have each segment between the commas in a line.
Then grep searches for 6 or 10 character long lines and prints them.
With your given example, the output would then be:
10023a
vija45
8877au
qwer65
baburam123
feild55512
If guru12 0099888das must also be matched as a 6 character and a 10 character word, then just change the tr part to include also spaces:
tr ", " "\n" <file | grep '^.\{6\}$\|^.\{10\}$'

I suggest you to use grep for matching.
grep -o '\b\w\{6\}\b' file

sed '
# keep only 6 char word (and space) by removing less or more than 6 character word
s/.*/,&,/
s/[^[:space:],]\{11,\}//g;s/[[:space:],][^[:space:],][[:space:],]\{1,5\}/,/g;s/[[:space:],][^[:space:],][[:space:],]\{7,9\}/,/g
# clean space element
s/[[:space:],]\{2,\}/,/g;s/^[[:space:],]*//g;s/[[:space:],]*$//g
# remove empty line
/$[[:space:],]*$/d
# 1 word per line (optional)
y/ ,/\n\n/
' YourFile
Detail:
print all word of 6 letter find in lines (option for 1 word printed per output line)
self explained
adapted for , separated
Correction: forget some g and a small bug on small word removing and add 10 char word (take 6 only in first version)

Related

How to convert an uneven tab separated file using sed?

How to convert an uneven TAB separated input file to CSV or PSV using sed command?
28828082-1 04/08/19 08:48 04/11/19 12:37 04/12/19 16:22 4/15-4/16 04/17/19 2 9 LCO W OIP 04/08/19 08:53 21 1 58.00 9 222 79 FEDX FEDXH SL3 484657064673 0410099900691041119 SMITHFIELD RI 02917 "41.890066 , -71.548680" YES
Above is 1 row, I tried using sed -r 's/^\s+//;s/\s+/|/g' but the result was not as expected.
gawk to the rescue!
$ awk -vFPAT='([^[:space:]]+)|("[^"]+")' -v OFS='|' '$1=$1' file
28828082-1|04/08/19|08:48|04/11/19|12:37|04/12/19|16:22|4/15-4/16|04/17/19|2|9|LCO|W|OIP|04/08/19|08:53|21|1|58.00|9|222|79|FEDX|FEDXH|SL3|484657064673|0410099900691041119|SMITHFIELD|RI|02917|"41.890066 , -71.548680"|YES
define the field pattern as non space or a quoted value which might include spaces (but not escaped quotes), replace the output field separated with tab, force the line to be parsed and non zero lines will be printed after format change.
A better version would be ... '{$1=$1; print}'.
Of course, if all the field delimiters are tabs and quotes string doesn't include any tabs, it's much simpler.
Your question isn't clear but is this what you're trying to do?
$ printf 'now\t"is the winter"\tof\t"our discontent"\n' > file
$ cat file
now "is the winter" of "our discontent"
$ tr '\t' ',' < file
now,"is the winter",of,"our discontent"
$ tr '\t' '|' < file
now|"is the winter"|of|"our discontent"
You initial answer was very close:
sed 's/[[:space:]]\+/|/g' input.txt
Explanation:
[[:space:]] Match a single whitespace character such as space/tab/CR/newline.
\+ Match one or more of the current grab.
Update:
If you require 2 or more white spaces.
sed 's/[[:space:]]\{2,\}/|/g' input.txt
\{2,\} Match two or more of the current grab.

Append then delete line to another line, only if it does not contain character

In my text file, there are 6 lines in a group separated by two blank lines. I have printed the line number for each line to the text document.
365:--------------------------------------------------------------------------------
366:--------------------------------------------------------------------------------
367:--------------------------------------------------------------------------------
368:--------------------------------------------------------------------------x-----
369:--------------------4-----------------------------------------------------------
370:--0-----------------------------------------------------------------------------
371:
372:
373:--------------------------------------------------------------------|
374:--------------------------------------------------------------------|
375:------------0--------2--------3h----2h----0-----2-------------------|
376:---2-----------------------------------------------------2----------|
377:--------------------------------------------------------------------|
378:--------------------------------------------------------------------|
Currently only 80 characters are printed to a line, so the rest of the data continues in the next group. For example, Line 365 corresponds to Line 373.
For only lines that do not contain a vertical bar (i.e., lines 365-370), I am trying to 1) append the line that is 8 lines away, then 2) delete the appended line after it has been printed.
So, ideally:
365:----------------------------------------------------------------------------------------------------------------------------------------------------|
366:----------------------------------------------------------------------------------------------------------------------------------------------------|
367:--------------------------------------------------------------------------------------------0--------2--------3h----2h----0-----2-------------------|
368:--------------------------------------------------------------------------x--------2-----------------------------------------------------2----------|
369:--------------------4-------------------------------------------------------------------------------------------------------------------------------|
370:--0-------------------------------------------------------------------------------------------------------------------------------------------------|
I can isolate the lines that do not contain a vertical bar using grep
grep -vn \| song.txt
I know that SED or AWK are likely my best bet, but I'm not sure how to proceed from here.
Just massage this approach to suit:
$ seq 16 | awk 'NR>8{print a[NR%8], $0} {a[NR%8]=$0}'
1 9
2 10
3 11
4 12
5 13
6 14
7 15
8 16
e.g. assuming 2 blank lines at the end of your input to make it blocks of 8 lines:
$ awk 'NR>8{print a[NR%8] $0} {a[NR%8]=$0}' file
--------------------------------------------------------------------------------------------------------------------------------------------------|
--------------------------------------------------------------------------------------------------------------------------------------------------|
------------------------------------------------------------------------------------------0--------2--------3h----2h----0-----2-------------------|
-------------------------------------------------------------------------x-------2-----------------------------------------------------2----------|
-------------------4------------------------------------------------------------------------------------------------------------------------------|
-0------------------------------------------------------------------------------------------------------------------------------------------------|
or if you don't have those blank lines after the last block:
$ awk '!NF{next} ++cnt>6{print a[NR%6] $0} {a[NR%6]=$0}' file
--------------------------------------------------------------------------------------------------------------------------------------------------|
-------------------------------------------------------------------------x------------------------------------------------------------------------|
-------------------4----------------------------------------------------------------------0--------2--------3h----2h----0-----2-------------------|
-0-------------------------------------------------------------------------------2-----------------------------------------------------2----------|
--------------------------------------------------------------------------------------------------------------------------------------------------|
--------------------------------------------------------------------------------------------------------------------------------------------------|
A little bit ugly, but working:
Split your input:
egrep -v "^$|\|" song.txt >file1
egrep "\|" song.txt >file2
And put it together:
paste -d "" file1 file2
I usually use the vim program for this type of work. For example, assuming you have a file named file_name.txt with the following content
-------------------------8----
------------0--------2--------|
---2--------------------------|
------------------aaa---------|
---------------984asds--------|
---------t6776----------------|
with the following command
vim -c ":6y" -c ":put" -c ":1" -c ":join!" -c ":6d" -c ":wq" file_name.txt
the program opens file_name.txt on the first line, copy the sixth line, paste the contents copied in the second line (the next line), go to the first line, joins the first line with the second, delete the line that was copied (sixth line), save and close the file. In this way, this command produces the following result
-------------------------8-------------------984asds--------|
------------0--------2--------|
---2--------------------------|
------------------aaa---------|
---------t6776----------------|
This might work for you (GNU utils);
sed '/^$/d' file |
split -nr/6 --filter 'cat'|
paste -sd'\0'|
sed 's/|/&\n/g;s/\n$//'
This removes any blank lines using sed, splits the file into 6 using a round-robin method and instead of making separate files, outputs all the files interleaved into the stdout. The lines are then pasted into a long lines (one per string) and split back into shorter lines using the | as record separators.

Bash Shell Script : Concating lines that do not end in ^M

This is again a question related to End of line characters in Unix and Windows.
I have a sql extract where some fields can contain text that have line breaks.
When I take this extract to a linux machine and open it in VI with :se list option set I see text like below:
1 some broken Text part 1 - Line1$
2 other broken text part 2 -line 2^M$
3 good line ^M$
I need to detect lines that do not end in CARRIAGE RETURN (CR) or ^M and see if it contains value that have line breaks.
In the above extract , basically i need to join the line 1 and line 2 and come up with just one line
1 'some broken Text part 1 - Line1 other broken text part 2 -line 2^M$
There should be no change to the Line 3 which would then become the line 2 of the file.
I tried to remove \n using tr but then the whole file became just 1 line in VI.
After removing \n , I tried to then replace \r with \r\n but it introduced unexpected behavior in the file.
Any help to figure out this issue will be appreciated.
You could just replace \n with a space and \r with \n:
$ printf 'some broken Text part 1 - Line1
other broken text part 2 -line 2\r
goodline\r\n' > file.txt
$ cat -vE file.txt
some broken Text part 1 - Line1$
other broken text part 2 -line 2^M$
goodline^M$
$ tr '\n\r' ' \n' < file.txt
some broken Text part 1 - Line1 other broken text part 2 -line 2
goodline
Below did the trick:
tr -d '\n' < file.txt > step1-file.txt
sed -i -e 's/\r/\r\n/g' step1-file.txt
Somehow the below perl line that i was trying to use earlier was introducing unexpected behavior.
perl -pi -e 's/\r/\r\n/' step1-file.txt

Is there a command in bash to get the first n words instead of n lines similar to 'head -n'?

I want to extract the first say 1M words from a large text file, can I do it in command line, instead of writing script?
Update: The data is one sentence per line, words are separated by white space, this structure should be preserved. I've done it with python with a word counter, just wondering whether it can be done with command line in a smarter way.
Yes.
tr '\n' ' ' < inputfile | cut -d' ' -f 1-1000000 > outputfile
Takes the first 1M words from inputfile (a word in this case is anything between two spaces) then outputs them on one line to outputfile. To have them on separate lines in the output (as per #triplee's comment):
tr ' ' '\n' < inputfile | head -1000000 > outputfile

How can I swap two lines using sed?

Does anyone know how to replace line a with line b and line b with line a in a text file using the sed editor?
I can see how to replace a line in the pattern space with a line that is in the hold space (i.e., /^Paco/x or /^Paco/g), but what if I want to take the line starting with Paco and replace it with the line starting with Vinh, and also take the line starting with Vinh and replace it with the line starting with Paco?
Let's assume for starters that there is one line with Paco and one line with Vinh, and that the line Paco occurs before the line Vinh. Then we can move to the general case.
#!/bin/sed -f
/^Paco/ {
:notdone
N
s/^\(Paco[^\n]*\)\(\n\([^\n]*\n\)*\)\(Vinh[^\n]*\)$/\4\2\1/
t
bnotdone
}
After matching /^Paco/ we read into the pattern buffer until s// succeeds (or EOF: the pattern buffer will be printed unchanged). Then we start over searching for /^Paco/.
cat input | tr '\n' 'ç' | sed 's/\(ç__firstline__\)\(ç__secondline__\)/\2\1/g' | tr 'ç' '\n' > output
Replace __firstline__ and __secondline__ with your desired regexps. Be sure to substitute any instances of . in your regexp with [^ç]. If your text actually has ç in it, substitute with something else that your text doesn't have.
try this awk script.
s1="$1"
s2="$2"
awk -vs1="$s1" -vs2="$s2" '
{ a[++d]=$0 }
$0~s1{ h=$0;ind=d}
$0~s2{
a[ind]=$0
for(i=1;i<d;i++ ){ print a[i]}
print h
delete a;d=0;
}
END{ for(i=1;i<=d;i++ ){ print a[i] } }' file
output
$ cat file
1
2
3
4
5
$ bash test.sh 2 3
1
3
2
4
5
$ bash test.sh 1 4
4
2
3
1
5
Use sed (or not at all) for only simple substitution. Anything more complicated, use a programming language
A simple example from the GNU sed texinfo doc:
Note that on implementations other than GNU `sed' this script might
easily overflow internal buffers.
#!/usr/bin/sed -nf
# reverse all lines of input, i.e. first line became last, ...
# from the second line, the buffer (which contains all previous lines)
# is *appended* to current line, so, the order will be reversed
1! G
# on the last line we're done -- print everything
$ p
# store everything on the buffer again
h

Resources