I have a huge file composed of the following:
this is text
1234.1234567
this is another text
1234.1234567
and so on
I would like to transfer it to:
this is text:1234.1234567
this is another text:1234.1234567
is this possible using sed? or any other similar command?
Thanks
If you just want to join lines using : as separator, you could use paste:
paste -d : - - < file.txt
Or using awk:
awk -v sep=: '{ if (NR % 2 == 0) { print prev sep $0 } else prev = $0 }' file.txt
If you have lines containing just alphabets and other containing floating point numbers, you can do the following:
awk '/[a-zA-Z]+/ {printf "%s:", $0}
/[0-9.]+/ {print $0}' data
data is the filename. You can redirect the output to another file.
Related
How to get 1st field of a file only when 2nd field matches a given string?
#cat temp.txt
Ankit pass
amit pass
aman fail
abhay pass
asha fail
ashu fail
cat temp.txt | awk -F"\t" '$2 == "fail" { print $1 }'*
gives no output
Another syntax with awk:
awk '$2 ~ /^faild$/{print $1}' input_file
A deleted 'cat' command.
^ start string
$ end string
It's the best way to match patten.
Either:
Your fields are not tab-separated or
You have blanks at the end of the relevant lines or
You have DOS line-endings and so there are CRs at the end of every
line and so also at the end of every $2 in every line (see
Why does my tool output overwrite itself and how do I fix it?)
With GNU cat you can run cat -Tev temp.txt to see tabs (^I), CRs (^M) and line endings ($).
Your code seems to work fine when I remove the * at the end
cat temp.txt | awk -F"\t" '$2 == "fail" { print $1 }'
The other thing to check is if your file is using tab or spaces. My copy/paste of your data file copied spaces, so I needed this line:
cat temp.txt | awk '$2 == "fail" { print $1 }'
The other way of doing this is with grep:
cat temp.txt | grep fail$ | awk '{ print $1 }'
Able to trim and transpose the below data with sed, but it takes considerable time. Hope it would be better with AWK. Welcome any suggestions on this
Input Sample Data:
[INX_8_60L ] :9:Y
[INX_8_60L ] :9:N
[INX_8_60L ] :9:Y
[INX_8_60Z ] :9:Y
[INX_8_60Z ] :9:Y
Required Output:
INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z
Just use awk, e.g.
awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file
Which will be orders of magnitude faster. It just picks out the (e.g. "INX_8_60L") substring using substring and match. n is simply used as a false/true (0/1) flag to prevent outputting a "!" before the first string.
Example Use/Output
With your data in file you would get:
$ awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file
INX_8_60L!INX_8_60L!INX_8_60L!INX_8_60Z!INX_8_60Z
Which appears to be what you are after. (Note: I'm not sure what your separator character is, so just change above as needed) If not, let me know and I'm happy to help further.
Edit Per-Changes
Including the '?' isn't difficult, and I just copied the character, so you would now have:
awk -v n=0 '{s=substr($0,2,match($0,/[ \t]+/)-2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1}
END {print ""}' file
Example Output
INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z
And to simplify, just operating on the first field as in #JamesBrown's answer, that would reduce to:
awk -v n=0 '{s=substr($1,2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1} END {print ""}' file
Let me know if that needs more changes.
Don't start so many sed commands, separate the sed operations with semicolon instead.
Try to process the data in a single job and avoid regex. Below reading with substr() static sized first block and insterting ? while outputing.
$ awk '{
b=b (b==""?"":";") substr($1,2,3) "?" substr($1,5)
}
END {
print b
}' file
Output:
INX?_8_60L;INX?_8_60L;INX?_8_60L;INX?_8_60Z;INX?_8_60Z
If the fields are not that static in size:
$ awk '
BEGIN {
FS="[[_ ]" # split field with regex
}
{
printf "%s%s?_%s_%s",(i++?";":""), $2,$3,$4 # output semicolons and fields
}
END {
print ""
}' file
Performance of solutions for 20 M records:
Former:
real 0m8.017s
user 0m7.856s
sys 0m0.160s
Latter:
real 0m24.731s
user 0m24.620s
sys 0m0.112s
sed can be very fast when used gingerly, so for simplicity and speed you might wish to consider:
sed -e 's/ .*//' -e 's/\[INX/INX?/' | tr '\n' '|' | sed -e '$s/|$//'
The second call to sed is there to satisfy the requirement that there is no trailing |.
Another solution using GNU awk:
awk -F'[[ ]+' '
{printf "%s%s",(o?"¦":""),gensub(/INX/,"INX?",1,$2);o=1}
END{print ""}
' file
The field separator is set (with -F option) such that it matches the wanted parameter.
The main statement is to print the modified parameter with the ? character.
The variable o allows to keep track of the delimeter ¦.
How i want to split content to multiple files using date format as following below:
Test_<ID name><ddmmyyyy>.CSV
How can I split according to the format?
as before this i use:
awk -F"," 'NR>1 {print > "Test_<ID name><ddmmyyyy>.CSV_"$1".csv"}' Original.CSV
Edit
I got there with
awk -v DATE="$(date +"%d%m%Y")" -F"," 'BEGIN{OFS=","}NR>1 { gsub(/"/,"",$1); print > "Assignment_"$1"_"DATE".csv"}' Test_01012020.CSV
but then I want to include my column name too. How?
You could try using variables from the shell in your thing:
_DATE=` date '+%d%m%Y' `
_ID=my_value
F_EXT=${_ID}${_DATE}
# here "var" is set to the value defined from the shell "F_EXT"
awk -v var=${F_EXT} -F"," 'NR>1 {print > "Test_" var ".CSV_"$1".csv"}' Original.CSV
(I didn't get where you were taking your "ID name", so here it's my_value)
Edit
If you want to include your column name, then read it with the case when NR==1:
awk -v DATE="$(date +"%d%m%Y")" -F"," 'BEGIN{OFS="," } NR==1 {COLUMN_NAME=$1} NR>1 { gsub(/"/,"",$1); print > "Assignment_"$1"_"COLUMN_NAME"_"DATE".csv"}' a.txt
I have codded the following lines :
ARRAY=($(awk 'FS = ";" {print $3}' file.txt))
LINE_CREATOR=`echo "aaaa;bbbb;cccccccc" |
'{awk -F";"};
END
for (i in ARRAY)
{
print $'${ARRAY['i']}'
}
}'`
the File.txt looks like
1;8;3
4;6;1
7;9;2
Explanation :
the array contains the value : 3 1 2
so the loop will loop on the array , and extract fields $3 $1 $2 from the "aaaa;bbbb;cccccccc" using awk
and the final output should be this
ccccccccaaaabbbb
I still have some errors while launching my script.
I'm making a few guesses here but I think that this does what you want:
$ echo "aaaa;bbbb;cccccccc" | awk -F\; 'NR == FNR { n = split($0, a); next }
{ printf "%s", a[$3] } END { print "" }' - file
ccccccccaaaabbbb
NR == FNR means that the block is only run for the first input. - as an argument tells awk to read first from standard input. The string is split on FS (;) into the array a. next skips the rest of the script.
The second block is only run for the second input (the text file). The values in the third field are used to print the elements in the array a.
if you want to pass the index as an awk variable, here is another way
$ awk -F';' -v ix="$(cut -d\; -f3 file | paste -sd\;)" '
BEGIN{n=split(ix,a)}
{for(i=1;i<n;i++) printf "%s",$a[i];
printf "%s\n",$a[n]}' <<< "aaaa;bbbb;cccccccc"
ccccccccaaaabbbb
Suppose I have a text file with 6 columns as below
a|b|c|d|e|f
I know that somwhere in the file the character 'd' exist, but I want to know the column no for it
I have used the following command
awk 's=index($0,"d"){print "position="s}' filename
but it counts the delimiters too which I dont want....I want the output to be 4 in case of "d"
define "|" as Record Seperator, and use NR variable in awk:
awk -v RS="|" '/^d$/{print NR;}' filename
change ^d$ to whatever you want to match.
Using awk you can do:
awk -v val='d' -F '|' '{for (i=1; i<=NF; i++) if ($i==val) {print i} }' file
4