merge csv lines if not ending with a pipe - linux

I have a rather large csv file where each line should end with a pipe (|) and if it doesn't combine the next line into it until find a pipe again. This need to done using a shell script.
I got an answer as
awk '!/|$/{l=l""$0|next|}{print l""$0|l=""}' file
But it gives me error as size of each line is quite large for me. I found out that I should be using perl to do that and have tried something as below but it does produce the desired result.
perl -pe 's/^\n(|\n)/ /gs' input.csv > output.csv
My data looks like
A|1|abc|<xml/>|
|2|def|<xml
>hello world</xml>|
|3|ghi|<xml/>|
And the desired output should be
A|1|abc|<xml/>|
|2|def|<xml>hello world</xml>|
|3|ghi|<xml/>|
Obviously the line size is quite large than the sample input here.
Any help would be highly appreciated.

awk '{printf "%s",$0} /[|][[:space:]]*$/ {print ""}'
Print every line without a newline. If the last non-whitespace character is a pipe, you have a complete line so print a newline.

This should work:
perl -lne 'unless(/\|$/){$line=$line.$_}else{print $line." $_";undef $line}' your_file
if you want to do an inplace replacement do this:
perl -i -lne 'unless(/\|$/){$line=$line.$_}else{print $line." $_";undef $line}' your_file
check here regarding your comment

This should happily handle all cases for you, and not break on any line length:
#!/bin/bash
newLine=0
IFS=
while read -r -n 1 char; do
if [[ $char =~ ^$ ]]; then
if [[ $newLine -eq 1 ]]; then
newLine=0
echo '|' # add a newline
fi
elif [[ $char =~ . && ( $newLine -eq 1 ) ]]; then
newLine=0
echo -n "|$char"
elif [[ $char =~ [|] ]]; then
if [[ $newLine -eq 1 ]]; then
echo -n '|'
fi
newLine=1
else
echo -n $char
fi
done < file.txt
Please note that building a lexer by hand in bash is usually a bad idea.

Related

extract substrings starting with same pattern in a file

I have a fileA.txt which contains strings:
>AMP_16 RS0247 CENPF__ENST00000366955.7__3251__30__0.43333__66.8488__1 RS0255
>AMP_16 RS0332 CENPF__ENST00000366955.7__2262__30__0.43333__67.9513__1 RS0451
>AMP_16 RS0332 CENPF__ENST00000366955.7__1108__30__0.43333__67.4673__1 RS0247
and so on .....
I would like to extract all the substring which start with RS
Output:
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247
I tried something like this:
while read line
do
for word in $line
do
if [[ "$word" =~ ^RS ]]
then
[ -z "$str" ] && str="$word"
fi
done
done < fileA.txt
echo "$str"
However I only get the first string RS0247 printed out when I do echo
Given the three sample lines pasted above in the file f...
Assuming a fixed format:
awk '{print $2"_"$4}' f
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247
Assuming a flexible format (and that you're only interested in the first two occurrences of fields that start with RS:
awk '{f=1;for (i=1;i<=NF;i++){if($i~/^RS/){a[f]=$i;f++}}print a[1]"_"a[2]}' f
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247
Edit 1:
And assuming that you want your own script patched rather than an efficient solution:
#!/bin/bash
while read line
do
str=""
for word in $line
do
if [[ "$word" =~ ^RS ]]
then
if [[ -z $str ]]
then
str=$word
else
str+="_${word}"
fi
fi
done
echo "$str"
done < fileA.txt
Edit 2:
In terms of efficiency; I copied and pasted those 3 lines into fileA.txt 60 times (180 lines total). The runtimes for the three attempts above in the same order are:
real 0m0.002s
real 0m0.002s
real 0m0.011s

bash multiple if statement inside loop is not working

I am new to bash scripting, I have a file and i want to check each line but it is not working.
My bach script code
declare -a arr
for (( i=0; i<${len}; i++ ))
do
if [[ ${arr[$i]} =~ ^[0-9]+$ ]]
then
echo ${arr[$i]}" numbr"
else
echo "No match"
fi
done
Why only last line is match ? is there any space or line break issue ? suggest me some solution
Your file is saved in "DOS" format, with \r\n line endings.
To convert to "unix" format:
dos2unix file.txt
# or, if that's not installed
sed -i 's/\r$//' file.txt
tangentially, I find the POSIX character class can describe what you want to match:
[[ ${arr[$i]} =~ ^[[:digit:]]+$ ]] && echo "${arr[i]} numbr"
[[ ${arr[$i]} =~ ^[[:alpha:]]+$ ]] && echo "${arr[i]} ltr only"
[[ ${arr[$i]} =~ ^[[:alnum:]]+$ ]] && echo "${arr[i]} both"
Your issue is MS-DOS line endings of $'\r\n' instead of $'\n' only.
You can remove the offending $'\n' on-the-fly like this:
mapfile -t arr < <(tr -d '\r' <file.txt)
Other than that, I suggest you check you script with https://shellcheck.net/ as it has some issues

Unix - Replace column value inside while loop

I have comma separated (sometimes tab) text file as below:
parameters.txt:
STD,ORDER,ORDER_START.xml,/DML/SOL,Y
STD,INSTALL_BASE,INSTALL_START.xml,/DML/IB,Y
with below code I try to loop through the file and do something
while read line; do
if [[ $1 = "$(echo "$line" | cut -f 1)" ]] && [[ "$(echo "$line" | cut -f 5)" = "Y" ]] ; then
//do something...
if [[ $? -eq 0 ]] ; then
// code to replace the final flag
fi
fi
done < <text_file_path>
I wanted to update the last column of the file to N if the above operation is successful, however below approaches are not working for me:
sed 's/$f5/N/'
'$5=="Y",$5=N;{print}'
$(echo "$line" | awk '$5=N')
Update: Few considerations which need to be considered to give more clarity which i missed at first, apologies!
The parameters file may contain lines with last field flag as "N" as well.
Final flag needs to be update only if "//do something" code has successfully executed.
After looping through all lines i.e, after exiting "while loop" flags for all rows to be set to "Y"
perhaps invert the operations do processing in awk.
$ awk -v f1="$1" 'BEGIN {FS=OFS=","}
f1==$1 && $5=="Y" { // do something
$5="N"}1' file
not sure what "do something" operation is, if you need to call another command/script it's possible as well.
with bash:
(
IFS=,
while read -ra fields; do
if [[ ${fields[0]} == "$1" ]] && [[ ${fields[4]} == "Y" ]]; then
# do something
fields[4]="N"
fi
echo "${fields[*]}"
done < file | sponge file
)
I run that in a subshell so the effects of altering IFS are localized.
This uses sponge to write back to the same file. You need the moreutils package to use it, otherwise use
done < file > tmp && mv tmp file
Perhaps a bit simpler, less bash-specific
while IFS= read -r line; do
case $line in
"$1",*,Y)
# do something
line="${line%Y}N"
;;
esac
echo "$line"
done < file
To replace ,N at the end of the line($) with ,Y:
sed 's/,N$/,Y/' file

Bash scripting: why is the last line missing from this file append?

I'm writing a bash script to read a set of files line by line and perform some edits. To begin with, I'm simply trying to move the files to backup locations and write them out as-is, to test the script is working. However, it is failing to copy the last line of each file. Here is the snippet:
while IFS= read -r line
do
echo "Line is ***$line***"
echo "$line" >> $POM
done < $POM.backup
I obviously want to preserve whitespace when I copy the files, which is why I have set the IFS to null. I can see from the output that the last line of each file is being read, but it never appears in the output.
I've also tried an alternative variation, which does print the last line, but adds a newline to it:
while IFS= read -r line || [ -n "$line" ]
do
echo "Line is ***$line***"
echo "$line" >> $POM
done < $POM.backup
What is the best way to do this do this read-write operation, to write the files exactly as they are, with the correct whitespace and no newlines added?
The command that is adding the line feed (LF) is not the read command, but the echo command. read does not return the line with the delimiter still attached to it; rather, it strips the delimiter off (that is, it strips it off if it was present in the line, IOW, if it just read a complete line).
So, to solve the problem, you have to use echo -n to avoid adding back the delimiter, but only when you have an incomplete line.
Secondly, I've found that when providing read with a NAME (in your case line), it trims leading and trailing whitespace, which I don't think you want. But this can be solved by not providing a NAME at all, and using the default return variable REPLY, which will preserve all whitespace.
So, this should work:
#!/bin/bash
inFile=in;
outFile=out;
rm -f "$outFile";
rc=0;
while [[ $rc -eq 0 ]]; do
read -r;
rc=$?;
if [[ $rc -eq 0 ]]; then ## complete line
echo "complete=\"$REPLY\"";
echo "$REPLY" >>"$outFile";
elif [[ -n "$REPLY" ]]; then ## incomplete line
echo "incomplete=\"$REPLY\"";
echo -n "$REPLY" >>"$outFile";
fi;
done <"$inFile";
exit 0;
Edit: Wow! Three excellent suggestions from Charles Duffy, here's an updated script:
#!/bin/bash
inFile=in;
outFile=out;
while { read -r; rc=$?; [[ $rc -eq 0 || -n "$REPLY" ]]; }; do
if [[ $rc -eq 0 ]]; then ## complete line
echo "complete=\"$REPLY\"";
printf '%s\n' "$REPLY" >&3;
else ## incomplete line
echo "incomplete=\"$REPLY\"";
printf '%s' "$REPLY" >&3;
fi;
done <"$inFile" 3>"$outFile";
exit 0;
After review i wonder if :
{
line=
while IFS= read -r line
do
echo "$line"
line=
done
echo -n "$line"
} <$INFILE >$OUTFILE
is juts not enough...
Here my initial proposal :
#!/bin/bash
INFILE=$1
if [[ -z $INFILE ]]
then
echo "[ERROR] missing input file" >&2
exit 2
fi
OUTFILE=$INFILE.processed
# a way to know if last line is complete or not :
lastline=$(tail -n 1 "$INFILE" | wc -l)
if [[ $lastline == 0 ]]
then
echo "[WARNING] last line is incomplete -" >&2
fi
# we add a newline ANYWAY if it was complete, end of file will be seen as ... empty.
echo | cat $INFILE - | {
first=1
while IFS= read -r line
do
if [[ $first == 1 ]]
then
echo "First Line is ***$line***" >&2
first=0
else
echo "Next Line is ***$line***" >&2
echo
fi
echo -n "$line"
done
} > $OUTFILE
if diff $OUTFILE $INFILE
then
echo "[OK]"
exit 0
else
echo "[KO] processed file differs from input"
exit 1
fi
Idea is to always add a newline at the end of file and to print newlines only BETWEEN lines that are read.
This should work for quite all text files given they are not containing 0 byte ie \0 character, in which case 0 char byte will be lost.
Initial test can be used to decided whether an incomplete text file is acceptable or not.
Add a new line if line is not a line. Like this:
while IFS= read -r line
do
echo "Line is ***$line***";
printf '%s' "$line" >&3;
if [[ ${line: -1} != '\n' ]]
then
printf '\n' >&3;
fi
done < $POM.backup 3>$POM

find string in file using bash

I need to find strings matching some regexp pattern and represent the search result as array for iterating through it with loop ), do I need to use sed ? In general I want to replace some strings but analyse them before replacing.
Using sed and diff:
sed -i.bak 's/this/that/' input
diff input input.bak
GNU sed will create a backup file before substitutions, and diff will show you those changes. However, if you are not using GNU sed:
mv input input.bak
sed 's/this/that/' input.bak > input
diff input input.bak
Another method using grep:
pattern="/X"
subst=that
while IFS='' read -r line; do
if [[ $line = *"$pattern"* ]]; then
echo "changing line: $line" 1>&2
echo "${line//$pattern/$subst}"
else
echo "$line"
fi
done < input > output
The best way to do this would be to use grep to get the lines, and populate an array with the result using newline as the internal field separator:
#!/bin/bash
# get just the desired lines
results=$(grep "mypattern" mysourcefile.txt)
# change the internal field separator to be a newline
IFS=$'/n'
# populate an array from the result lines
lines=($results)
# return the third result
echo "${lines[2]}"
You could build a loop to iterate through the results of the array, but a more traditional and simple solution would just be to use bash's iteration:
for line in $lines; do
echo "$line"
done
FYI: Here is a similar concept I created for fun. I thought it would be good to show how to loop a file and such with this. This is a script where I look at a Linux sudoers file check that it contains one of the valid words in my valid_words array list. Of course it ignores the comment "#" and blank "" lines with sed. In this example, we would probably want to just print the Invalid lines only but this script prints both.
#!/bin/bash
# -- Inspect a sudoer file, look for valid and invalid lines.
file="${1}"
declare -a valid_words=( _Alias = Defaults includedir )
actual_lines=$(cat "${file}" | wc -l)
functional_lines=$(cat "${file}" | sed '/^\s*#/d;/^\s*$/d' | wc -l)
while read line ;do
# -- set the line to nothing "" if it has a comment or is empty line.
line="$(echo "${line}" | sed '/^\s*#/d;/^\s*$/d')"
# -- if not set to nothing "", check if the line is valid from our list of valid words.
if ! [[ -z "$line" ]] ;then
unset found
for each in "${valid_words[#]}" ;do
found="$(echo "$line" | egrep -i "$each")"
[[ -z "$found" ]] || break;
done
[[ -z "$found" ]] && { echo "Invalid=$line"; sleep 3; } || echo "Valid=$found"
fi
done < "${file}"
echo "actual lines: $actual_lines funtional lines: $functional_lines"

Resources