This question already has answers here:
How to replace multiple spaces with a single space using Bash? [duplicate]
(3 answers)
Closed 3 years ago.
When I do a desc hive table like below on a Linux terminal CLI.
hive -e "desc hive_db.hive_table"
I get the following output
date string
entity string
key int
id string
direction string
indicator string
account_number string
document_date string
Now I want to redirect the output to a file before doing it I want to remove the spaces between each fields and have just one space between them.
The expected output is below
date string
entity string
key int
id string
direction string
indicator string
account_number string
document_date string
I have tried like below
hive -e "desc hive_db.hive_table" | tr -s " " > abc.txt
The output I get is
date string
entity string
key int
id string
direction string
indicator string
account_number string
document_date string
Th output I get is a close one but I have one space and one tab between each fields in each line
How can I achieve what I want?
Try:
hive -e "desc hive_db.hive_table" | sed -E 's/[[:space:]]+/ /' > abc.txt
[[:space:]]+ matches one or more of any kind of white space (tab, blank, or other including any unicode-defined white space). s/[[:space:]]+/ / replaces that sequence of white space with a single blank.
Related
I have a string with comma separated values, like:
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0-,,,
As you can see, the 3rd comma separated value has sometimes special character, like the dash (-), in the end. I want to used sed, or preferably perl command to replace this string (with the -i option, so as to replace at existing file), with same string at the same place (i.e. 3rd comma separated value) but without the special character (like the dash (-)) at the end of the string. So, result at above example string should be:
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0,,,
Since such multiple lines like the above are inside a file, I am using while loop at shell/bash script to loop and manipulate all lines of the file. And I have assigned the above string values to variables, so as to replace them using perl. So, my while loop is:
while read mystr
do
myNEWstr=$(echo $mystr | sed s/[_.-]$// | sed s/[__]$// | sed s/[_.-]$//)
perl -pi -e "s/\b$mystr\b/$myNEWstr/g" myFinalFile.txt
done < myInputFile.txt
where:
$mystr is the "SOME-STRING_A_-BLAHBLAH_1-4MP0-"
$myNEWstr result is the "SOME-STRING_A_-BLAHBLAH_1-4MP0"
Note that the myInputFile.txt is a file that contains the 3rd comma separated values of the myFinalFile.txt, so that those EXACT string values ($mystr) will be checked for special characters in the end, like underscore, dash, dot, double-underscore, and if they exist to be removed and form the new string ($myNEWstr), then finally that new string ($myNEWstr) to be replaced at the myFinalFile.txt, so as to have the resulting strings like the example final string shown above, i.e. with the 3rd comma separated sub-string value WITHOUT the special character in the end (which is dash (-) at above example).
Thank you.
You could use the following regex:
s/^([^,]*,[^,]*,[^,]*)-,/$1,/
This defined csv fields as series of characters other than a comma (empty fields are allowed). We are looking for a dash at the very end of the third csv field. The regex captures everything until there, and then replaces it while omitting the dash.
$ cat t.txt
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0-,,,
]$ perl -p -e 's/^([^,]*,[^,]*,[^,]*)-,/$1,/' t.txt
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0,,,
]$
I have a large (4 GB) Windows .csv text file (each lines end in "\r\n") in a Linux environment that was supposed to have been a csv delimited file (delimiter = '|', text qualifier = '"') with each field separated by a pipe and enclosed in double quotes. Any narrative text field with embedded double quotes was supposed to have the double quote escaped with a second double quote (ie. " the quick "brown" fox" was supposed to have been represented as "the quick ""brown"" fox"). Unfortunately escaping the embedded double quotes did not occur. Further the text fields may include embedded new lines (i.e. Windows CR (\r\n)) which need to be retained.
Sample lines might look as follows:
"1234567890123456"|"2016-07-30"|"2016-08-01"|"123"|"456"|"789"|"text narrative field starts\r\n
with text lines that may have embedded double quotes "For example"\r\n
and may include measurements such as 1/2" x 2" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote"\r\n
"9876543210654321"|"2017-01-31"|"2018-08-01"|"123"|"456"|"789"|"text narrative field"\r\n
"2345678901234567"|"...."\r\n
with the objective to have the output appear as follows:
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts\r\n
with text lines that may have embedded double quotes ""For example""\r\n
and may include measurements such as 1/2"" x 2"" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote~\r\n
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~\r\n
~2345678901234567~|~....~\r\n
The solution I was attempting to implement was to:
SUCCESSFUL: change all the "|" sequences to ~|~
SUCCESSFUL: change the double quote (")at the start of the first line and end of the last line to a tilde (~)
change the ending and starting double quotes to tildes for any lines ending in a double quote at the end of the first line and terminated with a CR (\r\n) (eg. ..."\r\n) and the next line begins with a double quote, followed by 16 digit number and a tilde (eg. "1234567890123456~...) (i.e. it is the start of a new record)
convert all remaining double quote characters to two successive double quotes (change " to "")
then reverse the first 3 steps above changing all ~ back to double quotes.
I started by using sed to replace all strings with double quote, followed by a pipe, followed by a double quote (i.e. "|") with a tilde, pipe, tilde (i.e. ~|~). I then manually replaced the first and last doublequote in the file with a tilde.
This is where I ran into issues as I tried to count the number of occurrences where a line ends with a doublequote(") and the start of the next line begins with a doublequote followed by a 16 digit number and a "~" which will tell me the actual number of csv records in the file (minus one) as opposed to the number of lines. I attempted to do this using grep: grep '"\r\n"\d{16}~' | wc -l but that didn't work
I then need to replace those double quotes wherein a double quote ends a record and the succeeding record begins with a double quote followed by a 16 digit number and a "~" leaving everything else intact.
I tried to use sed: sed 's/"\r\n"(\d{16}~)/~\r\n~\1' windows_file.txt but it is not working as hoped.
I would welcome any recommendations as to how to accomplish the above.
The script below does what you expect using awk, except for the very last line in the file since it does not know where that record ends.
It could be fixed counting lines in the file but would be impractical since it's a big file.
Looking at data structure records are separated by "\r\n" and fields by "|" let's use that with awk.
gawk 'BEGIN{
RS="\"\r\n\"" # input record separator RS, 2 double quotes with a DOS line ending in the middle
FS="\"\\|\"" # input field separator FS, 2 double quotes with a pipe in the middle
ORS="~\r\n~" # your record separator
OFS="~|~" # your field separator
} {
$1=$1 # trick awk into believing something has changed
if (NR == 1){ # first record, replace first character
print "~" substr($0,2)
}else{
print $0
}
} ' test.txt
Result (assuming lines end with \r\n):
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts
with text lines that may have embedded double quotes "For example"
and may include measurements such as 1/2" x 2" with
the text continuing and includes embedded line breaks
which will finally be terminated with a double quote~
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~
~10654321~|~2018-09-31~|~2018-08-01~|~123~|~456~|~789~|~asdasdasdasdad asasda"
~
~
PS: will break if a field contains a line that starts with " and the preceding line within the same ends with "\r\n since the pattern will match the proposed RS.
"10654321"|"2018-09-31"|"2018-08-01"|"123"|"456"|"789"|"asdasdasdasdad asasda"\r\n
"some more"\r\n
"22222"|".... (another record)
This question already has answers here:
What's the most robust way to efficiently parse CSV using awk?
(6 answers)
Closed 3 years ago.
I have data in a .csv column that sometimes contains commas and newlines. If there is a comma in my data, I have enclosed the entire string with double quotes. How would I go about parsing the output of that column to a .txt file taking the newlines and commas into consideration.
Sample data that doesn't work with my command:
,"This is some text with a , in it.", #data with commas are enclosed in double quotes
,line 1 of data
line 2 of data, #data with a couple of newlines
,"Data that may a have , in it and
also be on a newline as well.",
Here is what I have so far:
awk -F "\"*,\"*" '{print $4}' file.csv > column_output.txt
$ cat decsv.awk
BEGIN { FPAT = "([^,]*)|(\"[^\"]+\")"; OFS="," }
{
# create strings that cannot exist in the input to map escaped quotes to
gsub(/a/,"aA")
gsub(/\\"/,"aB")
gsub(/""/,"aC")
# prepend previous incomplete record segment if any
$0 = prev $0
numq = gsub(/"/,"&")
if ( numq % 2 ) {
# this is inside double quotes so incomplete record
prev = $0 RT
next
}
prev = ""
for (i=1;i<=NF;i++) {
# map the replacement strings back to their original values
gsub(/aC/,"\"\"",$i)
gsub(/aB/,"\\\"",$i)
gsub(/aA/,"a",$i)
}
printf "Record %d:\n", ++recNr
for (i=0;i<=NF;i++) {
printf "\t$%d=<%s>\n", i, $i
}
print "#######"
.
$ awk -f decsv.awk file
Record 1:
$0=<,"This is some text with a , in it.", #data with commas are enclosed in double quotes>
$1=<>
$2=<"This is some text with a , in it.">
$3=< #data with commas are enclosed in double quotes>
#######
Record 2:
$0=<,"line 1 of data
line 2 of data", #data with a couple of newlines>
$1=<>
$2=<"line 1 of data
line 2 of data">
$3=< #data with a couple of newlines>
#######
Record 3:
$0=<,"Data that may a have , in it and
also be on a newline as well.",>
$1=<>
$2=<"Data that may a have , in it and
also be on a newline as well.">
$3=<>
#######
Record 4:
$0=<,"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.",>
$1=<>
$2=<"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.">
$3=<>
#######
The above uses GNU awk for FPAT and RT. I don't know of any CSV format that would allow you to have a newline in the middle of a field that's not enclosed by quotes (if it did you'd never know where any record ended) so the script doesn't allow for that. The above was run on this input file:
$ cat file
,"This is some text with a , in it.", #data with commas are enclosed in double quotes
,"line 1 of data
line 2 of data", #data with a couple of newlines
,"Data that may a have , in it and
also be on a newline as well.",
,"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.",
I have to create a sql script from another altering their content. Eg.
SELECT value INTO val FROM table WHERE condition;
SELECT value2 INTO val2 FROM table WHERE condition1
OR condition2;
So I have tried
sed 's/FROM .*;/;/g'
But it's returns this
SELECT value INTO val ;
SELECT value2 INTO val2 FROM table WHERE condition1
OR condition2;
instead of this, which is what I need
SELECT value INTO val ;
SELECT value2 INTO val2 ;
Any ideas? Basically what I want to do is remove all that is included among 'FROM' and the next ';'
sed ':load
# load any multiline sequence before going further
/;[[:space:]]*$/ !{ N;b load
}
# from here you have a full (multi)line to treat
s/[[:space:]]\{1,\}FROM[[:space:]].*;/ ;/
' YourFile
You need to first load the multiline sequence before removing the end (sequence cycling in load section until a ended ; is found)
:load : address label for the 'goto' used later
/;[[:space:]]*$/: when there is no ending ; on the line (eventually some ending space later
N: load a new line in working buffer
b load : goto the label load (goto)
s/[[:space:]]\{1,\}FROM[[:space:]].*;/ ;/ change the whole current working buffer (so mono and multiline but all ending with ;) with your new format. Sed in this case treat the buffer and not a line, New line are character like other in this case.
Last line need to be ended by ; to be treated, if not, the last (uncomplete) sequence is lost
I think you can remove the '\n' in your script and then use sed to remove the from.
Eg
cat test.sql |tr -d '\n'|sed 's/FROM [^;]*;/;\n/g'
awk is record-based, not line-based like sed, so it has no problem handling multi-line strings:
$ awk 'BEGIN{RS=ORS=";"}{gsub(/FROM .*/,"")}1' file
SELECT value INTO val ;
SELECT value2 INTO val2 ;
The above just sets the Record Separator to a ; instead of the default newline and operates on the resulting strings which can contain newlines just like any other characters.
As far as I know, you either have to get rid of newline delimiters
tr -d '\n'
or use Pythons "re.M | re.DOTALL" arguments in re.compile
eg (roughly put):
pattern = re.compile('FROM[^;]*;', re.M | re.DOTALL)
result = re.findall(pattern, file)
Usually, when I needed to regex over newlines, I always ended up with Python. Bash is too newline based that it's hard to bend it to do this.
But replacing a '\n' with a placeholder might suffice, if you really need to use bash.
Good evening, People
Currently I have an Array called inputArray which stores an input file 7 lines line by line. I have a word which is 70000($s0), how do I split the word so it is 70000 & ($s0) separate?
I looked at an answer which is on this website already but I couldn't understand it the answer I looked at was:
s='1000($s3)'
IFS='()' read a b <<< "$s"
echo -e "a=<$a>\nb=<$b>"
giving the output a=<1000> b=<$s3>
Let me give this a shot.
In certain circumstances, the shell will perform "word splitting", where a string of text is broken up into words. The word boundaries are defined by the IFS variable. The default value of IFS is: space, tab, newline. When a string is to be split into words, any sequence of this set of characters is removes to extract the words.
In your example, the set of characters that delimit words are ( and ). So the words in that string that are bounded by the IFS set of characters are 1000 and $s3
What is <<< "$s"? This is a here-string. It's used to send a string to some command's standard input. It's like doing
echo "$s" | read a b
except that form doesn't work as expected in bash. read a b <<< "$s" works well.
Now, what are the circumstances where word splitting occurs? One is when a variable is unquoted. A demo:
IFS='()'
echo "$s" | wc # 1 line, 1 word and 10 characters
echo $s | wc # 1 line, 2 words and 9 characters
The read command also splits a string into words, in order to assign words to the named variables. The variable a gets the first word, and b gets all the rest.
The command, broken down is:
IFS='()' read a b <<< "$s"
# ^^^^^^^ 1
# ^^^^^^^^ 2
# ^^^^^^^^ 3
only for the duration of the read command, assign the variable IFS the value ()
send the string "$s" to read's stdin
from stdin, use $IFS to split the input into words: assign the first word to variable a and the rest of the string to variable b. Trailing characters from $IFS at the end of the string are discarded.
Documentation:
Word splitting
Here strings
Simple command execution, describing why this assignment of IFS is only in effect for the duration of the read command.
read command
Hope that helps.