Formatting Text into separate files [closed] - linux

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a slight problem and do not know where to start.
I have a text file that contains the following information.
MINI COOPER 2007, 30,000 miles, British Racing Green, full service history, metallic paint, alloys. Great condition. £5,995 ono Telephone xxxxx xxxxx
I need to populate the above information in the following format
<advert>
<manufacturer></manufacturer>
<make></make>
<model></make>
<price></price>
<miles></miles>
<image></image>
<desc><![CDATA[desc>
<expiry></expiry> // Any point in the future
<url></url> // Optional
</advert>
<advert>
The output should be.
</advert>
<advert>
<manufacturer>MINI</manufacturer>
<make></make>
<model></make>
<price>5,995</price>
<miles>30000</miles>
<image></image>
<desc><![CDATA[2007, British Racing Green, full service history, metallic paint, alloys. Great condition.Telephone xxxxxx xxxxxx]]></desc>
<expiry>Todays date 13/05/2013</expiry>
<url></url>
</advert>
Any help will be create appreciated.

Since sometimes commas are part of a field and sometimes they aren't you can't use commas or anything else as a field separator so you need something like this in GNU awk (for gensub() and strftime()):
gawk '{
print "<advert>"
printf "\t<manufacturer>%s</manufacturer>\n", $1
printf "\t<make></make>\n"
printf "\t<model></model>\n"
printf "\t<price>%s</price>\n", gensub(/.*£([[:digit:],]+).*/,"\\1","")
printf "\t<miles>%s</miles>\n", gensub(/.*[[:space:]]([[:digit:],]+)[[:space:]]+miles.*/,"\\1","")
printf "\t<image></image>\n"
printf "\t<desc><![CDATA[%s]]></desc>\n", gensub(/.*[[:space:]]+miles[[:space:]]*,[[:space:]]*(.*)/,"\\1","")
printf "\t<expiry>Todays date %s</expiry>\n", strftime("%d/%m/%Y")
printf "\t<url></url>\n"
print "</advert>"
}' file
My editor seems to choke on British pound signs so here's the above script running using a # symbol instead:
$ cat file
MINI COOPER 2007, 30,000 miles, British Racing Green, full service history, metallic paint, alloys. Great condition. #5,995 ono Telephone xxxxx xxxxx
$ gawk '{
print "<advert>"
printf "\t<manufacturer>%s</manufacturer>\n", $1
printf "\t<make></make>\n"
printf "\t<model></model>\n"
printf "\t<price>%s</price>\n", gensub(/.*#([[:digit:],]+).*/,"\\1","")
printf "\t<miles>%s</miles>\n", gensub(/.*[[:space:]]([[:digit:],]+)[[:space:]]+miles.*/,"\\1","
")
printf "\t<image></image>\n"
printf "\t<desc><![CDATA[%s]]></desc>\n", gensub(/.*[[:space:]]+miles[[:space:]]*,[[:space:]]*(.
*)/,"\\1","")
printf "\t<expiry>Todays date %s</expiry>\n", strftime("%d/%m/%Y")
printf "\t<url></url>\n"
print "</advert>"
}' file
<advert>
<manufacturer>MINI</manufacturer>
<make></make>
<model></model>
<price>5,995</price>
<miles>30,000</miles>
<image></image>
<desc><![CDATA[British Racing Green, full service history, metallic paint, alloys. Great con
dition. #5,995 ono Telephone xxxxx xxxxx]]></desc>
<expiry>Todays date 13/05/2013</expiry>
<url></url>
</advert>

Here's some example code that should get you going at least. Run like:
awk -f script.awk file.txt
Contents of script.awk:
{
for (i=1;i<=NF;i++) {
if ($i == "miles,") {
miles = $(i - 1)
$i = $(i - 1) = ""
}
if ($i ~ /£/) {
price = substr($i, 2)
$i = $(i + 1) = ""
}
}
gsub(/ +/, " ");
print "<advert>"
print "\t<manufacturer>" $1 "</manufacturer>"
print "\t<make></make>"
print "\t<model></make>"
print "\t<price>" price "</price>"
print "\t<miles>" miles "</miles>"
print "\t<image></image>"
print "\t<desc><![CDATA[" $0 "]></desc>"
print "\t<expiry>" strftime( "%d/%m/%Y" ) "</expiry>"
print "\t<url></url>"
print "</advert>"
}
Results:
<advert>
<manufacturer>MINI</manufacturer>
<make></make>
<model></make>
<price>5,995</price>
<miles>30,000</miles>
<image></image>
<desc><![CDATA[MINI COOPER 2007, British Racing Green, full service history, metallic paint, alloys. Great condition. Telephone xxxxx xxxx]></desc>
<expiry>13/05/2013</expiry>
<url></url>
</advert>

Related

I need to make an awk script to parse text in a file. I am not sure if I am doing it correctly

Hi I need to make a an awk script in order to parse a csv file and sort it in bash.
I need to get a list of presidents from Wikipedia and sort their years in office by year.
When it is all sorted out, each ear needs to be in a text file.
Im not sure I am doing it correctly
Here is a portion of my csv file:
28,Woodrow Wilson,http:..en.wikipedia.org.wiki.Woodrow_Wilson,4.03.1913,4.03.1921,Democratic ,WoodrowWilson.gif,thmb_WoodrowWilson.gif,New Jersey
29,Warren G. Harding,http:..en.wikipedia.org.wiki.Warren_G._Harding,4.03.1921,2.8.1923,Republican ,WarrenGHarding.gif,thmb_WarrenGHarding.gif,Ohio
I want to include $2 which is i think the name, and sort by $4 which is think the date the president took office
Here is my actual awk file:
#!/usr/bin/awk -f
-F, '{
if (substr($4,length($4)-3,2) == "17")
{ print $2 > Presidents1700 }
else if (substr($4,length($4)-3,2) == "18")
{ print $2 > Presidents1800 }
else if (substr($4,length($4)-3,2) == "19")
{ print $2 > Presidents1900 }
else if (substr($4,length($4)-3,2) == "20")
{ print $2 > Presidents2000 }
}'
Here is my function running it:
SplitFile() {
printf "Task 4: Spliting file based on century\n"
awk -f $AFILE ${custFolder}/${month}/$DFILE
}
Where $AFILE is my awk file, and the directories listed on the right lead to my actual file.
Here is a portion of my output, it's actually several hundred lines long but in the
end this is what a portion of it looks like:
awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47: ^ syntax error awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47: ^ syntax error
awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47: ^ syntax error
awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47:
I know the output is not very helpful; I would rather just screenshot but I can't. I tried getting help but these online classes can be really hard and getting help at a distance is tough, the syntax errors above seem to be pointing to commas in the csv file.
After the edits, it's clear you are trying to classify the presidents by century outputting the century in which the president served.
As stated in my comments above, you don't include single quotes or command-line arguments in an awk script file. You use the BEGIN {...} rule to set the field-separator FS = ",". Then there are several ways to you split things in the fourth field. split() is just as easy as anything else.
That will leave you with the ending year in which the president served in the fourth element of arr (arr[0] is always the complete expression matching any REGEX used). Then it just a matter of comparing with the largest year first and decreasing from there redirecting the output to the output file for the century.
Continuing with what you started, your awk script will look similar to:
#!/usr/bin/awk -f
BEGIN { FS = "," }
{
split ($4, arr, ".")
if (arr[3] >= 2000)
print $2 > "Presidents2000"
else if (arr[3] >= 1900)
print $2 > "Presidents1900"
else if (arr[3] >= 1800)
print $2 > "Presidents1800"
else if (arr[3] >= 1700)
print $2 > "Presidents1700"
}
Now make it executable (for convenience). Presuming the script is in the file pres.awk:
$ chmod +x pres.awk
Now simply call the awk script passing the .csv filename as the argument, e.g.
$ ./pres.awk my.csv
Now list the files named Presid* and see what is created:
$ ls -al Presid*
-rw-r--r-- 1 david david 33 Oct 8 22:28 Presidents1900
And verify the contents is what you needed:
$ cat Presidents1900
Woodrow Wilson
Warren G. Harding
Presuming that is the output you are looking for based on your attempt.
(note: you need to quote the output file name to ensure, e.g. Presidents1900 isn't taken as a variable that hasn't been set yet)
Let me know if you have further questions.

In-line replacement bash (replace line with new one using variables)

I'm going through and reading lines from a file. They have a ton of information that is unnecessary, and I want to reformat the lines for later use so that I can use the necessary information later.
Example line in file (file1)
Name: *name* Date: *date* Age: *age* Gender: *gender* Score: *score*
Say I want to just pull gender and age from the file and use that later
New line
*gender*, *age*
In bash:
while read line; do
<store variable for gender>
<store variable for age>
<overwrite each line in CSV - gender,age>
<use gender/age as inputs for later comparisons>
done < file1
EDIT: There is no stability in the entries. One value can be found using a echo $line | cut and the other value is found using a [ $line =~ "keyValue" ] then setting that value
I was thinking of storing the combination of the two variables as such:
newLine="$val1,$val2"
Then using a sed in-line replace to replace the $line with $newLine.
Is there a better way, though? It may come down to a sed formatting issue with variables.
This will produce your desired output from your posted sample input:
$ cat file
Name: *name* Date: *date* Age: *age* Gender: *gender* Score: *score*
$ awk -F'[: ]+' -v OFS=', ' '{for (i=1;i<NF;i+=2) a[$i]=$(i+1); print a["Gender"], a["Age"]}' file
*gender*, *age*
$ awk -F'[: ]+' -v OFS=', ' '{for (i=1;i<NF;i+=2) a[$i]=$(i+1); print a["Score"], a["Name"], a["Date"] }' file
*score*, *name*, *date*
and you can see above how easy it is to print whatever fields you like in whatever order you like.
If it's not what you want, post some more representative input.
Your example leaves room for interpretation, so I'm assuming that there may be whitespace in the field values, but that there are no colons in the field values and that each field key is followed by a colon. I also assume that the order is stable.
while IFS=: read _ _ _ age gender _; do
age="${age% Gender}" # Use parameter expansion to strip off the key for the *next* field.
gender="${gender% Score}"
printf '"%s","%s"\n' "$gender" "$age"
done < file1 > file1.csv
Update
Since your question now states that there is no stability, you have to iterate through the possible values to get your output:
while IFS=: read -a line; do
unset age key sex
for chunk in "${line[#]}"; do
val="${chunk% *}" # Everything but the key
case "$key" in
Age) age="$val";;
Gender) sex="$val";;
esac
# The key is for the *next* iteration.
key="${chunk##* }"
done
if [[ $age || $sex ]]; then
printf '"%s","%s"\n' "$sex" "$age"
fi
done < file1 > file1.csv
(Also I added quotes around the output values in the csv to be compliant with the actual csv format and in case sex or age happened to have commas in it. Maybe someone is 1,000,000 years old. ;)

Is There Any Way to Auto Increase Data Field Variable ($NUM) in gawk?

I have a pipe delimited file that contains many data fields. I am trying to use gawk separate them. The log likes this:
United States|San Francisco|CA|...goes on...
Germany|Quebec City|Quebec|...goes on...
The output format would look like:
Country: United States
City: San Francisco
State: CA
...goes on...
The question is...is there any way I can auto increase $1 all the way to $15? Skip $16 then auto increase again? I just don't want to hard code my gawk script with this:
print "Country:\t", $1
print "City:\t", $2
print "State:\t", $3
You can use a for loop and an if condition:
awk -v FS='|' '
BEGIN{split("country;city;state",a,";")}
{for (i=1;i<NF;i++) if (i != 16) print a[i],i}' input
It's not entirely clear, but if your truncated input looks something like:
United States|San Francisco|CA|Salem|OR...
and you are just trying to enumerate city/state pairs, you can certainly do:
for( i = 2; i <= NF ;i+=2 ) {
print "City:\t", $i
print "State:\t", $(i+1)
}

max comma's on one line, using bash script

I have some \n ended text:
She walks, in beauty, like the night
Of cloudless climes, and starry skies
And all that's best, of dark and bright
Meet in her aspect, and her eyes
And I want to find which line has the max number of , and print that line too.
For example, the text above should result as
She walks, in beauty, like the night
Since it has 2 (max among all line) comma's.
I have tried:
cat p.txt | grep ','
but do not know where to go now.
You could use awk:
awk -F, -vmax=0 ' NF > max { max_line = $0; max = NF; } END { print max_line; }' < poem.txt
Note that if the max is not unique this picks the first one with the max count.
try this
awk '-F,' '{if (NF > maxFlds) {maxFlds=NF; maxRec=$0}} ; END {print maxRec}' poem
Output
She walks, in beauty, like the night
Awk works with 'Fields', the -F says use ',' to separate the fields. (The default for F is adjacent whitespace, (space and tabs))
NF means Number of Fields (in the current record). So we're using logic to find the record with the maximum number of Fields, capturing the value of the line '$0', and at the END, we print out the line with the most fields.
It is left undefined what will happen if 2 lines have the same maximum # of commas ;-)
I hope this helps.
FatalError's FS-based solution is nice. Another way I can think of is to remove non-comma characters from the line, then count its length:
[ghoti#pc ~]$ awk '{t=$0; gsub(/[^,]/,""); print length($0), t;}' poem
2 She walks, in beauty, like the night
1 Of cloudless climes, and starry skies
1 And all that's best, of dark and bright
1 Meet in her aspect, and her eyes
[ghoti#pc ~]$
Now we just need to keep track of it:
[ghoti#pc ~]$ awk '{t=$0;gsub(/[^,]/,"");} length($0)>max{max=length($0);line=t} END{print line;}' poem
She walks, in beauty, like the night
[ghoti#pc ~]$
Pure Bash:
declare ln=0 # actual line number
declare maxcomma=0 # max number of commas seen
declare maxline='' # corresponding line
while read line ; do
commas="${line//[^,]/}" # remove all non-commas
if [ ${#commas} -gt $maxcomma ] ; then
maxcomma=${#commas}
maxline="$line"
fi
((ln++))
done < "poem.txt"
echo "${maxline}"

Bash Shell Programming Store Variables from Text File into Arrays

My program should be able to work this way.
Below is the content of the text file named BookDB.txt
The individual are separated by colons(:) and every line in the text file should serve as a set of information and are in the order as stated below.
Title:Author:Price:QtyAvailable:QtySold
Harry Potter - The Half Blood Prince:J.K Rowling:40.30:10:50
The little Red Riding Hood:Dan Lin:40.80:20:10
Harry Potter - The Phoniex:J.K Rowling:50.00:30:20
Harry Potter - The Deathly Hollow:Dan Lin:55.00:33:790
Little Prince:The Prince:15.00:188:9
Lord of The Ring:Johnny Dept:56.80:100:38
I actually intend to
1) Read the file line by line and store it in an array
2) Display it
However I have no idea on how to even start the first one.
From doing research online, below are the codes which I have written up till now.
#!/bin/bash
function fnReadFile()
{
while read inputline
do
bTitle="$(echo $inputline | cut -d: -f1)"
bAuthor="$(echo $inputline | cut -d: -f2)"
bPrice="$(echo $inputline | cut -d: -f3)"
bQtyAvail="$(echo $inputline | cut -d: -f4)"
bQtySold="$(echo $inputline | cut -d: -f5)"
bookArray[Count]=('$bTitle', '$bAuthor', '$bPrice', '$bQtyAvail', '$bQtySold')
Count = Count + 1
done
}
function fnInventorySummaryReport()
{
fnReadFile
echo "Title Author Price Qty Avail. Qty Sold Total Sales"
for t in "${bookArray[#]}"
do
echo $t
done
echo "Done!"
}
if ! [ -f BookDB.txt ] ; then #check existance of bookdb file, create the file if not exist else continue
touch BookDB.txt
fi
"HERE IT WILL THEN BE THE MENU AND CALLING OF THE FUNCTION"
Thanks to those in advance who helped!
Why would you want to read the entire thing into an array? Query the file when you need information:
#!/bin/sh
# untested code:
# print the values of any line that match the pattern given in $1
grep "$1" BookDB.txt |
while IFS=: read Title Author Price QtyAvailable QtySold; do
echo title = $Title
echo author = $Author
done
Unless your text file is very large, it is unlikely that you will need the data in an array. If it is large enough that you need that for performance reasons, you really should not be coding this in sh.
Since your goal here seems to be clear, how about using awk as an alternative to using bash arrays? Often using the right tool for the job makes things a lot easier!
The following awk script should get you something like what you want:
# This will print your headers, formatted the way you had above, but without
# the need for explicit spaces.
BEGIN {
printf "%-22s %-16s %-14s %-15s %-13s %s\n", "Title", "Author", "Price",
"Qty Avail.", "Qty Sold", "Total Sales"
}
# This is described below, and runs for every record (line) of input
{
printf "%-22s %-16s %-14.2f %-15d %-13d %0.2f\n",
substr($1, 1, 22), substr($2, 1, 16), $3, $4, $5, ($3 * $5)
}
The second section of code (between curly braces) runs for every line of input. printf is for formatted output, and uses the given format string to print out each field, denoted by $1, $2, etc. In awk, these variables are used to access the fields of your record (line, in this case). substr() is used to truncate the output, as shown below, but can easily be removed if you don't mind the fields not lining up. I assumed "Total Sales" was supposed to be Price multiplied by Qty Sold, but you can update that easily as well.
Then, you save this file in books.awk invoke this script like so:
$ awk -F: -f books.awk books
Title Author Price Qty Avail. Qty Sold Total Sales
Harry Potter - The Hal J.K Rowling 40.30 10 50 2015.00
The little Red Riding Dan Lin 40.80 20 10 408.00
Harry Potter - The Pho J.K Rowling 50.00 30 20 1000.00
Harry Potter - The Dea Dan Lin 55.00 33 790 43450.00
Little Prince The Prince 15.00 188 9 135.00
Lord of The Ring Johnny Dept 56.80 100 38 2158.40
The -F: tells awk that the fields are separated by colon (:), and -f books.awk tells awk what script to run. Your data is held in books.
Not exactly what you were asking for, but just pointing you toward a (IMO) better tool for this kind of job! awk can be intimidating at first, but it's amazing for jobs that work on records like this!

Resources