Find and Replace Pipe delimiter from field in a pipe delimited file - linux

I have had a similar question like this earlier later i've to add more scope to that question but had no idea how to edit it and make it live again. that's why i'm posting as a new Question.
My file is a pipe delimited file.
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO|OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y|A|HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | |FACE|B|O|O|K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A||M|AZ|ON| | AFRICA | AF DOLLAR | CAPETOWN
My file is as complicated as this is. Our need is to remove the "|" symbol from the WEB field and replace it with a junk value like #,$,& or anything.
The Output has to be:
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO#OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y#A#HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | #FACE#B#O#O#K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A##M#AZ#ON# | AFRICA | AF DOLLAR | CAPETOWN
I've tried awk'ing few filters to clear this mess up.nothing seems to find a happy ending. Thank you!
I would like to thank few names who answered my prev question : RomanPerekhrest, Ed Morton,shellter , val rog.

$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { outNf=NF; print; next }
{
end = beg + (NF - outNf) - 1
for (i=1; i<=NF; i++) {
sep = (i>=beg && i<=end ? "#" : OFS)
printf "%s%s", $i, (i<NF ? sep : ORS)
}
}
$ awk -v beg=3 -f tst.awk file
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO#OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y#A#HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | #FACE#B#O#O#K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A##M#AZ#ON# | AFRICA | AF DOLLAR | CAPETOWN
How it works: On the first line the number of fields to be output is the same as the number of fields on that line so it saves that number as outNF. From then on any subsequent line with more than outNF fields has outNF-NF fields starting at beg to be combined. So inside the loop it uses OFS between fields from 1 to beg, then from beg+1 to beg+(outNF-NF) it uses # between fields to create one merged output field from the input fields in that range, then it goes back to using OFS between fields.

You can use this awk command:
awk 'BEGIN{FS=OFS="|"} NR==1{n=NF} NF > n {
s=$3; for (i=4; i<=NF-3; i++) {s = s "#" $i; $i=""} $3=s; gsub(/\|{2,}/, "|")} 1' file
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO#OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y#A#HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | #FACE#B#O#O#K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A##M#AZ#ON# | AFRICA | AF DOLLAR | CAPETOWN

easy if you do not mind with Perl
If it has space; then we can print it by:
stackoverflow ❱ perl -F'\s+|\s+' -a -le 'print $F[5]' file
WEB
GO|OGLE
Y|A|HOO
|FACE|B|O|O|K
A||M|AZ|ON|
stackoverflow ❱
Since we can modify the #F array in Perl; thus we can:
$F[5] =~ s/\|/#/g;
It modifies only this column not others.
And eventually we can print it:
stackoverflow ❱ perl -F'\s+|\s+' -lae '$F[5] =~ s/\|/#/g;print "#F"' file
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO#OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y#A#HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | #FACE#B#O#O#K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A##M#AZ#ON# | AFRICA | AF DOLLAR | CAPETOWN
stackoverflow ❱
If your file has no space, as someone commented me; then you can spread others columns; modify only that one and join them all together:
stackoverflow ❱ cat file2
NAME|NUM|WEB|LOCATION|CURRENCY|PLACE
ABCD|04|GO|OGLE|EUROPE|EURO|PARIS
XYZE|12|Y|A|HOO|USA|DOLLAR|SEATTLE
LMNO|17||FACE|B|O|O|K|ASIA|ASIANDOLLAR|HONGKONG
EDDE|98|A||M|AZ|ON||AFRICA|AFDOLLAR|CAPETOWN
stackoverflow ❱ perl -F'\|' -le '$s=$#F;$e="#F[2..$s-3]";$e=~s/ +/#/g;print join "|", #F[0..1],$e,join "|",#F[$s-2,$s-1,$s]' file2
NAME|NUM|WEB|LOCATION|CURRENCY|PLACE
ABCD|04|GO#OGLE|EUROPE|EURO|PARIS
XYZE|12|Y#A#HOO|USA|DOLLAR|SEATTLE
LMNO|17|#FACE#B#O#O#K|ASIA|ASIANDOLLAR|HONGKONG
EDDE|98|A#M#AZ#ON#|AFRICA|AFDOLLAR|CAPETOWN

Another awk solution can be:-
awk -F'[[:space:]][|][[:space:]]' '{gsub(/\|/,"#",$3);print $1,"|",$2,"|",$3,"|",$4,"|",$5,"|",$6}' file.txt
Explanation:-
-F - for field separator here it is space|space
gsub - global substitution in field 3. i.e. every occurance of | will be replaced by #.
print - just print all the columns separated by "|"
output will be:-
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO#OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y#A#HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | #FACE#B#O#O#K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A##M#AZ#ON# | AFRICA | AF DOLLAR | CAPETOWN

A simple awk solution :
awk -F "|" '{printf $1}
{for(i=2; i<=NF; i++) { if(i>3 && i<NF-2)printf "#"$i; else printf "|"$i } printf "\n"} ' file
NAME|NUM|WEB|LOCATION|CURRENCY|PLACE
ABCD|04|GO#OGLE|EUROPE|EURO|PARIS
XYZE|12|Y#A#HOO|USA|DOLLAR|SEATTLE
LMNO|17|#FACE#B#O#O#K|ASIA|ASIANDOLLAR|HONGKONG
EDDE|98|A##M#AZ#ON#|AFRICA|AFDOLLAR|CAPETOWN
if(i>3 && i<NF-2) : this condition is for extra unwanted fields after 3rd field and before NF-2nd field. If it satisfies, prefix "#" before printing these extra fields.

I didn't try to put this in one line, but rather made it a little easier to read. Those who play perl golf will be able to reduce it considerably. The idea is to anchor the first two fields and the last three.
#!/usr/bin/perl
while(<DATA>) {
chomp;
if(($name, $num, $web, $location, $currency, $place) = $_ =~
/^([^\|]+)\|([^\|]+)\|(.+)\|([^\|]+)\|([^\|]+)\|([^\|]+)$/) {
$web =~ tr/\|/\_/;
printf "%s\n", join('|', ($name, $num, $web, $location, $currency, $place));
}
}
__DATA__
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO|OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y|A|HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | |FACE|B|O|O|K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A||M|AZ|ON| | AFRICA | AF DOLLAR | CAPETOWN
Output:
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO_OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y_A_HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | _FACE_B_O_O_K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A__M_AZ_ON_ | AFRICA | AF DOLLAR | CAPETOWN

Related

Combine multiple rows into single row in Pandas Dataframe

I have got a child table here. Here is the sample data.
+----+------+----------+----------------+--------+---------+
| ID | Name | City | Email | Phone | Country |
+----+------+----------+----------------+--------+---------+
| 1 | Ted | Chicago | abc#gmail.com | 132321 | USA |
| 1 | Josh | Richmond | abc#gmail.com | 435324 | USA |
| 2 | John | Seattle | 123#gmail.com | 322421 | USA |
| 2 | John | Berkley | 4723#gmail.com | 322421 | USA |
| 2 | Mike | Seattle | 4723#gmail.com | 322421 | USA |
+----+------+----------+----------------+--------+---------+
The rows above need to be appended together. Only unique values are required.
+----+---------------+----------------------+----------------------------------+-------------------+---------+
| ID | Name | City | Email | Phone | Country |
+----+---------------+----------------------+----------------------------------+-------------------+---------+
| 1 | 'Ted','Josh' | 'Chicago','Richmond' | 'abc#gmail.com' | '132321','435324' | 'USA' |
| 2 | 'John','Mike' | 'Seattle','Berkley' | '123#gmail.com','4723#gmail.com' | '322421' | 'USA' |
+----+---------------+----------------------+----------------------------------+-------------------+---------+
Use if ordering is important GroupBy.agg with lambda function and remove duplicates by dictionary:
df1=df.groupby('ID').agg(lambda x: ','.join(dict.fromkeys(x.astype(str)).keys())).reset_index()
#another alternative, but slow if large data
#df = df.groupby('ID').agg(lambda x: ','.join(x.astype(str).unique())).reset_index()
print (df1)
ID Name City Email \
0 1 Ted,Josh Chicago,Richmond abc#gmail.com
1 2 John,Mike Seattle,Berkley 123#gmail.com,4723#gmail.com
Phone Country
0 132321,435324 USA
1 322421 USA
If ordering is not important use similar solution with removed duplicates by sets:
df2 = df.groupby('ID').agg(lambda x: ','.join(set(x.astype(str)))).reset_index()
print (df2)
ID Name City Email \
0 1 Josh,Ted Richmond,Chicago abc#gmail.com
1 2 John,Mike Berkley,Seattle 4723#gmail.com,123#gmail.com
Phone Country
0 435324,132321 USA
1 322421 USA

Bash Printf Formatting Incorrectly

Im my bash script I am printing variables and formatting the output using printf
while most of the columns are aligned, there are some that are not (note: sport media) . Here is the code for printing the data:
for((counter = 0; counter < ${#views[#]}; counter++))
{
printf "%-40s | %-9s | %-15s" "${users[$counter]}" "${views[$counter]}" "${duration[$counter]}" #"${ids[$counter]}" "${titles[$counter]}"
printf "\n"
}
Here is a sample of the output:
users | views | duration
Saturday Night Live | 10853524 | 9:46
Right Side Broadcasting | 346333 | 2:34:31
FOX 10 Phoenix | 319507 | 3:29
LastWeekTonight | 2997140 | 19:55
nigahiga | 6372021 | 2:56
Disney Movie Trailers | 7372656 | 1:50
RWW Blog | 125448 | 1:29
POLITICAL HUMOR | 173517 | 4:23
solangeknowlesmusic | 1613158 | 4:25
theDOMINICshow | 488995 | 4:13
TheWeekndVEVO | 1937027 | 3:59
swampgarage | 720718 | 1:43
Fox News | 164336 | 7:40
Bud Light | 224627 | 0:16
BuzzFeedVideo | 5575303 | 7:56
swampfoot | 8177252 | 9:07
Bloomberg | 349937 | 2:33
Kubau2 | 6358091 | 8:40
DOCUMENTARY TUBE | 926035 | 13:12
KLM Royal Dutch Airlines | 5796674 | 6:12
DOCUMENTARY TUBE | 3456648 | 10:51
ExtremeTV | 18846489 | 6:34
Sport Mídia | 4806074 | 8:23
Sam Chui | 6124697 | 6:47
DMKSPROD | 4111882 | 11:30
That's why the tab character was invented: to have text at the same position.
$ a="ExtremeTV"
$ b="Sport Mídia"
$ printf '%18s |\n' "$a" "$b"
ExtremeTV |
Sport Mídia |
$ printf '%18s \t|\n' "$a" "$b"
ExtremeTV |
Sport Mídia |

PowerPivot Grouped Average DAX

I'm trying to model some outbound calling data in PowerPivot. We have reps across multiple locations, and in general we breakdown our outbound calling into two periods of the day (before and after 12pm).
We can export data from our phone system a list of every call made for a day -- let's say an example is as follows:
+------------+-------------+-------+-----------+-------------+
| Date | Call Length | Agent | Workgroup | Call Period |
+------------+-------------+-------+-----------+-------------+
| 01.01.2016 | 00:05:26 | Sam | Sydney | 1 |
| 01.01.2016 | 00:15:05 | Sam | Sydney | 1 |
| 01.01.2016 | 00:55:22 | John | Sydney | 2 |
| 01.01.2016 | 00:45:11 | Sam | Sydney | 2 |
| 01.01.2016 | 00:04:52 | John | Sydney | 1 |
| 01.01.2016 | 00:01:52 | Timmy | London | 1 |
| 01.01.2016 | 00:02:21 | Timmy | London | 2 |
| 01.01.2016 | 00:05:21 | Karen | London | 1 |
| 02.01.2016 | 00:15:21 | Sam | Sydney | 1 |
| 02.01.2016 | 00:42:44 | Sam | Sydney | 2 |
| 02.01.2016 | 01:52:22 | John | Sydney | 1 |
| 02.01.2016 | 00:53:24 | John | Sydney | 1 |
| 02.01.2016 | 00:05:53 | Kerry | Sydney | 2 |
| 02.01.2016 | 00:43:43 | Sam | Sydney | 2 |
| 02.01.2016 | 01:08:00 | John | Sydney | 2 |
| 02.01.2016 | 00:13:52 | Timmy | London | 2 |
| 02.01.2016 | 00:25:44 | Timmy | London | 1 |
| 02.01.2016 | 02:58:31 | Karen | London | 1 |
| 02.01.2016 | 00:08:37 | Timmy | London | 2 |
| 02.01.2016 | 00:12:28 | Karen | London | 2 |
+------------+-------------+-------+-----------+-------------+
What I'm trying to calculate is the average daily time spent on phone per Workgroup, eg. on average how long is each agent on the phone at each location.
I'm guessing the arithmetic is as follows:
Measure 1: Total talk time for each Agent (eg. sum of all talk time for the day)
Measure 2: Average agent total talk time per workgroup (eg. sum of the above grouped by workgroup, divided by number of agents in that workgroup)
The output might look something like this (but doesn't have to be):
+------------+-----------+-----------------------+-----------------+-----------------------------+
| Date | Workgroup | Total Number of Calls | Total Talk Time | Average Talk Time per Agent |
+------------+-----------+-----------------------+-----------------+-----------------------------+
| 01.01.2016 | Sydney | 11 | 03:02:42 | 1:34:53 |
| | London | 4 | 02:24:51 | 01:13:41 |
| 02.01.2016 | Sydney | 5 | 01:52:05 | 00:56:51 |
| | London | 52 | 10:11:23 | 03:51:11 |
+------------+-----------+-----------------------+-----------------+-----------------------------+
Apologies if I'm unclear it what I'm asking.
Slicing your data on a pivot table will do the calculations.
you only need the following calculations:
DurationOfCall :=sum(MyTable[CallLength])
NrOfCalls :=countrows(MyTable)
AvgDuration :=DIVIDE([DurationOfCall],[NrOfCalls])
this will give the following result (on your sample dataset):
Workbook with testcase: attachment

How to compare the lines in two files in the columns ID?

I have two files with dynamic length from 1 to 30 lines, and these data:
[File1]
Time | Name | Name | ID1 | ID2
10:50 | Volume | Xxx | 55 | 65
12:50 | Kate | Uh | 35 | 62
15:50 | Maria | Zzz | 38 | 67
15:50 | Alex | Web | 38 | 5
...
[File2]
Time | Name | Name | ID1 | ID2
10:50 | Les | Xxx | 31 | 75
15:50 | Alex | Web | 38 | 5
...
How to compare two files [only ID1 and ID2 columns]: [File1] and [File2] to all first lines of the file {File1] compared with all lines of {File2].
If data exists in both files saved to a file [File3] data adding character *
In addition to the file {File3] have hit other data from [File1].
Result:
[File3]
Time | Name | Name | ID1 | ID2
15:50 | Alex | Web | * 38 | 5
10:50 | Volume | Xxx | 55 | 65
12:50 | Kate | Uh | 35 | 62
15:50 | Maria | Zzz | 38 | 67
Using awk
awk 'BEGIN{t="Time | Name | Name | ID1 | ID2"}
FNR==1{next}
NR==FNR{a[$4 FS $5];next}
{ if ($4 FS $5 in a)
{$4="*"$4;t=t RS $0}
else{s=s==""?$0:s RS $0}
}
END{print t RS s}' FS=\| OFS=\| file2 file1
Time | Name | Name | ID1 | ID2
15:50 | Alex | Web |* 38 | 5
10:50 | Volume | Xxx | 55 | 65
12:50 | Kate | Uh | 35 | 62
15:50 | Maria | Zzz | 38 | 67
Explanation
BEGIN{t="Time | Name | Name | ID1 | ID2"} # set the title
FNR==1{next} # ignore the title, FNR is the current record number in the current file.for each file
NR==FNR{a[$4 FS $5];next} # record the $4 and $5 into Associative array a
{ if ($4 FS $5 in a)
{$4="*"$4;t=t RS $0} # if found in file1, mark the $4 with start "*" and attach to var t
else{s=s==""?$0:s RS $0} # if not found, attach to var s
{print t RS s} # print the result.

Script to delete all /n number of lines starting from a word except last line

How to delete all lines below a word except last line in a file. suppose i have a file which contains
| 02/04/2010 07:24:20 | 20-24 | 26 | 13 | 2.60 |
| 02/04/2010 07:24:25 | 25-29 | 6 | 3 | 0.60 |
+---------------------+-------+------------+----------+-------------+
02-04-2010-07:24 --- ER GW 03
+---------------------+-------+------------+----------+-------------+
| date | sec | BOTH_MO_MT | MO_or_MT | TPS_PER_SEC |
+---------------------+-------+------------+----------+-------------+
| 02/04/2010 07:00:00 | 00-04 | 28 | 14 | 2.80 |
| 02/04/2010 07:00:05 | 05-09 | 27 | 14 | 2.70 |
...
...
...
...
END OF TPS PER 5 REPORT
and I need to delete all contents from "02-04-2010-07:24 --- ER GW 03" except "END OF TPS PER 5 REPORT" and save the file.
This has to be done for around 700 files. all files are same format, with datemonthday filename.
sed -ni '/ER GW/ b end; p; d; :end $p; n; b end' $file
$file should be the filename. E.g.:
for file in *.txt ; do
sed -ni '/ER GW/ b end; p; d; :end $p; n; b end' $file
done
The following awk script will do it:
awk '
/^02-04-2010-07:24 --- ER GW 03$/ {skip=1}
{ln=$0;if (skip!=1){print}}
END {if (skip==1){print $ln}}'
as shown in the following transcript:
$ echo '| 02/04/2010 07:24:20 | 20-24 | 26 | 13 | 2.60 |
| 02/04/2010 07:24:25 | 25-29 | 6 | 3 | 0.60 |
+---------------------+-------+------------+----------+-------------+
02-04-2010-07:24 --- ER GW 03
+---------------------+-------+------------+----------+-------------+
| date | sec | BOTH_MO_MT | MO_or_MT | TPS_PER_SEC |
+---------------------+-------+------------+----------+-------------+
| 02/04/2010 07:00:00 | 00-04 | 28 | 14 | 2.80 |
| 02/04/2010 07:00:05 | 05-09 | 27 | 14 | 2.70 |
...
...
...
...
END OF TPS PER 5 REPORT' | awk '
/^02-04-2010-07:24 --- ER GW 03$/ {skip=1}
{ln=$0;if (skip!=1){print}}
END {if (skip==1){print $ln}}'
which produces:
| 02/04/2010 07:24:20 | 20-24 | 26 | 13 | 2.60 |
| 02/04/2010 07:24:25 | 25-29 | 6 | 3 | 0.60 |
+---------------------+-------+------------+----------+-------------+
END OF TPS PER 5 REPORT
as requested.
Breaking it down:
skip is initially 0 (false).
if you find a line you want to start skipping from, set skip to 1 (true) - change this pattern where necessary.
if skip is false, output the line.
regardless of skip, store the last line.
at the end, is skip is true, output the last line (sjip check prevents double print).
For doing it to multiple files, you can just use for:
for fspec in *.txt ; do
awk 'blah blah' <${fspec} >${fspec}.new
done
The command required for your update in the comment (searching for "--- ER GW 03") is:
awk '
/--- ER GW 03/ {skip=1}
{ln=$0;if (skip!=1){print}}
END {if (skip==1){print $ln}}'
This might work for you:
sed -i '$q;/^02-04-2010-07:24 --- ER GW 03/,$d' *.txt

Resources