Remove Lines With Number Less Than X In Nth Field - linux

I have a file consisting of lines like this:
ExampleText | En | 1.0
ExampledText | Es | 0.9
ExamplesText | En | 0.9994
ExampleTexts | Br | 0.991
ExampledText | Es | 0.83324
ExamplerText | En | 0.4494
Using grep .*| En, I can get all the lines containing En. However, how can I also remove all values that contain less than 0.5 in the last column?
Thus, the output is:
ExampleText | En | 1.0
ExamplesText | En | 0.9994
Your positive input is highly appreciated.

awk '$2 == "En" && $3 >= .5' FS=' \\| '
Set field separator to |
Match if field 2 equals En and field 3 is greater or equal to .5

Related

Excel search multiples rows containing substring

I have an Excel file with 2 sheets :
The first one got a list of keywords in a column.
The second one got sentences on a column along with an id on another column.
Thus the 2 sheets look like this :
Sheet 1: Sheet 2:
A A B
| the | | 15587 | The cat is walking |
| cat | | 94683 | No one here |
| ... | | 47222 | The TV is on |
| 59378 | No cat allowed |
| ... | ... |
What I want to do is to put on the B column of sheet 1 the list of sentences ids where the keyword is found. So here I'll get on sheet 1 :
A B
| the | 15587;47222 |
| cat | 15587;59378 |
| ... | ... |
Do you know how I can achieve this using functions ? I tried VLOOKUP but it only returns the first occurrence and I don't know how to use FILTER with an operator to check if the sentence contain a string.
Thanks
You could try:
Formula in E1:
=TEXTJOIN(";",,FILTER(A$1:A$4,ISNUMBER(SEARCH(" "&D1&" "," "&B$1:B$4&" ")),""))

Horizontal vs Vertical array delimiters - International

Following up on an earlier question I had about horizontal vs vertical arrays, I have a question about it's respective delimiters.
Problem definition:
Hereby an example of an incorrect way of comparing two arrays:
{=SUMPRODUCT(--({"Apple","Pear"}={"Apple","Lemon","Pear"}))}
The correct way, in case of an English application countrycode would be:
{=SUMPRODUCT(--({"Apple","Pear"}={"Apple";"Lemon";"Pear"}))}
Within an English version (most likely more than just English) of Excel these delimiters would respectively be a comma , for horizontal arrays and a semicolon ; for vertical ones. Plenty of online information to be found on this.
Working on a machine with a Dutch country code on it's application however, it't a complete other story. It does frustrate that my delimiters would both be different, respectively ; and a \. Being able to rather simply retrieve the semi-colon it's proven to be tricky to find any documentation on these delimiters for international version.
Workaround:
Not knowing these delimiters up-front makes it tricky for anyone on a variety of international versions of the application to work with these type of formulas. A rather easy workaround would be to use TRANSPOSE():
{=SUMPRODUCT(--({"Apple";"Pear"}=TRANSPOSE({"Apple";"Lemon";"Pear"})))}
Going through the build-in evaluation we can then retrieve the backslash as the column seperator. Another way would be to use the Application.International property and it's xlColumnSeparator and xlRowSeparator.
Question
We can both find and even override the xlDecimalSeparator and xlThousandsSeparator through Excel (File > Options > Advanced), or VBA (Application.DecimalSeparator = "-") but where can we find:
A place to actually see which xlRowSeparator and xlColumnSeparator are used within your own application, other than the workarounds I described. Looking for an interface similar to thousands and decimal seperator and/or official MS-documentation.
Furthermore (not specifically looking for this), is there:
A place to override them just like the decimal and thousand seperators
If not through Excel interfaces, can we brute-force this somehow through VBA?
I'm very curious if official documentation is present, and/or if the above can be done.
Not claiming this is the right answer, but with the help from comments from other users, maybe the below can clarify things a bit:
With no sign of any official documentation on this matter, and seemingly random row and column delimiters #Gserg showed a trick to retrieve information for any LCID using these unique id's on MS office support under "Create one-dimensional and two-dimensional constants". While this is MS office support information, the delimiters you see there are FALSE. They might come up as . a , a ; a : a \ or even a |. You get this results by changing the LCID from the URL to a LCID of interest, e.g.: fr-fr.
Although there are about 600 different LCID's they all get redirected to a default LCID. With the help of #FlorentB. we discovered that not only the MS office support documentation is wrong, it seems that these delimiters are not that random after all. Looking at countries using a decimal point, they use the , as a column delimiter (a horizontal array) and a ; as a row delimiter (a vertical array). Countries using a decimal comma however use a \ as a column delimiter and a ; for rows respectively.
Changing the system country settings, checking all default LCID's in Excel, we ended up with the matrix below showing all row and column delimiters per default LCID:
| LCID | Row | Column |
|-------|-----|--------|
| ar-sa | ; | , |
| bg-bg | ; | \ |
| cs-cz | ; | \ |
| da-dk | ; | \ |
| de-de | ; | \ |
| el-gr | ; | \ |
| en-gb | ; | , |
| en-ie | ; | , |
| en-us | ; | , |
| es-es | ; | \ |
| et-ee | ; | \ |
| fi-fi | ; | \ |
| fr-fr | ; | \ |
| he-il | ; | , |
| hr-hr | ; | \ |
| hu-hu | ; | \ |
| id-id | ; | \ |
| it-it | ; | \ |
| ja-jp | ; | , |
| ko-kr | ; | , |
| lt-lt | ; | \ |
| lv-lv | ; | \ |
| nb-no | ; | \ |
| nl-nl | ; | \ |
| pl-pl | ; | \ |
| pt-br | ; | \ |
| pt-pt | ; | \ |
| ro-ro | ; | \ |
| ru-ru | ; | \ |
| sk-sk | ; | \ |
| sl-si | ; | \ |
| sv-se | ; | \ |
| th-th | ; | , |
| tr-tr | ; | \ |
| uk-ua | ; | \ |
| vi-vn | ; | \ |
| zh-cn | ; | , |
| zh-hk | ; | , |
| zh-tw | ; | , |
The apparent conclusion is that all countries use a semicolon as a row (vertical) delimiter. And depending on decimal seperator countries use a backslash or comma as a column (horizontal) delimiter within array formulas.
So even without proper MS-documentation, nor a place within the Excel interface (like thousand en decimal delimiter do have), on this matter it is apparent that knowing your country's decimal seperator will automatically mean you either use a \ or , as a column delimiter.
| Dec_Seperator | Row | Column |
|---------------|-----|--------|
| . | ; | , |
| , | ; | \ |
I would happily recieve more information about the above and/or presence of any correct MS office documentation to add to this.
It is possible to do this through native Excel (without VBA or add-ins) by querying Excel's C API, but I don't know of anywhere this is documented.
Go into Excel's Name Manager and click 'New...'. Enter a name such as GetColumnSeparator.
In the 'RefersTo:' box, enter the following to get the column separator:
=INDEX(GET.WORKSPACE(37), 14)
In an Excel cell, you can now enter this:
=GetColumnSeparator
and the comma (in English - or whatever symbol is in use on your machine) will be shown.
For the row separator you need to change the index number to 15:
=INDEX(GET.WORKSPACE(37), 15)
On an English machine, this will be the semicolon by default.
On machines where Excel's 'display language' is not English (meaning Excel's function names are translated), you will need a translated version of the above formula. Again I don't know of any documentation on this, so my best suggestion would be to install the English language pack, enter the formula in English, save the workbook, then revert to your original Excel language and re-open the workbook; Excel will translate the formula automatically.
Note that you will need to save the workbook as macro-enabled (e.g. .xlsm rather than .xlsx).
Sorry this is nearly three years late but I hope it helps.

Excel how to filter based on ( select columnA = ValueA) OR (select columnB = ValueB)

For example, I have a very simple excel sheet which has the layout and info as following:
| Name | English | French | Chinese | Korean |
|-------------|---------|--------|---------|--------|
| Eddie | Y | | | |
| Raymond | Y | | Y | |
| Celine Dion | Y | Y | | |
| Marion | | Y | | |
I want to filter the rows such that all filtered persons can either speak Chinese OR English
The result will be Eddie, Raymond, Celine.
Assuming that your data is in a table or that you've engaged the autofilter, I would create an extra column called ChineaseOrEnglish in which i'd create an if() statement that uses OR() to place a 'Y' in the column where either English or Chinese is flagged.
You would then be able to filter on the ChineaseOrEnglish column.

Find and Replace Pipe delimiter from field in a pipe delimited file

I have had a similar question like this earlier later i've to add more scope to that question but had no idea how to edit it and make it live again. that's why i'm posting as a new Question.
My file is a pipe delimited file.
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO|OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y|A|HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | |FACE|B|O|O|K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A||M|AZ|ON| | AFRICA | AF DOLLAR | CAPETOWN
My file is as complicated as this is. Our need is to remove the "|" symbol from the WEB field and replace it with a junk value like #,$,& or anything.
The Output has to be:
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO#OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y#A#HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | #FACE#B#O#O#K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A##M#AZ#ON# | AFRICA | AF DOLLAR | CAPETOWN
I've tried awk'ing few filters to clear this mess up.nothing seems to find a happy ending. Thank you!
I would like to thank few names who answered my prev question : RomanPerekhrest, Ed Morton,shellter , val rog.
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { outNf=NF; print; next }
{
end = beg + (NF - outNf) - 1
for (i=1; i<=NF; i++) {
sep = (i>=beg && i<=end ? "#" : OFS)
printf "%s%s", $i, (i<NF ? sep : ORS)
}
}
$ awk -v beg=3 -f tst.awk file
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO#OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y#A#HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | #FACE#B#O#O#K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A##M#AZ#ON# | AFRICA | AF DOLLAR | CAPETOWN
How it works: On the first line the number of fields to be output is the same as the number of fields on that line so it saves that number as outNF. From then on any subsequent line with more than outNF fields has outNF-NF fields starting at beg to be combined. So inside the loop it uses OFS between fields from 1 to beg, then from beg+1 to beg+(outNF-NF) it uses # between fields to create one merged output field from the input fields in that range, then it goes back to using OFS between fields.
You can use this awk command:
awk 'BEGIN{FS=OFS="|"} NR==1{n=NF} NF > n {
s=$3; for (i=4; i<=NF-3; i++) {s = s "#" $i; $i=""} $3=s; gsub(/\|{2,}/, "|")} 1' file
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO#OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y#A#HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | #FACE#B#O#O#K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A##M#AZ#ON# | AFRICA | AF DOLLAR | CAPETOWN
easy if you do not mind with Perl
If it has space; then we can print it by:
stackoverflow ❱ perl -F'\s+|\s+' -a -le 'print $F[5]' file
WEB
GO|OGLE
Y|A|HOO
|FACE|B|O|O|K
A||M|AZ|ON|
stackoverflow ❱
Since we can modify the #F array in Perl; thus we can:
$F[5] =~ s/\|/#/g;
It modifies only this column not others.
And eventually we can print it:
stackoverflow ❱ perl -F'\s+|\s+' -lae '$F[5] =~ s/\|/#/g;print "#F"' file
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO#OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y#A#HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | #FACE#B#O#O#K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A##M#AZ#ON# | AFRICA | AF DOLLAR | CAPETOWN
stackoverflow ❱
If your file has no space, as someone commented me; then you can spread others columns; modify only that one and join them all together:
stackoverflow ❱ cat file2
NAME|NUM|WEB|LOCATION|CURRENCY|PLACE
ABCD|04|GO|OGLE|EUROPE|EURO|PARIS
XYZE|12|Y|A|HOO|USA|DOLLAR|SEATTLE
LMNO|17||FACE|B|O|O|K|ASIA|ASIANDOLLAR|HONGKONG
EDDE|98|A||M|AZ|ON||AFRICA|AFDOLLAR|CAPETOWN
stackoverflow ❱ perl -F'\|' -le '$s=$#F;$e="#F[2..$s-3]";$e=~s/ +/#/g;print join "|", #F[0..1],$e,join "|",#F[$s-2,$s-1,$s]' file2
NAME|NUM|WEB|LOCATION|CURRENCY|PLACE
ABCD|04|GO#OGLE|EUROPE|EURO|PARIS
XYZE|12|Y#A#HOO|USA|DOLLAR|SEATTLE
LMNO|17|#FACE#B#O#O#K|ASIA|ASIANDOLLAR|HONGKONG
EDDE|98|A#M#AZ#ON#|AFRICA|AFDOLLAR|CAPETOWN
Another awk solution can be:-
awk -F'[[:space:]][|][[:space:]]' '{gsub(/\|/,"#",$3);print $1,"|",$2,"|",$3,"|",$4,"|",$5,"|",$6}' file.txt
Explanation:-
-F - for field separator here it is space|space
gsub - global substitution in field 3. i.e. every occurance of | will be replaced by #.
print - just print all the columns separated by "|"
output will be:-
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO#OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y#A#HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | #FACE#B#O#O#K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A##M#AZ#ON# | AFRICA | AF DOLLAR | CAPETOWN
A simple awk solution :
awk -F "|" '{printf $1}
{for(i=2; i<=NF; i++) { if(i>3 && i<NF-2)printf "#"$i; else printf "|"$i } printf "\n"} ' file
NAME|NUM|WEB|LOCATION|CURRENCY|PLACE
ABCD|04|GO#OGLE|EUROPE|EURO|PARIS
XYZE|12|Y#A#HOO|USA|DOLLAR|SEATTLE
LMNO|17|#FACE#B#O#O#K|ASIA|ASIANDOLLAR|HONGKONG
EDDE|98|A##M#AZ#ON#|AFRICA|AFDOLLAR|CAPETOWN
if(i>3 && i<NF-2) : this condition is for extra unwanted fields after 3rd field and before NF-2nd field. If it satisfies, prefix "#" before printing these extra fields.
I didn't try to put this in one line, but rather made it a little easier to read. Those who play perl golf will be able to reduce it considerably. The idea is to anchor the first two fields and the last three.
#!/usr/bin/perl
while(<DATA>) {
chomp;
if(($name, $num, $web, $location, $currency, $place) = $_ =~
/^([^\|]+)\|([^\|]+)\|(.+)\|([^\|]+)\|([^\|]+)\|([^\|]+)$/) {
$web =~ tr/\|/\_/;
printf "%s\n", join('|', ($name, $num, $web, $location, $currency, $place));
}
}
__DATA__
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO|OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y|A|HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | |FACE|B|O|O|K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A||M|AZ|ON| | AFRICA | AF DOLLAR | CAPETOWN
Output:
NAME | NUM | WEB | LOCATION | CURRENCY | PLACE
ABCD | 04 | GO_OGLE | EUROPE | EURO | PARIS
XYZE | 12 | Y_A_HOO | USA | DOLLAR | SEATTLE
LMNO | 17 | _FACE_B_O_O_K | ASIA | ASIAN DOLLAR | HONGKONG
EDDE | 98 | A__M_AZ_ON_ | AFRICA | AF DOLLAR | CAPETOWN

How to insert a new record(row or line) after the last line of input file using awk?

The marks of the students are given as a table in the following format Name | rollno | marks in exam1 | marks in exam 2 ... i.e. There is one record per line and each column is separated by a | (pipe) character.At the end of all the records I want to add extra lines which contains information about max, min mean...So my question is How would one add new record at the end of input file?
Example:
Here is a sample input
Piyush | 12345 | 5 | 5 | 4
James | 007 | 0 | 0 | 7
Knuth | 31415 | 100 | 100 | 100
For which the output is
Piyush | 12345 | 5 | 5 | 4 | 14
James | 007 | 0 | 0 | 7 | 7
Knuth | 31415 | 100 | 100 | 100 | 300
max | | 100 | 100 | 100 | 300
min | | 0 | 0 | 4 | 7
mean | | 35.00 | 35.00 | 37.00 | 107.00
sd | | 46.01 | 46.01 | 44.56 | 136.50
awk '
BEGIN { FS=OFS="|" }
{
sum = 0
for (i=3;i<=NF;i++) {
tot[i] += $i
sum += $i
max[i] = ( (i in max) && (max[i] > $i) ? max[i] : $i )
}
print $0, sum
max[i] = ( (i in max) && (max[i] > sum) ? max[i] : sum )
}
END {
printf "max" OFS ""
nf = NF+1
for (i=3; i<=nf; i++) {
printf "%s%s", max[i], (i<nf?OFS:ORS)
}
}'
repeat for min and whatever else you need to calculate and check the printf formatting flags for whatever spacing you need, if any.

Resources