Extract emails from file with more than 100 users - linux

I can't quite wrap my head around this issue. I'm trying to output a file with a list of email addresses from a list of email address. If there are more than 100 email addresses assigned to any given in that list domain i need those emails outputted those to a file.
emaillist.txt file will have:
5000 occurrences of userID#yahoo.com
2000 occurrences of userID#aol.com
100 occurrences of userID#rr.com
10 occurrences of userID#whatever.com
cut -d # -f 2 emailist.txt | sort | uniq -c | sort -rn
outputs
5000 yahoo.com
2000 aol.com
100 rr.com
10 whatever.com
Now that i know the counts of how many emails i have at each domain, i only want the email addresses in the new file of domains that have more than 100 users.

This should do what you want:
cut -d # -f 2 email.txt | sort | uniq -c | awk '$1 >= 100 {print $2}' | while read e; do grep "#$e$" email.txt >> emailkeep.txt; done

Assuming your file contains emails only. Use the following awk would solve your problem.
awk '{split($0, a, "#");} NR==FNR{mp[a[2]]++; next} (mp[a[2]]>=100)' emaillist.txt emaillist.txt
^^^ modify to whatever you need
DEMO
lo#ubuntu:~$ cat emaillist.txt
userID#yahoo.com
userID1#yahoo.com
userID2#yahoo.com
userID#aol.com
userID#rr.com
userID#whatever.com
lo#ubuntu:~$ awk '{split($0, a, "#");} NR==FNR{mp[a[2]]++; next} (mp[a[2]]>1)' emaillist.txt emaillist.txt
userID#yahoo.com
userID1#yahoo.com
userID2#yahoo.com

Related

Extracting the user with the most amount of files in a dir

I am currently working on a script that should receive a standard input, and output the user with the highest amount of files in that directory.
I've wrote this so far:
#!/bin/bash
while read DIRNAME
do
ls -l $DIRNAME | awk 'NR>1 {print $4}' | uniq -c
done
and this is the output I get when I enter /etc for an instance:
26 root
1 dip
8 root
1 lp
35 root
2 shadow
81 root
1 dip
27 root
2 shadow
42 root
Now obviously the root folder is winning in this case, but I don't want only to output this, i also want to sum the number of files and output only the user with the highest amount of files.
Expected output for entering /etc:
root
is there a simple way to filter the output I get now, so that the user with the highest sum will be stored somehow?
ls -l /etc | awk 'BEGIN{FS=OFS=" "}{a[$4]+=1}END{ for (i in a) print a[i],i}' | sort -g -r | head -n 1 | cut -d' ' -f2
This snippet returns the group with the highest number of files in the /etc directory.
What it does:
ls -l /etc lists all the files in /etc in long form.
awk 'BEGIN{FS=OFS=" "}{a[$4]+=1}END{ for (i in a) print a[i],i}' sums the number of occurrences of unique words in the 4th column and prints the number followed by the word.
sort -g -r sorts the output descending based on numbers.
head -n 1 takes the first line
cut -d' ' -f2 takes the second column while the delimiter is a white space.
Note: In your question, you are saying that you want the user with the highest number of files, but in your code you are referring to the 4th column which is the group. My code follows your code and groups on the 4th column. If you wish to group by user and not group, change {a[$4]+=1} to {a[$3]+=1}.
Without unreliable parsing the output of ls:
read -r dirname
# List user owner of files in dirname
stat -c '%U' "$dirname/" |
# Sort the list of users by name
sort |
# Count occurrences of user
uniq -c |
# Sort by higher number of occurrences numerically
# (first column numerically reverse order)
sort -k1nr |
# Get first line only
head -n1 |
# Keep only starting at character 9 to get user name and discard counts
cut -c9-
I have an awk script to read standard input (or command line files) and sum up the unique names.
summer:
awk '
{ sum[ $2 ] += $1 }
END {
for ( v in sum ) {
print v, sum[v]
}
}
' "$#"
Let's say we are using your example of /etc:
ls -l /etc | summer
yields:
0
dip 2
shadow 4
root 219
lp 1
I like to keep utilities general so I can reuse them for other purposes. Now you can just use sort and head to get the maximum result output by summer:
ls -l /etc | summer | sort -r -k2,2 -n | head -1 | cut -f1 -d' '
Yields:
root

awk command that compares strings for difference

I have an gz file which contains values in $12 and $33, where they contains strings (ex $12: 33-A and $33: 33A), I am trying to create an awk command that reads the values and counts the number of times "-" is in $12 but not in $13.
I have: gzcat test.gz | awk '{if ($12!=$33 && $12~/ -/ && $33!~/ -/) wc -l; else null} | wc -l'
But that command doesn't seem to work and get me the outcome I would like.
no need to check equality separately since it's implied, and no need to use wc, awk is capable of counting
... | awk '$12~/-/ && $33!~/-/{count++} END{print count+0}'
ps. your script is not a valid awk script. Also is the field 33 or 13?

bash remove the same in file

I have one issue with getting number different strings.
I have two files, for example :
file1 :
aaa1
aaa4
bbb3
ccc2
and
file2:
bbb3
ccc2
aaa4
How from this get value 1 (in this case aaa1 string reason)?
I have one query but it calculates not only different strings, them also takes into account the order of the rows.
diff file1 file2 | grep "<" | wc -l
Thanks.
You can use grep -v -c with other options as this:
grep -cvwFf file2 file1
1
Options used are:
-c - get the count of matches
-v - invert matches
-w - full word match (to avoid partial matches)
-F - fixed string match
-f - Use a file for matching patterns
As far as I understand your requirements, sorting the files prior to the diff is a quick solution:
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted | egrep "[<>]" | wc -l

Linux shell script read columns into variable and then add the attribute

I have a file test.txt looking like this:
2092 Mary
103 Tom
1239 Mary
204 Mark
1294 Tom
1092 Mary
I am trying to create a shell script that will
Read each line and put the data in two columns into variable var1 and var2
If var2 in each line is the same, then add the var1 in those lines.
output the file into a text file.
The result should be unique values in the var2 column. Here's what I have so far:
#!/bin/sh
#!/usr/bin/sh
cat test.txt| while read line;
do
$var1=$(echo $line| awk -F\; '{print $1}')
$var2=$(echo $line| awk -F\; '{print $2}')
How can I reference the variable in each line and then compare them?
The expected output would be:
4423 Mary
1397 Tom
204 Mark
Using awk it is easy:
awk '{sum[$2] += $1} END {for (i in sum) printf "%4d %s\n", sum[i], i; }'
If you want to do it with bash 4.x (not 3.x), then:
declare -A sum
while read number name
do
((sum[$name] += $number))
done
for name in "${!sum[#]}"
do
echo ${sum[$name]} $name
done
The structure here is essentially isomorphic with the awk script, but a little less notationally convenient. It will read from standard input, using the names as indexes into the associative array sum. The ${!sum[#]} notation is described in the Shell Parameter Expansion section of the manual, and not even hinted at in the section on Arrays. The information is there if you know where to look.
If you want to process an arbitrary number of input files (like the awk script would) then you need to use cat to collect the data:
cat "$#" |
{
declare -A sum
while read number name
do
((sum[$name] += $number))
done
for name in "${!sum[#]}"
do
echo ${sum[$name]} $name
done
}
This is not UUOC because it handles no arguments (read standard input), one argument or many arguments.
For all the scripts, if you want to sort the output in number or name order, apply an appropriate sort to the output of the script:
script file1 file2 file3 | sort -k 1,1n # By sum increasing order
script file1 file2 file3 | sort -k 1,1nr # By sum decreasing order
script file1 file2 file3 | sort -k 2,2 # By name increasing order
script file1 file2 file3 | sort -k 2,2r # By name decreasing order

Find value from one csv in another one (like vlookup) in bash (Linux)

I have already tried all options that I found online to solve my issue but without good result.
Basically I have two csv files (pipe separated):
file1.csv:
123|21|0452|IE|IE|1|MAYOBAN|BRIN|OFFICE|STREET|MAIN STREET|MAYOBAN|
123|21|0453|IE|IE|1|CORKKIN|ROBERT|SURNAME|CORK|APTS|CORKKIN|
123|21|0452|IE|IE|1|CORKCOR|NAME|HARRINGTON|DUBLIN|STREET|CORKCOR|
file2.csv:
MAYOBAN|BANGOR|2400
MAYOBEL|BELLAVARY|2400
CORKKIN|KINSALE|2200
CORKCOR|CORK|2200
DUBLD11|DUBLIN 11|2100
I need a linux bash script to find the value of pos.3 from file2 based on the content of pos7 in file1.
Example:
file1, line1, pos 7: MAYOBAN
find MAYOBAN in file2, return pos 3 (2400)
the output should be something like this:
**2400**
**2200**
**2200**
**etc...**
Please help
Jacek
A little approach, far away to be perfect:
DELIMITER="|"
for i in $(cut -f 7 -d "${DELIMITER}" file1.csv );
do
grep "${i}" file2.csv | cut -f 3 -d "${DELIMITER}";
done
This will work, but since the input files must be sorted, the output order will be affected:
join -t '|' -1 7 -2 1 -o 2.3 <(sort -t '|' -k7,7 file1.csv) <(sort -t '|' -k1,1 file2.csv)
The output would look like:
2200
2200
2400
which is useless. In order to have a useful output, include the key value:
join -t '|' -1 7 -2 1 -o 0,2.3 <(sort -t '|' -k7,7 file1.csv) <(sort -t '|' -k1,1 file2.csv)
The output then looks like this:
CORKCOR|2200
CORKKIN|2200
MAYOBAN|2400
Edit:
Here's an AWK version:
awk -F '|' 'FNR == NR {keys[$7]; next} {if ($1 in keys) print $3}' file1.csv file2.csv
This loops through file1.csv and creates array entries for each value of field 7. Simply referring to an array element creates it (with a null value). FNR is the record number in the current file and NR is the record number across all files. When they're equal, the first file is being processed. The next instruction reads the next record, creating a loop. When FNR == NR is no longer true, the subsequent file(s) are processed.
So file2.csv is now processed and if it has a field 1 that exists in the array, then its field 3 is printed.
You can use Miller (https://github.com/johnkerl/miller).
Starting from input01.txt
123|21|0452|IE|IE|1|MAYOBAN|BRIN|OFFICE|STREET|MAIN STREET|MAYOBAN|
123|21|0453|IE|IE|1|CORKKIN|ROBERT|SURNAME|CORK|APTS|CORKKIN|
123|21|0452|IE|IE|1|CORKCOR|NAME|HARRINGTON|DUBLIN|STREET|CORKCOR|
and input02.txt
MAYOBAN|BANGOR|2400
MAYOBEL|BELLAVARY|2400
CORKKIN|KINSALE|2200
CORKCOR|CORK|2200
DUBLD11|DUBLIN 11|2100
and running
mlr --csv -N --ifs "|" join -j 7 -l 7 -r 1 -f input01.txt then cut -f 3 input02.txt
you will have
2400
2200
2200
Some notes:
-N to set input and output without header;
--ifs "|" to set the input fields separator;
-l 7 -r 1 to set the join fields of the input files;
cut -f 3 to extract the field named 3 from the join output
cut -d\| -f7 file1.csv|while read line
do
grep $line file1.csv|cut -d\| -f3
done

Resources