find different beginnings of files with bash [duplicate] - linux

This question already has answers here:
Getting the count of unique values in a column in bash
(7 answers)
Closed 2 years ago.
I have a directory with n files in it, all starting with a date in format yyyymmdd.
Example:
20210208_bla.txt
20210208_bla2.txt
20210209_bla.txt
I want know how many files of a certain date I have, so output should be like:
20210208 112
20210209 96
20210210 213
...
Or at least find the different beginnings of the actual files (=the different dates) in my folder.
Thanks

A very simple solution would be to do something like:
ls | cut -f 1 -d _ | sort -n | uniq -c
With your example this gives:
2 20210208
1 20210209
Update: If you need to swap the two columns you can follow: https://stackoverflow.com/a/11967849/2001017
ls | cut -f 1 -d _ | sort -n | uniq -c | awk '{ t = $1; $1 = $2; $2 = t; print; }'
which prints:
20210208 2
20210209 1

Related

Pipe each row of csv into bash command [duplicate]

This question already has answers here:
Count all occurrences of a string in lots of files with grep
(16 answers)
Closed 25 days ago.
This post was edited and submitted for review 25 days ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a single column CSV file with no header and I want to iteratively find the value of each row and count the number of times it appears in several files.
Something like this:
for i in file.csv:
zcat *json.gz | grep i | wc -l
However, I don't know how to iterate through the csv and pass the values forward
Imagine that file.csv is:
foo,
bar
If foo exists 20 times in *json.gz and bar exists 30 times in *json.gz, I would expect the output of my command to be:
20
30
Here is the solution I found:
while IFS=',' read -r column; do
count=$(zgrep -o "$column" *json.gz | wc -l)
echo "$column,$count"; done < file.csv
You can achieve that with a single grep operation treating file.csv as a patterns file (obtaining patterns one per line):
grep -f file.csv -oh *.json | wc -l
-o - to print only matched parts
-h - to suppress file names from the output
You can iterate through output of cat run through subprocess:
for i in `cat file.csv`: # iterates through all the rows in file.csv
do echo "My value is $i"; done;
using chatgpt :), try this:
#!/bin/bash
# Define the name of the CSV file
csv_file="path/to/file.csv"
# Extract the values from each row of the CSV file
values=$(cut -f1 "$csv_file" | uniq -c)
# Loop through each file
for file in path/to/file1 path/to/file2 path/to/file3
do
# Extract the values from each row of the file
file_values=$(cut -f1 "$file" | uniq -c)
# Compare the values and print the results
for value in $values
do
count=$(echo $value | cut -f1 -d' ')
val=$(echo $value | cut -f2 -d' ')
file_count=$(echo $file_values | grep -o "$val" | wc -l)
echo "$val appears $count times in $csv_file and $file_count times in $file"
done
done

Extracting the user with the most amount of files in a dir

I am currently working on a script that should receive a standard input, and output the user with the highest amount of files in that directory.
I've wrote this so far:
#!/bin/bash
while read DIRNAME
do
ls -l $DIRNAME | awk 'NR>1 {print $4}' | uniq -c
done
and this is the output I get when I enter /etc for an instance:
26 root
1 dip
8 root
1 lp
35 root
2 shadow
81 root
1 dip
27 root
2 shadow
42 root
Now obviously the root folder is winning in this case, but I don't want only to output this, i also want to sum the number of files and output only the user with the highest amount of files.
Expected output for entering /etc:
root
is there a simple way to filter the output I get now, so that the user with the highest sum will be stored somehow?
ls -l /etc | awk 'BEGIN{FS=OFS=" "}{a[$4]+=1}END{ for (i in a) print a[i],i}' | sort -g -r | head -n 1 | cut -d' ' -f2
This snippet returns the group with the highest number of files in the /etc directory.
What it does:
ls -l /etc lists all the files in /etc in long form.
awk 'BEGIN{FS=OFS=" "}{a[$4]+=1}END{ for (i in a) print a[i],i}' sums the number of occurrences of unique words in the 4th column and prints the number followed by the word.
sort -g -r sorts the output descending based on numbers.
head -n 1 takes the first line
cut -d' ' -f2 takes the second column while the delimiter is a white space.
Note: In your question, you are saying that you want the user with the highest number of files, but in your code you are referring to the 4th column which is the group. My code follows your code and groups on the 4th column. If you wish to group by user and not group, change {a[$4]+=1} to {a[$3]+=1}.
Without unreliable parsing the output of ls:
read -r dirname
# List user owner of files in dirname
stat -c '%U' "$dirname/" |
# Sort the list of users by name
sort |
# Count occurrences of user
uniq -c |
# Sort by higher number of occurrences numerically
# (first column numerically reverse order)
sort -k1nr |
# Get first line only
head -n1 |
# Keep only starting at character 9 to get user name and discard counts
cut -c9-
I have an awk script to read standard input (or command line files) and sum up the unique names.
summer:
awk '
{ sum[ $2 ] += $1 }
END {
for ( v in sum ) {
print v, sum[v]
}
}
' "$#"
Let's say we are using your example of /etc:
ls -l /etc | summer
yields:
0
dip 2
shadow 4
root 219
lp 1
I like to keep utilities general so I can reuse them for other purposes. Now you can just use sort and head to get the maximum result output by summer:
ls -l /etc | summer | sort -r -k2,2 -n | head -1 | cut -f1 -d' '
Yields:
root

Finding the Number of strings in a File

I'm trying to write a very small program that will check the number of sub strings in a large text file. All it will do is count the first 2000 lines of the text file, find any "TTT" sub-strings, count them, and set a variable to that total. I'm a bit new to shell, so any help would be amazingly appreciated!
#!/bin/bash
$counter=(head -2000 [file name] | grep TTT | grep -o TTT | wc -l)
echo $counter
For what it's worth you might awk better suited for this task:
awk -F"ttt" '{j=(NF-1)+j}END{print j}' filename
This will split each record in your file by delimiter "ttt". Then it counts the number of fields, subtracts one, and adds that to the total.
A file like:
ttt tttttt something
1 5 ttt
tt
one more ttt record
Would be split (visualizing with pipe delim) like:
| || something
1 5 |
tt
one more | record
Counting the number of fields per record:
4
2
1
2
Subtracting one from that:
3
1
0
1
Which totals to 5, which is how many "ttt" substrings are present.
To incorporate this into your script (and fixing your other issue):
#!/bin/bash
counter=$(awk -F"ttt" '{j=(NF-1)+j}END{print j}' filename)
echo $counter
The change here is that when we set a variable in Bash we don't include the $ sign at the front. Only in referencing the variable do we include the $.
You have some minor syntax errors there, probably you meant this:
counter=$(head -2000 [file name] | grep TTT | grep -o TTT | wc -l)
echo $counter
Notice the tiny changes I made there to make it work.
Btw the grep TTT in the middle is redundant, you can simply drop it, that is:
counter=$(head -2000 [file name] | grep -o TTT | wc -l)
grep can already do what you want: counter=$(grep -c TTT $infile). You can limit the number of hits (not lines) with -m NUM, --max-count=NUM, which makes grep stop at the end of the file OR when NUM occurrences are found.

recursively count lines with delimited pattern

Let's say I have a file that contained:
foo/bar
foo/bar/thing0
foo/bar/thing1
...
foo/bar/thing10
foo/bar1/thing0
foo/bar1
foo/bar2/thing2
foo/bar1/thing1/subthing1
foo/bar1/thing1/subthing2
...
I would like to generate a hierarchical view that gives me the counts of the "sublines". That is to say, the output will look like:
foo - 100
foo/bar - 20
foo/bar/thing0 - 8
...
foo/bar1 - 20
foo/bar1/thing0 - 10
foo/bar2 - 60
foo/bar2/thing2 - 1
And allow this to be configurable. For example, I can limit it to only count the occurrences of // (delimited by '/').
I've done this with a perl script before but I was wondering whether there's a method using the tcsh command line and some standard unix utilities to do it.
For each level of depth you can do it with standard unix tools:
Level 1 (foo):
cut -d/ -f1 < FILE | sort | uniq -c
Level 2 (foo/bar):
cut -d/ -f-2 < FILE | sort | uniq -c
Level 3 (foo/bar/thin1)
cut -d/ -f-3 < FILE | sort | uniq -c
And so on

How to add number of identical line next to the line itself? [duplicate]

This question already has answers here:
Find duplicate lines in a file and count how many time each line was duplicated?
(7 answers)
Closed 7 years ago.
I have file file.txt which look like this
a
b
b
c
c
c
I want to know the command to which get file.txt as input and produces the output
a 1
b 2
c 3
I think uniq is the command you are looking for. The output of uniq -c is a little different from your format, but this can be fixed easily.
$ uniq -c file.txt
1 a
2 b
3 c
If you want to count the occurrences you can use uniq with -c.
If the file is not sorted you have to use sort first
$ sort file.txt | uniq -c
1 a
2 b
3 c
If you really need the line first followed by the count, swap the columns with awk
$ sort file.txt | uniq -c | awk '{ print $2 " " $1}'
a 1
b 2
c 3
You can use this awk:
awk '!seen[$0]++{ print $0, (++c) }' file
a 1
b 2
c 3
seen is an array that holds only uniq items by incrementing to 1 first time an index is populated. In the action we are printing the record and an incrementing counter.
Update: Based on comment below if intent is to get a repeat count in 2nd column then use this awk command:
awk 'seen[$0]++{} END{ for (i in seen) print i, seen[i] }' file
a 1
b 2
c 3

Resources