I am learning bioinformatics.
I want to find GC content from a fasta file using Bash script.
GC content is basically (number of (g + c)) / (number of (a + t + g + c)).
I am trying to use wc command. But I was not able to get an answer.
Edit 17th Feb 2023.
After going through documentation and videos, I came up with a solution.
filename=$# # collecting all the filenames as parameters
for f in $filename # Looping over files
do
echo " $f is being processed..."
gc=( $( grep -v ">" < "$f" | grep -io 'g\|c'< "$f" | wc -l)) # Reading lines that dont start with < using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
total=( $( grep -v ">" < "$f" | tr -d '\s\r' | wc -c)) # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
percent=( $( echo "scale=2;100*$gc/$total" |bc -l)) # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
echo " The GC content of $f is: "$percent"%"
echo
done
Do not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. For example, use EMBOSS infoseq utility. EMBOSS can be easily installed, for example using conda.
Example:
Install EMBOSS package (do once):
conda create --name emboss emboss --channel iuc
Activate the conda environment and use EMBOSS infoseq, here to priitn the sequence name, length and percent GC:
source activate emboss
cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin
source deactivate
This prints into STDOUT something like this:
Name Length %GC
seq_foo 119 60.50
seq_bar 104 39.42
seq_baz 191 46.60
...
This should work:
#!/usr/bin/env sh
# Adapted from https://www.biostars.org/p/17680
# Fail on error
set -o errexit
# Disable undefined variable reference
set -o nounset
# ================
# CONFIGURATION
# ================
# Fasta file path
FASTA_FILE="file.fasta"
# Number of digits after decimal point
N_DIGITS=3
# ================
# LOGGER
# ================
# Fatal log message
fatal() {
printf '[FATAL] %s\n' "$#" >&2
exit 1
}
# Info log message
info() {
printf '[INFO ] %s\n' "$#"
}
# ================
# MAIN
# ================
{
# Check command 'bc' exist
command -v bc > /dev/null 2>&1 || fatal "Command 'bc' not found"
# Check file exist
[ -f "$FASTA_FILE" ] || fatal "File '$FASTA_FILE' not found"
# Count number of sequences
_n_sequences=$(grep --count '^>' "$FASTA_FILE")
info "Analyzing $_n_sequences sequences"
[ "$_n_sequences" -ne 0 ] || fatal "No sequences found"
# Remove sequence wrapping
_fasta_file_content=$(
sed 's/\(^>.*$\)/#\1#/' "$FASTA_FILE" \
| tr --delete "\r\n" \
| sed 's/$/#/' \
| tr "#" "\n" \
| sed '/^$/d'
)
# Vars
_sequence=
_a_count_total=0
_c_count_total=0
_g_count_total=0
_t_count_total=0
# Read line by line
while IFS= read -r _line; do
# Check if header
if printf '%s\n' "$_line" | grep --quiet '^>'; then
# Save sequence and continue
_sequence=${_line#?}
continue
fi
# Count
_a_count=$(printf '%s\n' "$_line" | tr --delete --complement 'A' | wc --bytes)
_c_count=$(printf '%s\n' "$_line" | tr --delete --complement 'C' | wc --bytes)
_g_count=$(printf '%s\n' "$_line" | tr --delete --complement 'G' | wc --bytes)
_t_count=$(printf '%s\n' "$_line" | tr --delete --complement 'T' | wc --bytes)
# Add current count to total
_a_count_total=$((_a_count_total + _a_count))
_c_count_total=$((_c_count_total + _c_count))
_g_count_total=$((_g_count_total + _g_count))
_t_count_total=$((_t_count_total + _t_count))
# Calculate GC content
_gc=$(
printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
"$N_DIGITS" "$_a_count" "$_c_count" "$_g_count" "$_t_count" \
| bc
)
# Add 0 before decimal point
_gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
info "Sequence '$_sequence' GC content: $_gc"
done << EOF
$_fasta_file_content
EOF
# Total data
info "Adenine total count: $_a_count_total"
info "Cytosine total count: $_c_count_total"
info "Guanine total count: $_g_count_total"
info "Thymine total count: $_t_count_total"
# Calculate total GC content
_gc=$(
printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
"$N_DIGITS" "$_a_count_total" "$_c_count_total" "$_g_count_total" "$_t_count_total" \
| bc
)
# Add 0 before decimal point
_gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
info "GC content: $_gc"
}
The "Count number of sequences" and "Remove sequence wrapping" codes are adapted from https://www.biostars.org/p/17680
The script uses only basic commands except for bc to do the precision calculation (See bc installation).
You can configure the script by modifying the variables in the CONFIGURATION section.
Because you haven't indicated which one you want, the GC content is calculated for both each sequence and the overall. Therefore, get rid of anything that isn't necessary :)
Despite my lack of bioinformatics background, the script successfully parses and analyzes a fasta file.
Related
I have an input file input.txt that contains the following values:
# time(t) Temperature Pressure Velocity(u, v, w)
t T P u v w
0 T0 P0 (u0 v0 w0)
0.0015 T1 P1 (u1 v1 w1)
0.0021 T2 P2 (u2 v2 w2)
0.0028 T3 P3 (u3 v3 w3)
0.0031 T4 P4 (u4 v4 w4)
0.0041 T5 P5 (u5 v5 w5)
... ... ... ... ...
... ... ... ... ...
1.5001 TN PN (uN vN wN)
where Ti, Pi, ui, vi, and wi for i = 0 to N are floating-point numbers.
I have on the other hand, some directories that correspond to the times:
0 # this is a directory
0.0015 # this is a directory also
0.0021 # ...etc.
0.0028
0.0031
...
...
I have a template myTemplate.txt file that looks like the following:
# This is my template file
The time of the simulation is: {%TIME%}
The Temperature is {%T%}
The pressure is {%P%}
The velocity vector is: ({%U%} {%V%} {%W%})
My goal is to create a file output.txt under each time directory using the template file myTemplate.txt and populate the values from the input file input.txt.
I have tried the following:
# assume the name of the directory perfectly matches the time in input file
inputfile="input.txt"
times = $(find . -maxdepth 1 -type d)
for eachTime in $times
do
line=$(sed -n "/^$eachTime/p" $inputfile)
T=$(echo "$line" cut -f2 ) # get temperature
P=$(echo "$line" | cut -f3 ) # get pressure
U=$(echo "$line" | cut -f4 | tr -d '(') # remove '('
V=$(echo "$line" | cut -f5 )
W=$(echo "$line" | cut -f6 | tr -d ')' ) # remove ')'
# I am stuck here, How can I generate a file output.txt from
# the template and save it under the directory.
done
I am stuck in the step where I need to populate the values in the template file and generate a file output.txt under each directory.
Any help on how to achieve that or may by suggesting an efficient way to accomplish this task using linux standard utilities such as sed, awk is very much appreciated.
I have adapted your bash script which contains multiple typos/errors.
This is not the most efficient way to accomplish this but I have tested it on your data and it works:
Create a script file generate.sh:
#!/bin/bash
timedir=$(find * -maxdepth 1 -type d) # use * to get rid of ./ at the beginning
templateFile='./myTemplate.txt' # the path to your template file
for eachTime in $timedir
do
# use bash substitution to replace . with \. in times
# in order to avoid unexpected matches
line="$(grep -m 1 -e '^'${eachTime//./\.} input.txt)"
if [ -z "$line" ]
then
echo "***Error***: Data at time: $eachTime were not found!" >&2
exit 1
fi
# the line below is redundant since time is already known
# replace tabs/and spaces with a single space
line=$(echo "$line" | tr -s '[:blank:]' ' ' )
Time=$(echo "$line" | cut -d' ' -f1 )
Temperature=$(echo "$line" | cut -d' ' -f2 )
Pressure=$(echo "$line" | cut -d' ' -f3 )
U=$(echo "$line" | tr -d '()' | cut -d' ' -f4 )
V=$(echo "$line" | tr -d '()' | cut -d' ' -f5 )
W=$(echo "$line" | tr -d '()' | cut -d' ' -f6 )
# Create a temporary file
buff_file="$(mktemp)"
# Copy the template to that file
cp "$templateFile" "$buff_file"
# Use sed to replace the values
sed -i "s/{%TIME%\}/$eachTime/g" "$buff_file"
sed -i "s/{%T%}/$Temperature/g" "$buff_file"
sed -i "s/{%P%}/$Pressure/g" "$buff_file"
sed -i "s/{%U%}/$U/g" "$buff_file"
sed -i "s/{%V%}/$V/g" "$buff_file"
sed -i "s/{%W%}/$W/g" "$buff_file"
# Copy that temporary file under the time directory
cp "$buff_file" "$eachTime"/output.txt
# delete the temporary file
rm "$buff_file"
done
echo "Done!"
Run the script:
chmod +x generate.sh
./generate.sh
I have checked that a file output.txt is created under each time directory and contains the correct values from input.txt. The script should also raise an error if a time is not found.
this is a working prototype, note that there is no error handling for missing directories or wrong input formatting etc.
$ awk 'NR==FNR {temp=temp sep $0; sep=ORS;next}
FNR==2 {for(i=1;i<=NF;i++) h[$i]=i}
FNR>3 {text=temp;
sub("{%TIME%}", $h["t"] ,text);
# add other sub(..., text) substitutions!
print text > ($1 "/output.txt")}' template.txt input.txt
this only replaces the time but you can repeat the same pattern for the other variables.
Reads the template file and saves in variable temp. Reads the input file and captures the header names for easy reference to array h. For each data line, do the replacements and save to the corresponding directory (assumes it exists).
This should be trivial to read:
sub("{%TIME%}", $h["t"], text) substitute {%TIME%} with the value of $h["t"] in variable text.
$h["t"] means the value at index h["t"], which we put the index of t in the header line, which is 1. So instead of writing $1 we can write $h["t"] so the variable we're referring to is documented in place.
The other variable you'll refer to again with the names "T", "P", etc.
How can I get the average CPU temperature from bash on Linux? Preferably in degrees Fahrenheit. The script should be able to handle different numbers of CPUs.
You do it like so:
Installation
sudo apt install lm-sensors
sudo sensors-detect --auto
get_cpu_temp.sh
#!/bin/bash
# 1. get temperature
## a. split response
## Core 0: +143.6°F (high = +186.8°F, crit = +212.0°F)
IFS=')' read -ra core_temp_arr <<< $(sensors -f | grep '^Core\s[[:digit:]]\+:') #echo "${core_temp_arr[0]}"
## b. find cpu usage
total_cpu_temp=0
index=0
for i in "${core_temp_arr[#]}"; do :
temp=$(echo $i | sed -n 's/°F.*//; s/.*[+-]//; p; q')
let index++
total_cpu_temp=$(echo "$total_cpu_temp + $temp" | bc)
done
avg_cpu_temp=$(echo "scale=2; $total_cpu_temp / $index" | bc)
## c. build entry
temp_status="CPU: $avg_cpu_temp F"
echo $temp_status
exit 0
output
CPU: 135.50 F
You can also read CPU temperatures directly from sysfs (path may differ from machine/OS to machine/OS though):
Bash:
temp_file=$(mktemp -t "temp-"$(date +'%Y%m%d#%H:%M:%S')"-XXXXXX")
ls $temp_file
while true; do
cat /sys/class/thermal/thermal_zone*/temp | tr '\n' ' ' >> "$temp_file"
printf "\n" >> $temp_file
sleep 2
done
If you're a fish user, you may add a function to your config dir, let's say: ~/.config/fish/functions/temp.fish
Fish
function temp
set temp_file (mktemp -t "temp-"(date +'%Y%m%d#%H:%M:%S')"-XXXXXX")
ls $temp_file
while true
cat /sys/class/thermal/thermal_zone*/temp | tr '\n' ' ' >> "$temp_file"
printf "\n" >> $temp_file
sleep 2
end
end
Example
I'm working on a task for uni work where the aim is to count all files and directories within a given directory and then all subdirectories as well. We are forbidden from using find, locate, du or any recursive commands (e.g. ls -R).
To solve this I've tried making my own recursive command and have run into the error above, more specificly it is line 37: testdir/.hidd1/: syntax error: operand expected (error token is ".hidd1/")
The Hierarchy I'm using
The code for this is as follows:
tgtdir=$1
visfiles=0
hidfiles=0
visdir=0
hiddir=0
function searchDirectory {
curdir=$1
echo "curdir = $curdir"
# Rather than change directory ensure that each recursive call uses the $curdir/NameOfWantedDirectory
noDir=$(ls -l -A $curdir| grep ^d | wc -l) # Work out the number of directories in the current directory
echo "noDir = $noDir"
shopt -s nullglob # Enable nullglob to prevent a null term being added to the array
directories=(*/ .*/) # Store all directories and hidden directories into the array 'directories'
shopt -u nullglob #Turn off nullglob to ensure it doesn't later interfere
echo "${directories[#]}" # Print out the array directories
y=0 # Declares a variable to act as a index value
for i in $( ls -d ${curdir}*/ ${curdir}.*/ ); do # loops through all directories both visible and hidden
if [[ "${i:(-3)}" = "../" ]]; then
echo "Found ./"
continue;
elif [[ "${i:(-2)}" = "./" ]]; then
echo "Found ../"
continue;
else # When position i is ./ or ../ the loop advances otherwise the value is added to directories and y is incremented before the loop advances
echo "Adding $i to directories"
directories[y]="$i"
let "y++"
fi
done # Adds all directories except ./ and ../ to the array directories
echo "${directories[#]}"
if [[ "${noDir}" -gt "0" ]]; then
for i in ${directories[#]}; do
echo "at position i ${directories[$i]}"
searchDirectory ${directories[$i]} #### <--- line 37 - the error line
done # Loops through subdirectories to reach the bottom of the hierarchy using recursion
fi
visfiles=$(ls -l $tgtdir | grep -v ^total | grep -v ^d | wc -l)
# Calls the ls -l command which puts each file on a new line, then removes the line which states the total and any lines starting with a 'd' which would be a directory with grep -v,
#finally counts all lines using wc -l
hiddenfiles=$(expr $(ls -l -a $tgtdir | grep -v ^total | grep -v ^d | wc -l) - $visfiles)
# Finds the total number of files including hidden and puts them on a line each (using -l and -a (all)) removes the line stating the total as well as any directoriesand then counts them.
#Then stores the number of hidden files by expressing the complete number of files minus the visible files.
visdir=$(ls -l $tgtdir | grep ^d | wc -l)
# Counts visible directories by using ls -l then filtering it with grep to find all lines starting with a d indicating a directory. Then counts the lines with wc -l.
hiddir=$(expr $(ls -l -a $tgtdir | grep ^d | wc -l) - $visdir)
# Finds hidden directories by expressing total number of directories including hidden - total number of visible directories
#At minimum this will be 2 as it includes the directories . and ..
total=$(expr $visfiles + $hiddenfiles + $visdir + $hiddir) # Calculates total number of files and directories including hidden.
}
searchDirectory $tgtdir
echo "Total Files: $visfiles (+$hiddenfiles hidden)"
echo "Directories Found: $visdir (+$hiddir hidden)"
echo "Total files and directories: $total"
exit 0
Thanks for any help you can give
Line 37 is searchDirectory ${directories[$i]}, as I count. Yes?
Replace the for loop with for i in "${directories[#]}"; do - add double quotes. This will keep each element as its own word.
Replace line 37 with searchDirectory "$i". The for loop gives you each element of the array in i, not each index. Therefore, you don't need to go into directories again - i already has the word you need.
Also, I note that the echos on lines 22 and 25 are swapped :) .
I am trying to use the Bash variable $RANDOM to create a random string that consists of 8 characters from a variable that contains integer and alphanumeric digits, e.g., var="abcd1234ABCD".
How can I do that?
Use parameter expansion. ${#chars} is the number of possible characters, % is the modulo operator. ${chars:offset:length} selects the character(s) at position offset, i.e. 0 - length($chars) in our case.
chars=abcd1234ABCD
for i in {1..8} ; do
echo -n "${chars:RANDOM%${#chars}:1}"
done
echo
For those looking for a random alpha-numeric string in bash:
LC_ALL=C tr -dc A-Za-z0-9 </dev/urandom | head -c 64
The same as a well-documented function:
function rand-str {
# Return random alpha-numeric string of given LENGTH
#
# Usage: VALUE=$(rand-str $LENGTH)
# or: VALUE=$(rand-str)
local DEFAULT_LENGTH=64
local LENGTH=${1:-$DEFAULT_LENGTH}
LC_ALL=C tr -dc A-Za-z0-9 </dev/urandom | head -c $LENGTH
# LC_ALL=C: required for Mac OS X - https://unix.stackexchange.com/a/363194/403075
# -dc: delete complementary set == delete all except given set
}
Another way to generate a 32 bytes (for example) hexadecimal string:
xxd -l 32 -c 32 -p < /dev/random
add -u if you want uppercase characters instead.
OPTION 1 - No specific length, no openssl needed, only letters and numbers, slower than option 2
sed "s/[^a-zA-Z0-9]//g" <<< $(cat /dev/urandom | tr -dc 'a-zA-Z0-9!##$%*()-+' | fold -w 32 | head -n 1)
DEMO: x=100; while [ $x -gt 0 ]; do sed "s/[^a-zA-Z0-9]//g" <<< $(cat /dev/urandom | tr -dc 'a-zA-Z0-9!##$%*()-+' | fold -w 32 | head -n 1) <<< $(openssl rand -base64 17); x=$(($x-1)); done
Examples:
j0PYAlRI1r8zIoOSyBhh9MTtrhcI6d
nrCaiO35BWWQvHE66PjMLGVJPkZ6GBK
0WUHqiXgxLq0V0mBw2d7uafhZt2s
c1KyNeznHltcRrudYpLtDZIc1
edIUBRfttFHVM6Ru7h73StzDnG
OPTION 2 - No specific length, openssl needed, only letters and numbers, faster than option 1
openssl rand -base64 12 # only returns
rand=$(openssl rand -base64 12) # only saves to var
sed "s/[^a-zA-Z0-9]//g" <<< $(openssl rand -base64 17) # leave only letters and numbers
# The last command can go to a var too.
DEMO: x=100; while [ $x -gt 0 ]; do sed "s/[^a-zA-Z0-9]//g" <<< $(openssl rand -base64 17); x=$(($x-1)); done
Examples:
9FbVwZZRQeZSARCH
9f8869EVaUS2jA7Y
V5TJ541atfSQQwNI
V7tgXaVzmBhciXxS
Others options not necessarily related:
uuidgen or cat /proc/sys/kernel/random/uuid
After generating 1 billion UUIDs every second for the next 100 years,
the probability of creating just one duplicate would be about 50%. The
probability of one duplicate would be about 50% if every person on
earth owns 600 million UUIDs 😇 source
Not using $RANDOM, but worth mentioning.
Using shuf as source of entropy (a.k.a randomness) (which, in turn, may use /dev/random as source of entropy. As in `shuf -i1-10 --random-source=/dev/urandom) seems like a solution that use less resources:
$ shuf -er -n8 {A..Z} {a..z} {0..9} | paste -sd ""
tf8ZDZ4U
head -1 <(fold -w 20 <(tr -dc 'a-zA-Z0-9' < /dev/urandom))
This is safe to use in bash script if you have safety options turned on:
set -eou pipefail
This is a workaround of bash exit status 141 when you use pipes
tr -dc 'a-zA-Z0-9' < /dev/urandom | fold -w 20 | head -1
Little bit obscure but short to write solution is
RANDSTR=$(mktemp XXXXX) && rm "$RANDSTR"
expecting you have write access to current directory ;-)
mktemp is part of coreutils
UPDATE:
As Bazi pointed out in the comment, mktemp can be used without creating the file ;-) so the command can be even shorter.
RANDSTR=$(mktemp --dry-run XXXXX)
Using sparse array to shuffle characters.
#!/bin/bash
array=()
for i in {a..z} {A..Z} {0..9}; do
array[$RANDOM]=$i
done
printf %s ${array[#]::8} $'\n'
(Or alot of random strings)
#!/bin/bash
b=()
while ((${#b[#]} <= 32768)); do
a=(); for i in {a..z} {A..Z} {0..9}; do a[$RANDOM]=$i; done; b+=(${a[#]})
done
tr -d ' ' <<< ${b[#]} | fold -w 8 | head -n 4096
An abbreviated safe pipe workaround based on Radu Gabriel's answer and tested with GNU bash version 4.4.20 and set -euxo pipefail:
head -c 20 <(tr -dc [:alnum:] < /dev/urandom)
I am trying to search a set of directories and list files where a certain string appears more than X times.
For instance I want to search /home/userX/files (and all subdirectories) and list all files where the string "uploads" occurs more than 10 times.
Ideally an output like this would be awesome:
/home/userX/files/file1:15
/home/userX/files/file2:34
/home/userX/files/file3:67
where the :xx is the string count in that file... but this final count wouldn't be necessary... only a nice-to-have.
I have figured out how to find files with a certain string, count strings in single files, and list files where a string occurs but have not been able to put these all together... and now I am just totally flustered and confused...
Any help is appreciated!
Thank you in advance.
I now have something I am happy with:
grep -Hcri uploads * | awk -F ':' -e '$2>10 {print}'
This still ignores multiple occurances of 'uploads' per line, but it should be reasonably fast.
-H means that the filename is preserved,
-c means that it should just count the number of lines
-r is for recursive
-i for case-insensitive
It passes the output to awk which splits each line along the colon (-F ':') and if the second value is larger than 10, it prints the whole line.
A bash solution involves a small script that allows the user to specify the search term, the minimum occurrences per file and the search path. It then collects the /absolute/path/to/file:matches for files where search term occurs greater than, or equal to, occurs number of times in the file, saving all matching files in an array for whatever later use you need. For purposes of this example it simply prints the search criteria and matching files contained in the array:
#!/bin/bash
[ $# -eq 3 ] || { ## test for sufficient input
printf "error: insufficient input: usage: %s term occurs path\n" "${0//*\//}"
exit 1
}
[ $2 -eq $2 >/dev/null 2>&1 ] || { ## test that 'occur' is an integer value
printf "error: invalid input: occurs '%s' is not an integer value!\n" "$2"
printf "\n usage: %s term occurs path\n\n" "${0//*\//}"
exit 1
}
[ -d "$3" ] || { ## test path is a valid directory
printf "error: invalid input: path '%s' is not a valid directory!\n" "$3"
printf "\n usage: %s term occurs path\n\n" "${0//*\//}"
exit 1
}
srchterm="$1" ## assignment of arguments to variables
occur=$2
srchpath="$3"
declare -a array ## declare array to hold values
## for each file containing $srchterm
while IFS=$'\n' read -r line; do
[ "${line##*:}" -ge $occur ] && ## test it occurs >= occur
array+=( "$(realpath "${line%:*}"):${line##*:}" ) ## if so add it to array
done < <(grep -r -c "$srchterm" "$srchpath"/* ) ## grep -r -c to provides files
## output search information
printf "\nsearch term : %s\noccurrances : %d\nsearch path : %s\n\n" \
"$srchterm" $occur "$srchpath"
printf "number of matching files : %d\n\n" ${#array[#]}
for i in "${array[#]}"; do ## output matching files
printf "%s\n" "$i"
done
exit 0
Use/Output
$ bash srchterminfile.sh char 10 .
search term : char
occurrances : 10
search path : .
number of matching files : 77
/home/david/dev/src-c/tmp/arginfo.c:16
/home/david/dev/src-c/tmp/bin_chs_test.c:19
/home/david/dev/src-c/tmp/binprntst.c:42
/home/david/dev/src-c/tmp/binprnverif.c:12
/home/david/dev/src-c/tmp/bookmgr.c:17
/home/david/dev/src-c/tmp/censorwds.c:11
/home/david/dev/src-c/tmp/ch+13.c:14
/home/david/dev/src-c/tmp/ch13str.c:20
/home/david/dev/src-c/tmp/chkendian.c:16
/home/david/dev/src-c/tmp/concatwords.c:17
<snip>
Note: if you do not need to save the matching files in an array for later use, you can simply remove the array and replace it with a printf or echo statement to simply output the lines. I understood you wanted to combine and save the matching filenames and occurrence data within your script.
Try this (edited based on comments):
find . -type f | xargs grep -o [STRING] | awk -F':' '{print $1}' | uniq -c | awk '$1>=X {print;}'
Replace [STRING] with the string you want to search for and X with the number of times you want it to appear. In your example:
find . -type f | xargs grep -o uploads | awk -F':' '{print $1}' | uniq -c | awk '$1>=10 {print;}'