Making a text file grow by replicating on mac

Making a text file grow by replicating on mac - linux

I am doing some unit tests and I have a small text file (a few kilobytes) and what I would like to do is make a new file out of this where the same text is replicated over and over again for some user specified times. The reason I want to do this is to ensure that my algorithm can handle large files and the results are correct (I can extrapolate the correct results from the tests ran on the smaller text file).
Is there a utility on the mac or linux platform that allows me to do that?

You can use a for loop and concatenate the contents of the file to a temporary file.
COUNT=10 # larger or smaller, depending on how large you want the file
FILENAME=test.txt
# remove the mv command if you do not wish the original file to be overwritten
for i in $(seq 1 $COUNT) ; do cat $FILENAME >> $FILENAME.tmp ; done && mv $FILENAME.tmp $FILENAME

Related

Bash script to copy file by type

How do I use file command for copying the files in a directory according to their type? I know I can use file to find the type of the file, but I don't know how to use it it the if condition.
What I want to achieve is this. I need to tidy up my downloads folder. When I run the specific script, I want the files in the mentioned folder to be moved into a dedicated folder, according to its type. For eg, image files should be moved to a folder named "Images", video files to "Videos", executables to "Programs" and so on.

Something like this?
for filename in ./*; do
case $(file -b -i "$filename") in
inode/directory* | inode/symlink*)
echo "$0: skip $filename" >&2
continue;;
application/*) dest=Random;;
image/*) dest=Images;;
text/html*) dest=Webpages;;
text/plain*) dest=Documents;;
video/*) dest=Videos;;
*) dest=Unknown;;
esac
mkdir -p "$dest"
mv "$filename" "$dest/"
done
The mapping of MIME types (-i option) to your hierarchy of directories isn't entirely straightforward. The application MIME type hierarchy in particular corresponds to a vast number of document types (PDF, Excel, etc) - some of which also have designated types - as well as the completely unspecified generic application/octet-stream. Using something else than MIME types is often even more problematic, as the labels that file prints are free-form human-readable text which can be essentially random (for example, different versions of the same file format may correspond to different detections with different labels, which are not systematically formatted, and so you might get Evil Empire Insult (tm) format 1997 from one file and Insult 2000 from another with the same extension).
Probably do a test run with file -i ./* and examine the results you get, then update the code above with cases which actually make sense for your specific files.

How can I run two bash scripts simultaneously and without repetition of the same action?

I'm trying to write a script that automatically runs a data analysis program. The data analysis takes a file, analyzes it, and puts all the outputs into a folder. The program can be run on two terminals simultaneously (each analyzing a different subject file).
I wrote a script that can do all the inputs automatically. However, I can only get my script to run one automatically. If I run my script simultaneously it will analyze the same subject twice (useless)
Currently, my script looks like:
for name in `ls [file_directory]`
do
[Data analysis commands]
done
If you run this on two terminals, it will start from the top of the directory containing all the data files. This is a problem, so I tried to do checks for duplicates but they weren't very effective.
I tried a name comparison with the if command (didn't work because all the output files except one were of a unique name, so it would check the first outfput folder at the top of the directory and say the name was different even though an output folder further down had the same name). It looked something like..
for name in `ls <file_directory>`
do
for output in `ls <output directory>`
do
If [ name==output ]
then
echo "This file has already been analyzed."
else
<Data analyis commands>
fi
done
done
I thought this was the right method but apparently not. I would need to check all the names before some decision was made (rather one by one which that does)
Then I tried moving completed data files with the mv command (didn't work because "name" in the for statement stored all the file names so it went down the list regardless of what was in the folder at present). I remember reading something about how shell scripts do not do things in "real time" so it makes sense that this didn't work.
My thought was looking for some sort of modification to that if statement so it does all the name checks before I make a decision (how?)
Also are there any other commands I could possibly be missing that I could possibly try?

One pattern I use often is to use split command.
ls <file_directory> > file_list
split -d -l 10 file_list file_list_part
This will create files like file_list_part00 to file_list_partnn
You can then feed these file names to you script.
for file_part in `ls file_list_part*`
do
for file_name in `cat file_part | tr '\n' ' '`
do
data_analysis_command file_name
done
done

Never use "ls" in a "for" (http://mywiki.wooledge.org/ParsingLs)
I think you should use a fifo (see mkfifo)

As a follow-on from the comments, you can install GNU Parallel with homebrew:
brew install parallel
Then your command becomes:
parallel analyse ::: *.dat
and it will process all your files in parallel using as many CPU cores as you have in your Mac. You can also add in:
parallel --dry-run analyse ::: *.dat
to get it to show you the commands it would run without actually running anything.
You can also add in --eta (Estimated Time of Arrival) for an estimate of when the jobs will be done, and -j 8 if you want to run, say 8, jobs at a time. Of course, if you specifically want the 2 jobs at a time you asked for, use -j 2.
You can also have GNU Parallel simply distribute jobs and data to any other machines you may have available via ssh access.

Searching an entire drive for plaintext passwords

I have an encrypted database of about 6000 unique passwords, and I want to search about 1TB of data for any instance of these passwords. I am using Cygwin, but I could have the drive available in a real linux environment if I needed to.
I have a file "ClientPasswords.txt" which contains every unique password only once, one password per line. I am trying to compare every file in my T:/ drive to this list.
I am using this command:
grep -nr -F -f ClientPasswords.txt /cygdrive/t 2> SuspectFiles.txt
My goal is to generate a list of all files, "SuspectFiles.txt", that contain any known password in our password database in plaintext so that we can redact sensitive information from the drive.
Currently, it is getting a ton of false positives, including some that don't seem to match anything in the list. I have already eliminated all passwords that are fewer than 6 characters, can be found in the dictionary (or are otherwise known client names), or are just numbers.
I would like to:
Limit it to a select few filetypes (txt, csv, xls, xlsx, doc, docx, etc.)
Eliminate all compressed files (or find a way to search inside them)
Limit snippet output to prevent dumping entire binary files into the output file.
Anyone done something similar, or know of an easier way to search for these improperly documented passwords from a blacklist? I have also played around with the Windows program "Agent Ransack", but it seems much more limited than grep.
Thanks!

First thing you want to do is make a list of all files on drive T that are of the right type, and output that to a list of target file names. Use a shell script:
for i in txt csv xls xlsx doc docx;
do
find /cygdrive/t -name \*.$i >> target_file_names.txt
done
Now that you have the target file names, you can search your passwords amongst those target file names.
for target_file in `cat target_file_names.txt`
do
for pwd in `cat ClientPasswords.txt`
do
grep -l $pwd $target_file > /dev/null
test $? -eq 0 && echo $target_file has password $pwd
done
done
Something like that should work. You might have to tweak it a little.

Download multiple files, with different final names

OK, what I need is fairly simple.
I want to download LOTS of different files (from a specific server), via cURL and would want to save each one of them as a specific new filename, on disk.
Is there an existing way (parameter, or whatever) to achieve that? How would you go about it?
(If there was an option to input all URL-filename pairs in a text file, one per line, and get cURL to process it, would be ideal)
E.g.
http://www.somedomain.com/some-image-1.png --> new-image-1.png
http://www.somedomain.com/another-image.png --> new-image-2.png
...

OK, just figured a smart way to do it myself.
1) Create a text file with pairs of URL (what to download) and Filename (how to save it to disk), separated by comma (,), one per line. And save it as input.txt.
2) Use the following simple BASH script :
while read line; do
IFS=',' read -ra PART <<< "$line";
curl $PART[0] -o $PART[1];
done < input.txt
*Haven't thoroughly tested it yet, but I think it should work.

Shortening large CSV on debian

I have a very large CSV file and I need to write an app that will parse it but using the >6GB file to test against is painful, is there a simple way to extract the first hundred or two lines without having to load the entire file into memory?
The file resides on a Debian server.

Did you try the head command?
head -200 inputfile > outputfile

head -10 file.csv > truncated.csv
will take the first 10 lines of file.csv and store it in a file named truncated.csv

"The file resides on a Debian server."- this is interesting. This basically means that even if you use 'head', where does head retrieve the data from? The local memory (after the file has been copied) which defeats the purpose.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string