im trying to remove certain text from .txt files - text

I hav pdf files that have security marks on the top right and left of the pagei convert them to .pdb files to read them on my phone and it puts the writing in the security marks into the .pdb file so that every few pages has:
PDF Transform
PDF Transform
Y
Y
Y
er
Y
er
B
2
B
2
B
.0
B
.0
A
A
Click here to buy
Click here to buy
w
w
w
w
w .
w
A B B YY.com
.A B BYY.com
I've tried converting them to multiple types of files using calibre but it shows up in all of them.
If I convert them to .txt files can anyone make a batch file that will erase these lines of text in multiple files?

I am not sure which OS you are on but this will work on *nix, osx with SED installed, not sure if you can use SED in windows:
for filename in *.txt; do sed ${filename} -e '1,20d' -e '/^PDF Transform/,/^A B B YY\.com/d' > newfiles/${filename}; done
First -e command removes lines 1-20 if you know it's something static and in the same position inside of the file. Second -e command will remove everything between PDF Transform and YY.com including these lines. You can use many or one -e commands to get what you want. It assumes that newfiles folder does exist. I didn't test this so regex might be off.

Related

Data hidden in jpg

I am currently looking for hidden data in a jpg file but I have no clue on how to operate.
There is a jpg file containing text in a format I have never seen before :
-ne \xff\xd8\xff\xe0\x00\x10\x4a\x46\x49\x46\x00\x01\x01\x01\x00\x60\x00\x60\x00\x00\xff\xdb\x00\x43\x00\x06\x04\x04\x05\x04\x04\x06\x05\x05\x05\x06\x06\x06\x07\x09\x0e\x09\x09\x08\x08\x09\x12\x0d\x0d\x0a\x0e\x15\x12\x16\x16\x15\x12\x14\x14\x17\x1a\x21\x1c\x17\x18\x1f\x19\x14\x14\x1d\x27\x1d\x1f\x22\x23\x25\x25\x25\x16\x1c\x29\x2c\x28\x24\x2b\x21\x24\x25\x24\xff\xdb\x00\x43\x01\x06\x06\x06\x09\x08\x09\x11\x09\x09\x11\x24\x18\x14\x18\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\xff\xc0\x00\x11\x08\x01\x8e\x03\x4e\x03\x01\x22\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1f\x00\x00\x01\x05\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\xff\xc4\x00\xb5\x10\x00\x02\x01\x03\x03\x02\x04\x03\x05\x05\x04\x04\x00\x00\x01\x7d\x01\x02\x03\x00\x04\x11\x05\x12\x21\x31\x41\x06\x13\x51\x61\x07\x22\x71\x14\x32\x81\x91\xa1\x08\x23
-ne \x42\xb1\xc1\x15\x52\xd1\xf0\x24\x33\x62\x72\x82\x09\x0a\x16\x17\x18\x19\x1a\x25\x26\x27\x28\x29\x2a\x34\x35\x36\x37\x38\x39\x3a\x43\x44\x45\x46\x47\x48\x49\x4a\x53\x54\x55\x56\x57\x58\x59\x5a\x63\x64\x65\x66\x67\x68\x69\x6a\x73\x74\x75\x76\x77\x78\x79\x7a\x83\x84\x85\x86\x87\x88\x89\x8a\x92\x93\x94\x95\x96\x97\x98\x99\x9a\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xff\xc4\x00\x1f\x01\x00\x03\x01\x01\x01\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\xff\xc4\x00\xb5\x11\x00\x02\x01\x02\x04\x04\x03\x04\x07\x05\x04\x04\x00\x01\x02\x77\x00\x01\x02\x03\x11\x04\x05\x21\x31\x06\x12\x41\x51\x07\x61\x71\x13\x22\x32\x81\x08\x14\x42\x91\xa1\xb1\xc1\x09\x23\x33\x52\xf0\x15\x62\x72\xd1\x0a\x16\x24\x34\xe1\x25\xf1\x17\x18\x19\x1a\x26\x27\x28\x29\x2a\x35\x36\x37\x38\x39\x3a\x43\x44\x45\x46\x47\x48\x49
This is just the beggining of the file as there is at least a hundred lines.
The file type given by the command file : file.jpg: ASCII text, with very long lines
I tried some of the common tools to identify any patterns or hidden data like exiftools, strings, xxd but I found nothing.
If you have any idea on what to do it would be very much appreciated.
If it's a challenge of CTF, there are some common way to find out flag.
First try to find flag in file metadata, like description of file field
you can also try tool: stegsolve.jar.
In more advance sence, stego info hidden with some math calulation, give this tool a try: zsteg
Perhaps I'm misunderstanding the problem here, but if your file actually starts with a backslash character followed by the characters x, f, f, \, x, d, 8 and so on, then what you're looking at is the binary content of a JPG file that has been converted into ASCII text.
If so, you need to convert this back into binary data. For example, in Linux or MacOS, you could do this by entering the following on the command line:
echo -ne '\xff\xd8\xff\xe0\x00\x10\x4a\x46\x49\x46\x00\x01...etc...' > img.jpg
echo -ne '\x42\xb1\xc1\x15\x52\xd1\xf0\x24\x33\x62\x72\x82...etc...' >> img.jpg
(Note: > sends the results to a new file, and >> appends to the end of the file)
Or alternatively in Python:
with open("img.jpg","wb") as f:
f.write(b'\xff\xd8\xff\xe0\x00\x10\x4a\x46\x49\x46\x00\x01...etc...')
f.write(b'\x42\xb1\xc1\x15\x52\xd1\xf0\x24\x33\x62\x72\x82...etc...')
# and so on for all the other lines
Either way, you should end up with a file called img.jpg containing the image you're after.

How to merge columns with colon sign

I have a tab separated file that looks like this. I need to prepare it so I can import it in R:
1 344544 rs30540
2 284783 rs34560
14 384643 rs30567
19 584643 rs31110
Genome_phase,common=1,19,genomes=hg19
11 222643 rs30543
44 544643 rs32345
Genome_phase,common=1,23,genomes=hg19
I want to keep only the rows that start with numbers and drop all others that begin with characters. It is a huge file of a few Gbs. Is there any way to do that in Linux?
You can use awk or gawk (depending which you have installed on your Linux version).
gawk '/^[0-9]/' file > newfile
Since your aim is to read this into R, I suggest you use fread from library(data.table) as it is fast for large files. In that case you could use the the ability of fread to accept a shell command that preprocesses the file as its input.
cmd = paste("gawk /^[0-9]/", filename)
x = fread(cmd = cmd)
Based on the title of the question, I guess you also want to replace \t with :. Then try the following:
sed -n '/^[0-9]\+\t/s/\t/:/gp' inputfile.txt
/^[0-9]\+\t/ part finds line that starts with digits followed by a \t.
s/\t/:/ substitutes \t by :.
g means global (i.e., substitute all \t), and p means 'print the line'. (We need to explicitly force print since we put option -n, i.e., 'silent'.)

Sed removing last 4 characters in a file text inside a folder

This is a linux problem
So I need to make a folder called "Notes", and inside that folder I need to make 6 files. The files are named with ID number, as follows:
a00144998.txt
a00154667.txt
a00130933.txt
a00143561.txt
a00157888.txt
The first 3 letter are always "a00", and the 4th & 5th letter are the year. So I now need to copy the ID number/title of file without the '.txt' extension, and paste it on a new file outside the folder "Notes"
Example file a00144998, it shows year 14(from the 4th & 5th letter), so I will copy a00144998 to a new file named "year14.txt" and sort it. Same as a00154667, it shows year 15, so I will copy a00154667 to a new file named "year15.txt". So at the end, file "year14.txt" will have :
a00143561
a00144998
I have found the code, and it works if the files are not in the folder. But once I create files inside folder, this code doesn't work, it keeps copying the txt extension. Any idea? Thanks!
ls ~/Notes/???14????.txt|sed 's/.\[4\]$//'|sort>year14.txt
No need for sed, you can do it all in a bash oneliner:
for f in a0014*.txt; do echo ${f:5:4}; done | sort > year14.txt
This will loop over every file matching the glob a0014*.txt and put each string in an f variable, echo out 4 characters starting after the 5th character.
TLDP has a great guide on string manipulation in bash.
You should also avoid parsing ls. It's meant to be human and not machine readable.

How to ask cygwin to repeat a command with different paramaters, drawn from a list, and then return the mean?

In phonological corpus tools, the documentation says
The procedure for running command-line analysis scripts is essentially
the same for any analysis. First, open a Terminal window (on Mac OS X
or Linux) or a CygWin window (on Windows, can be downloaded at
https://www.cygwin.com/). Using the “cd” command, navigate to the
directory containing your corpus file. If the analysis you want to
perform requires any additional input files, then they must also be in
this directory. (Instead of running the script from the relevant file
directory, you may also run scripts from any working directory as long
as you specify the full path to any files.) You then type the analysis
command into the Terminal and press enter/return to run the analysis.
The first (positional) argument after the name of the analysis script
is always the name of the corpus file.
The command I need is
corpustools.symbolsim.edit_distance.edit_distance(word1, word2, sequence_type, max_distance=None)
The file will be as required:
The file should be set up in columns (e.g., imported from a
spreadsheet) and be delimited with some uniform character (tab, comma,
backslash, etc.). The names of most columns of information can be
anything you like, but the column representing common spelling of the
word should be called “spelling”; that with transcription should be
called “transcription”; and that with token frequency should be called
“frequency.”
And I want to return values for a "transcription" series_type. So, from the above, I think that means that my .txt will be as follows:
spelling . transcription . frequency
PLEASE . P L IY Z . 1
HELP . HH EH L P . 1
ME . M IY . 1
Though getting all the data into .txt files of this sort will take a while.
Question
Is it possible, in cygwin , to run the above command, but to tell cygwin to work out the value for every possible word (I think that means I'll have to use the transcription values for the "word", so e.g. "P L IY Z") pair in the list as it appears in the .txt file (P L IY Z, HH EH L P... P L IY Z, M IY... HH EH L P, M IY), and then return the mean of these values?
I have no experience of cygwin, or coding.

Paste header line in multiple tsv (tab separated) files

I have multiple .tsv files named as choochoo1.tsv, choochoo2.tsv, ... choochoo(nth).tsv files. I also have a main.tsv file. I want to extract the header line in main.tsv and paste over all choochoo(nth).tsv files. Please note that there are other .tsv files in the directory that I don't want to change or paste header, so I can't do *.tsv and select all the .tsv files (so need to select choochoo string for wanted files). This is what I have tried using bash script, but could not make it work. Please suggest the right way to do it.
for x in *choochoo; do
head -n1 main.tsv > $x
done
You have a problem with the file glob, as well as the redirect:
the file glob will catch things like AAchoochoo but not choochoo1.tsv and not even AAchoochoo.tsv
the redirect will overwrite the existing files instead of adding to them. The redirect command for adding to a file is >>, but that will append text to the end and you want to prepend text in the beginning.
The problem with prepending text to an existing file, is that you have to open the file for both reading and writing and then stream both prepended text and original text, in order - and that is usually where people fail because the shell can't open files like that (there is a slightly more complex way of doing this directly, by opening the file for both reading and writing, but I'm not going to address that further).
You might want to use a temporary file, something like this:
for x in choochoo[0-9]*.tsv; do
mv "$x"{,.orig}
(head -n1 main.tsv; cat "$x.orig") > $x
rm "$x.orig"
done

Resources