Unicode manipulation and garbage '[]' characters - python-3.x

I have a 4GB text file which I can't even load to view so I'm trying to separate it but I need to manipulate the data a bit at a time.
The problem is I'm getting these garbage white vertical rectangular characters and I can't search for what they are in a search engine because it won't paste nor can I get rid of them.
They look like these square parenthesis '[]' but without that small amount of space in the middle.
Their Unicode values differ so I can't just select one value and get rid of it.
I want to get rid of all of these rectangles.
Two more questions.
1) Why are there any Unicode characters here (in the img below) at all? I decoded them. What am I missing? Note: Later on I get string output that looks like a normal string such as 'code1234' etc but there are those Unicode exceptions there as well.
2) Can you see why larger end values would get this exception list index out of range? This only happens towards the end of the range and it isn't constant i.e. if end is 100 then maybe the last 5 will throw that exception but if end is 1000 then ONLY the LAST let's say 10 throw that exception.
Some code:
from itertools import islice
def read_from_file(file, start, end):
with open(file,'rb') as f:
for line in islice(f, start, end):
data.append(line.strip().decode("utf-8"))
for i in range(len(data)-1):
try:
if '#' in data[i]:
a = data.pop(i)
mail.append(a)
else:
print(data[i], data[i].encode())
except Exception as e:
print(str(e))
data = []
mail = []
read_from_file('breachcompilationuniq.txt', 0, 10)
Some Output:
Image link here as it won't let me format after pasting.
There's also this stuff later on, I don't know what these are either.

It appears that you have a text file which is not in the default encoding assumed by python (UTF-8), but nevertheless uses bytes values in the range 128-255. Try:
f = open(file, encoding='latin_1')
content = f.read()

Related

pandas.read_clipboard only reads whole lines not columns

I transferred all my python3 codes from macOS to Ubuntu 18.04 and in one program I need to use pandas.clipboard(). At this point of time there is a list in the clipboard with multiple lines and columns divided by tabs and each element in quotation marks.
After just trying
import pandas as pd
df = pd.read_clipboard()
I'm getting this error: pandas.errors.ParserError: Expected 8 fields in line 3, saw 11. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.. And line 3 looks like "word1" "word2 and another" "word3" .... Without the quotation marks you count 11 elements and within quotation marks you count 8.
In the next step I tried
import pandas as pd
df = pd.read_clipboard(sep='\t')
and I'm getting no errors but it results only in a Series with each line of the clipboard source in one element.
Yes, maybe it's a solution to write a code for separating each element of a line after this step but because it's working very well under macOS (with just pd.read_clipboard()) I hope that there's a better solution.
Thank you for helping.
I wrote a "turnaround" for my question. It's not the exact solution but because I just need the elements of one column in an array I solved it like that:
import pyperclip
# read clipboard
cb = pyperclip.paste()
# lines in array
cb_arr = cb.splitlines()
column = []
for cb_line in cb_arr:
# words in array
cb_words = cb_line.split("\"")
# pick element of column 1
word = cb_words[1]
column.append(word)
# delete column name
column.pop(0)
print(column)
Maybe it helps someone else, too.

Finding specific object after another one

I am creating a program that extracts the relevant information from a textfile with 500k lines.
What I've managed so far is to take the info from the textfile and make it into a list which each element being a line.
The relevant text is formatted like this:
*A title that informs that the following section will have the data I'm trying to extract *
*Valuable info in random amount of lines*
*-------------------*
and in between each relevant section of information, formatted in the same way but starting with another title i.e:
*A title that shows that this is data I don't want *
*Non-valuable info in random amount of lines *
*------------------- *
I've managed to list the indexes of the starting point with the follow code:
start = [i for i, x in enumerate(lines) if x[0:4] == searchObject1 and x[5:8] == searchObject2]
But I'm struggling to find the stopping points. I can't use the same method used when finding the starting points because the stopping line appears also after non-important info.
I'm quite the newbie to both Python and programming so the solution might be obvious.
A simple solution is to loop over the input file line by line, and keep only valuable lines. To know whether a line is valuable, we use a boolean variable that is:
set to true ("keep the lines") whenever we encounter a title marking the beginning of a section of interesting data,
set to false ("discard the lines") whenever we encounter an end of section mark. The variable is set to discard even when we encounter the end of a useless section, which doesn't change its state.
Here is the code (lines is the list of strings containing the data to parse):
bool keep = false;
data = []
for line in lines:
if line == <title of useful section> # Adapt
keep = true
elif line == <end of section> # Adapt
keep = false
else:
if keep:
data.append(line)
If none of the cases matched, the line was one of two things:
a line of data in a useless section
the title of a useless section
So it can be discarded.
Note that the titles and end of section lines are not saved.

Iterate over images with pattern

I have thousands of images which are labeled IMG_####_0 where the first image is IMG_0001_0.png the 22nd is IMG_0022_0.png, the 100th is IMG_0100_0.png etc. I want to perform some tasks by iterating over them.
I used this fnames = ['IMG_{}_0.png'.format(i) for i in range(150)] to iterate over the first 150 images but I get this error FileNotFoundError: [Errno 2] No such file or directory: '/Users/me/images/IMG_0_0.png' which suggests that it is not the correct way to do it. Any ideas about how to capture this pattern while being able to iterate over the specified number of images i.e in my case from IMG_0001_0.png to IMG_0150_0.png
fnames = ['IMG_{0:04d}_0.png'.format(i) for i in range(1,151)]
print(fnames)
for fn in fnames:
try:
with open(fn, "r") as reader:
# do smth here
pass
except ( FileNotFoundError,OSError) as err:
print(err)
Output:
['IMG_0000_0.png', 'IMG_0001_0.png', ..., 'IMG_0148_0.png', 'IMG_0149_0.png']
Dokumentation: string-format()
and format mini specification.
'{:04d}' # format the given parameter with 0 filled to 4 digits as decimal integer
The other way to do it would be to create a normal string and fill it with 0:
print(str(22).zfill(10))
Output:
0000000022
But for your case, format language makes more sense.
You need to use a format pattern to get the format you're looking for. You don't just want the integer converted to a string, you specifically want it to always be a string with four digits, using leading 0's to fill in any empty space. The best way to do this is:
'IMG_{:04d}_0.png'.format(i)
instead of your current format string. The result looks like this:
In [2]: 'IMG_{:04d}_0.png'.format(3)
Out[2]: 'IMG_0003_0.png'
generate list of possible names and try if exist is slow and horrible way to iterate over files.
try look to https://docs.python.org/3/library/glob.html
so something like:
from glob import iglob
filenames = iglob("/path/to/folder/IMG_*_0.png")

How to decode a text file by extracting alphabet characters and listing them into a message?

So we were given an assignment to create a code that would sort through a long message filled with special characters (ie. [,{,%,$,*) with only a few alphabet characters throughout the entire thing to make a special message.
I've been searching on this site for a while and haven't found anything specific enough that would work.
I put the text file into a pastebin if you want to see it
https://pastebin.com/48BTWB3B
Anywho, this is what I've come up with for code so far
code = open('code.txt', 'r')
lettersList = code.readlines()
lettersList.sort()
for letters in lettersList:
print(letters)
It prints the code.txt out but into short lists, essentially cutting it into smaller pieces. I want it to find and sort out the alphabet characters into a list and print the decoded message.
This is something you can do pretty easily with regex.
import re
with open('code.txt', 'r') as filehandle:
contents = filehandle.read()
letters = re.findall("[a-zA-Z]+", contents)
if you want to condense the list into a single string, you can use a join:
single_str = ''.join(letters)

Skipping over array elements of certain types

I have a csv file that gets read into my code where arrays are generated out of each row of the file. I want to ignore all the array elements with letters in them and only worry about changing the elements containing numbers into floats. How can I change code like this:
myValues = []
data = open(text_file,"r")
for line in data.readlines()[1:]:
myValues.append([float(f) for f in line.strip('\n').strip('\r').split(',')])
so that the last line knows to only try converting numbers into floats, and to skip the letters entirely?
Put another way, given this list,
list = ['2','z','y','3','4']
what command should be given so the code knows not to try converting letters into floats?
You could use try: except:
for i in list:
try:
myVal.append(float(i))
except:
pass

Resources