reading a files which has several lines in pyspark - apache-spark

I have a file that has several lines like this:
(lp0
I200
aV<!DOCTYPE HTML
When I read this file in python, the file is read as it is, like this:
(lp0
I200
aV<!DOCTYPE HTML
but when i read it in pyspark, i got the following value:
(lp0\nI200\naV<!DOCTYPE HTML
How can I get the output of pyspark read to its original value.
I read the file as this:
rdd = sc.wholeTextFiles("file:///home/hadoopuser/gc/data_from_gc/part-04068",use_unicode=False)
Thanks in advance.

Your system is probably reading the file correctly, in both cases... and in both cases, it almost assuredly contains the '\n' (newline) characters (even if you don't see them).
For example, in Python, if you use the print() function, any text with newline characters will display to the screen, but you won't see the actual characters, you will simply see the text, with text wrapping, as shown above.
In some tools, and PySpark may be one of them (again, not seeing your code) if you display the output of a calculation, i.e. by evaluating a Python statement using a Python prompt on the commandline versus printing the text, your result may be displayed to the screen as a string representation of the variable, which will show you the newline characters.
NOTE: If you give us the appropriate snippets of code, we can try to see where things have gone awry and provide better solutions.
For example:
In [4]: h = 'hello\nworld!'
In [5]: h # Here we are simply evaluating the Python Statement
Out[5]: 'hello\nworld!'
In [6]: print(h) # Here we are printing the content of h
hello
world!

Related

Is there an end= equivalent for inputs?

So as I'm sure you know there's a specific operator for print() functions called end.
#as an example
print('BOB', end='-')
#would output
BOB-
So is there something like this for inputs? For example, if I wanted to have an input that would look something like this:
Input here
►
-------------------------------------------------------
And have the input at the ► and be inside the dashes using something like
x = input('Input here\n►', end='-------')
Would there be some equivalent?
EDIT:
Just to be clear, everything will be printed at the same time. The input would just be on the line marked with the ►, and the ---- would be printed below it, but at the SAME time. This means that the input would be "enclosed" by the ---.
Also, there has been a comment about curses - can you please clarify on this?
Not exactly what you want, but if the --- (or ___) can also be on the same line, you could use an input prompt with \r:
input("__________\r> ")
This means: print 10 _, then go back \r to the beginning of the line, print > overwriting the first two _, then capture the input, overwriting more _. This shows the input prompt > ________. After typing some chars: > test____. Captured input: 'test'
For more complex input forms, you should consider using curses.
When using basic console IO, once a line has been ended with a newline, it's gone and can't be edited. You can't move the cursor up to do print anything above that last line, only add on a new line below.
That means that without using a specialized "console graphics" library like curses (as tobias_k suggests), you pretty much can't do what you're asking. You can mess around a little with the contents of the last line (overwriting text you've already written there), but you can't write to any line other than the last one.
To understand why console IO works this way, you should know that very early computers didn't have screens. Instead, their console output was directly printed out one line at a time on paper. While some line printers could print several characters on the same spot (to get effects line strikethrough or underline), you couldn't unprint anything once it was on the paper. Furthermore, the paper feed only worked in one direction. Once you had sent a newline character that told the printer to advance the paper, you couldn't go back to an old line again.
I believe this would solve your problem:
print(f">>> {input()} ------")
OR
print(f"{input(">>>")} ------")
F-strings are quite useful when it comes to printing text + variables.

Exact strings do not match in Jupyter Notebook

What did I want to do?
I was reading file names with various organ names in their file endings and there are many such files using glob.glob('filename/**/blabla')
Later, I tried to match a particular string if present inside the filename using IN operator. like
"ADRENALGLAND(LEFT).NRRD" IN "blabla/blabla/blabla/blablabla_ADRENALGLAND(LEFT).NRRD"
It worked for other filenames with the same ending whereas it did not work for a few.
To debug, I was trying to match if visually the same filename endings from two files are the same programmatically, but they are not!!! why?
For debug, I tried to match string to string. Like below. But I saw a peculiar thing while comparing strings in python.
Can anyone tell me what is the difference here?
**
'ADRENALGLAND(LEFT).NRRD' == 'АDRENALGLAND(LEFT).NRRD' => False !!!
**
I bring it down to this part where 'A's do not match whereas others matched properly.
As mentioned by #canbax, I checked the underline ASCII value for both the character and found that they are different. One gave 65 (Normal ASCII Code for English Alphabet 'A') whereas the other one gave 1040.
You can use ord() to get the ASCII int value of a character.
Although the int values are different, visually they look the same, which might be an issue from the jupyter notebook side.
Final Solution: Replaced the fancy A with the normal A in the file.

Why do apostrophes (" ' ") turn into ▒'s when reading from a file in Python?

I used Bash to open a Python file. The Python file should read a utf-8 file and display it in the terminal. It gives me a bunch of ▒'s ("Aaron▒s" instead of "Aaron's"). Here's the code:
# It reads text from a text file (done).
f = open("draft.txt", "r", encoding="utf8")
# It handles apostrophes and non-ASCII characters.
print(f.read())
I've tried different combinations of:
read formats with the open function ("r" and "rb")
strip() and rstrip() method calls
decode() method calls
text file encoding (specifically ANSI, Unicode, Unicode big endian, and UTF-8).
It still doesn't display apostrophes (" ' ") properly. How do I make it display apostrophes instead of ▒'s?
The issue is with Git Bash. If I switch to Powershell, Python displays the apostrophes (Aaron's) perfectly. The semantic read errors (Aaron▒s) appear only with Git Bash. I'll give more details if I learn more about it.
Update: #jasonharper and #entpnerd suggested that the draft.txt apostrophe might be "apostrophe-ish" and not a legitimate apostrophe. I compared the draft.txt apostrophe (copy and pasted from a Google Doc) with an apostrophe directly entered. They look different (’ vs. '). In xxd, the value for the apostrophe-ish character is 92. An actual apostrophe is 27. Git Bash only supports the latter (unless there's just something I need to configure, which is more likely).
Second Update: Clarified that I'm using Git Bash. I wasn't aware that there were multiple terminals (is that the right way of putting it?) that ran Bash.

Multiline input prompt indentation interfering with output indentation

I have a function that prints the number of pixels found in an image and then asks the user how they would like to proceed. As long as the interpreter hasn't moved on from the function I want all the output to be indented accordingly.
One such 'sub output' (the input prompt) needs to be multiple lines. So I kick off with the 3*quote (''') followed by two spaces to create the indentation. At the end of the question 'how would you like to proceed?' I use a hard return. An extra indentation is assumed by the text editor so I remove it causing the following list of suggestions to line up flush with the input variable command. Here's how it looks:
def returnColors():
#
# lots of code that does stuff...
#
print("The source image contains", lSize, "px.")
print("")
command=input(''' What would you like to do? You can say:
get all
get unique
''')
The problem with this is that the interpreter is acknowledging the indentation that separates the function body from the function statement as actual string contents, causing the output to look like this:
The source image contains 512 px.
What would you like to do? You can say...
get all
get unique
|
The only way to avoid this is by breaking indentation in the interpreter. Although I know it works, it doesn't look very good. So what options do I have?
EDIT: Just because I have the screenshot_
One thing that you should keep in mind is that once you have start a multiline string declaration, all the text until it is closed is taken as is and syntax (ie, indentation) is no longer considered.
You can start your multiline with an explicit new line so that everything in the multiline string can be indented together in code.
IE.
command=input('''
What would you like to do? You can say:
get all
get unique
''')
would print out the prompt with a new line on top, but the formatting of the text is more explicitly shown and should appear as seen.
OR you could use the \n for each new line in the string to get it formatted more correctly and remember to use a single \ after each new line. E.g.
instead of:
''' What would you like to do? You can say:
get all
get unique
'''
Try
' What would you like to do? You can say:\
\n\
\n get all\
\n get unique\
\n'
The indent won't matter, no matter where you use \n at the beginning of new line, the input() will output the same. This is will give the same input() string:
' What would you like to do? You can say:\
\n\
\n get all\
\n get unique\
\n'

How to make a dictionary that contains an Arabic diacritic as a key in python

I am trying to make a program that converts the Arabic diacritics and letters into the Latin script. The letters work well in the program, but the diacritics can not be converted as I get an error every time I run the program.
At the beginning, I put the diacritics alone as keys but that did not work with me. please, see the last key, it contains َ ,which is a diacritic, but do not work properly as the letters:
def convert(lit):
ArEn = {'ا':'A', 'ل':'L', "و": "W", "َ":"a"}
end_word=[]
for i in range(len(lit)):
end_word.append(ArEn[lit[i]])
jon = ""
print(jon.join(end_word))
convert("الوَ")
However, I tried to fix the problem by using letters attached with diacritics as keys, but the program resulted in the same error:
the dictionary:
ArEn = {'ا':'A', 'ل':'L', "وَ":"Wa"}
the error:
Traceback (most recent call last):
File "C:\Users\Abdulaziz\Desktop\converter AR to EN SC.py", line 10, in <module>
convert("الوَ")
File "C:\Users\Abdulaziz\Desktop\converter AR to EN SC.py", line 5, in convert
end_word.append(ArEn[lit[i]])
KeyError: 'و'
The chances are rather there is a bug in the programing-code editor you are using for coding Python than on Pyhton itself.
Since you are using Python-3.x, the diacritics from the running progam point of view are just a single character, like any other, and there should be no issues at all.
From the cod-editor point of view, there are issues such as whether to advance one character when displaying certain special unicode characters or not, and maybe the " character itself can be show out of space - when one tries to manually correct the position of the ", one could place it out of order, leaving the special character actually outside the quoted string -
The fact you could solve the issue by re-editing the file suggests that is indeed what happened.
One way to avoid this is to put certain special characters - specially ones that have different displaying rules, is to escape then with the "\uxxxx" unicode codepoint unicode sequence. This will avoid yourself or other persons having issues when editing your file again in the future, since even i yu get it working now, the editor may show then incorrectly when they are opened, and by trying to fix it one might break the syntax again.
You can use a table on the web or Python3's interactive prompt to get the unicode codepoint of each character, ensuring the code part of the program is displayed in a deterministic way in any editor - (if you add the diacritical char as a comment on the same line, it will actually enhance the readability of your code - enormously if it is ever supposed to be edited by non Arabic speakers)
So, your above declaration, I used this snippet to extract the codepoints:
>>> ArEn = {'ا':'A', 'ل':'L', "و": "W", "َ":"a"}
>>> [print (hex(ord(yy)), yy ) for yy in ArEn.keys()]
0x648 و
0x644 ل
0x64e َ
0x627 ا
Which allows me to declare the dictionary like this:
ArEn = {
"\u0648": "W", # و
"\u0644": "L", # L
"\u064e": "a", # ۮ
"\u0627": "A", # ا
}
(And yes, I had trouble with displaying the characters on my terminal like I said you probably had on your editor while getting these - the fatha ("\u064e" - "a") character is tricky ! :-) )
Alternatively for using the codepoints in your code, is to use Python's unicode data module to discover and them use the actual character names - this can enhance readability further, and maybe by exploring unicodedata you can find out you don't even have to create this dictionary manually, but use that module instead -
In [16]: [print("\\u{:04x} - '{}' - {}".format(ord(yy), unicodedata.name(yy), yy) ) for yy in ArEn.keys()]
\u0648 - 'ARABIC LETTER WAW' - و
\u0644 - 'ARABIC LETTER LAM' - ل
\u064e - 'ARABIC FATHA' - َ
\u0627 - 'ARABIC LETTER ALEF' - ا
And from these full text names, you can get back to the character with the unicodedata.lookup function:
>>> unicodedata.lookup("ARABIC LETTER LAM")
'ل'
notes:
1) This requires Python3 - for Python2 one might try to prefix each string with u"" - but one dealign with these characters is far better off using Python 3, since unicode support is one of the big deals with it.
2) This also requires a terminal with a nice support for unicode characters using "utf-8" encoding - I am on a Linux system with the "konsole" terminal. On Windows, the idle Python prompt might work, but not the cmd Python prompt.
You might need proper indentation in python:
def convert(lit):
ArEn = {'ا':'A', 'ل':'L', "و":"W", "َ":"a", "ُ":"w", "":""}
end_word=[]
for i in range(len(lit)):
end_word.append(ArEn[lit[i]])
jon = ""
print(jon.join(end_word))
convert("اُلوَ")
Update: I just noticed, after years, that the letters and diacritics are put together in the first try. When I separated them, the program worked.
I just solved the problem!
I am not really sure if it is a mistake in python or something else, but as far as I know python does not support Arabic very well. Or maybe I made a problem in the program above.
I kept writing the same program and suddenly it worked very well.
I even added different diacritics and they worked properly.
def convert(lit):
ArEn = {'ا':'A', 'ل':'L', "و":"W", "َ":"a", "ُ":"w", "":""}
end_word=[]
for i in range(len(lit)):
end_word.append(ArEn[lit[i]])
jon = ""
print(jon.join(end_word))
convert("اُلوَ")
the reult is
AwLWa

Resources