Pandas Column Names Not Lining Up When .dat File Read - python-3.x

I'm going through Wes McKinney's Python for Data Analysis 2nd Edition and in Chapter 2 he has several examples based of merging three .dat files about movie reviews.
I can get two of the three data files to work (users and reviews), but the third one (movie titles) I can not get to work and can't figure out what to do.
Here's the code:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('movies.dat', sep = '::', header = None, engine = 'python', names = mnames)
print(movies[:5])
And here is what the output/problem looks like. Seems the file is not lining up the separator correctly and I've tried recreating the file and comparing to the other two files which are working but they look exactly the same.
Here's a sample data taken from here:
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror
13::Balto (1995)::Animation|Children's
14::Nixon (1995)::Drama
I'd like to be able to read this file properly so I can join it to the other two example files and keep learning Pandas :)

try adding encoding='UTF-16' to pd.read_table()
(Sorry, not enough reputation to add a comment.)

Related

Problem with .xls file validation on e-commerce platform

you may have noted that this is a long question, that was because I really put an effort to explain how many WTF's I am facing, and, maybe, is not that good yet, anyway, I appreciate your help!
Context
I'm doing an integration project for a client that handles a bunch of data to generate Excel files in .xls format, notice that extension!
While developing the project I was using the xlrd and xlwt python extensions, because, again, I need to create a .xls file. But at some time I had to download and extract a file and was in .csv format (but, in reality, the file contains an HTML table :c).
So I decided to use padas to read the HTML, create a data frame so I can manipulate and return a .xls excel file.
The Problem
after coding the logic and checking that the data was correct, I tried to upload this file to the e-commerce plataform.
What happened is that the platform doesn't validate my archive.
First I will briefly explain how the site work: He accepts .xls and only .xls file, probably manipulate and use them to update the database, I have access to nothing from the code source.
When I upload the file, the site takes me to a configuration page where, if I want or the site didn't relate right, I could relate excel columns to be the id or values that would be updated on the database.
The 'generico4' field expects 'smallint(5) unsigned' on the type.
An important fact is that I sent the file to my client so he could validate the data, and after many conversations between us was discovered that if he, just by downloading my file, opening, and saving, the upload works fine (the second image from my slide), important to note that he has a MacBook and me, Ubuntu. I tried to do the same thing but not worked.
He sent me this file and I tried to see the difference between both and I found nothing, the type of the numbers are the same, that is 'float', and printed via excel with the formula =TYPE(cell) returned 1.
I already tried many other things but nothing works :c
The code
Follow the code so you can have an idea of the logic
def stock_xls(data_file_path):
# This is my logic to manipulate the data
df = pd.read_html(data_file_path)[0]
df = df[[1,2]]
df.rename(columns={1:'sku', 2:'stock'}, inplace=True)
df = df.groupby(['sku']).sum()
df.reset_index(inplace=True)
df.loc[df['stock'] > 0, 'stock'] = 1
df.loc[df['stock'] == 0, 'stock'] = 2
# I create a new Worbook (via pandas was not working too)
wb_out = xlwt.Workbook()
ws_out = wb_out.add_sheet(sheetname='stock')
# Set the columns name
ws_out.write(0, 0, 'sku')
ws_out.write(0, 1, 'generico4')
# Copy DataFrame data to the WorkBook
for index, value in df.iterrows():
ws_out.write(index + 1, 0, str(value['sku']))
ws_out.write(index + 1, 1, int(value['stock']))
path = os.path.join(BASE_DIR, f'src/xls/temp/')
Path(path).mkdir(parents=True, exist_ok=True)
file_path = os.path.join(path, "stock.xls")
wb_out.save(file_path)
return file_path

We have many mainframe files which are in EBCDIC format, is there a way in Python to parse or convert the mainframe file into csv file or text file?

I need to read the records from mainframe file and apply the some filters on record values.
So I am looking for a solution to convert the mainframe file to csv or text or Excel workbook so that I can easily perform the operations on the file.
I also need to validate the records count.
Who said anything about EBCDIC? The OP didn't.
If it is all text then FTP'ing with EBCDIC to ASCII translation is doable, including within Python.
If not then either:
The extraction and conversion to CSV needs to happen on z/OS. Perhaps with a COBOL program. Then the CSV can be FTP'ed down with
or
The data has to be FTP'ed BINARY and then parsed and bits of it translated.
But, as so often is the case, we need more information.
I was recently processing the hardcopy log and wanted to break the record apart. I used python to do this as the record was effectively a fixed position record with different data items at fixed locations in the record. In my case the entire record was text but one could easily apply this technique to convert various colums to an appropriate type.
Here is a sample record. I added a few lines to help visualize the data offsets used in the code to access the data:
1 2 3 4 5 6 7 8 9
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
N 4000000 PROD 19114 06:27:04.07 JOB02679 00000090 $HASP373 PWUB02#C STARTED - INIT 17
Note the fixed column positions for the various items and how they are referenced by position. Using this technique you could process the file and create a CSV with the output you want for processing in Excel.
For my case I used Python 3.
def processBaseMessage(self, message):
self.command = message[1]
self.routing = list(message[2:9])
self.routingCodes = [] # These are routing codes extracted from the system log.
self.sysname = message[10:18]
self.date = message[19:24]
self.time = message[25:36]
self.ident = message[37:45]
self.msgflags = message[46:54]
self.msg = [ message[56:] ]
You can then format into the form you need for further processing. There are other ways to process mainframe data but based on the question this approach should suit your needs but there are many variations.

How to save the output of text from selenium chrome (Python)

I'm using Selenium for extracting comments of Youtube.
Everything went well. But when I print comment.text, the output is the last sentence.
I don't know who to save it for further analyze (cleaning and tokenization)
path = "/mnt/c/Users/xxx/chromedriver.exe"
This is the path that I saved and downloaded my chrome
chrome = webdriver.Chrome(path)
url = "https://www.youtube.com/watch?v=WPni755-Krg"
chrome.get(url)
chrome.maximize_window()
scrolldown
sleep = 5
chrome.execute_script('window.scrollTo(0, 500);'
time.sleep(sleep)
chrome.execute_script('window.scrollTo(0, 1080);')
time.sleep(sleep)
text_comment = chrome.find_element_by_xpath('//*[#id="contents"]')
comments = text_comment.find_elements_by_xpath('//*[#id="content-text"]')
comment_ids = []
Try this approach for getting the text of all comments. (the forloop part edited- there was no indention in the previous code.)
for comment in comments:
comment_ids.append(comment.get_attribute('id'))
print(comment.text)
when I print, i can see all the texts here. but how can i open it for further study. Should i always use for loop? I want to tokenize the texts but the output is only last sentence. Is there a way to save this .text file with the whole texts inside it and open it again? I googled it a lot but it wasn't successful.
So it sounds like you're just trying to store these comments to reference later. Your current solution is to append them to a string and use a token to create substrings? I'm not familiar with pythons data structures, but this sounds like a great job for an array or a list depending on how you plan to reference this data.

Improve speed of wikipedia.search with Python

I have a dictionary of famous people's names sorted by their initials. I want to convert these names into their respective Wikipedia title page names. These are the same for the first three given in this example, but Alexander Bell gets correctly converted to Alexander Graham Bell after running this code.
The algorithm works, although took about an hour to do all the 'AA' names and I am hoping for it to do this all the way up to 'ZZ'.
Is there any optimisation I can do on this? For example I saw something about batch requests but am not sure if it applies to my algorithm.
Or is there a more efficient method that I could use to get this same information?
Thanks.
import wikipedia
PeopleDictionary = {'AA':['Amy Adams', 'Aaron Allston'], 'AB':['Alia Bhatt', 'Alexander Bell']}
for key, val in PeopleDictionary.items():
for val in range(len(PeopleDictionary[key])):
Name_URL_All = wikipedia.search(PeopleDictionary[key][val])
if Name_URL_All:
Name_URL = Name_URL_All[0]
PeopleDictionary[key][val] = Name_URL

Accessing unicode content from DataFrame returns unicode content with additional backslash in Python3

I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.
I put the CSV File into DataFrame,
df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns
one of the tweets is -
b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
But when i access this tweet through the command -
df['tweet'][0]
the output is returned in below format -
"b'RT : This little girl dressed as her father for Halloween, a employee \\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x91\\x8c (via ) '"
I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.
time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map that\xe2\x80\x99s confusing.\xe2\x80\xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And they\xe2\x80\xa6'
Screenshot of 'sample.csv'.
As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.
Can anyone please explain why this is happening and how to avoid it?
thanks
You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'\xff...' characters.
So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.
One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.
So, after you have your data loaded into your dataframe, this could fix your tweets column:
import ast
df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)

Resources