Program using ast.literal_eval is too slow - python-3.x

I tried to transform strings into list using ast.literal_eval function for a column in a CSV file. The string is something like this '['abbb','cddd','cdcdc']'. For some reason this is a string instead of list, I tried to use ast.literal_eval to transform it into a list with components 'abbb','cddd' and 'cdcdc'. The problem is the execution is to slow (there are 1326101 rows to execute). The code I use is this:
import pandas as pd
import ast
import sys
user_dataset = pd.read_csv('user.csv')
for x in range(len(user_dataset['friends'])):
if user_dataset['friends'][x]!=[]:
"""Covert string to list"""
user_dataset['friends'][x] = ast.literal_eval(user_dataset['friends'][x])
Thanks a lot!

Related

How to manipulate this csv, in py especially, while i insert one Character like psg, and the result is 9

This is the example csv. I already tried another way but still print useless things, i just want to get the values (Count).
You can just read through pandas:
import pandas as pd
df = pd.read_csv(r'FILE.csv')
print(df['Count'])
If you need to print only the Counts:
print(*df['Count'].values.tolist(),sep="\n")

Python vs BigQuery FarmHash Sometimes Do Not Equal

In BigQuery when i run
select farm_fingerprint('6823339101') as f
resuls in
-889610237538610470
In Python
#pip install pyfarmhash
import farmhash
print(farmhash.hash64('6823339101'))
results in
17557133836170941146
BigQuery & Python do agree on most inputs, but there are specific ones like the one above where there is a mismatch for the same input
'6823339101'
How can I get bigquery & python to agree 100% of the time?
Links to bigquery & python hash documentation
https://pypi.org/project/pyfarmhash/
https://cloud.google.com/bigquery/docs/reference/standard-sql/hash_functions
As mentioned in the comments, the function is returning an unsigned int.
So we need to convert that as follows;
import numpy as np
np.uint64(farmhash.fingerprint64(x)).astype('int64')
Relevant issues: https://github.com/lovell/farmhash/issues/26#issuecomment-524581600
Results:
>>> import farmhash
>>> import numpy as np
>>> np.uint64(farmhash.fingerprint64('6823339101')).astype('int64')
-889610237538610470
Quickly scanning over the documentation that you have linked and pyfarmhash source:
The docs for farm_fingerprint read:
Computes the fingerprint of the STRING or BYTES input using the Fingerprint64 function
But in your python code, you are using the hash64 function, which according to the pyfarmhash source code uses a different function from the farmhash library than fingerprint does
Solution:
Use the same function farm_fingerprint is using
import farmhash
print(farmhash.fingerprint64('6823339101'))

Python translate a column with multiple languages to english

I have a dataset where there are multiple comments columns having multiple languages and I want to translate these columns into English and create new columns with all the english translations.
Accountability_COMMENT is the column which has multiple comments in different language in every row. I want to create a new column and translate all such comments to English.
I have tried the following code :
from googletrans import Translator
from textblob import TextBlob
translator = Translator()
data_merge['Accountability_COMMENT'] = data_merge['Accountability_COMMENT'].apply(lambda x:
TextBlob(x).translate(to='en'))
The error that I am getting is :
TypeError: The text argument passed to __init__(text) must be a string, not class 'float'
My column has objet format which is correct
You most probably have some comments that only consists of a float (i.e. a decimal number), that even if they are type: object according to pandas they are still interpreted as float by TextBlob. This leads to the error:
TypeError: The text argument passed to __init__(text) must be a string, not <class 'float'>
One solution is to make sure that the input x of TextBlob(x) is a string. You could do this by modifying the apply row like:
data_merge['Accountability_COMMENT'] = data_merge['Accountability_COMMENT'].apply(lambda x: TextBlob(str(x)).translate(to='en'))
Unfortunately this will probably also rais an error like:
raise NotTranslated('Translation API returned the input string unchanged.')
textblob.exceptions.NotTranslated: Translation API returned the input string unchanged.
This is due to the fact that when translating a number, the translation and the original text will be exactly the same, and apparently TextBlob doesn't like that.
What you can do to avoid this is to catch that exception NotTranslated and just return the untranslated TextBlob, like this:
from textblob import TextBlob
from textblob.exceptions import NotTranslated
def translate_comment(x):
try:
# Try to translate the string version of the comment
return TextBlob(str(x)).translate(to='en')
except NotTranslated:
# If the output is the same as the input just return the TextBlob version of the input
return TextBlob(str(x))
data_merge['Accountability_COMMENT'] = data_merge['Accountability_COMMENT'].apply(translate_comment)
EDIT:
If you get the HTTP error Too Many Requests it's probably because you are being kicked out by the Google Translate API. Instead of using apply, you can make your translation "extra-slow" by using a for loop with some sleep in-between cycles. In this case you should import another package (time) and substitute the last line:
from time import sleep
from textblob import TextBlob
from textblob.exceptions import NotTranslated
def translate_comment(x):
try:
# Try to translate the string version of the comment
return TextBlob(str(x)).translate(to='en')
except NotTranslated:
# If the output is the same as the input just return the TextBlob version of the input
return TextBlob(str(x))
for i in range(len(data_merge['Accountability_COMMENT'])):
# Translate one comment at a time
data_merge['Accountability_COMMENT'].iloc[i] = translate_comment(data_merge['Accountability_COMMENT'].iloc[i])
# Sleep for a quarter of second
sleep(0.25)
You can then experiment with different values for the sleep function. Of course the longer the sleep the slower the translation! N.B. sleep argument is in seconds.

Changing specific strings into floats in multidimensional array

I saved all the data into an array full of strings, but I want to change the strings in that array into float without changing the header (the first row) and first column of the array. How should I change my code?
import numpy as np
import csv
with open('MI_5MINS_INDEX.csv', encoding="utf-8") as f:
data=list(csv.reader(f))
for line in data:
line.remove('')
ary=np.array(data)
ary.astype(float)
Use pandas read_csv() and it will work as you wish.

python, loading a string from file

I'm trying to load a .txt file into my python project using numpy:
import numpy as np
import sys
g = np.loadtxt(sys.argv[1])
this command has worked for me when .txt file was a 0/1 matrix, but not
working now as it is a string matrix (4*7 table of words like "crew")
error says "cant convert string to float".. any help?
Take a look at the dtype parameter. (here)
dtype : data-type, optional
Data-type of the resulting array; default: float. If this is a structured data-type, the resulting array will be 1-dimensional, and each row will be interpreted as an element of the array. In this case, the number of columns used must match the number of fields in the data-type.
The default is float, which results in the error you are pointing out in your question.
One option is using pandas:
import numpy as np
import pandas as pd
arr = pd.read_table(filename, sep=" ", header=None).values
(Assuming the separator is a whitespace and there is no header column. Specify otherwise).

Resources