Get word definition/s from google translator using Python - python-3.x

please, I am making my own dictionary and cant figure out how to pull translation definitions from google translate. My idea is that python will open my excel file and in every cell in column 1 is a new word. python will take every single one simultaneously. translate it from English to Slovak by using google translator and don't take just the translated word, but rather its definition/s (if there's more than one definition, take them all) and the group of the definition (noun, adverb, verb, ...) and then add these data in the excel table either in a new cell next to the original translated word or if more definitions, add rows for every definition.
I'm new to this so please excuse me.

To be able to satisfy your requirements. A way to do this is to do the following in your script:
You can use pandas.read_excel to read your excel file and do some data manipulation to get all values in your column 1.
When you got your values to translate you can use something like googletrans which uses Google Translate on the back end or use the paid Google Translation API to handle your translations. But based from your requirements, I suggest using the Google Translation API since it is capable of returning all possible definitions.
When you get your translations, it is up to you to transform your data so you can add them as a new column on your original excel file. You can use pandas.ExcelWriter for this.
I made this simple script that reads a CSV file (I don't have excel installed in my machine), translates everything under text column and puts them to the translated column. It's up to you if you process the data differently.
NOTE for the script below:
I used the Google Translation API which is the paid service
Use pd.read_excel() to read excel files
Adjust the column number based from your input file
sample_data.csv:
text
dummy_field
run
dummy1
how are you
dummy2
jump
dummy3
Sample script:
import pandas as pd
from google.cloud import translate_v2 as translate
def translate_text(text):
translate_client = translate.Client()
target = 'tl'
result = translate_client.translate(text, target_language = target)
return result["translatedText"]
def process_data(input_file):
#df = pd.read_excel('test.xlsx', engine='openpyxl')
df = pd.read_csv(input_file)
df['translated'] = df['text'].apply(translate_text)
# move column 'translated' to second column
# this position will depend on your actual data
second_col = df.pop('translated')
df.insert(1, 'translated', second_col)
print(df)
df.to_csv('./updated_data.csv',index=False)
df.to_excel('./updated_data.xlsx',index=False)
process_data('sample_data.csv')
Output:
Dataframe
Generated csv file:
Generated excel file:

Related

Ideas to extract specific invoice pdf data for different formats and convert to Excel

I am currently working on a digitalisation project which consists in extracting specific information from pdf-formatted electricity invoices. Once the data is extracted, I would like to store it in an Excel spreadsheet.
The objectives are the following:
First of all, the data to be extracted would be the following:
https://i.stack.imgur.com/6RLo2.png
In this case, the data to be extracted is the information surrounded in red. This would be the CUPS, the total amount and the consumed electricity per period (P1-P6).
Once this is extracted, I would like to display this in an Excel Spreadsheet.
Could you please give me any ideas/tips regarding the extraction of this data? I understand that OCR software would do this best, but do not know how I could extract this specific information.
Thanks for you help and advice.
If there is no text data in your PDF then I don't believe there is a clean and consistent way to do this yet. If your invoice templates are always the same format and resolution, then the pixel coordinates of the text positions should be the same.
This means that you can create a cropped image with only the text you're interested in. Then you can use your OCR tool to extract all the text and you have extracted your data field. You would have to do this for all the data fields that you want to extract.
This would only work for invoices that always have the same format and resolution. So scanned invoices wouldn't work, and dynamic tables make things exponentially more complex as well.
I would check if its possible to simply extract the text using PDF to text 1st then work my cmd text parsing around that output, and loop file to file.
I don't have your sample to test so you would need to adjust to suit your bills
pdftotext -nopgbrk -layout electric.pdf - |findstr /i "cups factura" & pdftotext -nopgbrk -layout -y 200 -W 300 -H 200 electric.pdf
Personally would use the two parts as separate cycles so first pair replace the , with a safe csv character such as * then inject , for the large gap to make them 2 column csv (perhaps replace the Γé¼ with ,€ if necessary since your captured text may be in €uros already)
The second group I would possibly inject , by numeric position to form the desired columns, I only demo 4 column by 2 rows but you want 7 column by 4 rows, so adjust those values to suit. However, you can use any language you are familiar with such as VBA to split how you want to import in to eXcel.
In Excel you may want to use PowerQuery to read the pdf:
https://learn.microsoft.com/en-us/power-query/connectors/pdf
Then you can further process to extract the data you want within PowerQuery.
If you are interested in further data analysis after extraction you may want to consider KNIME as well:
https://hub.knime.com/jyotendra/spaces/Public/latest/Reading%20PDF%20and%20extracting%20information~pNh3GdorF0Z9WGm8
From there export to Excel is also supported.
edit:
after extracting, regex helps to filter for the specific data, e.g. look for key words, length and structure of the data item (e.g. the CUPS number), is it a currency with decimal etc.
edit 2: regex in Excel
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
e.g. look for a new line starting with CUPS followed by a sequence of 15-characters (if you have more details, you can specify the matching pattern more: e.g. starting with E, or 5th character is X or 5, etc.)

Separating values that are combined in one string

I would like to solve this either in Excel or in SPSS:
I have categorical data (each number representing a medical diagnosis) that are combined into single cells. In other words, a row (patient) has multiple diagnoses. However, I would like to know the frequencies of each diagnosis. What is the best way to go about this? (See picture for reference)
For SPSS:
First just creating some sample data to demonstrate on:
data list free/e_cerv_dis_state (a20).
begin data
"{1/2/3/6}" "{1/2/4}" "{2/4/5}" "{1/5/6}" "{4}" "{4/5/6}" "{1/2/3/4/5/6}"
end data.
Now the following code will create a separate variable for each possible diagnosis, and will put a 1 in it if the diagnosis exists in the original variable.
do repeat vr=diag1 to diag9/vl=1 to 9.
compute vr=char.index(e_cerv_dis_state, string(vl, f1) ) > 0.
end repeat.
freq diag1 to diag6.
Note this will only work for up to 9 diagnoses. If you have more than that the solution will have to be adapted to multiple digits.
Assuming that the number of columns is fairly regular, I would suggest using text to columns, and then using COUNTIF on the cells if they are the value wanted. However there is a more robust and reproducible solution that would involve using SQL. If you download the free version of SQL Express here: https://www.microsoft.com/en-gb/sql-server/sql-server-downloads
Then you can import your table of data, here's how to do that: How to import an Excel file into SQL Server?
Then you could use the more friendly SQL database to get the answers you want. For example you can use a select statement that would say:
SELECT count(e_cerv_dis_state)
WHERE e_cerv_dis_state = '6'
It would also be possible to use a CASE WHEN statement to add-in the names of the diagnoses.

Problem when importing table from pdf to python using tabula

When importing data from pdf using tabula with Python, in some cases, I obtain two or more columns merged in one. It does not happen with all the files obtained from the same pdf.
In this case, this is the code used to read the pdf:
from tabula import wrapper
tables = wrapper.read_pdf("933884 cco Saupa 1.pdf",multiple_tables=True,pages='all')
i=1
for table in tables:
table.to_excel('output'+str(i)+'.xlsx',index=False)
i=i+1
For example, when I print the first item of the dataframe obtained from one of these excel files, named "output_pd":
print (output_pd[0][1])
I obtain:
76) 858000015903708 77) 858000013641969 78)
The five numbers are in a single column, so I cannot treat them individually.
Is it possible to improve the data handling in these cases?
You could try manually editing the data in excel. If you use text to columns under the data tab in excel it allows you to split one column into multiple columns without too much work, but you would need to do it for every excel file which could be a pain.
Iterating in each item of each column of each dataframe in the list obtained with tabula
wrapper.read_pdf(file)
in this case
tables
it is possible to obtain clean data.
In this case:
prueba =[]
i = 0
for table in tables:
for columna in table.columns:
for item in (str(table[columna]).split(" ")):
if "858" in str(item):
prueba.append(item[0:15])
print (prueba[0:5])
result in:
['858000019596025', '858000015903707', '858000013641975', '858000000610864', '858000013428853']
But
tabula.wrapper.read_pdf
does not read the whole initial pdf. 2 values are left in the last page. So, it is still neccesary to manually make a little edit.

From Excel to Quant mod

I am doing some simple analysis using quantmod, my file is in Excel csv file.
The first column is the date format YYYY-MM-DD, I then have ten columns containing price data, each represents a fund or index. None of the data is on yahoo, so I cannot use getSymbols.
Could someone give the code to bring the excel file into R in a format workable with Quantmod in an understandable form that a non-programmer can understand?
I think the issue you have is that if you read the CSV file into R it is a dataframe object. Use the class() function to confirm.
library(tidyverse)
library(quantmod)
library(timekt)
my_data <- readr::read_csv('my excel file.csv')
class(my_data)
To use quantmod function your data needs to be in an xts object (time-series object), it can't be in a dataframe. You can convert a dataframe that has a date/index column into an xts object using the timekt::tk_xts() function. And then you should be able to use quantmod functions for analysis on your data.
my_xts <- timekt::tk_xts(my_data)
quantmod::monthlyReturns(my_xts)

how to search a text file in python 3

I have this text file that has lists in it. How would I search for that individual list? I have tried using loops to find it, but every time it gives me an error since I don't know what to search for.
I tried using a if statement to find it but it returns -1.
thanks for the help
I was doing research on this last night. You can use pandas for this. See here: Load data from txt with pandas. One of the answers talks about list in text files.
You can use:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["Name", "b", "c", "etc."]
Add sep=" " in your code, leaving a blank space between the quotes. So pandas can detect spaces between values and sort in columns. Data columns isenter code here for naming your columns.
With a JSON or XML format, text files become more searchable. In my research I’ve decided to go with an XML approach. Here is the link to a blog that explains how do use Python with XML: http://www.austintaylor.io/lxml/python/pandas/xml/dataframe/2016/07/08/convert-xml-to-pandas-dataframe.
If you want to search the data frame try:
import pandas as pd
txt_file = 'C:\path\to\your\txtfile.txt'
df = pd.read_table(txt_file, sep = ",")
row = df.loc[df['Name'] == 'bob']
Print(row)
Now depending how your text file is formated, your results will not work for every text file. The idea of a dataframe in pandas helps u create a CSV file formats. This giving the process a repeatable structure to enable testing results. Again I recommend using a JSON or XML format before implementing pandas data frames in ur solution. U can then create a consistent result, that is testable too!

Resources