Issue with translating multiple languages in a single document - python-3.x

I am trying a language translation code where I am using the translate package where the provider is Microsoft. The input text has 2 languages english and Russian and my to language is english. The translated text does not change to English. Can anyone provide some inputs ?
from translate import Translator
to_lang = "en"
translator = Translator(provider='microsoft', to_lang=to_lang, secret_access_key=secret)
translator.translate("Elapsed Task Time – время в течение, которого выполнялась задача ")
'Elapsed Task Time – время в течение, которого выполнялась задача '
Here is what I tried to compare the issue
from googletrans import Translator
translator = Translator()
translator.translate(r.text, dest='en').text
"Elapsed Task Time - the time during which the task was performed"
Expected result :
"Elapsed Task Time - the time during which the task was performed"

For whatever reason, the (older) version of the Microsoft translator API used in this library won't autodetect mixed languages properly. It will work if your mixed languages include English and you specify from_lang for the other language. It always detects English. For example, if you specify from_lang='ru' and translate to 'it', the English portion will also be translated to Italian.
So, back to your scenario, this should work:
translator = Translator(provider='microsoft', to_lang=to_lang, from_lang='ru', secret_access_key=secret)
That said, I recommend you look at: https://github.com/MicrosoftTranslator/Text-Translation-API-V3-Python. In particular, Translate.py. This should work as expected and uses the latest API (well, more precisely, you get control over which API).

Related

Searching a lot of keywords on twitter via tweepy

I am trying to make a python code with tweepy that will track all the tweets from a specific country from a date which will have some of the chosen specific keywords. And I have chosen a lot of keywords like 24-25.
My keywords are vigilance anticipation interesting ecstacy joy serenity admiration trust acceptance terror fear apprehensive amazement surprize distraction grief sadness pensiveness loathing disgust boredom rage anger annoyance.
for more understanding, my code till now is:
places = api.geo_search(query="Canada",granularity="country")
place_id = places[0].id
public_tweets = tweepy.Cursor(api.search,
q="place:"+place_id+" since:2020-03-01",
lang="en",
).items(num_tweets)
Please help me with this question as soon as possible.
Thank You

Adding json data to json column is adding escape characters

I am using postgres database, I have a table called names which has a column named 'info' which is of type json. I am adding
{ "info": "security" , description : "Sednit update: Analysis of Zebrocy: The Sednit group \u2013 also known as APT28, Fancy Bear, Sofacy or STRONTIUM \u2013 is a group of attackers operating since 2004, if not earlier, and whose main objective is to steal confidential information from specific targets.\n\nToward the end of 2015, we started seeing a new component deployed by the group; a downloader for the main Sednit backdoor, Xagent. Kaspersky mentioned this component for the first time in 2017 in their APT trend report and recently wrote an article where they quickly described it under the name Zebrocy.\n\nThis new component is a family of malware, comprising downloaders and backdoors written in Delphi and AutoIt. These components play the same role in the Sednit ecosystem as Seduploader; that of first-stage malware."}
Here I am using node js, with sequelize as orm. When I save it in table. I see "\\n" for "\n" and "\\u" for \u. Can anyone help me to avoid escaping characters while saving to table.
I see \n for \n and \u for \u.
In your json description is type of string , so it will convert the new line/enter to \n that the default behavior , or else you will not get the new line / enter when you try to fetch the data again.
And \u is for unicode , you might be saving some smily or special character so that will be converted to such strings.
So there is no bug , this is how it works.

Finding Related Topics using Google Knowledge Graph API

I'm currently working on a behavioral targeting application and I need a considerably large keyword database/tool/provider that enables applications to reach to the similar keywords via given keyword for my app. I've recently found that Freebase, which had been providing a similar service before Google acquired them and then integrated to their Knowledge Graph. I was wondering if it's possible to have a list of related topics/keywords for the given entity.
import json
import urllib
api_key = 'API_KEY_HERE'
query = 'Yoga'
service_url = 'https://kgsearch.googleapis.com/v1/entities:search'
params = {
'query': query,
'limit': 10,
'indent': True,
'key': api_key,
}
url = service_url + '?' + urllib.urlencode(params)
response = json.loads(urllib.urlopen(url).read())
for element in response['itemListElement']:
print element['result']['name'] + ' (' + str(element['resultScore']) + ')'
The script above returns the queries below, though I'd like to receive related topics to yoga, such as health, fitness, gym and so on, rather than the things that has the word "Yoga" in their name.
Yoga Sutras of Patanjali (71.245544)
Yōga, Tokyo (28.808222)
Sri Aurobindo (28.727333)
Yoga Vasistha (28.637642)
Yoga Hosers (28.253984)
Yoga Lin (27.524054)
Patanjali (27.061115)
Yoga Journal (26.635073)
Kripalu Center (26.074436)
Yōga Station (25.10318)
I'd really appreciate any suggestions, and I'm also open to using any other API if there is any that I could make use of. Cheers.
See your point:) So here's the script I use for that using Serpstat's API. Here's how it works:
Script collects the keywords from Serpstat's database
Then, collects search suggestions from Serpstat's database
Finally, collects search suggestions from Google's suggestions
Note that to make script work correctly, it's preferable to fill all input boxes. But not all of them are required.
Keyword — required keyword
Search Engine — a search engine for which the analysis will be carried out. For example, for the US Google, you need to set the g_us. The entire list of available search engines can be found here.
Limit the maximum number of phrases from the organic issue, which will participate in the analysis. You cannot set more than 1000 here.
Default keys — list of two-word keywords. You should give each of them some "weight" to receive some kind of result if something goes wrong.
Format: type, keyword, "weight". Every keyword should be written from a new line.
Types:
w — one word
p — two words
Examples:
"w; bottle; 50" — initial weight of word bottle is 50.
"p; plastic bottle; 30" — initial weight of phrase plastic bottle is 30.
"w; plastic bottle; 20" — incorrect. You cannot use a two-word phrase for the "w" type.
Bad words — comma-separated list of words you want the script to exclude from the results.
Token — here you need to enter your token for API access. It can be found on your profile page.
You can download the source code for script here

Labelling text using Notepad++ or any other tool

I have several .dat, containing information about hotel reviews as below
/*
<Author> simmotours
<Content> review......goes here
<Date>Nov 18, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>4`enter code here`
<Value>4
<Rooms>3
<Location>4
<Cleanliness>4
<Check in / front desk>4
<Service>4
<Business service>-1
*/
I want to classify the review into two pos and neg , i.e. have two folder pos and neg containing several files with reviews above 3 classified as positive and below 3 classified as negative.
How can I quickly and efficiently automate this process?
You could write up a python script to read the overall score. Do this by looping over the the lines using readline() See here. Find the "Overall" Score using some string parsing. Then move the file into the right directory. All very simple things to do in Python, just break it down into steps and search for answers to those steps.
Notepad++ can do replacements with regular expressions. And allows the definition of macros. Use them to convert the file to an XML file. Check out the help file.
Then you can read it with any scripting language and do what you want.
Alternatively you could change the file to a form where you can load it into Excel and do the analysis there.

Strings in Erlang - what libraries and techniques should I be examining?

I am working on a project that will require internationalisation support down the track. I want to get started on the right foot with UTF support, and I was wondering what the best practice for handling UTF in Erlang is?
From my current research it seems there are a couple of issues with Erlang's built in string handling for some use cases (JSON parsing being a good example).
I have been looking at Starling and read (somewhere) recently that it is possibly going to be rolled into the standard Erlang release as the UTF 'standard'. Is this true? Are there other libraries or approaches I should be looking at?
From the comments:
EEP (Erlang Enhancement Proposal) 10 details Representing Unicode characters in Erlang
This page:
http://erlang.org/doc/highlights.html
...lists hightlights of release 5.7/OTP R13A. Note this passage:
1.2 Unicode support
Support for Unicode is implemented as
described in EEP10. Formatting and
reading of unicode data both from
terminals and files is supported by
the io and io_lib modules. Files can
be opened in modes with automatic
translation to and from different
unicode formats. The module 'unicode'
contains functions for conversion
between external and internal unicode
formats and the re module has support
for unicode data. There is also
language syntax for specifying string
and character data beyond the
ISO-latin-1 range.
I don't like to make pronouncements on what best practices would be, but I often find it helpful to have a minimal, complete example to start to generalize from. Here's one of getting utf into an erlang application and sending it out again to a different context. Assuming you had a MySql database with a row field in a table containing utf8 characters, here's one way to get it out and pipe it to a web browser as json:
hg clone http://bitbucket.org/justin/webmachine/ webmachine-read-only
cd webmachine-read-only
make
./scripts/new_webmachine.erl mywebdemo /tmp
svn checkout http://erlang-mysql-driver.googlecode.com/svn/trunk/ erlang-mysql-driver-read-only
cd erlang-mysql-driver-read-only/src
cp * /tmp/mywebdemo/src
svn checkout http://mochiweb.googlecode.com/svn/trunk/ mochiweb-read-only
cp mochiweb-read-only/src/mochijson2.erl /tmp/mywebdemo/src
cd /tmp/mywebdemo
Edit src/mywebdemo_resource.erl so it looks like this:
-module(mywebdemo_resource).
-export([init/1, to_html/2]).
-include_lib("webmachine/include/webmachine.hrl").
init([]) -> {ok, undefined}.
to_html(ReqData, State) ->
mysql:start_link(pool_id, "database.host.com", 3306, "db_user", "db_password", "db_name", fun(A, B, C, D) -> ouch end, utf8), %% add your connection string info
{data, Res} = mysql:fetch(pool_id, "select * from table where IdWhatever = 13"),
[[_, Utf8Str, _]] = mysql:get_result_rows(Res), %% pattern will need to be altered to match your table structure
{mochijson2:encode({struct, [{Utf8Str, 100}]}), ReqData, State}.
Build everything and start the url dispatcher:
make
./start.sh
Then execute the following in a web page (or something more convenient, like MozRepl):
var req = new XMLHttpRequest;
req.open('GET', "http://localhost:8000", false);
req.send(null);
eval("(" + req.responseText + ")");
As the previous poster mentioned the latest release of erlang supports utf natively. If you can't use the latest though then one thing I do usually is to use binaries for string data. It keeps erlang from mangling the bytes in a list. It has the side effect of making lists of strings easier to handle as well.

Resources