Indexing HTML in Elasticsearch via python3

Indexing HTML in Elasticsearch via python3 - python-3.x

I am new to Elasticsearch. I have to index many HTML files via python3. I've seen many examples of adding info into Elasticsearch, but couldn't actually find anything appropriate for me. Can I index HTML files without extracting all their information in JSON format? I've seen some examples of indexing PDF to Elasticsearch via PHP using pipeline, but could not find something like this for python.

What do you mean by index HTML files to Elasticsearch? What kind of information do you want to send to Elasticsearch?
Yes its definitely possible, but give a bit more details of what you want to be sending to Elasticsearch. (full HTML pages, only the name, certain information from HTML files, etc)

Here a sample of a class that might be handy for you..
#ELK credentials
ELK_HOST = "[hostname]"
ELK_USER = "[elastic_user]"
ELK_PASSWORD= "[elastic_password]"
HEADERS = {
'host' : '[put hostname again if using redirects ;)]',
'Content-Type' : 'application/json',
}
class ElasticSearch():
def __init__(self,host,user,password):
self._host = host
self._user = user
self._password = password
self._auth = (self._user, self._password)
def update_index(self, index, data):
endpoint = str(index)+"/doc/"
uri = self._host +"/"+ endpoint
_data = data
_data = python_to_json(_data)
response = requests.post(uri, headers=HEADERS, auth=self._auth,data=_data)
es = ElasticSeach(ELK_HOST,ELK_USER,ELK_PASSWORD);
#some random data
data = {"test1": 1, "test2" : 2}
#update index (if doesnt exist, it will create a new one)
es.update_index("testindex",data)
hope this will help you!

Related

receive the data passed in the body of the request API Django Python

I want to get the data passed in the request body like :
[1]: https://i.stack.imgur.com/V5s1e.png
My function look like :
def test(request):
# I want recieve data from body here like
data = functionRequestBody
print(data)
return HttpResponse()

What information are you looking for? Try request.text, request.session, etc. If your app has users you can get the current user's info through request.user: ex request.user.pk returns the db entry's primary key, request.user.username will return username.
https://www.javatpoint.com/django-request-and-response

The answer is request.body .
I find a better answer on another post StackOverflow
body_unicode = request.body.decode('utf-8')
request = json.loads(body_unicode)
(warning : ended your endpoint with a "/" else is not working)
Thx Raphael for your comm !

Problems loading images from bytes in Flutter/Dart

I am building an application with Flutter. I am using using a mix of newer technologies that I am struggling to piece together.
My actual issue is that I cannot get images to load in the app when using the following methods:
Insert test data into MongoDB using Mongoengine, images are inserted into GridFS using this method.
Query GraphQL server for data and retrieve data and receive images from GridFS in form of bytes but in a string - e.g "b'/9j/4AAQSkZJRgABAQEASABIAAD/4V0X .... '"
Use that bytes string to load images in app with something like Image.memory()
But the images fail to load with an error: type String is not a subtype of type Uint8List. So I tried to convert the string of bytes from a String to raw bytes by doing:
List<int> bytesList = imageData['image']['data'].codeUnits;
Uint8List thumbImageBytes = Uint8List.fromList(bytesList);
I get the following exception:
I/flutter ( 4303): ══╡ EXCEPTION CAUGHT BY IMAGE RESOURCE SERVICE ╞══════
I/flutter ( 4303): The following _Exception was thrown while resolving an image:
I/flutter ( 4303): Exception: Could not instantiate image codec.
I have no idea what I can do to fix this, I cannot seem to find anything by googling etc. It would seem there is no information available for this exact scenario except for this S.O question which is what I have tried to do. I have also tried all the methods sugested in comments, followed the suggested links and tried all available combinations of answers, comments, everything.
My set up is as follows;
Main app: Flutter/Dart
API Server: Python/Flask/Mongoengine based GraphQL/Graphene API
Backend Database: MongoDB
The Python side;
A Mongoengine Document model:
class Product(Document):
meta = {'collection': 'product'}
name = StringField(unique=True)
price = FloatField()
sale_price = FloatField()
description = StringField()
image = FileField()
thumb = FileField()
created_at = DateTimeField(default=datetime.utcnow)
edited_at = DateTimeField()
# user = ReferenceField(User)
def __repr__(self):
return f'<Product Model::name: {self.name}>'
A Graphene schema for the model:
class ProductAttribute:
name = graphene.String()
price = graphene.Float()
sale_price = graphene.Float()
description = graphene.String()
image = Upload()
thumb = graphene.String()
created_at = graphene.DateTime()
edited_at = graphene.DateTime()
class Product(MongoengineObjectType):
"""Product node."""
class Meta:
model = ProductModel
interfaces = (graphene.relay.Node,)
class CreateProductInput(graphene.InputObjectType, ProductAttribute):
"""Arguments to create a product."""
pass
class CreateProduct(graphene.Mutation):
"""Create a product."""
product = graphene.Field(lambda: Product, description="Product created by this mutation.")
class Arguments:
input = CreateProductInput()
image = Upload(required=True)
def mutate(self, info, image, input):
data = utils.input_to_dictionary(input)
data['created_at'] = datetime.utcnow()
data['edited_at'] = datetime.utcnow()
print(data)
product = ProductModel(**data)
product.save()
return CreateProduct(product=product)
My base Graphene schema:
class Query(graphene.ObjectType):
"""Query objects for GraphQL API."""
node = graphene.relay.Node.Field()
single_product = graphene.relay.Node.Field(schema_product.Product)
all_products = MongoengineConnectionField(schema_product.Product)
class Mutations(graphene.ObjectType):
createProduct = schema_product.CreateProduct.Field()
schema = graphene.Schema(query=Query, types=[schema_product.Product], mutation=Mutations)

It's very suspicious that:
Your binary data is stored in a String (this is usually wrong).
Your String happens to be composed entirely of printable characters.
It's therefore likely that you're getting back binary data that's been encoded to a printable string. A common encoding is base64, and sure enough, when I tried to base64-encode a few different types of images, I see that base64-encoding JPEG files generates a string that starts off with /9j/4AAQSk, just like the string you get.
You are definitely getting back base64-encoded data. If you aren't doing the encoding yourself, then something is automatically doing the encoding for you, and likely there's a symmetric mechanism to decode it for you. If not, you'll need to explicitly base64-decode your String to get back binary data. You can use dart:convert to decode it for you.

python3, Trying to get an output from my function I defined, need some guidance

I found pretty cool ASN API tool that allows me to supply an AS # and it will go out and pull down the subnets that relate with that ASN.
Here is (rough) but partial code. I am defining a function ASNNUMBER (to which I will supply the number through another file)
When I call url here, it just gives me an n...
What I'm trying to do here, is append my str(ASNNUMBER) to the end of the ?q= parameter in the URL.
Once I do that, I'd like to display my results and output it to a file
import requests
def asnfinder(ASNNUMBER):
print('n\n######## Running ASNFinder ########\n')
url = 'https://api.hackertarget.com/aslookup?q=' + str(ASNNUMBER)
response = requests.get(url)
My results I'd like to get is an output of the get request I'm performing
## Running ASNFinder
n

Try to write something like that:
import requests
def asnfinder(ASNNUMBER):
print('n\n######## Running ASNFinder ########\n')
url = 'https://api.hackertarget.com/aslookup?q=' + str(ASNNUMBER)
response = requests.get(url)
data = response.text
print(data)
with open('filename', 'r') as f:
f.write(data)
It must works fine
P.S. If it helped ya, please make sure you mark this as the answer :)

Google Translate API : Multiple input texts - Python

I am struggling to find a way to input multiple texts in Google Translate API.
My setup includes the following things.
Using urllib.request.build_opener (Python3)
Google Translate API https://translation.googleapis.com/language/translate/v2
I know that we can pass multiple parameters (Multiple "q"), but I don't know how to do it with Python.
I referred Google Translate documents. I found this.
My Question :
How to add multiple texts to the input. ? Because the following code is not making any sense to me.
data = {'q':'cat', 'q':'dog','source':source,'target':target,'format':'html'}
This is my code.
data = {'q':'This is Text1', 'q':'This is Text2', 'q':'This is Text3', source':source,'target':target,'format':'html'}
_req = urllib.request.Request("https://translation.googleapis.com/language/translate/v2?key="+API_KEY)
_req.add_header('Content-length', len(data))
_req.data = urllib.parse.urlencode(data).encode("utf-8")
response = Connector._get(_req,_session)
Connector._get() is in some other file and it internally calls urllib.request.build_opener with data.
Thanks!

To post multiple parameters (with the same name) in Python for an HTTP request, you can use a list for the values. They'll be added to the URL like q=dog&q=cat.
Example:
headers = { 'content-type': 'application/json; charset=utf-8' }
params = {'q': ['cat', 'dog'],'source':source,'target':target,'format':'html'}
response = requests.post(
"https://translation.googleapis.com/language/translate/v2?key=",
headers=headers,
params=params
)
Specifically, params = {'q': ['cat', 'dog']} is relevant to your question.

I do not have tested by myself, but it seems that you should build the data string to give as data argument to your python urllib.request method. So something like data = "{\n \'q\':{}\n \'q\':{} {} etc.".format(qstr,qstr, etc...)
After that you could want to make it more painfull to have several qs.
You could make a loop and building your string with += operations.

Python 3.6 Downloading .csv files from finance.yahoo.com using requests module

I was trying to download a .csv file from this url for the history of a stock. Here's my code:
import requests
r = requests.get("https://query1.finance.yahoo.com/v7/finance/download/CHOLAFIN.BO?period1=1514562437&period2=1517240837&interval=1d&events=history&crumb=JaCfCutLNr7")
file = open(r"history_of_stock.csv", 'w')
file.write(r.text)
file.close()
But when I opened the file history_of_stock.csv, this was what I found: {
"finance": {
"error": {
"code": "Unauthorized",
"description": "Invalid cookie"
}
}
}
I couldn't find anything that could fix my problem. I found this thread in which someone has the same problem except that it is in C#: C# Download price data csv file from https instead of http

To complement the earlier answer and provide a concrete completed code, I wrote a script which accomplishes the task of getting historical stock prices in Yahoo Finance. Tried to write it as simply as possible. To give a summary: when you use requests to get a URL, in many instances you don't need to worry about crumbs or cookies. However, with Yahoo finance, you need to get the crumbs and the cookies. Once you get the cookies, then you are good to go! Make sure to set a timeout on the requests.get call.
import re
import requests
import sys
from pdb import set_trace as pb
symbol = sys.argv[-1]
start_date = '1442203200' # start date timestamp
end_date = '1531800000' # end date timestamp
crumble_link = 'https://finance.yahoo.com/quote/{0}/history?p={0}'
crumble_regex = r'CrumbStore":{"crumb":"(.*?)"}'
cookie_regex = r'set-cookie: (.*?);'
quote_link = 'https://query1.finance.yahoo.com/v7/finance/download/{}?period1={}&period2={}&interval=1d&events=history&crumb={}'
link = crumble_link.format(symbol)
session = requests.Session()
response = session.get(link)
# get crumbs
text = str(response.content)
match = re.search(crumble_regex, text)
crumbs = match.group(1)
# get cookie
cookie = session.cookies.get_dict()
url = "https://query1.finance.yahoo.com/v7/finance/download/%s?period1=%s&period2=%s&interval=1d&events=history&crumb=%s" % (symbol, start_date, end_date, crumbs)
r = requests.get(url,cookies=session.cookies.get_dict(),timeout=5, stream=True)
out = r.text
filename = '{}.csv'.format(symbol)
with open(filename,'w') as f:
f.write(out)

There was a service for exactly this but it was discontinued.
Now you can do what you intend but first you need to get a Cookie. On this post there is an example of how to do it.
Basically, first you need to make a useless request to get the Cookie and later, with this Cookie in place, you can query whatever else you actually need.
There's also a post about another service which might make your life easier.
There's also a Python module to work around this inconvenience and code to show how to do it without it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Indexing HTML in Elasticsearch via python3 - python-3.x

What do you mean by index HTML files to Elasticsearch? What kind of information do you want to send to Elasticsearch? Yes its definitely possible, but give a bit more details of what you want to be sending to Elasticsearch. (full HTML pages, only the name, certain information from HTML files, etc)

Related

receive the data passed in the body of the request API Django Python

Problems loading images from bytes in Flutter/Dart

python3, Trying to get an output from my function I defined, need some guidance

Google Translate API : Multiple input texts - Python

Python 3.6 Downloading .csv files from finance.yahoo.com using requests module

Categories

Resources