what encoding are files after being dumped by nutch? - nutch

I have been using the readseg function to dump data after crawling with nutch. But I have been having encoding issues. What encoding are the files after being dumped by nutch?

The HTML content is still in the original encoding. Starting with Nutch 1.17 it can be optionally converted to UTF-8, see NUTCH-2773. You need to set the property segment.reader.content.recode to true. Of course, this will not work for binary document formats.
All other data (metadata, extracted plain-text) is always encoded in UTF-8 when segments are dumped.

Related

GO generate incorrect UTF-8 chars like ö

Ik building an GO application where I want to output a cvs string from a buffer out via the http server.
I'm putting it into the csv buffer:
var buffer bytes.Buffer
resp := csv.NewWriter(&buffer)
resp.Write("Schröder")
The output it via the http server:
resp.Flush()
w.Header().Set("Content-Type", "text/csv; charset=utf-8")
w.Write([]byte(buffer.String()))
When I then open my url a csv file is download en opened by Excel. In that excelsheet the field value is converted to "Schröder".
Any idee, i'm already a week stuk on this item?
The problem is not in Go but in Excel. The information that the data are encoded in UTF-8 is lost when saving the file, since there is no such thing as an encoding attribute on saved files.
Thus Excel will just see the plan data and has no information about the encoding. There are several tricks to make Excel do the right guess, like placing the proper byte order mark (BOM) at start of the file. See Is it possible to force Excel recognize UTF-8 CSV files automatically?. But just specifying charset=utf-8 within the HTTP Content-type header will not help since Excel does not get this information.

Convert Binary content of PDF file To JSON format using node.js

we want Json format from binary content of pdf file using node.js.
Actually we are getting binary content of pdf from 3 party api response , using this response we will save in our database ,so give me working code for convert binary pdf format to json format
in simple words
Please let us know , "any working code so i have just pass binary data got json data" .
The JSON format natively doesn't support binary data.
Use Base64 or base85
I think the best you can do space-wise is base85 which represents four bytes as five characters. However, this is only a 7% improvement over base64, it's more expensive to compute, and implementations are less common than for base64 so it's probably not a win.

Encoding data to ISO_8859_1 in Bigquery using pyspark

I have multi language characters in my pyspark dataframe. After writing the data to bigquery it shows me strange characters because of its deafult encoding scheme (utf-8).
How can I change encoding in Bigquery to ISO_8859_1 using pyspark / dataproc?
There was an issue in the source file itself, as its coming through an api. Hence able to resolve the issue.
First thing you have to check at source or source system
How it's sending the data and understand which encoding it is. If still different then do the following investigation.
AFAIK pyspark is reading json with utf-8 encoding and loading in to bigquery as per your comments . So its not bigquerys fault as default is utf-8.
you can change encoding to ISO-8859-1 and load json like below
spark.read.option('encoding','ISO-8859-1').json("yourjsonpathwith latin-1 ")
and load in to bigquery.
Also...
while writing the dataframe in to bigquery.
you can test/debug using decode function with col and charset both in iso-8859-1 and utf-8 formats to understand where its going wrong using...
pyspark.sql.functions.decode(columnname , charset) as well to see its able to decode to utf-8 or not...
you can write dataframe with pyspark.sql.functions.decode(col, charset)

Creating a file in node.js, using an encoding (CP437 / IBM) which is not part of the supported standard node encodings [ascii/base64/latin1/...]

I am processing Files with different encoding-types.
Right now, any encoded file is transformed to utf-8 and saved to my SQL DB.
My goal ist to generate new files with the same encoding as the original data.
I am able to decode hex as CP437/IBM but unable to write the resulting String to a File maintaining the desired encoding.
decodedString = cptable.utils.decode(437, myHexString);
fs.appendFile(filename, decodedString, [options.encoding],(err)=>{
console.log("please help me")
}
The result is a file with faulty encoding, but also contains a hidden message.

Node.js Buffer and Encoding

I have an HTTP endpoint where user uploads file. I need to read the file contents and then store it to DB. I can read it to Buffer and getting string from it.
The problem is, then file content is not UTF-8 I can see "strange" symbols in output string.
Is that possible somehow to detect the encoding of Buffer contents and serialise it to string correctly?

Resources