The Problem
On Google Flights, search information is encoded in a URL parameter, presumably so users can share flight searches with each other easily.
The URL format looks like this:
https://www.google.com/travel/flights/search?tfs=CBwQAhoeagcIARIDSE5MEgoyMDIxLTA5LTEzcgcIARIDU0ZPGh5qBwgBEgNTRk8SCjIwMjEtMDktMTdyBwgBEgNITkxwAYIBCwj___________8BQAFIAZgBAQ
I am trying to write a program that can generate flight search URLs given flight information (origin, destination, flight dates, passengers, etc). To do this I need know how the information is encoded in the URL so I can recreate it.
What I've tried
I know that the flight info is encoded in base64 or some variant of it (I've been using base64decode.org for testing). For a round-trip flight from HNL-SFO on 2021-09-13 - 2021-09-17, Google Flights has this URL:
https://www.google.com/travel/flights/search?tfs=CBwQAhoeagcIARIDSE5MEgoyMDIxLTA5LTEzcgcIARIDU0ZPGh5qBwgBEgNTRk8SCjIwMjEtMDktMTdyBwgBEgNITkxwAYIBCwj___________8BQAFIAZgBAQ
The part of the tfs query parameter before the underscores decodes to
jHNL
2021-09-13rSFOjSFO
2021-09-17rHNLp
which contains some (but not all) recognizable flight info. What I don't understand is the whitespace between the recognizable information. Using this site, I learned that the whitespace is a mix of characters:
U+0008 : <control> BACKSPACE [BS]
U+001C : <control> INFORMATION SEPARATOR FOUR {file separator (FS)}
U+0010 : <control> DATA LINK ESCAPE [DLE]
U+0002 : <control> START OF TEXT [STX]
U+001A : <control> SUBSTITUTE [SUB]
U+001E : <control> INFORMATION SEPARATOR TWO {record separator (RS)}
U+006A : LATIN SMALL LETTER J
U+0007 : <control> BELL [BEL]
U+0008 : <control> BACKSPACE [BS]
U+0001 : <control> START OF HEADING [SOH]
U+0012 : <control> DEVICE CONTROL TWO [DC2]
U+0003 : <control> END OF TEXT [ETX]
U+0048 : LATIN CAPITAL LETTER H
U+004E : LATIN CAPITAL LETTER N
U+004C : LATIN CAPITAL LETTER L
...
This suggests that I'm not decoding the data properly. I've tried some other variants of base64, but haven't had any luck.
Does anyone know how this info is encoded? Another thing I haven't been able to figure out is how the information after the underscores (8BQAFIAZgBAQ) is encoded. Based on the behavior of the Google Flights site, I think it encodes passenger information, but it base64 decodes to only whitespace characters.
Additional Context
Two years ago I made a working version of the program which produced URLs like
https://www.google.com/flights?hl=en#flt=ORD.MCO.2021-07-16*MCO.ORD.2021-07-19;c:USD;e:1;px:2,2,0,0;sd:1;t:f
Several months ago Google changed the format they use from the above to the encoded version. I want to figure out how to recreate the encoded URLs so I can update my program instead of retiring it.
You can have your program output flight URLs in query format using the q URL param. No need to encode/decode the URL.
For example:
https://www.google.com/travel/flights?q=Flights%20to%20SFO%20from%20HNL%20on%202022-09-13%20through%202022-09-17
Which leads to the results page: HNL <> SFO Flight Results
I miss having the ability to encode a query and have the same question. Nice work with finding out it's in base64.
I think reverse engineering is the only way to find out how things are encoded. For example, the stuff after the underlines is most likely binary-encoded.
See the below for economy:
11101111 10111111 10111101 00010100 00000000 00010100 11101111 10111111 10111101 00011001 11101111 10111111 10111101 00010000 00010000
And the same query but for for business class
11101111 10111111 10111101 00010100 00000000 00010100 11101111 10111111 10111101 00101001 11101111 10111111 10111101 00010000 00010000
As you can see the 10th byte goes from 00011001 to 10111101
Related
Using nodejs and iconv-lite to create a http response file in xml with charset windows-1252, the file -i command cannot identify it as windows-1252.
Server side:
r.header('Content-Disposition', 'attachment; filename=teste.xml');
r.header('Content-Type', 'text/xml; charset=iso8859-1');
r.write(ICONVLITE.encode(`<?xml version="1.0" encoding="windows-1252"?><x>€Àáção</x>`, "win1252")); //euro symbol and portuguese accentuated vogals
r.end();
The browser donwloads the file and then i check it in Ubuntu 20.04 LTS:
file -i teste.xml
/tmp/teste.xml: text/xml; charset=unknown-8bit
When i use gedit to open it, the accentuated vogal appear fine but the euro symbol it does not (all characters from 128 to 159 get messed up).
I checked in a windows 10 vm and in there all goes well. Both in Windows and Linux web browsers, it also shows all fine.
So, is it a problem in file command? How to check the right charsert of a file in Linux?
Thank you
EDIT
The result file can be get here
2nd EDIT
I found one error! The code line:
r.header('Content-Type', 'text/xml; charset=iso8859-1');
must be:
r.header('Content-Type', 'text/xml; charset=Windows-1252');
It's important to understand what a character encoding is and isn't.
A text file is actually just a stream of bits; or, since we've mostly agreed that there are 8 bits in a byte, a stream of bytes. A character encoding is a lookup table (and sometimes a more complicated algorithm) for deciding what characters to show to a human for that stream of bytes.
For instance, the character "€" encoded in Windows-1252 is the string of bits 10000000. That same string of bits will mean other things in other encodings - most encodings assign some meaning to all 256 possible bytes.
If a piece of software knows that the file is supposed to be read as Windows-1252, it can look up a mapping for that encoding and show you a "€". This is how browsers are displaying the right thing: you've told them in the Content-Type header to use the Windows-1252 lookup table.
Once you save the file to disk, that "Windows-1252" label form the Content-Type header isn't stored anywhere. So any program looking at that file can see that it contains the string of bits 10000000 but it doesn't know what mapping table to look that up in. Nothing you do in the HTTP headers is going to change that - none of those are going to affect how it's saved on disk.
In this particular case the "file" command could look at the "encoding" marker inside the XML document, and find the "windows-1252" there. My guess is that it simply doesn't have that functionality. So instead it uses its general logic for guessing an encoding: it's probably something ASCII-compatible, because it starts with the bytes that spell <?xml in ASCII; but it's not ASCII itself, because it has bytes outside the range 00000000 to 01111111; anything beyond that is hard to guess, so output "unknown-8bit".
I have a set of 100K XML-ish (more on that later) legacy files with a consistent structure - an Archive wrapper with multiple Date and Data pair records.
I need to extract the individual records and write them to individual text files, but am having trouble parsing the data due to illegal characters and random CR/space/tab leading and trailing data.
About the XML Files
The files are inherited from a retired system and can't be regenerated. Each file is pretty small (less then 5 MB).
There is one Date record for every Data record:
vendor-1-records.xml
<Archive>
<Date>10 Jan 2019</Date>
<Data>Vendor 1 Record 1</Data>
<Date>12 Jan 2019</Date>
<Data>Vendor 1 Record 2</Data>
(etc)
</Archive>
vendor-2-records.xml
<Archive>
<Date>22 September 2019</Date>
<Data>Vendor 2 Record 1</Data>
<Date>24 September 2019</Date>
<Data>Vendor 2 Record 2</Data>
(etc)
</Archive>
...
vendor-100000-records.xml
<Archive>
<Date>12 April 2019</Date>
<Data>Vendor 100000 Record 1</Data>
<Date>24 October 2019</Date>
<Data>Vendor 100000 Record 2</Data>
(etc)
</Archive>
I would like to extract each Data record out and use the Date entry to define a unique file name, then write the contents of the Data record to that file as so
filename: vendor-1-record-1-2019-1Jan-10.txt contains
file contents: Vendor 1 record 1
(no tags, just the record terminated by CR)
filename: vendor-1-record-2-2019-1Jan-12.txt contains
file contents: Vendor 1 record 2
filename: vendor-2-record-1-2019-9Sep-22.txt contains
file contents: Vendor 2 record 1
filename: vendor-2-record-2-2019-9Sep-24.txt contains
file contents: Vendor 2 record 2
Issue 1 : illegal characters in XML Data records
One issue is that the elements contain multiple characters that XML libraries like Etree/etc terminate on including control characters, formatting characters and various Alt+XXX type characters.
I've searched online and found all manner of workaround and regex and search and replace scripts but the only thing that seems to work in Python is lxml's etree with recover=True.
However, that doesn't even always work because some of the files are apparently not UTF-8, so I get the error:
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Issue 2 - Data records have random amounts of leading and following CRs and spaces
For the files I can parse with lxml.etree, the actual Data records are also wrapped in CRs and random spaces:
<Data>
(random numbers of CR + spaces and sometimes tabs)
*content<CR>*
(random numbers of CR + spaces and sometimes tabs)
</Data>
and therefore when I run
parser = etree.XMLParser(recover=True)
tree = etree.parse('vendor-1-records.xml', parser=parser)
tags_needed = tree.iter('Data')
for it in tags_needed:
print (it.tag,it.attrib)
I get a collection of empty Data tags (one for each data record in the file) like
Data {}
Data {}
Questions
Is there a more efficient language/module than Python's lxml for ignoring the illegal characters? As I said, I've dug through a number of cookbook blog posts, SE articles, etc for pre-processing the XML and nothing seems to really work - there's always one more control character/etc that hangs the parser.
SE suggested a post about cleaning XML which references an old Atlassian tool ( Stripping Invalid XML characters in Java). I did some basic tests and it seems like it might work but open to other suggestions.
I have not used regex with Python much - any suggestions on how to handle cleaning the leading/trailing CR/space/tab randomness in the Data tags? The actual record string I want in that Data tag also has a CR at the end and may contain tabs as well so I can't just search and replace. Maybe there is a regex way to pull that but my regex-fu is pretty weak.
For my issues 1 and 2, I kind of solved my own problem:
Issue 1 (parsing and invalid characters)
I ran the entire set of files through the Atlassian jar referenced in (Stripping Invalid XML characters in Java) with a batch script:
for %%f in (*.xml) do (
java -jar atlassian-xml-cleaner-0.1.jar %%f > clean\%%~f
)
This utility standardized all of the XML files and made them parseable by lxml.
Issue 2 (CR, spaces, tabs inside the Data element)
This configuration for lxml stripped all whitespace and handled the invalid character issue
from lxml import etree
parser = etree.XMLParser(encoding = 'utf-8',recover=True,remove_blank_text=True)
tree = etree.parse(filepath, parser=parser)
With these two steps I'm now able to start extracting records and writing them to individual files:
# for each date, finding the next item gives me the Data element and I can strip the tab/CR/whitespace:
for item in tree.findall('Date'):
dt = parse_datestamp(item.text.strip())
content = item.getnext().text.strip()
I am using postgres database, I have a table called names which has a column named 'info' which is of type json. I am adding
{ "info": "security" , description : "Sednit update: Analysis of Zebrocy: The Sednit group \u2013 also known as APT28, Fancy Bear, Sofacy or STRONTIUM \u2013 is a group of attackers operating since 2004, if not earlier, and whose main objective is to steal confidential information from specific targets.\n\nToward the end of 2015, we started seeing a new component deployed by the group; a downloader for the main Sednit backdoor, Xagent. Kaspersky mentioned this component for the first time in 2017 in their APT trend report and recently wrote an article where they quickly described it under the name Zebrocy.\n\nThis new component is a family of malware, comprising downloaders and backdoors written in Delphi and AutoIt. These components play the same role in the Sednit ecosystem as Seduploader; that of first-stage malware."}
Here I am using node js, with sequelize as orm. When I save it in table. I see "\\n" for "\n" and "\\u" for \u. Can anyone help me to avoid escaping characters while saving to table.
I see \n for \n and \u for \u.
In your json description is type of string , so it will convert the new line/enter to \n that the default behavior , or else you will not get the new line / enter when you try to fetch the data again.
And \u is for unicode , you might be saving some smily or special character so that will be converted to such strings.
So there is no bug , this is how it works.
I am using the XML-INTO op-code to parse a web service request. Every now and then I get errors in the logs
(RNX0351 - "The XML parser detected error code 302").
The help for a 302 is
302 The parser does not support the requested CCSID value or
the first character of the XML document was not '<'
To the best of my knowledge, the first character is "<" and the request is generated from a previous web service call so I would be very suprised if the CCSID has changed.
The error is repeatable, for the specific query so it is almost certainly data related, I am just unsure how I would go about identifying the offending item.
Any thoughts on how to determine the issue, or better yet, how to overcome it?
cheers
CCSID is an AS400/iSeries/Power System attribute, and it applies to the whole IFS.It's like a declaration of what inside the file is, or in other words what its internal encoding "should be".
It's supposed that data content encoding in the file and the file one (the envelope) match, and the box uses this attribute to show and handle corresponding characters.
It sounds like you receive data under one encoding, but CCSID file doesn't match.
Try changing CCSID on your file (only the envelope). E.G.: 37 (american), 500 (latin-1), 819 (utf-8), 850 (dos), 1252 (win) and display file after.You can check first using ls -Sla yourfile in QSH or QP2TERM, or EDTF as well. CHGATTR allows you to change CCSID, as well as setccsid in QSH (again).
This way helped me to find related issues. Remember that although data may be visible in the four hundred, they may not be visible through a share folder in Win. It means that CCSID file, an content encoding don't match.
Hope it helps.
Hi I've seen this error with XML data uploaded to AS400/iSeries/IBM i with FTP and the CCSID 819 (ISO 8859-1 ASCII) and it has some binary garbage in first few positions of file. Changed encoding to CCSID 1208 (UTF-8 with IBM PUA) using FTP "quote type c 1208" and the problem cleared and XML-INTO was successful.
So, suggestion about XML parser error 302 received when using XML-INTO is to look at the file (wrklnk ...) and if first character is not "<" but instead some binary garbage then try CCSID 1208 for utf-8.
Statements in this answer about what 819 is and what ccsid represents utf-8 do not agree with previous answer but are correct, according to IBM documentation:
https://www-01.ibm.com/software/globalization/ccsid/ccsid819.html
https://www-01.ibm.com/software/globalization/ccsid/ccsid1208.html
I'm working on this problem a couple hours,
for me the solution was use option ccsid=UCS2 when you use data structure or variable to store xml.
something like that :
XML-INTO customer %XML( xmlSource : 'ccsid=UCS2');
I have the program running on ccsid = 870, every conversion to ccsid on the xmlSource field don't work,
The strange thing that when I use the file with ccsid = 850, every thing work fine
I mention that becouse this is the first page when you looking about this problem.
Maybe this help someone.
On my page i have set meta tag for description, which then use google. It's all normal. But from some reason google add space after one character. Here is my meta description:
Igraj brezplačne online igre! Izbiraj med več kot 6.000 igrami! Vsak dan dodajamo nove igre! Igraj zdaj!
Yes it's not in english, but that's normal :D
And her is how google shows it:
Igraj brezplač ne online igre! Izbiraj med več kot 6.000 igrami! Vsak dan dodajamo nove igre! Igraj zdaj!
The problem is first 'č' character google ads space after it. Check google description of it:
https://www.google.si/search?q=bringler&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
I have no idea why does it do it. And the funny part is that after second one all is ok, so no space added.
Any ideas?
Your content (the actual one in the meta-description, not the one in your question) contains a hidden control character: U+008D REVERSE LINE FEED
You can see it if you analyze the characters in the string, e.g. with Rishida’s String analyser: analyze "brezplačne"
If you copy the string directly from your meta-description and search for it in Google, it converts it to brezplaÄ Â ne.
So, replace the string "brezplačne" (note that Stack Overflow removes this hidden character, so these strings are actually the same here) in your content with "brezplačne" and you should be fine (when Google visits your page again, in some time).