I have some data coming through a web service that was described as base64 encoded.
Example: AgAOAAAAQQEA3AcKDhIyCNwHCg4SMgyYIzSWoACP1T2TRRw1MTExMDUwMTE2ICAAAAAAAAAAAAAA3AAjU1QsKzAyMjEuMGxiDQo=
However, attempting to decode this isn't coming up with the results I would have expected:
>>> base64.b64decode('AgAOAAAAQQEA3AcKDhIyCNwHCg4SMgyYIzSWoACP1T2TRRw1MTExMDUwMTE2ICAAAAAAAAAAAAAA3AAjU1QsKzAyMjEuMGxiDQo=')
'\x02\x00\x0e\x00\x00\x00A\x01\x00\xdc\x07\n\x0e\x122\x08\xdc\x07\n\x0e\x122\x0c\x98#4\x96\xa0\x00\x8f\xd5=\x93E\x1c5111050116 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xdc\x00#ST,+0221.0lb\r\n'
It looks like the end of the decoded string is kinda-sorta along the lines of what I'm looking for. It should theoretically be transformable to something resembling MT=2012-10-14 18:50:08, TT=2012-10-14 18:50:12, BT=00:A0:96:34:23:98, SN=5111050116 , BL=6.30V, S/H=4/3, Weight=221.0lb(100.24kg) but I can't figure out what's going on with the encoding here.
What I have so far, I probably need more info to decode everything but here it goes:
>>> t = base64.b64decode('AgAOAAAAQQEA3AcKDhIyCNwHCg4SMgyYIzSWoACP1T2TRRw1MTExMDUwMTE2ICAAAAAAAAAAAAAA3AAjU1QsKzAyMjEuMGxiDQo=')
Datetime fields MT and TT in order are:
>>> print int(t[9:11][::-1].encode("hex"), 16), int(t[11].encode("hex"), 16), int(t[12].encode("hex"), 16), int(t[13].encode("hex"), 16), int(t[14].encode("hex"), 16), int(t[15].encode("hex"), 16)
2012 10 14 18 50 8
>>> print int(t[16:18][::-1].encode("hex"), 16), int(t[18].encode("hex"), 16), int(t[19].encode("hex"), 16), int(t[20].encode("hex"), 16), int(t[21].encode("hex"), 16), int(t[22].encode("hex"), 16)
2012 10 14 18 50 12
BT is, you just have to add the ':' each two letters :
>>> t[23:29][::-1].encode("hex")
'00a096342398'
SN is:
>>> t[35:47]
'5111050116 '
Weight is:
>>> t[63:72]
'+0221.0lb'
Sorry but I don't have any idea at the moment how the rest are stored, and since I don't know what the range on those might be either I really have no way of decoding the rest, let me know if you can disclose a bit more information about what those fields should store.
Related
I have an array of Polygons. I need to convert the array in to Multipolygon.
["POLYGON ((-93.8153401599999 31.6253224010001, -93.8154545089999 31.613245482, -93.8256952309999 31.6133096470001, -93.8239846819999 31.6142335050001, -93.822649241 31.614534889, -93.819589744 31.6141266810001, -93.8187199179999 31.6145615630001, -93.818796329 31.6166099970001, -93.8191396409999 31.616805696, -93.822160944 31.6185287610001, -93.8259606669999 31.6195415540001, -93.827173805 31.6202834370001, -93.826861 31.621054014, -93.826721397 31.6210996090001, -93.825838469 31.621387795, -93.823763302 31.620645804, -93.8224278609999 31.620880388, -93.8207344099999 31.6214468590001, -93.817712918 31.621645233, -93.8171636009999 31.6218779230001, -93.8170138 31.622175612, -93.816896795 31.622408104, -93.816843193 31.622514901, -93.8172703129999 31.623758464, -93.817027909 31.6250143240001, -93.816942408 31.624910524, -93.8153401599999 31.6253224010001))", "POLYGON ((-93.827875499 31.6135011530001, -93.8276549939999 31.6133218590001, -93.830593683 31.613340276, -93.827860513 31.616556659, -93.825911348 31.6159317660001, -93.825861447 31.615915767, -93.826296355 31.6149087000001, -93.8272805829999 31.614407122, -93.827341685 31.6143140250001, -93.827875499 31.6135011530001))"]
I am using the following code to convert the Multipolygons using Apache Sedona
select FID,ST_Multi(ST_GeomFromText(collect_list(polygon))) polygon_list group by 1
I am getting the error like "org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.unsafe.types.UTF8String" .How can I overcome this issue ? is the same thing can be achieved using Geopandas or shapely?
The answer given by #Antoine B is a very good attempt. But it won't work with the polygons that have hole(s) in them. There is another approach that works with such polygons, and the code is easier to comprehend.
from shapely.geometry import Polygon, MultiPolygon
from shapely import wkt
from shapely.wkt import loads
# List of strings representing polygons
poly_string = ["POLYGON ((-93.8153401599999 31.6253224010001, -93.8154545089999 31.613245482, -93.8256952309999 31.6133096470001, -93.8239846819999 31.6142335050001, -93.822649241 31.614534889, -93.819589744 31.6141266810001, -93.8187199179999 31.6145615630001, -93.818796329 31.6166099970001, -93.8191396409999 31.616805696, -93.822160944 31.6185287610001, -93.8259606669999 31.6195415540001, -93.827173805 31.6202834370001, -93.826861 31.621054014, -93.826721397 31.6210996090001, -93.825838469 31.621387795, -93.823763302 31.620645804, -93.8224278609999 31.620880388, -93.8207344099999 31.6214468590001, -93.817712918 31.621645233, -93.8171636009999 31.6218779230001, -93.8170138 31.622175612, -93.816896795 31.622408104, -93.816843193 31.622514901, -93.8172703129999 31.623758464, -93.817027909 31.6250143240001, -93.816942408 31.624910524, -93.8153401599999 31.6253224010001))", "POLYGON ((-93.827875499 31.6135011530001, -93.8276549939999 31.6133218590001, -93.830593683 31.613340276, -93.827860513 31.616556659, -93.825911348 31.6159317660001, -93.825861447 31.615915767, -93.826296355 31.6149087000001, -93.8272805829999 31.614407122, -93.827341685 31.6143140250001, -93.827875499 31.6135011530001))"]
# Create a list of polygons from the list of strings
all_pgons = [loads(pgon) for pgon in poly_string]
# Create the required multipolygon
multi_pgon = MultiPolygon(all_pgons)
This is a list of strings of polygons with holes.
# List of polygons with hole
poly_string = ['POLYGON ((1 2, 1 5, 4 4, 1 2), (1.2 3, 3 4, 1.3 4, 1.2 3))',
'POLYGON ((11 12, 11 15, 14 14, 11 12), (11.2 13, 13 14, 11.3 14, 11.2 13))']
The code above also works well in this case.
a MultiPolygon is just a list of Polygon, so you need to reconstruct every Polygon in a list and then pass it to MultiPolygon.
With the format of the string you gave, I got it to work like that :
from shapely.geometry import Polygon, MultiPolygon
poly_string = ["POLYGON ((-93.8153401599999 31.6253224010001, -93.8154545089999 31.613245482, -93.8256952309999 31.6133096470001, -93.8239846819999 31.6142335050001, -93.822649241 31.614534889, -93.819589744 31.6141266810001, -93.8187199179999 31.6145615630001, -93.818796329 31.6166099970001, -93.8191396409999 31.616805696, -93.822160944 31.6185287610001, -93.8259606669999 31.6195415540001, -93.827173805 31.6202834370001, -93.826861 31.621054014, -93.826721397 31.6210996090001, -93.825838469 31.621387795, -93.823763302 31.620645804, -93.8224278609999 31.620880388, -93.8207344099999 31.6214468590001, -93.817712918 31.621645233, -93.8171636009999 31.6218779230001, -93.8170138 31.622175612, -93.816896795 31.622408104, -93.816843193 31.622514901, -93.8172703129999 31.623758464, -93.817027909 31.6250143240001, -93.816942408 31.624910524, -93.8153401599999 31.6253224010001))", "POLYGON ((-93.827875499 31.6135011530001, -93.8276549939999 31.6133218590001, -93.830593683 31.613340276, -93.827860513 31.616556659, -93.825911348 31.6159317660001, -93.825861447 31.615915767, -93.826296355 31.6149087000001, -93.8272805829999 31.614407122, -93.827341685 31.6143140250001, -93.827875499 31.6135011530001))"]
polygons = []
for poly in poly_string:
coordinates = []
for s in poly.split('('):
if len(s.split(')')) > 1:
for c in s.split(')')[0].split(','):
coordinates.append((float(c.lstrip().split(' ')[0]),
float(c.lstrip().split(' ')[1])))
polygons.append(Polygon(coordinates))
multipoly = MultiPolygon(polygons)
The resulting MultiPolygon looks like that :
I would try
select
FID,
ST_Multi(ST_Collect(ST_GeomFromText(polygon))) polygon_list
group by 1
I am trying to write Chinese characters to a CSV file based on their Unicode code points found in a text file in unicode.org/Public/zipped/13.0.0/Unihan.zip. For instance, one example character is U+9109.
In the example below I can get the correct output by hard coding the value (line 8), but keep getting it wrong with every permutation I've tried at generating the bytes from the code point (lines 14-16).
I'm running this in Python 3.8.3 on a Debian-based Linux distro.
Minimal working (broken) example:
1 #!/usr/bin/env python3
2
3 def main():
4
5 output = open("test.csv", "wb")
6
7 # Hardcoded values work just fine
8 output.write('\u9109'.encode("utf-8"))
9
10 # Comma separation
11 output.write(','.encode("utf-8"))
12
13 # Problem is here
14 codepoint = '9109'
15 u_str = '\\' + 'u' + codepoint
16 output.write(u_str.encode("utf-8"))
17
18 # End with newline
19 output.write('\n'.encode("utf-8"))
20
21 output.close()
22
23 if __name__ == "__main__":
24 main()
Executing and viewing results:
example $
example $./test.py
example $
example $cat test.csv
鄉,\u9109
example $
The expected output would look like this (Chinese character occurring on both sides of the comma):
example $
example $./test.py
example $cat test.csv
鄉,鄉
example $
chr is used to convert integers to code points in Python 3. Your code could use:
output.write(chr(0x9109).encode("utf-8"))
But if you specify the encoding in the open instead of using binary mode you don't have to manually encode everything. print to a file handles newlines for you as well.
with open("test.txt",'w',encoding='utf-8') as output:
for i in range(0x4e00,0x4e10):
print(f'U+{i:04X} {chr(i)}',file=output)
Output:
U+4E00 一
U+4E01 丁
U+4E02 丂
U+4E03 七
U+4E04 丄
U+4E05 丅
U+4E06 丆
U+4E07 万
U+4E08 丈
U+4E09 三
U+4E0A 上
U+4E0B 下
U+4E0C 丌
U+4E0D 不
U+4E0E 与
U+4E0F 丏
I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b
using pyspark.
import array
from io import StringIO
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)
def mapper(features):
a = array.array('f')
a.frombytes(features)
return a.tolist()
def byte_mapper(bytes):
return str(bytes)
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
When just product_id is selected from the rdd using
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
The output for product_id is
["b'1582480311'", "b'\\x00\\x00\\x00\\x00\\x88c-?\\xeb\\xe2'", "b'7#\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\xec/\\x0b?\\x00\\x00\\x00\\x00K\\xea'", "b'\\x00\\x00c\\x7f\\xd9?\\x00\\x00\\x00\\x00'", "b'L\\xa6\\n>\\x00\\x00\\x00\\x00\\xfe\\xd4'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\xe5\\xd0\\xa2='", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'"]
The file is hosted on s3.
The file in each row has first 10 bytes for product_id next 4096 bytes as image_features
I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.
EDIT:
Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)
Should work.
Actually you can find this in the provided code from the web site you downloaded the binary file:
for i in range(4096):
feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4
Old answer:
I think the issue comes from your byte_mapper function.
That's not the correct way to convert bytes to string. You should be using decode:
bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"
print(bytes.decode("utf-8"))
# output: '1582480311'
If you're getting the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte
That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.
However, you may want to ignore those characters by adding option ignore to decode function:
bytes.decode("utf-8", "ignore")
I know there are lots of Q&As to extract datetime from string, such as dateutil.parser, to extract datetime from a string
import dateutil.parser as dparser
dparser.parse('something sep 28 2017 something',fuzzy=True).date()
output: datetime.date(2017, 9, 28)
but my question is how to know which part of string results this extraction, e.g. i want a function that also returns me 'sep 28 2017'
datetime, datetime_str = get_date_str('something sep 28 2017 something')
outputs: datetime.date(2017, 9, 28), 'sep 28 2017'
any clue or any direction that i can search around?
Extend to the discussion with #Paul and following the solution from #alecxe, I have proposed the following solution, which works on a number of testing cases, I've made the problem slight challenger:
Step 1: get excluded tokens
import dateutil.parser as dparser
ostr = 'something sep 28 2017 something abcd'
_, excl_str = dparser.parse(ostr,fuzzy_with_tokens=True)
gives outputs of:
excl_str: ('something ', ' ', 'something abcd')
Step 2 : rank tokens by length
excl_str = list(excl_str)
excl_str.sort(reverse=True,key = len)
gives a sorted token list:
excl_str: ['something abcd', 'something ', ' ']
Step 3: delete tokens and ignore space element
for i in excl_str:
if i != ' ':
ostr = ostr.replace(i,'')
return ostr
gives a final output
ostr: 'sep 28 2017 '
Note: step 2 is required, because it will cause problem if any shorter token a subset of longer ones. e.g., in this case, if deletion follows an order of ('something ', ' ', 'something abcd'), the replacement process will remove something from something abcd, and abcd will never get deleted, ends up with 'sep 28 2017 abcd'
Interesting problem! There is no direct way to get the parsed out date string out of the bigger string with dateutil. The problem is that dateutil parser does not even have this string available as an intermediate result as it really builds parts of the future datetime object on the fly and character by character (source).
It, though, also collects a list of skipped tokens which is probably your best bet. As this list is ordered, you can loop over the tokens and replace the first occurrence of the token:
from dateutil import parser
s = 'something sep 28 2017 something'
parsed_datetime, tokens = parser.parse(s, fuzzy_with_tokens=True)
for token in tokens:
s = s.replace(token.lstrip(), "", 1)
print(s) # prints "sep 28 2017"
I am though not 100% sure if this would work in all the possible cases, especially, with the different whitespace characters (notice how I had to workaround things with .lstrip()).
I am trying to transform image geotags so that images and ground control points lie in the same coordinate system inside my software (Pix4D mapper).
The answer here says:
Exif data is standardized, and GPS data must be encoded using
geographical coordinates (minutes, seconds, etc) described above
instead of a fraction. Unless it's encoded in that format in the exif
tag, it won't stick.
Here is my code:
import os, piexif, pyproj
from PIL import Image
img = Image.open(os.path.join(dirPath,fn))
exif_dict = piexif.load(img.info['exif'])
breite = exif_dict['GPS'][piexif.GPSIFD.GPSLatitude]
lange = exif_dict['GPS'][piexif.GPSIFD.GPSLongitude]
breite = breite[0][0] / breite[0][1] + breite[1][0] / (breite[1][1] * 60) + breite[2][0] / (breite[2][1] * 3600)
lange = lange[0][0] / lange[0][1] + lange[1][0] / (lange[1][1] * 60) + lange[2][0] / (lange[2][1] * 3600)
print(breite) #48.81368778730952
print(lange) #9.954511162420633
x, y = pyproj.transform(wgs84, gk3, lange, breite) #from WGS84 to GaussKrüger zone 3
print(x) #3570178.732528623
print(y) #5408908.20172699
exif_dict['GPS'][piexif.GPSIFD.GPSLatitude] = [ ( (int)(round(y,6) * 1000000), 1000000 ), (0, 1), (0, 1) ]
exif_bytes = piexif.dump(exif_dict) #error here
img.save(os.path.join(outPath,fn), "jpeg", exif=exif_bytes)
I am getting struct.error: argument out of range in the dump method. The original GPSInfo tag looks like: {0: b'\x02\x03\x00\x00', 1: 'N', 2: ((48, 1), (48, 1), (3449322402, 70000000)), 3: 'E', 4: ((9, 1), (57, 1), (1136812930, 70000000)), 5: b'\x00', 6: (3659, 10)}
I am guessing I have to offset the values and encode them properly before writing, but have no idea what is to be done.
It looks like you are already using PIL and Python 3.x, not sure if you want to continue using piexif but either way, you may find it easier to convert the degrees, minutes, and seconds into decimal first. It looks like you are trying to do that already but putting it in a separate function may be clearer and account for direction reference.
Here's an example:
def get_decimal_from_dms(dms, ref):
degrees = dms[0][0] / dms[0][1]
minutes = dms[1][0] / dms[1][1] / 60.0
seconds = dms[2][0] / dms[2][1] / 3600.0
if ref in ['S', 'W']:
degrees = -degrees
minutes = -minutes
seconds = -seconds
return round(degrees + minutes + seconds, 5)
def get_coordinates(geotags):
lat = get_decimal_from_dms(geotags['GPSLatitude'], geotags['GPSLatitudeRef'])
lon = get_decimal_from_dms(geotags['GPSLongitude'], geotags['GPSLongitudeRef'])
return (lat,lon)
The geotags in this example is a dictionary with the GPSTAGS as keys instead of the numeric codes for readability. You can find more detail and the complete example from this blog post: Getting Started with Geocoding Exif Image Metadata in Python 3
After much hemming & hawing I reached the pages of py3exiv2 image metadata manipulation library. One will find exhaustive lists of the metadata tags as one reads through but here is the list of EXIF tags just to save few clicks.
It runs smoothly on Linux and provides many opportunities to edit image-headers. The documentation is also quite clear. I recommend this as a solution and am interested to know if it solves everyone else's problems as well.