Convert a Base64 LDIF file to plaintext (for import) - linux

I have a LDIF file which has a multi-value Base64-encoded attribute, and I'd like to convert it in non-Base64-encoded syntax. How can this be done?
Context
The LDIF file is as such:
dn: cn=johndoe,ou=clients,ou=management,dc=example,dc=com
changetype: modify
replace: foobarStatus
foobarStatus:: ZW5hYmxl... (Base64 string) ...ZCA9IHRydWU
where the decoded Base64 string is as such:
market = "US"
mgmt.account.mode = "X12"
foo.field = "Something"
bar.field = "Something else"
...
Problem
When I try to import this LDIF file into a LDAP server via ldapmodify, I get an error:
ldapmodify: invalid format (line 4) entry: "cn=johndoe,ou=clients,ou=management,dc=example,dc=com"
I've been trying to solve this for a while but couldn't find the error. It could be some spurious character somewhere. Therefore I thought of converting the Base64 part of the LDIF and importing it on this format. The attribute values don't contain any non-printable ASCII (e.g. accented letters) so it should work fine.
Note
This could be a XY problem so if anyone has another suggestion, I'm eager to read it.

It turns out ldapmodify doesn't like long lines. Therefore, after splitting the Base64 code here
foobarStatus:: ZW5hYmxl... (Base64 string) ...ZCA9IHRydWU
into multiple lines of 79 chars or less, ldapmodify was able to import it.
This solved my original problem. I'm leaving the solution here for future readers.

Related

Python: How to translate UTF8 String containing unicode decoded characters ("Ok\u00c9" to "Oké")

I'm trying to fix the string I'm getting from my python script.
I'm doing a call to an API, but it is returning me utf8 String that is still containing unicode encoded characters.
stuff like "Ok\u00c9" should be "Oké".
I tried converting it, but all efforts to fix it seem to result in errors or in the same result. is there someone who could fix this for me in Python 3?
print('\u00c9'.encode().decode('unicode-escape'))
>> é
print('Ok\u00c9'.encode().decode('unicode-escape'))
>> should print 'Oké'
>> but gives an error
hope you guys know the solution. thanks in advance!
Ive found the problem. The encoding decoding was wrong. The text came in as Windows-1252 encoding.
I've use
import chardet
chardet.detect(var3.encode())
to detect the proper encoding, and the did a
var3 = 'OK\u00c9'.encode('utf8').decode('Windows-1252').encode('utf8').decode('utf8')
conversion to eventually get it in the right format!

How to search specific unicode tokens in the list in python

I want to search a specific string from the file for which the code is as follows
f1= codecs.open('brokenhindi.txt', encoding='utf-8')
for tokens in f1:
if u"राज्य" in tokens:
print 'done_3'
but it did not search the string(राज्य), if I replace राज्य with an english token then it searches it. I cannot find the error in the code.
Your code just works fine.
It seems though that your script or your text file is ascii encoded rather than utf-8.
Try to save it as utf-8 encoded

Arabic text replaced with escape sequences when creating CSV files using python

I am trying to create a CSV file that contains Arabic tweets collected using tweepy for a project I am doing. All is fine gathering the data, however, when i am writing to the CSV file all Arabic results are escaped with \xXXXX sequences
as follows:
b'#\xd8\xa7\xd9\x84\xd9\x8a\xd9\x88\xd9\x85_\xd8\xa7\xd9\x84\xd8\xb9\xd8\xa7\xd9\x84\xd9\x85\xd9\x8a_\xd9\x84\xd9\x84\xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd9\x87_2017 \xd8\xa7\xd9\x84\xd8\xa5\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd8\xad\xd9\x82\xd9\x8a\xd9\x82\xd9\x8a\xd8\xa9 \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd9\x81\xd9\x83\xd8\xb1 \xd9\x88\xd9\x84\xd9\x8a\xd8\xb3\xd8\xaa \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9
I looked at many previously asked questions and all I could find was suggestions for python 2 or answers similar to the one I am writing. When I was creating JSON files instead I was using ensure_ascii=False but I couldn't find anything similar for CSV. Below is my code:
with codecs.open('tweets.csv', 'a', encoding='utf-8') as file:
fieldnames = ['tweet', 'country']
writer = csv.DictWriter(file, fieldnames=fieldnames)
data = {'tweet': status.text, 'country': status.place.full_name}
writer.writerow(data)
I tried adding .encoding='utf-8' to status.text and status.place as well but that also didn't work. Any suggestions?
You have to make sure the Arabic string you have is decoded into UTF-8 before you write it. Assuming status.text is of type bytes you should type text=status.text.decode('utf-8'). (Maybe you have to do this for status.place.full_name too.) But if it's of type str then it won't have an decode() method. To avoid escape sequences in your file, a str object should be written anyway.
If you try to specify the encoding of a bytes object (like the one you presumably have) as 'utf-8' that won't work because the text is already in UTF-8 bytes. So in order to get UTF-8 characters you must call decode() on the bytes object. That way it writes the UTF-8 characters and not the UTF-8 bytes.

Python3 utf-8 decoding/encoding problems with data hiding

I'm trying to take the text from a file (the text is Russian), hide it in an image, and then later be able to retrieve it from the image. However, I keep getting binascii.Error: Odd-length string when I try to retrieve the data from the image I hid it in.
I feel like the problem may lie within what I use to hide the text. When I do someString = file.read() on the file, and print someString everything comes out fine. But when I run:
file = open(<text file path>, 'r', encoding='utf-8')
entireText = file.read()
print(codecs.encode(entireText,'utf-8'))
I get the following:
b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xbf\xd0\xb5\xd0\xb2:\n\xd0\x9e\xd1\x87\xd0\xb8 \xd1\x87\xd1\x91\xd1\x80\xd0\xbd\xd1\x8b\xd0\xb5, \xd0\xbe\xd1\x87\xd0\xb8 \xd0
That is only a piece of it, but the theme is shown; it has colons, spaces, commas, and \n all throughout the 'bytes' which is what type the codecs.encode returns. If i use codecs to decode it, then I get the original text back in perfect format.
if it helps, here are the functions I use to make it happen:
def stringToBinary(msg):
return bin(int(binascii.hexlify(msg.encode('utf-8')), 16))[2:]
def binaryToString(bNum):
return binascii.unhexlify('%x' % (int('0b' + bNum, 2))).decode('utf-8')
If that is not enough, the entire file is here: http://pastebin.com/f541DpzS
EDIT: I think I'm getting that issue because the image I'm trying to hide the text in didn't have enough pixels for me to hide the complete message, so it was trying to convery the binary number to a string without all of the bits, thus throwing binascii.Error: Odd-length string.

Python3 Base64 decode of a var containing ==

So Ive got a string of:
YDNhZip1cDg1YWg4cCFoKg==
that needs to be decoded using Pythons Base64 module.
Ive written the code
import base64
test = 'YDNhZip1cDg1YWg4cCFoKg=='
print(test)
print(base64.b64decode(test))
which gives the answer
b'`3afup85ah8p!h'
when, according to the website decoders Ive used, its really
`3afup85ah8p!h
Im guessing that its decoding the additional quotes.
Is there some way that I can save this variable with a delimiter, as another type of variable, or run the b64encode on a section of the string as slice doesnt seem to work?
b' is Python's way of delimiting data from bytes, see: What does the 'b' character do in front of a string literal?
i.e., it is decoding it correctly.

Resources