I have large tsv format file(30GB). I have to transform all those data to google bigquery. So I split the files into smaller chunks and gzip all those chunk files and moved to google cloud storage. After that I have calling google bigquery api to load data from GCS. But I have facing following encoding error.
file_data.part_0022.gz: Error detected while parsing row starting at position: 0. Error: Bad character (ASCII 0) encountered. (error code: invalid)
I am using following unix commands in my python code for splitting and gzip tasks.
cmd = [
"split",
"-l",
"300000",
"-d",
"-a",
"4",
"%s%s" % (<my-dir>, file_name),
"%s/%s.part_" % (<my temp dir>, file_prefix)
]
code = subprocess.check_call(cmd)
cmd = 'gzip %s%s/%s.part*' % (<my temp dir>,file_prefix,file_prefix)
logging.info("Running shell command: %s" % cmd)
code = subprocess.Popen(cmd, shell=True)
code.communicate()
Files are successfully splitted and gziped (file_data.part_0001.gz, file_data.part_0002.gz, etc..) but when I try to load these files to bigquery it throws above error. I understand that was encoding issue.
Is there any way to encoding files while split and gzip operation? or we need to use python file object to read line by line and do unicode encoding and write it to new gzip file?(pythonic way)
Reason:
Error: Bad character (ASCII 0) encountered
Clearly states you have a unicode (UTF-16) tab character there which cannot be decoded.
BigQuery service only supports UTF-8 and latin1 text encodings. So, the file is supposed to be UTF-8 encoded.
Solution: I haven't tested it. Use the -a or --ascii flag with gzip command. It'll be decoded ok by bigquery.
Related
I am trying to extract a file name with this format--> filename.tar.gz10
I have tried mutpile wayd but for all of them, I get the error that is unknow format. it works fine for files ends with tar.gz00. I tried to change the name but still does not work.
Here are what I have tried,
import tarfile
file = tarfile.open('filename.tar.gz10')
file.extractall('./extracted_path')
file.close()
Another way is,
shutil.unpack_archive('./filename.tar.gz10', './extracted_path', 'tar.gz17')
Thanks for your help in advance.
This coule be because the archive was split into smaller chunks, on linux you could do so using the split -b command so one big file is actually multiple smaller ones now, and they are named like
file.tar.gz01
file.tar.gz02
file.tar.gz03
file.tar.gz04
etc...
you wont be able to decompress these file individually, so you have to concatenate them first into one file then decompress.
To verify whther it was split or not, run file {filename} and if does not recognize it as a gzip compressed archive then it is propably split (this is why you get unknown format error)
You can try to do the following:
from glob import glob
import os
path = '/path/to/' # location of your files
list_of_files = glob(path + '*.tar.gz*') # list all gzip files
bash_command = 'gzip -dk filename.tar.gz' + ' '.join(list_of_files) # create bash command to concatenate the files
os.system(bash_command)
I am utilizing subprocess in order to grab the hexdump for a .tgz file as I require the hex string. The only problem is, hexdump is throwing a bad format error, but only when the command is issues through subprocess. I believe I have escaped everything correctly, but I can't figure out why I am not getting my intended output:
def package_plugin():
plugin_hex = subprocess.run(["hexdump", "-v", "-e", "'1/1 \"\\\\x%%02x\"'", "package.tgz"])
This results in an error: hexdump: "'1/1 "''x%%02x"'": bad format. However, if I just run the command straight in the terminal I receive the expected output of a hexstring with the '\x' separating the hex.
How should I be running this to store the output in a Python variable? Is my command being mangled somehow and hence not executing correctly? Any advice is appreciated.
Thanks
EDIT: I should add that when entering in the terminal the command is hexdump -v -e '1/1 "\\x%02x"' I am not sure why the extra '%' sign is shown in the error as it should be interpreting as a single % sign.
Nevermind. I couldn't figure this out but I solved it with hexlify:
def package_plugin():
plugin_hex = ""
with open('plugin.tgz', 'rb') as f:
for chunk in iter(lambda: f.read(32), b''):
plugin_hex += binascii.hexlify(chunk).decode("utf-8")
formatted_hex = '\\x' + '\\x'.join(plugin_hex[i:i+2] for i in range(0, len(plugin_hex), 2))
I am trying to compile my py file but end up with an error.
The scripts reads from 2 excel files and write back to 1
When compiling the py file i get error FileNotFoundError: [Errno 2] No such file or directory: 'file.xlsx'. While the file is there and can be found when i execute the py file I cant seems to fix this.
When i chande the path from relative to full, this error pops up
workbook = load_workbook(filename="C:\Users\userxdx\Desktop\Excellsupport\file.xlsx")
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
To compile I make use of py2exe (for windows)
what am i missing here?
This is not working because a \ is an escape character. For example, "\n" will create a new line in a string. To ignore escape characters, place an r at the beginning of the string like so:
filename=r"C:\Users\userxdx\Desktop\Excellsupport\file.xlsx"
In Linux I created a plain text file. using "file -i" I am seeing file encoding is "us-ascii" . After trying below commands it is still showing output file encoding as "us-ascii". Could you please tell me how to change encoding? or Is there any way to download some encoded file which I can't read.
iconv -f US-ASCII -t ISO88592//TRANSLIT -o o.txt ip.txt
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT -o op.txt ip.txt
I am expecting either iconv change the encoding or I can download some encoded file.
If your file contains only ASCII character, then there's no difference between the ASCII, UTF-8 and different ISO8859-x encoding. So after conversion, you will end up with the exactly same file.
A text file does not store any information about what encoding was used. Therefore, the file applies a few rules but at the end of the day, it's just a guess. And as the files are identical, the result will alwazys be the same.
To see a difference, you will must use characters that are encoded differently with the different encoding or are not avaialbe at all in one of the encodings, e.g. ă, € or 😊.
i work with txt files, and i recently found e.g. these characters in a few of them:
http://pastebin.com/raw.php?i=Bdj6J3f4
what could these characters be? wrong character-encoding? i just want to use normal UTF-8 TXT files, but when i use:
iconv -t UTF-8 input.txt > output.txt
it's still the same.
When i open the files in gedit, copy+paste them in another txt files, then there's no characters like in the ones in pastebin. so gedit can solve this problem, it encodes the TXT files well. but there are too many txt files.
why are there http://pastebin.com/raw.php?i=Bdj6J3f4 -like chars in the text files? can they be converted to "normal chars"? I can't see e.g.: the "Ì" char, when i open the files with vim, only after i "work with them" (e.g.: awk, etc)
It would help if you posted the actual binary content of your file (perhaps by using the output of od -t x1). The pastebin returns this as HTML:
"Ì"
"Ã"
"é"
The first line corresponds to U+00C3 U+0152. THe last line corresponds to U+00C3 U+00A9, which is the string "\ux00e9" in UTF ("\xc3\xa9") with the UTF-8 bytes reinterpreted as Latin-1.
From man iconv:
The iconv program converts text from
one encoding to another encoding. More
precisely, it converts from the
encoding given for the -f option to
the encoding given for the -t option.
Either of these encodings defaults to
the encoding of the current locale
Because you didn't specify the -f option it assumes the file is encoded with your current locale's encoding (probably UTF-8), which apparently is not true. Your text editors (gedit, vim) do some encoding detection - you can check which encoding do they detect (I don't know how - I don't use any of them) and use that as -f iconv option (or save the open file with your desired encoding using one of those text editors).
You can also use some tool for encoding detection like Python chardet module:
$ python -c "import chardet as c; print c.detect(open('file.txt').read(4096))"
{'confidence': 0.7331842298102511, 'encoding': 'ISO-8859-2'}
..solved !
how:
i just right clicked on the folders containing the TXT files, and pasted them to another folder.. :O and presto..theres no more ugly chars..