I have a .dat file exported from a mainframe system. It is EBCIDIC encoded(cp037). I would like to load the contents into a pandas or spark dataframe.
I tried using "iconv" to convert the file to ascii, it does not support conversion from cp037. "iconv -l" does not list cp037.
What is the best way to achieve this?
Related
I've read a lot of stackoverflow and other threads where it's been mentioned how to read excel binary file.
Reference: Read XLSB File in Pandas Python
import pandas as pd
df = pd.read_excel('path_to_file.xlsb', engine='pyxlsb')
However, I can not find any solution on how to write it back as .xlsb file after processing using pandas? Can anyone please suggest a workable solution for this using python?
Any help is much appreciated!
I haven't been able to find any solution to write into xlsb files or create xlsb files using python.
But maybe one work around is to save your file as xlsx using any of the many available libraries to do that (such as pandas, xlsxwriter, openpyxl) and then converting that file into a xlsb using xlsb-converter. https://github.com/gibz104/xlsb-converter
CAUTION: This repository uses WIN32COM, which is why this script only supports Windows
you can read binary file with open_workbook under pyxlsb. Please find below the code:
import pandas as pd
from pyxlsb import open_workbook
path=r'D:\path_to_file.xlsb'
df2=[]
with open_workbook(path) as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df2.append([item.v for item in row])
data= pd.DataFrame(df2[1:], columns=df2[0])
I want to read a specific line in a csv file in pandas on python.
Here is the structure of the file :
file :
example
how would be the best way to fill the values into a dataframe, with the correct name of the parameters?
thanks for help
Possible methods:
pandas.read_table method seems to be a good way to read (also in chunks) a tabular data file
doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html
pandas has a good fast (compiled) csv reader pandas.read_csv (may be more than one).
doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Ref Link: https://codereview.stackexchange.com/questions/152194/reading-from-a-txt-file-to-a-pandas-dataframe
So I have a .mat file
It is a little over 1 GB but I don't know how much data or lines of code is on it. I want to convert this .mat file to a NumPy file in Python so I can look at the data and see what is in it. How do I do this conversion?
I think you have two options to read it.
Reading it in python:
import scipy.io
mat = scipy.io.loadmat('fileName.mat')
Converting it to .csv in MATLAB in order to read it in python later:
FileData = load('FileName.mat');
csvwrite('FileName.csv', FileData.M);
I have a problem where I have to fetch the data through API calls using Python. I have the data. Now I have to convert it to a csv file.
We do not have to use numpy or panda. We can only use "Import collections" to generate csv file.
I am a beginner in python. Can someone help me with that please?
I would like to save a huge pyspark dataframe as a Hive table. How can I do this efficiently? I am looking to use saveAsTable(name, format=None, mode=None, partitionBy=None, **options) from pyspark.sql.DataFrameWriter.saveAsTable.
# Let's say I have my dataframe, my_df
# Am I able to do the following?
my_df.saveAsTable('my_table')
My question is which formats are available for me to use and where can I find this information for myself? Is OrcSerDe an option? I am still learning about this. Thank you.
Following file formats are supported.
text
csv
ldap
json
parquet
orc
Referece: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
So I was able to write the pyspark dataframe to a compressed Hive table by using a pyspark.sql.DataFrameWriter. To do this I had to do something like the following:
my_df.write.orc('my_file_path')
That did the trick.
https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.write
I am using pyspark 1.6.0 btw