Reading .data files using pandas - python-3.x

Recently i encountered a file with .data extension and i searched google, i found irrelevant answers. I tried different solutions provided by blogs and websites. Nothing seems helpful. I am providing solution which was suggested to me by my colleague. Before i tried reading with read_csv.
import pandas as pd
data = pd.read_csv("example.data")
It processed the csv file but with irrelevant data.
Hope it will be helpful.

The solution to read .data file using pandas is read_fwf(). For better knowledge refer read_fwf.
Example:
import pandas as pd
data = pd.read_fwf("example.data")
By default data will not contains columns because in .data will contain any columns. In order to get column names we have to pass the column names while reading the file.
Example:
import pandas as pd
data = pd.read_fwf("example.data", names=["col1", "col2"])
print(data.columns)
>>> [col1, col2]
Hope this is useful..!!!

I would say treat the .data file as a csv file(for me worked out). In case your column names are missing, just specify them.
I used:
import pandas as pd
DataFrame = pd.read_csv("file.data", names=["columnName", "..." , ".." ])

Related

Why Pandas does not read xlsx files?

I am having trouble reading an xlsx file on Pandas.. The same code used to work before but does not work anymore. I tried a lot of ways but to no avails.
Here is my code
import pandas as pd
from io import StringIO
df = pd.read_csv("Muzika.xlsx")
print(df)
Your file is not .csv, it is an excel file, so try read_excel(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

How to load .gds file into Pandas?

I have a .gds file. How can I read that file with pandas and do some analysis? What is the best way to do that in Python? The file can be downloaded here.
you need to change the encoding and read the data using latin1
import pandas as pd
df = pd.read_csv('example.gds',header=27,encoding='latin1')
will get you the data file, also you need to skip the first 27 rows of data for the real pandas meat of the file.
The gdspy package comes handy for such applications. For example:
import numpy
import gdspy
gdsii = gdspy.GdsLibrary(infile="filename.gds")
main_cell = gdsii.top_level()[0] # Assume a single top level cell
points = main_cell.polygons[0].polygons[0]
for p in points:
print("Points: {}".format(p))

Pandas Is Not Reading_csv Raw Data When Names Are Defined in a Second Line

I just started my first IRIS FLOWER project based on your example. After completing two projects, I will move to the next step, statistical and deep learning. Of course, before that I will get your book and study it.
Despite, I faced with error in my first project. The problem is I couldn't load/read the data from online or from my local computer. My computer is equipped with all necessary modules (find an attachment).
I applied the same procedure you illustrated in your example. My system read the data only when I removed the name definitions from the second line, which is names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'].
When I deleted the definitions of the names, from the coding, pandas read_csv file directly from online and also it read from the local computer. But, the retrieved data has no heading (field) at the top.
When I tried to read the data with the name definitions in the second line, it gives the following error message:
NameError: the name 'pandas' is not defined
How I can deal with this problem?
#Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)
print(dataset)
I'm guessing that you put import pandas as pd in your imports. Use pd.read_csv() instead. If you didn't import pandas, then you need to import it at the top of your Python file with import pandas or import pandas as pd (which is what pretty much everyone else uses).
Otherwise, your code looks fine.

'ValueError: No tables found': Python pd.read_html not loading input files

I am trying to import a series of HTML files with news articles that I have saved in my working directory. I developed the code using one single HTML files and it was working perfectly. However, I have since amended the code to import multiple files.
As you can see from the code below I am using pandas and pd.read_html(). It no longer imports any files and give me the error code 'ValueError: No tables found'.
I have tried with different types of HTML files so that doesn't seem to be the problem. I have also updated all of the packages that I am using. I am using OSX and Python 3.6 and Pandas 0.20.3 in Anaconda Navigator.
It was working, now it's not. What am I doing wrong?
Any tips or clues would be greatly appreciated.
import pandas as pd
from os import listdir
from os.path import isfile, join, splitext
import os
mypath = 'path_to_my_wd'
raw_data = [f for f in listdir(mypath) if (isfile(join(mypath, f)) and splitext(f)[1]=='.html')]
news = pd.DataFrame()
for htmlfile in raw_data:
articles = pd.read_html(join(mypath, htmlfile), index_col=0) #reads file as html
data = pd.concat([art for art in articles if 'HD' in art.index.values],
axis=1).T.set_index('AN')
data_export = pd.DataFrame(data, columns=['AN', 'BY', 'SN', 'LP', 'TD'])
#selects columns to export
news = news.append(data_export)
The HTML files were slightly different in formatting and I needed to pass sort=False to pd.concat(): data = pd.concat([art for art in articles if 'HD' in art.index.values], sort=False, axis=1).T.set_index('AN') This is new in Pandas version 0.23.0. That solved the problem.

Solving csv files with quoted semicolon in Pandas data frame

So I am facing the following problem:
I have a ; separated csv, which has ; enclosed in quotes, which is corrupting the data.
So like abide;acdet;"adds;dsss";acde
The ; in the "adds;dsss" is moving " dsss" to the next line, and corrupting the results of the ETL module which I am writing. my ETL is taking such a csv from the internet, then transforming it (by first loading it in Pandas data frame, doing pre-processing and then saving it), then loading it in sql server. But corrupted files are breaking the sql server schema.
Is there any solution which I can use in conjunction with Pandas data frame which allows me to fix this issue either during the read(pd.read_csv) or writing(pd.to_csv)( or both) part using Pandas dataframe?
You might need to tell the reader some fields may be quoted:
pd.read_csv(your_data, sep=';', quotechar='"')
Let's try:
from io import StringIO
import pandas as pd
txt = StringIO("""abide;acdet;"adds;dsss";acde""")
df = pd.read_csv(txt,sep=';',header=None)
print(df)
Output dataframe:
0 1 2 3
0 abide acdet adds;dsss acde
The sep parameter of pd.read_csv allows you to specify which character is used as a separator in your CSV file. Its default value is ,. Does changing it to ; solve your problem?

Resources