How to load .gds file into Pandas? - python-3.x

I have a .gds file. How can I read that file with pandas and do some analysis? What is the best way to do that in Python? The file can be downloaded here.

you need to change the encoding and read the data using latin1
import pandas as pd
df = pd.read_csv('example.gds',header=27,encoding='latin1')
will get you the data file, also you need to skip the first 27 rows of data for the real pandas meat of the file.

The gdspy package comes handy for such applications. For example:
import numpy
import gdspy
gdsii = gdspy.GdsLibrary(infile="filename.gds")
main_cell = gdsii.top_level()[0] # Assume a single top level cell
points = main_cell.polygons[0].polygons[0]
for p in points:
print("Points: {}".format(p))

Related

How to load large multi file parquet files for tensorflow/pytorch

I am trying to load a few parquet files from a directory into Python for tensorflow/pytorch.
The files are too large to be loaded through the pyarrow.parquet functions
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('dir')
table = dataset.read()
This gives out of memory error.
I have also tried using petastorm, but that doesn't work for make_reader() because it isn't of the petastorm type.
with make_batch_reader('dir') as reader:
dataset = make_petastorm_dataset(reader)
When I used the make_batch_reader() and then the make_petastorm_dataset(reader), it again gave an zip not iterable error or something along those lines.
I am not sure how to load the file into Python for ML training.
Some quick help would be greatly appreciated.
Thanks
Zash
For pyarrow, you can list the directory with Python, iterate over *.parquet files, open each one as pq.ParquetFile, and read it one row group at a time. This will alleviate the memory pressure, but won't be super fast without parallelization.
For petastorm, you are right to use make_batch_reader(). Indeed, the error messages are not always helpful; but you can inspect the stack trace and investigate where in petastorm code it originates from.
You can load entire data using dask using below code.
You can also load only chucks of data whenever needed by computing only those lines using the index. [Assuming you have different index].
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
#delayed
def load_chunk(pth):
x = ParquetFile(pth).to_pandas()
x = x.drop('[unwanted_columns_to_save_space]',axis=1)
return x
files = glob.glob('./your_path/*.parquet')
ddf = dd.from_delayed([load_chunk(f) for f in files])
df = ddf.compute()

Reading .data files using pandas

Recently i encountered a file with .data extension and i searched google, i found irrelevant answers. I tried different solutions provided by blogs and websites. Nothing seems helpful. I am providing solution which was suggested to me by my colleague. Before i tried reading with read_csv.
import pandas as pd
data = pd.read_csv("example.data")
It processed the csv file but with irrelevant data.
Hope it will be helpful.
The solution to read .data file using pandas is read_fwf(). For better knowledge refer read_fwf.
Example:
import pandas as pd
data = pd.read_fwf("example.data")
By default data will not contains columns because in .data will contain any columns. In order to get column names we have to pass the column names while reading the file.
Example:
import pandas as pd
data = pd.read_fwf("example.data", names=["col1", "col2"])
print(data.columns)
>>> [col1, col2]
Hope this is useful..!!!
I would say treat the .data file as a csv file(for me worked out). In case your column names are missing, just specify them.
I used:
import pandas as pd
DataFrame = pd.read_csv("file.data", names=["columnName", "..." , ".." ])

Raw output data frame manipulation in python

Using python 3 I need to process qPCR sequencing raw data outputs by searching for the first occurrence of a user defined string and then making a new data frame using all lines after that string. I am trying to find solutions in the pandas doc but so far unsuccessful.
This is a raw output .csv file that I need to process. (couldn't paste complete csv as exceeds character limit, this is lines 40-50 and am hoping this text is useful?). I need to tell pandas to create a new data frame that 1. starts at the line containg the first occurance of str("Sample Name") with that line as header and containing all lines following. And then 2., only including columns ("Sample Name"), ("Target Name"), ("CT").
Could someone please help me so that I can use python to help me analyze biological data?
Many thanks,
Luke
40,Quantification Cycle Method,Ct,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
41,Signal Smoothing On,true,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
42,Stage where Melt Analysis is performed,Stage3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
43,Stage/ Cycle where Ct Analysis is performed,"Stage2, Step2",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
44,User Name,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
45,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
46,Well,Well Position,Omit,Sample Name,Target Name,Task,Reporter,Quencher,Quantity,Quantity Mean,SE,RQ,RQ Min,RQ Max,CT,Ct Mean,Ct SD,Delta Ct,Delta Ct Mean,Delta Ct SD,Delta Ct SE,Delta Delta Ct,Automatic Ct Threshold,Ct Threshold,Automatic Baseline,Baseline Start,Baseline End,Amp Status,Comments,Cq Conf,CQCONF,HIGHSD,OUTLIERRG,Tm1,Tm2,Tm3,Tm4
47,1,A1,False,WT1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,23.357698440551758,23.4766845703125,0.5336655378341675,,,,,,True,20959.612776965325,True,3,17,Amp,,0.9588544573203085,N,Y,N,81.40960693359375,,,
48,2,A2,False,WT1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,24.05980110168457,23.4766845703125,0.5336655378341675,,,,,,True,20959.612776965325,True,3,15,Amp,,0.9592687354496955,N,Y,N,81.40960693359375,,,
49,3,A3,False,WT1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,23.012556076049805,23.4766845703125,0.5336655378341675,,,,,,True,20959.612776965325,True,3,16,Amp,,0.9592714462250367,N,Y,N,81.40960693359375,,,
50,4,A4,False,fla11fla12-1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,23.803699493408203,24.419523239135742,0.5669151544570923,,,,,,True,20959.612776965325,True,3,17,Amp,,0.9671570584141241,N,Y,N,81.40960693359375,,,
This is the code that I have so far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_excel ("2019-02-27_161601 AtWAKL8 different version expressions.xls", sheet_name='Results').fillna(0)
data.to_csv('df1' + '.csv', index=True)
df1 = pd.read_csv ("df1.csv")
You are having trouble with quoting.
grep is a better fit for .csv files rather than .xlsx
You are forking off a shell subprocess with a filename argument,
without correctly quoting the spaces in the filename.
It would be simplest to rename it, turning spaces into dashes,
e.g. 2019-02-27_161601-AtWAKL8-different-version-expressions.xls
As it stands, you are trying to grep the string "Position"
from a file named 2019-02-27_161601,
and from a 2nd file named AtWAKL8,
a 3rd named different, and so on,
which is unlikely to work.
An .xlsx spreadsheet is not the line-oriented
text format that grep expects.
You will be happier if you export or Save As .csv format within Excel,
or if you execute data.to_csv('expressions.csv')

Solving csv files with quoted semicolon in Pandas data frame

So I am facing the following problem:
I have a ; separated csv, which has ; enclosed in quotes, which is corrupting the data.
So like abide;acdet;"adds;dsss";acde
The ; in the "adds;dsss" is moving " dsss" to the next line, and corrupting the results of the ETL module which I am writing. my ETL is taking such a csv from the internet, then transforming it (by first loading it in Pandas data frame, doing pre-processing and then saving it), then loading it in sql server. But corrupted files are breaking the sql server schema.
Is there any solution which I can use in conjunction with Pandas data frame which allows me to fix this issue either during the read(pd.read_csv) or writing(pd.to_csv)( or both) part using Pandas dataframe?
You might need to tell the reader some fields may be quoted:
pd.read_csv(your_data, sep=';', quotechar='"')
Let's try:
from io import StringIO
import pandas as pd
txt = StringIO("""abide;acdet;"adds;dsss";acde""")
df = pd.read_csv(txt,sep=';',header=None)
print(df)
Output dataframe:
0 1 2 3
0 abide acdet adds;dsss acde
The sep parameter of pd.read_csv allows you to specify which character is used as a separator in your CSV file. Its default value is ,. Does changing it to ; solve your problem?

How to import .dta via pandas and describe data?

I am new to python and have a simple problem. In a first step, I want to load some sample data I created in Stata. In a second step, I would like to describe the data in python - that is, I'd like a list of the imported variable names. So far I've done this:
from pandas.io.stata import StataReader
reader = StataReader('sample_data.dta')
data = reader.data()
dir()
I get the following error:
anaconda/lib/python3.5/site-packages/pandas/io/stata.py:1375: UserWarning: 'data' is deprecated, use 'read' instead
warnings.warn("'data' is deprecated, use 'read' instead")
What does it mean and how can I resolve the issue? And, is dir() the right way to get an understanding of what variables I have in the data?
Using pandas.io.stata.StataReader.data to read from a stata file has been deprecated in pandas 0.18.1 version and hence you are getting that warning.
Instead, you must use pandas.read_stata to read the file as shown:
df = pd.read_stata('sample_data.dta')
df.dtypes ## Return the dtypes in this object
Sometimes this did not work for me especially when the dataset is large. So the thing I propose here is 2 steps (Stata and Python)
In Stata write the following commands:
export excel Cevdet.xlsx, firstrow(variables)
and to copy the variable labels write the following
describe, replace
list
export excel using myfile.xlsx, replace first(var)
restore
this will generate for you two files Cevdet.xlsx and myfile.xlsx
Now you go to your jupyter notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('Cevdet.xlsx')
This will allow you to read both files into jupyter (python 3)
My advice is to save this data file (especially if it is big)
df.to_pickle('Cevdet')
The next time you open jupyter you can simply run
df=pd.read_pickle("Cevdet")

Resources