Solving csv files with quoted semicolon in Pandas data frame - python-3.x

So I am facing the following problem:
I have a ; separated csv, which has ; enclosed in quotes, which is corrupting the data.
So like abide;acdet;"adds;dsss";acde
The ; in the "adds;dsss" is moving " dsss" to the next line, and corrupting the results of the ETL module which I am writing. my ETL is taking such a csv from the internet, then transforming it (by first loading it in Pandas data frame, doing pre-processing and then saving it), then loading it in sql server. But corrupted files are breaking the sql server schema.
Is there any solution which I can use in conjunction with Pandas data frame which allows me to fix this issue either during the read(pd.read_csv) or writing(pd.to_csv)( or both) part using Pandas dataframe?

You might need to tell the reader some fields may be quoted:
pd.read_csv(your_data, sep=';', quotechar='"')

Let's try:
from io import StringIO
import pandas as pd
txt = StringIO("""abide;acdet;"adds;dsss";acde""")
df = pd.read_csv(txt,sep=';',header=None)
print(df)
Output dataframe:
0 1 2 3
0 abide acdet adds;dsss acde

The sep parameter of pd.read_csv allows you to specify which character is used as a separator in your CSV file. Its default value is ,. Does changing it to ; solve your problem?

Related

How to load .gds file into Pandas?

I have a .gds file. How can I read that file with pandas and do some analysis? What is the best way to do that in Python? The file can be downloaded here.
you need to change the encoding and read the data using latin1
import pandas as pd
df = pd.read_csv('example.gds',header=27,encoding='latin1')
will get you the data file, also you need to skip the first 27 rows of data for the real pandas meat of the file.
The gdspy package comes handy for such applications. For example:
import numpy
import gdspy
gdsii = gdspy.GdsLibrary(infile="filename.gds")
main_cell = gdsii.top_level()[0] # Assume a single top level cell
points = main_cell.polygons[0].polygons[0]
for p in points:
print("Points: {}".format(p))

Reading .data files using pandas

Recently i encountered a file with .data extension and i searched google, i found irrelevant answers. I tried different solutions provided by blogs and websites. Nothing seems helpful. I am providing solution which was suggested to me by my colleague. Before i tried reading with read_csv.
import pandas as pd
data = pd.read_csv("example.data")
It processed the csv file but with irrelevant data.
Hope it will be helpful.
The solution to read .data file using pandas is read_fwf(). For better knowledge refer read_fwf.
Example:
import pandas as pd
data = pd.read_fwf("example.data")
By default data will not contains columns because in .data will contain any columns. In order to get column names we have to pass the column names while reading the file.
Example:
import pandas as pd
data = pd.read_fwf("example.data", names=["col1", "col2"])
print(data.columns)
>>> [col1, col2]
Hope this is useful..!!!
I would say treat the .data file as a csv file(for me worked out). In case your column names are missing, just specify them.
I used:
import pandas as pd
DataFrame = pd.read_csv("file.data", names=["columnName", "..." , ".." ])

Raw output data frame manipulation in python

Using python 3 I need to process qPCR sequencing raw data outputs by searching for the first occurrence of a user defined string and then making a new data frame using all lines after that string. I am trying to find solutions in the pandas doc but so far unsuccessful.
This is a raw output .csv file that I need to process. (couldn't paste complete csv as exceeds character limit, this is lines 40-50 and am hoping this text is useful?). I need to tell pandas to create a new data frame that 1. starts at the line containg the first occurance of str("Sample Name") with that line as header and containing all lines following. And then 2., only including columns ("Sample Name"), ("Target Name"), ("CT").
Could someone please help me so that I can use python to help me analyze biological data?
Many thanks,
Luke
40,Quantification Cycle Method,Ct,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
41,Signal Smoothing On,true,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
42,Stage where Melt Analysis is performed,Stage3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
43,Stage/ Cycle where Ct Analysis is performed,"Stage2, Step2",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
44,User Name,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
45,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
46,Well,Well Position,Omit,Sample Name,Target Name,Task,Reporter,Quencher,Quantity,Quantity Mean,SE,RQ,RQ Min,RQ Max,CT,Ct Mean,Ct SD,Delta Ct,Delta Ct Mean,Delta Ct SD,Delta Ct SE,Delta Delta Ct,Automatic Ct Threshold,Ct Threshold,Automatic Baseline,Baseline Start,Baseline End,Amp Status,Comments,Cq Conf,CQCONF,HIGHSD,OUTLIERRG,Tm1,Tm2,Tm3,Tm4
47,1,A1,False,WT1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,23.357698440551758,23.4766845703125,0.5336655378341675,,,,,,True,20959.612776965325,True,3,17,Amp,,0.9588544573203085,N,Y,N,81.40960693359375,,,
48,2,A2,False,WT1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,24.05980110168457,23.4766845703125,0.5336655378341675,,,,,,True,20959.612776965325,True,3,15,Amp,,0.9592687354496955,N,Y,N,81.40960693359375,,,
49,3,A3,False,WT1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,23.012556076049805,23.4766845703125,0.5336655378341675,,,,,,True,20959.612776965325,True,3,16,Amp,,0.9592714462250367,N,Y,N,81.40960693359375,,,
50,4,A4,False,fla11fla12-1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,23.803699493408203,24.419523239135742,0.5669151544570923,,,,,,True,20959.612776965325,True,3,17,Amp,,0.9671570584141241,N,Y,N,81.40960693359375,,,
This is the code that I have so far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_excel ("2019-02-27_161601 AtWAKL8 different version expressions.xls", sheet_name='Results').fillna(0)
data.to_csv('df1' + '.csv', index=True)
df1 = pd.read_csv ("df1.csv")
You are having trouble with quoting.
grep is a better fit for .csv files rather than .xlsx
You are forking off a shell subprocess with a filename argument,
without correctly quoting the spaces in the filename.
It would be simplest to rename it, turning spaces into dashes,
e.g. 2019-02-27_161601-AtWAKL8-different-version-expressions.xls
As it stands, you are trying to grep the string "Position"
from a file named 2019-02-27_161601,
and from a 2nd file named AtWAKL8,
a 3rd named different, and so on,
which is unlikely to work.
An .xlsx spreadsheet is not the line-oriented
text format that grep expects.
You will be happier if you export or Save As .csv format within Excel,
or if you execute data.to_csv('expressions.csv')

Pandas read_csv method can't get 'œ' character properly while using encoding ISO 8859-15

I have some trubble reading with pandas a csv file which include the special character 'œ'.
I've done some reseach and it appears that this character has been added to the ISO 8859-15 encoding standard.
I've tried to specify this encoding standard to the pandas read_csv methods but it doesn't properly get this special character (I got instead a '☐') in the result dataframe :
df= pd.read_csv(my_csv_path, ";", header=None, encoding="ISO-8859-15")
Does someone know how could I get the right 'œ' character (or eaven better the string 'oe') instead of this ?
Thank's a lot :)
As a matter of facts, I've just tried to write down the dataframe than I get with the read_csv and ISO-8859-15 encoding (using pd.to_csv method and "ISO-8859-15" encoding) and the special 'œ' character properly appears in the result csv file... :
df.to_csv(my_csv_full_path, sep=';', index=False, encoding="ISO-8859-15")
So it seems that pandas has properly read the special character in my csv file but can't show it within the dataframe...
Anyone have a clue ? I've manage the problem by manually rewrite this special character before reading my csv with pandas but that doesn't answer my question :(

How to filter a CSV file without Pandas? (Best Substitute for Pandas in Pythonista)

I am trying to do some data analysis on Pythonista 3 (iOS app for python), however because of the C libraries of pandas it does not compile in the iOS device.
Is there any substitute for Pandas?
Would numpy be an option for data of type string?
The data set I have at the moment is the history of messages between my friends and I.
The whole history is in one csv file. Each row has the columns 'day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'
The goal of the analysis is to produce a report of our chat for the past year.
I want be able to count number of messages each friend sent. I want to be able to plot a histogram of the hours in which the messages where sent by each friend.
Then, I want to do some word counting individually and as a group.
In Pandas I know how to do that. For example:
df = read_csv("messages.csv")
number_of_messages_friend1 = len(df[df.author_of_message == 'friend1']
How can I filter a csv file without Pandas?
Since Pythonista does have numpy, you will want to look at recarrays, which are numpy's approach to this type of problem. The following worked out of the box in Pythonista for me:
import numpy as np
df=np.recfromcsv('messages.csv')
len(df[df.author_of_message==b'friend1'])
Depending on your data format, tou may find that recsfromcsv "just works", since it tries to guess data types, or you might need to customize things a bit. See genfromtext for a number of options, such as explictly specifying data types or for using converters for converting string dates to datetime objects. recsfromcsv is just a convienece wrapper around genfromtext
https://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html#
Once in recarray, many of the simple indexing operations work the same as in pandas. Note you may need to do string compares using b-prefixed strings (bytes objects), unless you convert to unicode strings, as shown above.
Use the csv module from the standard library to read the messages.
You could store it into a list of collections.namedtuple for easy access.
import csv
messages = []
with open('messages.csv') as csvfile:
reader = csv.DictReader(csvfile, fieldnames=('day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'))
for row in reader:
messages.append(row)
That gives you all the messages as a list of dictionaries.
Alternatively you could use a normal csv reader combined with a collections.namedtuple to make a list of named tuples, which are slightly easier to access.
import csv
from collections import namedtuple
Msg = namedtuple('Msg', ('day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'))
messages = []
with open('messages.csv') as csvfile:
msgreader = csv.reader(csvfile)
for row in msgreader:
messages.append(Msg(*row))
Pythonista now has competition on iOS. The pyto app provides python 3.8 with pandas. https://apps.apple.com/us/app/pyto-python-3-8

Resources