Parsing data from an unstructured Excel file using python - excel

my objective is to parse and then find 2 columns within certain rows having certain headings.
I am trying to parse an Excel file that contains some blocks of strings/numbers that are separated with blanks both in rows and columns from each other.
I am using Pandas in python and I am still not succeeding to resolve the problem of empty or error noìumbers between the two blocks.
using
import os
import pandas as pd
..
my_df = pd.read_excel(my_file, error_bad_lines=False)
did not resolve the problem and stops as soon as it gets to the first empty zone after the first block.
most of the tutorials I watched assume that the excel files to be parsed are neatly filled from top to bottom with maximum som NAN cells in the middle

Related

Pandas read_csv - dealing with columns that have ID numbers with consecutive '$' and '#' symbols, along with letters and digits

I'm trying to read a csv file with a column of data that has a scrambled ID number that includes the occasional consecutive $$ along with #, numbers, and letters.
SCRAMBLE_ID
AL9LLL677
AL9$AM657
$L9$$4440
#L9$306A1
etc.
I tried the following:
df = pd.read_csv('MASTER~1.CSV',
dtype = {'SCRAMBLE_ID': str})
which rendered the third entry as L9$4440 (L9 appear in serif font, italicized, and the first and second $ vanish).
Faced with an entire column of ID numbers configured in this manner, what is the best way of dealing with such data? I can imagine:
PRIOR TO pd.read_csv: replacing the offending symbols with substitutes that don't create this problem (and what would those be), OR,
is there a way of preserving the IDs as is but making them into a data type that ignores these symbols while keeping them present?
Thank you. I've attached a screenshot of the .csv side by side with resulting df (Jupyter notebook) below.
csv column to pandas df with $$
I cannot replicate this using the same values as you in a mock CSV file.
Are you sure that the formatting based on the $ symbol is not occurring in wherever you are rendering your dataframe values? Have you checked to see if the data in the dataframe is what you expect or are you just rendering it externally?

How to import a column from an Excel file?

This is my excel file:
Here I read the entire column A of the Excel sheet named "Lazy_eight" but the problem is that I have different sheets in which the column A has a different number of elements. So, I want to import only the numbers without specifing the length of the column vector.
I use the function readmatrix with the following syntax in order to read the entire column:
p_time = readmatrix('Input_signals.xlsx','Sheet','Lazy_eight','Range','A:A')
I get this in matlab workspace:
So, I wish to give to the "readmatrix" function only the first element of the column I want to import but I want that it stops at the last element, without specifing the coordinate of the last element in order to avoid the NaN that you can see in the last image. I want to import only the numbers without the NaN value.
I cannot read the initial and the last element (in this way: 'Range', 'A3: A13') beacuse in every sheet the column A (as the other ones) has a different number of elements.
I've solved the problem by using the “rmmissing” function, that removes “NaN” values from an array.

comparing multiple tab delimited csv files in python

As a start, I want to compare the first two columns of two .csv files then write what is common in these files to an output file, say common.csv, then also write the differences in each file to different output files, say f1.csv and f4.csv.
So far I have tried to using set(), difflib, and also taking the two files, create lists from the files then comparing the first two columns in each file. This gave me the output for what is common but not for what the differences are in each file when compared to each other. I have tried most of the solutions posted that seemed like the problem was similar to mine but I am still stuck. Can someone please assist?
this is the headers in my files and only want to compare the first two columns but write out the entire line to the output file.
fieldnames = (["Chromosome" ,"GenomicPosition", "ReferenceBase",
"AlternateBase", "GeneName", "GeneID",
"TrancriptID", "Varianteffect-Variantimpact",
"Biotype", "TranscriptBiotype" , "Referencebase",
"Alternatebase", "Depth coverage"])
One solution is to use pandas, which is very powerful.
To convert csv <-> pandas dataframes:
import pandas as pd
df = pd.read_csv('csv_file.csv') # csv -> pandas
df.to_csv('csv_file.csv', index=False) # pandas -> csv
To compare pandas dataframes on columns, this post should point you in the right direction: https://stackoverflow.com/a/47107164/2667536

Problem when importing table from pdf to python using tabula

When importing data from pdf using tabula with Python, in some cases, I obtain two or more columns merged in one. It does not happen with all the files obtained from the same pdf.
In this case, this is the code used to read the pdf:
from tabula import wrapper
tables = wrapper.read_pdf("933884 cco Saupa 1.pdf",multiple_tables=True,pages='all')
i=1
for table in tables:
table.to_excel('output'+str(i)+'.xlsx',index=False)
i=i+1
For example, when I print the first item of the dataframe obtained from one of these excel files, named "output_pd":
print (output_pd[0][1])
I obtain:
76) 858000015903708 77) 858000013641969 78)
The five numbers are in a single column, so I cannot treat them individually.
Is it possible to improve the data handling in these cases?
You could try manually editing the data in excel. If you use text to columns under the data tab in excel it allows you to split one column into multiple columns without too much work, but you would need to do it for every excel file which could be a pain.
Iterating in each item of each column of each dataframe in the list obtained with tabula
wrapper.read_pdf(file)
in this case
tables
it is possible to obtain clean data.
In this case:
prueba =[]
i = 0
for table in tables:
for columna in table.columns:
for item in (str(table[columna]).split(" ")):
if "858" in str(item):
prueba.append(item[0:15])
print (prueba[0:5])
result in:
['858000019596025', '858000015903707', '858000013641975', '858000000610864', '858000013428853']
But
tabula.wrapper.read_pdf
does not read the whole initial pdf. 2 values are left in the last page. So, it is still neccesary to manually make a little edit.

Writing pandas data frame to csv but no space between columns- sep'\t' argument is ignored in Python 3

I have a problem and found many related questions asked here and read them all, but still can`t solve it. So far I didn't get any answer.
I have two files one is .csv and the other is .xlsx. They have a different number of rows and columns. I would like to merge these two according to filenames. Very simplified the two files look like as follows;
The csv file;
the excel file;
First i converted them to panda data frame;
import pandas as pd
import csv,xlrd
df1 = pd.read_csv('mycsv.csv')
df2=pd.read_excel(myexcel.xlsx', sheetname=0)
To merge the two files on the same column I remove the white space in column names in df2 using the first line below and, then I merge them and print the merged data frame in csv file.
df2.columns=df2.columns.str.replace(' ', '')
df=pd.merge(df1, df2, on="filename")
df.to_csv('myfolder \\merged_file.csv', sep="\t ")
When I check my folder, I see merged_file.csv exists but when I opened it there is no space between columns and values. I want to see nice normal csv or excel look, like my example files above. Just to make sure I tried everything, I also converted the Excel file to a csv file and then merged two csv but still merged data is written without spaces. Again, the above files are very simplified, but my real merged data look like this;
Finally, figured it out. I am putting the answer here just in case if anyone else also manages the same mistake as me. Just remove the sep="\t" and use below line instead;
df.to_csv('myfolder \\merged_file.csv')
Just realized the two csv files were comma separated and using tab delimiter for merge didn`t work.

Resources