Can't correctly read .csv file - python-3.x

When importing a .csv, I saved the result as a pandas DataFrame as follows:
csv_dataframe= pd.DataFrame(pd.read_csv(r'filepath.csv', delimiter=';', encoding='iso-8859-1', decimal=',', low_memory=False))
However, when I call a specific column that has numbers and letters, it ignores some of the characters or adds others. For example, in column 'A', there are elements similar to this:
'ABC123456789'
'123456789'
'1234567'
and when I call:
csv_dataframe['A']
The results are:
'ABC123456789'
'1234567342'
'3456475'
So, some of the values are correct but, in others, it changes the values, adding or removing elements. In some cases it even alters their length.
Is there some form of changing the way that other programs read .csv files in the .csv file, for example? That is, is there an option in the .csv file that masks values that isn't noticeable when openning it? Or, did I make any mistake when calling the file/functions, please?
Thank you very much.

Try removing 'pd.DataFrame()'
pd.read_csv already creates a dataframe
This should work:
csv_dataframe= pd.read_csv(r'filepath.csv', delimiter=';', encoding='iso-8859-1', decimal=',', low_memory=False)
It might fix your issue, other than that, I'm willing to bet the issue is in the CSV.

Related

Pandas read_csv - dealing with columns that have ID numbers with consecutive '$' and '#' symbols, along with letters and digits

I'm trying to read a csv file with a column of data that has a scrambled ID number that includes the occasional consecutive $$ along with #, numbers, and letters.
SCRAMBLE_ID
AL9LLL677
AL9$AM657
$L9$$4440
#L9$306A1
etc.
I tried the following:
df = pd.read_csv('MASTER~1.CSV',
dtype = {'SCRAMBLE_ID': str})
which rendered the third entry as L9$4440 (L9 appear in serif font, italicized, and the first and second $ vanish).
Faced with an entire column of ID numbers configured in this manner, what is the best way of dealing with such data? I can imagine:
PRIOR TO pd.read_csv: replacing the offending symbols with substitutes that don't create this problem (and what would those be), OR,
is there a way of preserving the IDs as is but making them into a data type that ignores these symbols while keeping them present?
Thank you. I've attached a screenshot of the .csv side by side with resulting df (Jupyter notebook) below.
csv column to pandas df with $$
I cannot replicate this using the same values as you in a mock CSV file.
Are you sure that the formatting based on the $ symbol is not occurring in wherever you are rendering your dataframe values? Have you checked to see if the data in the dataframe is what you expect or are you just rendering it externally?

How to ingest comma separated csv with unquoted commas in some columns in ADF

Hi everyone I have not seen this particular issue pop up, I've seen a few related but none address this.
I have very big CSVs (up to 8gb) with comma as delimiter, free text in some columns, and commas in some of that free text.
As requirements, I cannot generate or ask for the CSVs to be generated again with another delimiter, and I have to use Data Flow to achieve this.
I would like to learn how to deal with text such as:
A, some text 2132,ALL, free text 00001,2020-11-29 - 2020-12-05
A, some text 2132,ALL, free text\,more text 0002,2018-12-09 - 2018-12-15
A, some text 2132,ALL, free text\,more text 00003,2018-12-09 - 2018-12-15
Things I have tried:
I tried making both simple Data Flows and Copy Activities in order to see if the parser did the operation properly, which it did not, didn't matter what combination of configuration of dataset as csv I tried.
I tried reading the whole csv as one column, writing to file with the "," regexed out, this has the issue of "losing" the commas from the csv so I have spaces as delimiter so I am back to square one, not having a proper delimiter, since text has spaces and would break.
Actually, data factory can't deal with the csv file which column data have the character same with column delimiter. It will cause the schema/column missing.
Even with Data Flow, Data Factory will always recognize the first row as the schema according the delimiter number.
As you said you can't change the source csv file and can't use data flow. then I'm afraid to say I we can't achieve it in Data Factory.
What I did for this to work (did it it twice with different results so I am still missing things, but one of them it worked).
Create a dataset with no delimiter, so the whole CSV row is read as a column. Use dataflow replace function there to make the problematic string dissapear. Write to disk as CSV.
Read as CSV with proper delimiter. Do whatever data needs done, write to disk as parquet. That appears to work.

comparing multiple tab delimited csv files in python

As a start, I want to compare the first two columns of two .csv files then write what is common in these files to an output file, say common.csv, then also write the differences in each file to different output files, say f1.csv and f4.csv.
So far I have tried to using set(), difflib, and also taking the two files, create lists from the files then comparing the first two columns in each file. This gave me the output for what is common but not for what the differences are in each file when compared to each other. I have tried most of the solutions posted that seemed like the problem was similar to mine but I am still stuck. Can someone please assist?
this is the headers in my files and only want to compare the first two columns but write out the entire line to the output file.
fieldnames = (["Chromosome" ,"GenomicPosition", "ReferenceBase",
"AlternateBase", "GeneName", "GeneID",
"TrancriptID", "Varianteffect-Variantimpact",
"Biotype", "TranscriptBiotype" , "Referencebase",
"Alternatebase", "Depth coverage"])
One solution is to use pandas, which is very powerful.
To convert csv <-> pandas dataframes:
import pandas as pd
df = pd.read_csv('csv_file.csv') # csv -> pandas
df.to_csv('csv_file.csv', index=False) # pandas -> csv
To compare pandas dataframes on columns, this post should point you in the right direction: https://stackoverflow.com/a/47107164/2667536

Writing pandas data frame to csv but no space between columns- sep'\t' argument is ignored in Python 3

I have a problem and found many related questions asked here and read them all, but still can`t solve it. So far I didn't get any answer.
I have two files one is .csv and the other is .xlsx. They have a different number of rows and columns. I would like to merge these two according to filenames. Very simplified the two files look like as follows;
The csv file;
the excel file;
First i converted them to panda data frame;
import pandas as pd
import csv,xlrd
df1 = pd.read_csv('mycsv.csv')
df2=pd.read_excel(myexcel.xlsx', sheetname=0)
To merge the two files on the same column I remove the white space in column names in df2 using the first line below and, then I merge them and print the merged data frame in csv file.
df2.columns=df2.columns.str.replace(' ', '')
df=pd.merge(df1, df2, on="filename")
df.to_csv('myfolder \\merged_file.csv', sep="\t ")
When I check my folder, I see merged_file.csv exists but when I opened it there is no space between columns and values. I want to see nice normal csv or excel look, like my example files above. Just to make sure I tried everything, I also converted the Excel file to a csv file and then merged two csv but still merged data is written without spaces. Again, the above files are very simplified, but my real merged data look like this;
Finally, figured it out. I am putting the answer here just in case if anyone else also manages the same mistake as me. Just remove the sep="\t" and use below line instead;
df.to_csv('myfolder \\merged_file.csv')
Just realized the two csv files were comma separated and using tab delimiter for merge didn`t work.

Removing duplicates between multiple large CSV files

I am trying to find the best way to remove duplicates from large CSV files.
I receive CSV files of around 5/6 million rows every month.
I need to adjust these (I only need some of the columns, and I need to add some others).
The files also contain a lot of duplicate, and incomplete rows.
I've come up with a solution in python where I use a set and check for each row if it's in the set. And change what needs changing.
Now, I get the second file, and it contains a lot of duplicates that are in the previous file.
I'm trying to find an efficient solution to remove duplicates within the file, and between the different files. In the end I want to have a list (table or csv file) that contains only the new entries for that month.
I would like use python, and I was thinking about using a sqlite database for storing the data. But I'm unsure which way would be most efficient.
I would use numpy.unique():
import numpy as np
data = np.vstack((np.loadtxt("path/to/file1.csv"), np.loadtxt("path/to/file2.csv")))
#this will stack both arrays on top of each other, creating one giant array
data = np.unique(data, axis=0)
np.unique takes the entire array and returns only the unique elements. Make sure you set axis=0 so that it goes row by row and not cell by cell.
One caveat: This should work, but if there are several million rows, it may take a while. Still better than doing it by hand though! Good luck!

Resources