As a start, I want to compare the first two columns of two .csv files then write what is common in these files to an output file, say common.csv, then also write the differences in each file to different output files, say f1.csv and f4.csv.
So far I have tried to using set(), difflib, and also taking the two files, create lists from the files then comparing the first two columns in each file. This gave me the output for what is common but not for what the differences are in each file when compared to each other. I have tried most of the solutions posted that seemed like the problem was similar to mine but I am still stuck. Can someone please assist?
this is the headers in my files and only want to compare the first two columns but write out the entire line to the output file.
fieldnames = (["Chromosome" ,"GenomicPosition", "ReferenceBase",
"AlternateBase", "GeneName", "GeneID",
"TrancriptID", "Varianteffect-Variantimpact",
"Biotype", "TranscriptBiotype" , "Referencebase",
"Alternatebase", "Depth coverage"])
One solution is to use pandas, which is very powerful.
To convert csv <-> pandas dataframes:
import pandas as pd
df = pd.read_csv('csv_file.csv') # csv -> pandas
df.to_csv('csv_file.csv', index=False) # pandas -> csv
To compare pandas dataframes on columns, this post should point you in the right direction: https://stackoverflow.com/a/47107164/2667536
Related
I know this sounds silly, but is it possible to read a CSV file containing multiple columns and combine all the data into one column? Let's say I have a CSV file with 6 columns and they have different delimiters. Is it possible to read these files, but spit out the first 100 rows into one column, without specifying a delimiter? My understanding is that this isn't possible if using pandas.
I don't know if this helps, but to add context to my question, I'm trying to use Treeview from Tkinter to display the first 100 rows of a CSV file. The Treeview window should display this data as 1 column if a delimiter isn't specified. Otherwise, it will automatically split the data based on a delimiter from the user input.
This is the data I have:
This should be the result:
Pandas isn't the only way to read a CSV file. There is also the built in csv module in the python standard lib as well as the basic builtin function open that will work just as well. Both of these methods can generate single rows of data like your question indicates.
Using open function
filepath = "/path/to/file.csv"
with open(filepath, "rt", encoding="utf-8") as fd:
header = next(fd)
for row in fd:
# .... do something with row data
# row will be a string of all the data for a single row.
# example: "Information,44775.4541667,MicrosoftWindowsSecurity,16384..."
# then you can break at any time you want to stop reading.
or using the csv module:
import csv
reader = csv.reader(open("/path/to/file.csv", "rt", encoding="utf8"), delimeter=',')
header = next(reader)
for row in reader:
# this time the row will be a list split by the delimiter which
# by default is a comma but you can change it in the call to the reader
you can use
open('file.csv') as f: data=list(f.readlines())
to read file line by line
As other answers have explained, you can use various ways to read first n-lines of text from a file. But if you insist on using pandas then there is a trick you can use.
Find a character which will never appear in your text and use it as dummy delimiter to read_csv(), so that all text will be read as one column. Use nrows parameter to control number of lines to read:
pd.read_csv("myfile.csv", sep="~", nrows=100)
Have a question about matching data.
I have two excel files one in the extract of the database which is updated once a while and doesn't hold all records because it is not linked to the source application where information is stored.
the other extract that I got is the extract of a system where everyone puts in information.
The two excel files have a lot of id numbers. My teacher asks me to match the data so I can see which ones are missing. He told me to use a v look up but that doesn't make sense. is there a more easy way to match data out of two excel sheets?
thanks for your time in advance.
I recommend to use pandas library with concat.
import glob
import pandas as pd
# specifying the path to excel files
path = "C:/downloads"
# excel files in the path
file_list = glob.glob(path + "/*.xlsx")
# list of excel files we want to merge.
# pd.read_excel(file_path) reads the
# excel data into pandas dataframe.
excl_list = []
for file in file_list:
excl_list.append(pd.read_excel(file))
# concatenate all DataFrames in the list
# into a single DataFrame, returns new
# DataFrame.
excl_merged = pd.concat(excl_list, ignore_index=True)
# exports the dataframe into excel file
# with specified name.
excl_merged.to_excel('merged_excel.xlsx', index=False)
You can also use pd.merge() once you have your two different file opened (df1 = pd.read_excel(file1); df2 = pd.read_excel(file2)), to merge them on specific column names, maybe id in your case. df1.merge(df2, left_on='lkey', right_on='rkey') to add suffixes according to the file.
When importing a .csv, I saved the result as a pandas DataFrame as follows:
csv_dataframe= pd.DataFrame(pd.read_csv(r'filepath.csv', delimiter=';', encoding='iso-8859-1', decimal=',', low_memory=False))
However, when I call a specific column that has numbers and letters, it ignores some of the characters or adds others. For example, in column 'A', there are elements similar to this:
'ABC123456789'
'123456789'
'1234567'
and when I call:
csv_dataframe['A']
The results are:
'ABC123456789'
'1234567342'
'3456475'
So, some of the values are correct but, in others, it changes the values, adding or removing elements. In some cases it even alters their length.
Is there some form of changing the way that other programs read .csv files in the .csv file, for example? That is, is there an option in the .csv file that masks values that isn't noticeable when openning it? Or, did I make any mistake when calling the file/functions, please?
Thank you very much.
Try removing 'pd.DataFrame()'
pd.read_csv already creates a dataframe
This should work:
csv_dataframe= pd.read_csv(r'filepath.csv', delimiter=';', encoding='iso-8859-1', decimal=',', low_memory=False)
It might fix your issue, other than that, I'm willing to bet the issue is in the CSV.
I have a problem and found many related questions asked here and read them all, but still can`t solve it. So far I didn't get any answer.
I have two files one is .csv and the other is .xlsx. They have a different number of rows and columns. I would like to merge these two according to filenames. Very simplified the two files look like as follows;
The csv file;
the excel file;
First i converted them to panda data frame;
import pandas as pd
import csv,xlrd
df1 = pd.read_csv('mycsv.csv')
df2=pd.read_excel(myexcel.xlsx', sheetname=0)
To merge the two files on the same column I remove the white space in column names in df2 using the first line below and, then I merge them and print the merged data frame in csv file.
df2.columns=df2.columns.str.replace(' ', '')
df=pd.merge(df1, df2, on="filename")
df.to_csv('myfolder \\merged_file.csv', sep="\t ")
When I check my folder, I see merged_file.csv exists but when I opened it there is no space between columns and values. I want to see nice normal csv or excel look, like my example files above. Just to make sure I tried everything, I also converted the Excel file to a csv file and then merged two csv but still merged data is written without spaces. Again, the above files are very simplified, but my real merged data look like this;
Finally, figured it out. I am putting the answer here just in case if anyone else also manages the same mistake as me. Just remove the sep="\t" and use below line instead;
df.to_csv('myfolder \\merged_file.csv')
Just realized the two csv files were comma separated and using tab delimiter for merge didn`t work.
I have this text file that has lists in it. How would I search for that individual list? I have tried using loops to find it, but every time it gives me an error since I don't know what to search for.
I tried using a if statement to find it but it returns -1.
thanks for the help
I was doing research on this last night. You can use pandas for this. See here: Load data from txt with pandas. One of the answers talks about list in text files.
You can use:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["Name", "b", "c", "etc."]
Add sep=" " in your code, leaving a blank space between the quotes. So pandas can detect spaces between values and sort in columns. Data columns isenter code here for naming your columns.
With a JSON or XML format, text files become more searchable. In my research I’ve decided to go with an XML approach. Here is the link to a blog that explains how do use Python with XML: http://www.austintaylor.io/lxml/python/pandas/xml/dataframe/2016/07/08/convert-xml-to-pandas-dataframe.
If you want to search the data frame try:
import pandas as pd
txt_file = 'C:\path\to\your\txtfile.txt'
df = pd.read_table(txt_file, sep = ",")
row = df.loc[df['Name'] == 'bob']
Print(row)
Now depending how your text file is formated, your results will not work for every text file. The idea of a dataframe in pandas helps u create a CSV file formats. This giving the process a repeatable structure to enable testing results. Again I recommend using a JSON or XML format before implementing pandas data frames in ur solution. U can then create a consistent result, that is testable too!