Removing duplicates between multiple large CSV files - python-3.x

I am trying to find the best way to remove duplicates from large CSV files.
I receive CSV files of around 5/6 million rows every month.
I need to adjust these (I only need some of the columns, and I need to add some others).
The files also contain a lot of duplicate, and incomplete rows.
I've come up with a solution in python where I use a set and check for each row if it's in the set. And change what needs changing.
Now, I get the second file, and it contains a lot of duplicates that are in the previous file.
I'm trying to find an efficient solution to remove duplicates within the file, and between the different files. In the end I want to have a list (table or csv file) that contains only the new entries for that month.
I would like use python, and I was thinking about using a sqlite database for storing the data. But I'm unsure which way would be most efficient.

I would use numpy.unique():
import numpy as np
data = np.vstack((np.loadtxt("path/to/file1.csv"), np.loadtxt("path/to/file2.csv")))
#this will stack both arrays on top of each other, creating one giant array
data = np.unique(data, axis=0)
np.unique takes the entire array and returns only the unique elements. Make sure you set axis=0 so that it goes row by row and not cell by cell.
One caveat: This should work, but if there are several million rows, it may take a while. Still better than doing it by hand though! Good luck!

Related

How to grep csv documents to APPEND info and keep the old data intact?

I have huge_database.csv like this:
name,phone,email,check_result,favourite_fruit
sam,64654664,sam#example.com,,
sam2,64654664,sam2#example.com,,
sam3,64654664,sam3#example.com,,
[...]
===============================================
then I have 3 email lists:
good_emails.txt
bad_emails.txt
likes_banana.txt
the contents of which are:
good_emails.txt:
sam#example.com
sam3#example.com
bad_emails.txt:
sam2#example.com
likes_banana.txt:
sam#example.com
sam2#example.com
===============================================
I want to do some grep, so that at the end the output will be like this:
sam,64654664,sam#example.com,y,banana
sam2,64654664,sam2#example.com,n,banana
sam3,64654664,sam3#example.com,y,
I don't mind doing it in multiple steps manually and, perhaps, in some complex algorithm such as copy pasting to multple files. What matters to me is the reliability, and most importantly the ability to process very LARGE csv files with more than 1M lines.
What must also be noted is the lists that I will "grep" to add data to some of the columns will most of the times affect at most 20% of the total csv file rows, meaning the remaining 80% must be intact and if possible not even displace from their current order.
I would also like to note that I will be using a software called EmEditor rather than spreadsheet softwares like Excel due to the speed of it and the fact that Excel simply cannot process large csv files.
How can this be done?
Will appreciate any help.
Thanks.
Googling, trial and error, grabbing my head from frustration.
Filter all good emails with Filter. Open Advanced Filter. Next to the Add button is Add Linked File. Add the good_emails.txt file, set to the email column, and click Filter. Now only records with good emails are shown.
Select column 4 and type y. Now do the same for bad emails and change the column to n. Follow the same steps and change the last column values to the correct string.

How can I filter all values in a CSV spreadsheet that don't match one of hundreds of values from another spreadsheet?

I'm using Google Sheets and Google Collab together and trying to clean up the data I've downloaded as a CSV file. The problem I'm facing is that I want to filter out all results that don't match one of 100+ values one could have as group names which I've grabbed from another spreadsheet and currently have stored in an array. I think there are one or two other filters I'll want to apply, but the others only have four or five possible values in comparison.
I succeeded using Pandas isin()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html
Specifically I used something like the example here:
titanic[titanic["Pclass"].isin([2, 3])]
I understand isin() provided you with booleans telling you whether an index is in the array. Using it like above turns it into a filter of sorts only keeping the items that match the items in the array.
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html

Specify rows to read in from excel using pd.read_excel

I have a large excel file and want to select specific rows (not continuous block) and columns to read in. With columns this is easy, is there a way to do this for rows? or do I need to read in everything and then delete all the rows I don't want?
Consider an excel file with structure
,CW18r4_aer_7,,,,,,,
,Vegetation ,,,,,,,
Date,A1,A2,B1,B2,C1,C2,C3,C4
1/7/86,3.80,8.02,7.94,9.81,9.82,4.19,3.88,0.87
2/7/86,0.50,2.02,5.26,3.70,8.59,8.61,9.86,3.27
3/7/86,4.75,3.88,0.46,5.95,9.45,9.62,4.33,1.63
4/7/86,7.64,6.93,2.71,9.96,1.25,0.35,1.84,1.02
5/7/86,3.33,8.24,7.36,7.86,0.43,2.32,2.18,1.91
6/7/86,1.96,1.78,7.45,2.28,5.27,9.94,0.22,2.94
7/7/86,4.67,8.41,1.49,5.48,5.46,1.39,1.85,7.71
8/7/86,8.07,5.60,4.23,3.93,3.92,9.09,9.90,2.15
9/7/86,7.00,5.16,6.10,8.86,7.18,9.42,8.78,5.42
10/7/86,7.53,9.81,3.33,1.50,9.45,6.96,5.41,5.25
11/7/86,0.95,3.84,3.52,5.94,8.77,1.94,5.69,8.62
12/7/86,2.94,3.07,5.13,8.10,6.52,9.93,5.85,3.91
13/7/86,9.33,7.03,5.80,2.45,2.86,7.32,5.00,0.17
14/7/86,7.39,4.85,9.15,2.23,1.70,9.42,2.72,9.32
15/7/86,3.38,4.67,6.63,2.12,5.09,7.71,0.99,9.72
16/7/86,9.85,6.68,3.09,5.05,0.34,5.44,5.99,6.19
I want to take the headers from row 3 and then read in some of the rows and columns.
import pandas as pd
df = pd.read_excel("filename.xlsx", skiprows = 2, usecols = "A:C,F:I", userows = "4:6,13,17:19")
Importantly, this is not a block that can be described by say [A3:C10] or the like.
The userows option does not exist. I know I can skip rows at the top, and at the bottom - so presumably can make lots of data frames and knit them together. But is there a simple way to just read in what you need once? My workaround is to just create lots of excel spreadsheets that just have what I need for different data frames, but this leaves things very open to me making a mistake I can't find.

Replacing null values with zeroes in multiple columns [Spotfire]

I have about 100 columns with some empty values that I would like to replace with zeroes. I know how to do this with a single column using Calculate and Replace, but I wanted to see if there was a way to do this with multiple columns at once.
Thanks!
You could script it but it'd probably take you as long to write the script as it would to do it manually with a transformation. A better idea would be to fix it in the data source itself before you import it so SPOTFIRE doesn't have to do the transformation every time, which if you are dealing with a large amount of data, could hinder your performance.

Append new columns into Excel with MATLAB

I would like to ask how to use MATLAB to append new columns into existing excel file without altering the original data in the file? In my case I don't know the original number of columns and rows in the file and it is inefficient to open the files one by one and check in practice. Another difficulty is that the new columns may have different number of rows to the existing data so that I cannot use the trick of reading in the data, forming a new matrix and replace the data with the new matrix.
I have seen many posts teaching people how to add new rows but adding new column seems quite a different thing since the columns are named by letters instead of numbers.
Thank you.
You could try reading in the data, use size on the array to determine the number of columns, and then use xlswrite with the range that you want. Have a look here for a function to turn the column number into the excel format: http://au.mathworks.com/matlabcentral/answers/54153-dynamic-ranges-using-xlswrite
Finally I solve it with the following code:
%%%
if (step==1)
xlswrite(filename,array,sheetname,'A1'); %Create the file
else
[~,~,Data]=xlsread(filename,sheetname); %read in all the old data
OriCol=size(Data,2); %get the column number of the old data
NewCol=OriCol+1; %the new array is placed right next to the original data
ColLetter=xlcolumnletter(NewCol);
StartCell=[ColLetter,'1'];
xlswrite(filename,array,sheetname,StartCell);
end

Resources