Split a very large csv file into many smaller files - python-3.x

I read this question but it does not seem to give a good reply to my situation (maybe I misread it):
Splitting very large csv files into smaller files
I have a large CSV file (1.0 Gb) with lots of individual rows (over 1 million) and 8 columns. Two columns represent date and time while others represent other stock related information (price, etc.). I would like to save individual files separated by date and time attributes. So, if there are 100 different combinations of date-time, I would extract rows for each combination and save is as a separate CSV file under subfolder like C:/Date/Time/filename.csv.
I am currently using pandas to filter each time-date combination to get a dataset with required information for that combination and then saving each file using loops (a list of date-time combinations is looped through). This is taking a very long time.
Is there a better way to accomplish this?
(I will look into multithreading as well but I don't believe it will solve the speed issue significantly).
Thanks!
Adding a sample of input data:

Related

How to grep csv documents to APPEND info and keep the old data intact?

I have huge_database.csv like this:
name,phone,email,check_result,favourite_fruit
sam,64654664,sam#example.com,,
sam2,64654664,sam2#example.com,,
sam3,64654664,sam3#example.com,,
[...]
===============================================
then I have 3 email lists:
good_emails.txt
bad_emails.txt
likes_banana.txt
the contents of which are:
good_emails.txt:
sam#example.com
sam3#example.com
bad_emails.txt:
sam2#example.com
likes_banana.txt:
sam#example.com
sam2#example.com
===============================================
I want to do some grep, so that at the end the output will be like this:
sam,64654664,sam#example.com,y,banana
sam2,64654664,sam2#example.com,n,banana
sam3,64654664,sam3#example.com,y,
I don't mind doing it in multiple steps manually and, perhaps, in some complex algorithm such as copy pasting to multple files. What matters to me is the reliability, and most importantly the ability to process very LARGE csv files with more than 1M lines.
What must also be noted is the lists that I will "grep" to add data to some of the columns will most of the times affect at most 20% of the total csv file rows, meaning the remaining 80% must be intact and if possible not even displace from their current order.
I would also like to note that I will be using a software called EmEditor rather than spreadsheet softwares like Excel due to the speed of it and the fact that Excel simply cannot process large csv files.
How can this be done?
Will appreciate any help.
Thanks.
Googling, trial and error, grabbing my head from frustration.
Filter all good emails with Filter. Open Advanced Filter. Next to the Add button is Add Linked File. Add the good_emails.txt file, set to the email column, and click Filter. Now only records with good emails are shown.
Select column 4 and type y. Now do the same for bad emails and change the column to n. Follow the same steps and change the last column values to the correct string.

Removing duplicates between multiple large CSV files

I am trying to find the best way to remove duplicates from large CSV files.
I receive CSV files of around 5/6 million rows every month.
I need to adjust these (I only need some of the columns, and I need to add some others).
The files also contain a lot of duplicate, and incomplete rows.
I've come up with a solution in python where I use a set and check for each row if it's in the set. And change what needs changing.
Now, I get the second file, and it contains a lot of duplicates that are in the previous file.
I'm trying to find an efficient solution to remove duplicates within the file, and between the different files. In the end I want to have a list (table or csv file) that contains only the new entries for that month.
I would like use python, and I was thinking about using a sqlite database for storing the data. But I'm unsure which way would be most efficient.
I would use numpy.unique():
import numpy as np
data = np.vstack((np.loadtxt("path/to/file1.csv"), np.loadtxt("path/to/file2.csv")))
#this will stack both arrays on top of each other, creating one giant array
data = np.unique(data, axis=0)
np.unique takes the entire array and returns only the unique elements. Make sure you set axis=0 so that it goes row by row and not cell by cell.
One caveat: This should work, but if there are several million rows, it may take a while. Still better than doing it by hand though! Good luck!

Replacing null values with zeroes in multiple columns [Spotfire]

I have about 100 columns with some empty values that I would like to replace with zeroes. I know how to do this with a single column using Calculate and Replace, but I wanted to see if there was a way to do this with multiple columns at once.
Thanks!
You could script it but it'd probably take you as long to write the script as it would to do it manually with a transformation. A better idea would be to fix it in the data source itself before you import it so SPOTFIRE doesn't have to do the transformation every time, which if you are dealing with a large amount of data, could hinder your performance.

The most efficient way to break down huge file into chunks of max 10MBs

There is a text file of a size of 20GBs. Each line of the file is a json object. The odd lines are the ones describing the even subsequent lines.
The goal is to divide the big file into chunks of maximum size of 10MBs knowing that each file should have even number of lines so the formatting doesn't get lost. What would be the most efficient way to do that?
My research so far made me lean towards:
split function in Linux. Is there any way to export always even number of lines based on the size?
Modified version of divide & conquer algo. Would this even work?
Estimating average number of lines that meet the 10MBs criteria and iterating through it & exporting it it meets the criteria.
I'm thinking that 1. would be the most efficient but I wanted to get an opinion of experts here.

Managing data sets in SPSS where multiple cases appear in one row

I'm working with a data set which has details on multiple people on one row. How I've dealt with this is to have variables like this:
P1Name P1Age P1Gender P1Ethnicity P2Name P2Age P2Gender... etc
This makes analysis very difficult. I have used multiple response variables which are good frequencies, but its unweildy, takes time to write out the syntax (there's a lot of 'p's) and you can't do other analysis with it.
first of all is there a way to run analyses as if all the name, age, gender and so on variables are all on the same row? (if that makes sense) To do this all I can think of doing is pasting the data into Excel and then cutting and pasting to get them all into the same columns, then pasting back to SPSS. Any other ideas?
Or is this just a matter of having two datasets, one for the case details and one for the people details?
Any advice would be greatly appreciated!
Write the data out using SAVE TRANSLATE and then read it back in removing the P's, like this:
FILE HANDLE MyFile /NAME='/Users/rick/tmp/test.csv'.
DATA LIST FREE /p1x1 p1x2 p1x3 p2x1 p2x2 p3x3.
BEGIN DATA.
1 2 3 1 2 3 1 2 3
END DATA.
LIST.
SAVE TRANSLATE
/OUTFILE=MyFile
/TYPE=CSV /ENCODING='UTF8' /REPLACE
/CELLS=VALUES.
DATA LIST FREE FILE=MyFile /x1 x2 x3.
LIST.
That should do it.

Resources