Split CSV File into two files keeping header in both files - python-3.x

I am trying to split a large CSV file into two files. I am using below code
import pandas as pd
#csv file name to be read in
in_csv = 'Master_file.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
#size of rows of data to write to the csv,
#you can change the row size according to your need
rowsize = 600000
#start looping through data writing it to a new file for each set
for i in range(0,number_lines,rowsize):
df = pd.read_csv(in_csv,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'File_Number' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=True,
mode='a',#append data to csv file
chunksize=rowsize)#size of data to append for each loop
It is splitting the file but its missing header in second file. How can I fix it

.read_csv() returns an iterator when used with chunksize and then keeps track of the header. The following is an example. This should be much faster since the original code above reads the entire file to count the lines, then re-reads all previous lines in each chunk iteration; whereas below reads through the file only once:
import pandas as pd
with pd.read_csv('Master_file.csv', chunksize=60000) as reader:
for i,chunk in enumerate(reader):
chunk.to_csv(f'File_Number{i}.csv', index=False, header=True)

Related

How to split the large text file into records based on the row size in Python or PySpark

I have a large text file with 15GB of size. The data inside the text file is considered as the single string with some 20million records of data. Each record is of length 5000. Each record is having 450+ column
Now I want to split the each record of the text file into new line. And split the each record as per the schema with some delimiter to load it as a Dataframe.
This is the sample approach - sample data:
HiIamRowData1HiIamRowData2HiIamRowData3HiIamRowData4HiIamRowData5HiIamRowData6HiIamRowData7HiIamRowData8
Expected output:
Hi#I#am#Row#Data#1#
Hi#I#am#Row#Data#2#
Hi#I#am#Row#Data#3#
Hi#I#am#Row#Data#4#
Hi#I#am#Row#Data#5#
Hi#I#am#Row#Data#6#
Hi#I#am#Row#Data#7#
Hi#I#am#Row#Data#8#
Code:
### Schema
schemaData = [['col1',0,2],['col2',2,1],['col3',3,2],['col4',5,3],['col5',8,4],['col6',12,1]]
df = pd.DataFrame(data= schemaData, columns=['FeildName','offset','size'])
print(df.head(5))
file = 'sampleText.txt'
inputFile = open(file, 'r').read()
recordLen = 13
totFileLen = len(inputFile)
finalStr = ''
### First for loop to split the each record based on record length
for i in range(0,totFileLen,recordLen):
record = inputFile[i:i+recordLen]
recStr = ''
### Second For loop to apply the Schema on top of each record.
for index, row in df.iterrows():
#print(record[row['offset']:row['offset'] + row['size']])
recStr = recStr + record[row['offset']:row['offset'] + row['size']] + '#'
recStr = recStr + '\n'
finalStr += recStr
print(finalStr)
text_file = open("Output.txt", "w")
text_file.write(finalStr)
For the above 8 rows sample data It is taking 56 (8 rows + 48 row times column) Total Iterations.
In real Data set I am having 25 Million Rows and 500 columns. It will take 25 mil + 25 mil X 500col Iterations
Constraints:
The entire data in the text file is sequence data, all the records are placed next to each other and entire data is in one string. I want to read the text file and write the final Data to new text file.
I don't want to split the File into smaller size chunks while processing. Like 50 MB of data files, by doing this IF the last record got splits between the half first 50MB and second chunk of 50MB, Then from second 50MB chunk onwards the data will be wrong sliced. As we are slicing each record based on the length of record 5000.
If I can split the each chunk based on the File length inside the text file that will be possible approach.
I have tried the below python approach. For smaller files it is working fine. But for the file >500MB onwards its taking hours to split the each record schema wise.
I have tried multithreading and multiprocessing approach too didn't seen much improvement there.
QUESTION: Is there any better approach for this problem either in Python or PySpark? To reduce the time complexity.
You can effectively process your big file iteratively by:
capturing a sequential chunk of the needed size at a time
passing it to pandas.read_fwf with predefined column widths
and immediately export the constructed dataframe to the output csv file (creates it if it doesn't exist) appending the line with specified separator
from io import StringIO
rec_len = 13
widths = [2, 1, 2, 3, 4, 1]
with open('sampleText.txt') as inp, open('output.txt', 'w+') as out:
while (line := inp.read(rec_len).strip()):
pd.read_fwf(StringIO(line), widths=widths, header=None) \
.to_csv(out, sep='#', header=False, index=False, mode='a')
The output.txt contents I get:
Hi#I#am#Row#Data#1
Hi#I#am#Row#Data#2
Hi#I#am#Row#Data#3
Hi#I#am#Row#Data#4
Hi#I#am#Row#Data#5
Hi#I#am#Row#Data#6
Hi#I#am#Row#Data#7
Hi#I#am#Row#Data#8
Yes, we can achieve the same result using PySpark UDF with Spark functions. Let me show you how in 5 steps:
Import necessary
import pandas as pd
from pyspark.sql.functions import udf, split, explode
Reading text file using Spark read method
sample_df = spark.read.text("path/to/file.txt")
Convert your custom function to PySpark UDF (User Defined Function) inorder to use it in Spark
def delimit_records(value):
recordLen = 13
totFileLen = len(value)
finalStr = ''
for i in range(0,totFileLen,recordLen):
record = value[i:i+recordLen]
schemaData = [['col1',0,2],['col2',2,1],['col3',3,2],['col4',5,3],['col5',8,4],['col6',12,1]]
pdf = pd.DataFrame(data= schemaData, columns=['FeildName','offset','size'])
recStr = ''
for index, row in pdf.iterrows():
recStr = recStr + record[row['offset']:row['offset'] + row['size']] + '#'
recStr = recStr + '\n'
finalStr += recStr
return finalStr.rstrip()
Registering your User Defined Function
delimit_records = udf(delimit_records)
Use your custom function against the column, you want to modify
df1 = sample_df.withColumn("value", delimit_records("value"))
Split the record based on delimiter "\n" using PySpark split()
function
df2 = df1.withColumn("value", split("value", "\n"))
Use the explode() function to transform a column of arrays or maps
into multiple rows
df3 = df2.withColumn("value", explode("value"))
Let's print the output
df3.show()
Output:
+-------------------+
| value|
+-------------------+
|Hi#I#am#Row#Data#1#|
|Hi#I#am#Row#Data#2#|
|Hi#I#am#Row#Data#3#|
|Hi#I#am#Row#Data#4#|
|Hi#I#am#Row#Data#5#|
|Hi#I#am#Row#Data#6#|
|Hi#I#am#Row#Data#7#|
|Hi#I#am#Row#Data#8#|
+-------------------+

consolidating csv file into one master file

I am facing the following challenges
I have approximately 400 files which i have to consolidate into one master file but there is one problem that the files have different headers and when I try to consolidate it put the data into different rows on the basis of column
Example:-
lets say i have two files C1 and C2
file C1.csv
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
and file C2.csv
name,last-name,phone-no,add-line1,add-line2,add-line3
jorge,aggarwal,65465464654,line1,line2,line3
brad,smit,456446546454,line1,line2,line3
joy,kennedy,65654644646,line1,line2,line3
so I have these two files and from these files I want that when I consolidate these files the output will be like this:-
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
Jorge aggarwal,65465464654,line1-line2-line3
brad smith,456446546454,line1-line2-line3
joy kennedy,65654644646,line1-line2-line3
for Consolidation I am using the following code
import glob
import pandas as pd
directory = 'C:/Test' # specify the directory containing the 300 files
filelist = sorted (glob.glob(directory + '/*.csv')) # reads all 300 files in the directory and stores as a list
consolidated = pd.DataFrame() # Create a new empty dataframe for consolidation
for file in filelist: # Iterate through each of the 300 files
df1 = pd.read_csv(file) # create df using the file
df1col = list (df1.columns) # save columns to a list
df2 = consolidated # set the consolidated as your df2
df2col = list (df2.columns) # save columns from consolidated result as list
commoncol = [i for i in df1col for j in df2col if i==j] # Check both lists for common column name
# print (commoncol)
if commoncol == []: # In first iteration, consolidated file is empty, which will return in a blank df
consolidated = pd.concat([df1, df2], axis=1).fillna(value=0) # concatenate (outer join) with no common columns replacing null values with 0
else:
consolidated = df1.merge(df2,how='outer', on=commoncol).fillna(value=0) # merge both df specifying the common column and replace null values with 0
# print (consolidated) << Optionally, check the consolidated df at each iteration
# writing consolidated df to another CSV
consolidated.to_csv('C:/<filepath>/consolidated.csv', header=True, index=False)
but it can't merge the columns having same data like the output shown earlier.
From your two-file example, you know the final (least common) header for the output, and you know what one of the bigger headers looks like.
My take on that is to think of every "other" kind of header as needing a mapping to the final header, like concatenating add-lines 1-3 into a single address field. We can use the csv module to read and write row-by-row and send the rows to the appropriate consolidator (mapping) based on the header of the input file.
The csv module provides a DictReader and DictWriter which makes dealing with fields you know by name very handy; especially, the DictWriter() constructor has the extrasaction="ignore" option which means that if you tell the writer your fields are:
Col1, Col2, Col3
and you pass a dict like:
{"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"}
it will just ignore Col4 and only write Cols 1-3:
writer = csv.DictWriter(sys.stdout, fieldnames=["Col1", "Col2", "Col3"], extrasaction="ignore")
writer.writeheader()
writer.writerow({"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"})
# Col1,Col2,Col3
# val1,val2,val3
import csv
def consolidate_add_lines_1_to_3(row):
row["address"] = "-".join([row["add-line1"], row["add-line2"], row["add-line3"]])
return row
# Add other consolidators here...
# ...
Final_header = ["name", "phone-no", "address"]
f_out = open("output.csv", "w", newline="")
writer = csv.DictWriter(f_out, fieldnames=Final_header, extrasaction="ignore")
writer.writeheader()
for fname in ["file1.csv", "file2.csv"]:
f_in = open(fname, newline="")
reader = csv.DictReader(f_in)
for row in reader:
if "add-line1" in row and "add-line2" in row and "add-line3" in row:
row = consolidate_add_lines_1_to_3(row)
# Add conditions for other consolidators here...
# ...
writer.writerow(row)
f_in.close()
f_out.close()
If there are more than one kind of other header, you'll need to seek those out, and figure out the extra consolidators to write, and the conditions to trigger them in for row in reader loop.

Live updating graph from increasing amount of csv files

I need to analyse some spectral data in real-time and plot it as a self-updating graph.
The program I use outputs a text file every two seconds.
Usually I do the analysis after gathering the data and the code works just fine. I create a dataframe, where each csv file represents a column. The problem is, with several thousands of csv files, the import becomes very slow and creating a dataframe out of all the csv files takes usually more than half an hour.
Below the code for creating the dataframe from multiple csv files.
''' import, append and concat files into one dataframe '''
all_files = glob.glob(os.path.join(path, filter + "*.txt")) # path to the files by joining path and file name
all_files.sort(key=os.path.getmtime)
data_frame = []
name = []
for file in (all_files):
creation_time = os.path.getmtime(file)
readible_date = datetime.fromtimestamp(creation_time)
df = pd.read_csv(file, index_col=0, header=None, sep='\t', engine='python', decimal=",", skiprows = 15)
df.rename(columns={1: readible_date}, inplace=True)
data_frame.append(df)
full_spectra = pd.concat(data_frame, axis=1)
for column in full_spectra.columns:
time_step = column - full_spectra.columns[0]
minutes = time_step.total_seconds()/60
name.append(minutes)
full_spectra.columns = name
return full_spectra
The solution I thought of was using the watchdog module and everytime a new textfile is created it gets appended as a new column to the existing dataframe and the updated dataframe is plotted. Because then, I do not need to loop over all csv files all the time.
I found a very nice example on how to use watchdog here
My problem is, I could not find a solution how after detecting the new file with watchdog, to read it and append it to the existing dataframe.
A minimalistic example code should look something like this:
def latest_filename():
"""a function that checks within a directoy for new textfiles"""
return(filename)
df = pd.DataFrame() #create a dataframe
newdata = pd.read_csv(latest_filename) #The new file is found by watchdog
df["newcolumn"] = newdata["desiredcolumn"] #append the new data as column
df.plot() #plot the data
The plotting part should be easy and my thoughts were to adapt the code presented here. I am more concerned with the self-updating dataframe.
I appreciate any help or other solutions that would solve my issue!

comparing two csv files in python that have different data sets

using python, I want to compare two csv files but only compare row2 of the first csv against row0 of the second csv, but print out in a new csv file only the lines where there are no matches for the compared rows.
Example....
currentstudents.csv contains the following information
Susan,Smith,susan.smith#mydomain.com,8
John,Doe,john.doe#mydomain.com,9
Cool,Guy,cool.guy#mydomain.com,3
Test,User,test.user#mydomain.com,5
previousstudents.csv contains the following information
susan.smith#mydomain.com
john.doe#mydomain.com
test.user#mydomain.com
After comparing the two csv files, a new csv called NewStudents.csv should be written with the following information:
Cool,Guy,cool.guy#mydomain.com,3
Here is what I have, but this fails to produce what I need....The below code will work, if I omit all data except the email address in the original currentstudents.csv file, but then I dont end up with the needed data in the final csv file.
def newusers():
for line in fileinput.input(r'C:\work\currentstudents.csv', inplace=1):
print(line.lower(), end='')
with open(r'C:\work\previousstudents.csv', 'r') as t1, open(r'C:\work\currentstudents.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open(r'C:\work\NewStudents.csv', 'w') as outFile:
for (line[0]) in filetwo:
if (line[0]) not in fileone:
outFile.write(line)
Thanks in advance!
This script writes NewStudents.csv:
import csv
with open('sample.csv', newline='') as csvfile1, \
open('sample2.csv', newline='') as csvfile2, \
open('NewStudents.csv', 'w', newline='') as csvfile3:
reader1 = csv.reader(csvfile1)
reader2 = csv.reader(csvfile2)
csvwriter = csv.writer(csvfile3)
emails = set(row[0] for row in reader2)
for row in reader1:
if row[2] not in emails:
csvwriter.writerow(row)
The content of NewStudents.csv:
Cool,Guy,cool.guy#mydomain.com,3
With a pandas option
For small files it's not going to matter, but for larger files, the vectorized operations of pandas will be significantly faster than iterating through emails (multiple times) with csv.
Read the data with pd.read_csv
Merge the data with pandas.DataFrame.merge
The columns do not have names in the question, so columns are selected by column index.
Select the desired new students with Boolean indexing with [all_students._merge == 'left_only'].
.iloc[:, :-2] selects all rows, and all but last two columns.
import pandas as pd
# read the two csv files
cs = pd.read_csv('currentstudents.csv', header=None)
ps = pd.read_csv('previousstudents.csv', header=None)
# merge the data
all_students = cs.merge(ps, left_on=2, right_on=0, how='left', indicator=True)
# select only data from left_only
new_students = all_students.iloc[:, :-2][all_students._merge == 'left_only']
# save the data without the index or header
new_students.to_csv('NewStudents.csv', header=False, index=False)
# NewStudents.csv
Cool,Guy,cool.guy#mydomain.com,3

How read big csv file and read each column separatly?

I have a big CSV file that I read as dataframe.
But I can not figure out how can I read separately by every column.
I have tried to use sep = '\' but it gives me an error.
I read my file with this code:
filename = "D:\\DLR DATA\\status maritime\\nperf-data_2019-01-01_2019-09-10_bC5HIhAfJWqdQckz.csv"
#Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv(filename)
df1 = df.head()
When I print my dataframe head I have this result:
In variable explorer my dataframe consist only 1 column with all data inside:
I try to set sep = '' with space and coma. But it didn't work.
How can I read all data with appropriate different columns?
I would appreciate any help.
You have a tab separated file use \t
df = dd.read_csv(filename, sep='\t')

Resources