Problem with processing large(>1 GB) CSV file - python-3.x

I have a large CSV file and I have to sort and write the sorted data to another csv file. The CSV file has 10 columns. Here is my code for sorting.
data = [ x.strip().split(',') for x in open(filename+'.csv', 'r').readlines() if x[0] != 'I' ]
data = sorted(data, key=lambda x: (x[6], x[7], x[8], int(x[2])))
with open(filename + '_sorted.csv', 'w') as fout:
for x in data:
print(','.join(x), file=fout)
It works fine with file size below 500 Megabytes but cannot process files with a size greater than 1 GB. Is there any way I can make this process memory efficient? I am running this code on Google Colab.

Here is a Link to a blog about using pandas for large datasets. In the examples from the link, they are looking at analyzing data from large datasets ~1gb in size.
Simply type the following to import your csv data into python.
import pandas as pd
gl = pd.read_csv('game_logs.csv', sep = ',')

Related

How to get specific data from excel

Any idea on how can I acccess or get the box data (see image) under TI_Binning tab of excel file using python? What module or similar code you can recommend to me? I just need those specifica data and append it on other file such as .txt file.
Getting the data you circled:
import pandas as pd
df = pd.read_excel('yourfilepath', 'TI_Binning', skiprows=2)
df = df[['Number', 'Name']]
To appending to an existing text file:
import numpy as np
with open("filetoappenddata.txt", "ab") as f:
np.savetxt(f, df.values)
More info here on np.savetxt for formats to fit your output need.

Rasterio for reading large Tiff files

I am trying to convert a large Tiff file into GeoJson. The problem I am facing here is that given the my Tif file is in gigabytes (3.5GB), the programme is getting out of memory.
Is there any way to process this in chunks and write a chunk to the output file every time?
import rasterio
import rasterio.features
import rasterio.warp
with rasterio.open('MyBigFile.tif') as dataset:
# Read the dataset's valid data mask as a ndarray.
mask = dataset.dataset_mask()
# Extract feature shapes and values from the array.
for geom, val in rasterio.features.shapes(mask, transform=dataset.transform):
# Transform shapes from the dataset's own coordinate
# reference system to CRS84 (EPSG:4326).
geom = rasterio.warp.transform_geom(
dataset.crs, 'EPSG:4326', geom, precision=6)
with open('output.json', 'w') as f:
f.write(geom)

Reading Large CSV file using pandas

I have to read and analyse a logging file from CAN which is in CSV format. It has 161180 rows and I'm only separating 566 columns with semicolon. This is the code I have.
import csv
import dtale
import pandas as pd
path = 'C:\Thesis\Log_Files\InputOutput\Input\Test_Log.csv'
raw_data = pd.read_csv(path,engine="python",chunksize = 1000000, sep=";")
df = pd.DataFrame(raw_data)
#df
dtale.show(df)
I have following error when I run the code in Jupyter Notebook and it's encountering with below error message. Please help me with this. Thanks in advance!
MemoryError: Unable to allocate 348. MiB for an array with shape (161180, 566) and data type object
import time
import pandas as pd
import csv
import dtale
chunk_size = 1000000
batch_no=1
for chunk in pd.read_csv("C:\Thesis\Log_Files\InputOutput\Input\Book2.csv",chunksize=chunk_size,sep=";"):
chunk.to_csv('chunk'+str(batch_no)+'.csv', index=False)
batch_no+=1
df1 = pd.read_csv('chunk1.csv')
df1
dtale.show(df1)
I used above code for only 10 rows and 566 columns. Then it's working. If I consider all the rows (161180), it's not working. Could anyone help me with this. Thanks in advance!
I have attached the output here
You are running out of RAM when loading in the datafile. Best option is to split the file into chunks and read in chunks of files
To read the first 999,999 (non-header) rows:
read_csv(..., nrows=999999)
If you want to read rows 1,000,000 ... 1,999,999
read_csv(..., skiprows=1000000, nrows=999999)
You'll probably also want to use chunksize:
This returns a TextFileReader object for iteration:
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)

Merge Multiple txt files and removing duplicates from resulting big file

I've been trying to get this one to work, without success because:
The files to be merged are large (up to 20MB each);
Duplicate lines come in separate files. That's why I need to remove it from the resulting merged file;
Right now, the code is running, but nothing shows up and it basically merged the files, not dealing with duplicates.
import os
import io
import pandas as pd
merged_df = pd.DataFrame()
for file in os.listdir(r"C:\Users\username\Desktop\txt"):
if file.endswith(".txt"):
file_path = os.path.join(r"C:\Users\username\Desktop\txt", file)
bytes = open(file_path, 'rb').read()
merged_df = merged_df.append(pd.read_csv(io.StringIO(
bytes.decode('latin-1')), sep=";", parse_dates=['Data']))
SellOutCombined = open('test.txt', 'a')
SellOutCombined.write(merged_df.to_string())
SellOutCombined.close()
print(len(merged_df))
Any help is appreciated.

Pyspark - How can I convert parquet file to text file with delimiter

I have a parquet file with the following schema:
|DATE|ID|
I would like to convert it into a text file with tab delimiters as follows:
20170403 15284503
How can I do this in pyspark?
In Spark 2.0+
spark.read.parquet(input_path)
to read the parquet file into a dataframe. DataFrameReader
spark.write.csv(output_path, sep='\t')
to write the dataframe out as tab delimited. DataFrameWriter
You can read your .parquet file in python using DataFrame and with the use of list data structure, save that in a text file. the sample code is here:
this code, reads word2vec (word to vector) that is output of spark mllib WordEmbeddings class in a .parquet file and convert it to tab delimiter .txt file.
import pandas as pd
import pyarrow.parquet as pq
import csv
data = pq.read_pandas('C://...//parquetFile.parquet', columns=['word', 'vector']).to_pandas()
df = pd.DataFrame(data)
vector = df['vector'].tolist()
word = df['word']
word = word.tolist()
k = [[]]
for i in range(0, word.__len__()):
l = []
l.append(word[i])
l.extend(vector[i])
k.append(l)
#you can not save data frame directly to .txt file.
#so, write df to .csv file at first
with open('C://...//csvFile.csv', "w", encoding="utf-8") as f:
writer = csv.writer(f)
for row in k:
writer.writerow(row)
outputTextFile = 'C://...//textFile.txt'
with open(outputTextFile, 'w') as f:
for record in k:
if (len(record) > 0):
for element in record:
#tab delimiter elements
f.write("%s\t" % element)
f.write("%s" % element)
#add enter after each records
f.write("\n")
I hope it helps :)

Resources