loading 12GB csv into python and converted it into dataframe - python-3.x

I want to load a 12GB csv file into python and then do analysis.
I attempted to use this method
file_input_to_system = pd.read_csv(usrinput)
, but it failed because the method consumed all my RAM.
My goal now is to read the file from hard disk but not read it from RAM. I googled it and found out this sample
f = open("file_path","r")
for row in csv.reader(f):
df = pd.DataFrame(row)
print(df)
f.close()
But I am not sure how to modify it such that it can read a csv and parse it into dataframe.
When I try this one, it can read file and not consume all my memory.
However, when I parse it to dataframe, all my memory is consumed.
chunksize = 100
df = pd.read_csv("C:/Users/user/Documents/GitHub/MyfirstRep/export_lage.csv",iterator=True,chunksize=chunksize)
df = pd.concat(df, ignore_index=True)
print(df)

Related

Spark s3 csv files read order

Let's say three files in s3 folder and whether read through spark.read.csv(s3:bucketname/folder1/*.csv) reads the files in order or not ?
If not, is there way to order the files while reading the whole folder with multiple files received at different time internal.
File name
s3 file uploaded/Last modified time
s3:bucketname/folder1/file1.csv
01:00:00
s3:bucketname/folder1/file2.csv
01:10:00
s3:bucketname/folder1/file3.csv
01:20:00
You can achive this using following
Iterate over all the files in the bucket and load that csv with adding a new column last_modified. Keep a list of all the dfs that will be loaded in dfs_list. Since pyspark does lazy evaluation it will not load the data instantly.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname')
dfs_list = []
for file_object in my_bucket.objects.filter(Prefix="folder1/"):
df = spark.read.parquet('s3a://' + file_object.name).withColumn("modified_date", file_object.last_modified)
dfs_list.append(df)
Now take the union of all the dfs using pyspark unionAll function and then sort the data according to modified_date.
from functools import reduce
from pyspark.sql import DataFrame
df_combined = reduce(DataFrame.unionAll, dfs_list)
df_combined = df_combined.orderBy('modified_date')

Load a big file at pandas and filter columns before loadings

I´m trying to load a csv at pandas, but my computer keeps freezing. The csv has more than 20k columns and 500k rows, but I only need a few to them
df = pd.read_csv('e:/teste/teste/file.txt', sep="|", header=None, encoding="latin1", error_bad_lines=False, engine= 'python', dtype='unicode', usecols = [4,6,7,8,9,10,11,12,13,14,15,16,21,22,23,24,25,26,56,93])
The computer hangs if I load just a few of the columns.
Anyone knows how can I fix it?
When you want to read a big file the best way to do it is with chunks
mylist = []
for chunk in pd.read_csv(file, chunksize=Number):
mylist.append(chunk)
df = pd.concat(mylist, axis= 0)
Put the chunk in an array and later concatenate all chunks. Is more memory efficent than concatenate in every iteration.
And for the columns in the docs of pandas read_csv there's a usecols parameter, I hope it helps

Pyspark: How to convert a spark dataframe to json and save it as json file?

I am trying to convert my pyspark sql dataframe to json and then save as a file.
df_final = df_final.union(join_df)
df_final contains the value as such:
I tried something like this. But it created a invalid json.
df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)
{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}
My expected file should have data as below:
[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]
For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.
df_final.coalesce(1).write.format('json').save('/path/file_name.json')
and still you want to convert your datafram into json then you can used
df_final.toJSON().
A solution can be using collect and then using json.dump:
import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
json.dump(data, outfile)
Here is how you can do the equivalent of json.dump for a dataframe with PySpark 1.3+.
df_list_of_jsons = df.toJSON().collect()
df_list_of_dicts = [json.loads(x) for x in df_list_of_jsons]
df_json = json.dumps(df_list_of_dicts)
sc.parallelize([df_json]).repartition(1).cache().saveAsTextFile("<HDFS_PATH>")
Note this will result in the whole dataframe being loaded into the driver memory, so this is only recommended for small dataframe.
If you want to use spark to process result as json files, I think that your output schema is right in hdfs.
And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :
with open('data.json') as f:
data = json.load(f)
You should try to read data line by line:
data = []
with open("data.json",'r') as datafile:
for line in datafile:
data.append(json.loads(line))
and you can use pandas to create dataframe :
df = pd.DataFrame(data)

Reading Unzipped Shapefiles stored in AWS S3 from AWS EMR Cluster using PySpark in Jupyter Notebook

I'm completely new to AWS EMR and apache spark. I'm trying to assign GeoID's to residential properties using shapefiles. I'm not able to read the shapefiles from my s3 bucket. Please help me in understanding what is going on as I couldn't find any answer on the internet that explains the exact problem.
<!-- language: python 3.4 -->
import shapefile
import pandas as pd
def read_shapefile(shp_path):
"""
Read a shapefile into a Pandas dataframe with a 'coords' column holding
the geometry information. This uses the pyshp package
"""
#read file, parse out the records and shapes
sf = shapefile.Reader(shp_path)
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10")
Files That I want to read
The error that I'm getting while reading from the bucket
I really want to read these shapefiles in AWS EMR cluster, as it's not possible for me to work locally on them individually. Any kind of help is appreciated.
I was able to read my shape files from s3 bucket as a binary object in the beginning and then build a wrapper function around it, finally parsed the individual file objects to shapefile.reader() method in .dbf, .shp ,.shx formats separately.
This was happening because PySpark cannot read formats that are not provided in SparkContext. Found this link helpful Using pyshp to read a file-like object from a zipped archive.
My solution
def read_shapefile(shp_path):
import io
import shapefile
blocks = sc.binaryFiles(shp_path)
block_dict = dict(blocks.collect())
sf = shapefile.Reader(shp=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shp")][0]]),
shx=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shx")][0]]),
dbf=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".dbf")][0]]))
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
block_shapes = read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10*")
This works fine without breaking.

Generate single json file for pyspark RDD

I am building a Python script in which I need to generate a json file from json RDD .
Following is code snippet for saving json file.
jsonRDD.map(lambda x :json.loads(x))
.coalesce(1, shuffle=True).saveAsTextFile('examples/src/main/resources/demo.json')
But I need to write json data to a single file instead of data distributed across several partitions.
So please suggest me appropriate solution for it
Without the use of additional libraries like pandas, you could save your RDD of several jsons by reducing them to one big string of jsons, each separated by a new line:
# perform your operation
# note that you do not need a lambda expression for json.loads
jsonRDD = jsonRDD.map(json.loads).coalesce(1, shuffle=True)
# map jsons back to string
jsonRDD = jsonRDD.map(json.dumps)
# reduce to one big string with one json on each line
json_string = jsonRDD.reduce(lambda x, y: x + "\n" + y)
# write your string to a file
with open("path/to/your.json", "w") as f:
f.write(json_string.encode("utf-8"))
I have had issues with pyspark saving off JSON files once I have them in a RDD or dataframe, so what I do is convert them to a pandas dataframe and save them to a non distributed directory.
import pandas
df1 = sqlContext.createDataFrame(yourRDD)
df2 = df1.toPandas()
df2.to_json(yourpath)

Resources