Processing bzipped json file in Spark? - apache-spark

I have about 200 files in S3, e.g., a_file.json.bz2, each line of these file is a record in JSON format but some fields were serialised by pickle.dumps, e.g. a datetime field. Each file is about 1GB after bzip compression. Now I need to process these files in Spark (pyspark, actually) but I couldn't even get each record out. So what would be the best practice here?
The ds.take(10) gives
[(0, u'(I551'),
(6, u'(dp0'),
(11, u'Vadv_id'),
(19, u'p1'),
(22, u'V479883'),
(30, u'p2'),
(33, u'sVcpg_id'),
(42, u'p3'),
(45, u'V1913398'),
(54, u'p4')]
Apparently the splitting is not by each record.
Thank you.

I had this issue reading gpg-encrypted files. You can use wholeTextFiles as Daniel suggests, but you have to be careful when reading large files as the entire file will be loaded to memory before processing. If the file is too large, it can crash the executor. I used parallelize and flatMap. Maybe something along the lines of
def read_fun_generator(filename):
with bz2.open(filename, 'rb') as f:
for line in f:
yield line.strip()
bz2_filelist = glob.glob("/path/to/files/*.bz2")
rdd_from_bz2 = sc.parallelize(bz2_filelist).flatMap(read_fun_generator)

You can access the input file-by-file (instead of line-by-line) via SparkContext.wholeTextFiles. You can then use flatMap to uncompress and parse the lines in your own code.

In fact it is a problem caused by pickle. By looking at the file content after the compression, it is indeed
(I551
(dp0
Vadv_id
p1
V479883
p2
sVcpg_id
p3
V1913398
p4
which gives me trouble to parse. I know I can just pick.load(file) multiple times to get the objects out, but cannot find a quick solution in Spark where I can only access the loaded files line by line. Also, the records in this file have variable fields and lengths which makes it more difficult to hack.
I ended up re-generating these bz2 files from the source because it is actually easier and faster. And I learnt that Spark and hadoop supports bz2 compression perfectly so there is no additional action required.

Related

Best Way to Save Sensor Data, Split Every x Megabytes in Python

I'm saving sensor data at 64 samples per second into a csv file. The file is about 150megs at end of 24 hours. It takes a bit longer than I'd like to process it and I need to do some processing in real time.
value = str(milivolts)
logFile.write(str(datet) + ',' + value + "\n")
So I end up with single lines with date and milivolts up to 150 megs. At end of 24 hours it makes a new file and starts saving to it.
I'd like to know if there is a better way to do this. I have searched but can't find any good information on a compression to use while saving sensor data. Is there a way to compress while streaming / saving? What format is best for this?
While saving the sensor data, is there an easy way to split it into x megabyte files without data gaps?
Thanks for any input.
I'd like to know if there is a better way to do this.
One of the simplest ways is to use a logging framework, it will allow you to configure what compressor to use (if any), the approximate size of a file and when to rotate logs. You could start with this question. Try experimenting with several different compressors to see if speed/size is OK for your app.
While saving the sensor data, is there an easy way to split it into x megabyte files without data gaps?
A logging framework would do this for you based on the configuration. You could combine several different options: have fixed-size logs and rotate at least once a day, for example.
Generally, this is accurate up to the size of a logged line, so if the data is split into lines of reasonable size, this makes life super easy. One line ends in one file, another is being written into a new file.
Files also rotate, so you can have order of the data encoded in the file names:
raw_data_<date>.gz
raw_data_<date>.gz.1
raw_data_<date>.gz.2
In the meta code it will look like this:
# Parse where to save data, should we compress data,
# what's the log pattern, how to rotate logs etc
loadLogConfig(...)
# any compression, rotation, flushing etc happens here
# but we don't care, and just write to file
logger.trace(data)
# on shutdown, save any temporary buffer to the files
logger.flush()

How can I shuffle the rows of a large csv file and write the result to a new csv file without using too much memory?

So if I have a csv file as follows:
User Gender
A M
B F
C F
Then I want to write another csv file with rows shuffled like so (as an example):
User Gender
C F
A M
B F
My problem is that I don't know how to randomly select rows and ensure that I get every row from the original csv file. For reference my csv file is around 3gb. If I load my entire dataset into a dataframe and use the random package to shuffle it, my PC crashes due to RAM use.
Probably the easiest (and fastest) is to use shuf in bash!
shuf words.txt > shuffled_words.txt
(I know you asked for a Python solution, but I am going to assume this is still a better answer)
To programmatically do it from Python:
import sh
sh.shuf("words.txt", out="shuffled_words.txt")
Create an array of lines as file positions of line starts, by reading the file once as random access or memory mapped file. The array has one extra entry with the file length,
so the line i holds the bytes [array[i], array[i+1]>.
Shuffle the indices 0 .. number of lines - 1.
Now you can use random access positioning (seek) to read a line buffer.
You can use the chunk_size argument to csv in chunks
df_chunks = pandas.read_csv("your_csv_name.csv", chunk_size=10)
Then you can shuffle only the chunks so it takes less memory
for chunk in df_chunks:
do stuff
Then you can concat them and save it into another csv :
new_df = pandas.concat(new_chunks)
new_df.to_csv("your_new_csv_name.csv")
If you have memory issue, while you create new_chunks don't forget to erase old one as you don't want them to be left in RAM for no reason, you can do it with
chunk=None

How to read a csv file with huge data from a .7z archive?

I have a csv file containing 162 gb of data, that I had to compress using 7-zip to save space. I have been using libarchive to read from .7z files and append the blocks read to get the final result at the end. But the problem with this file is, its so huge that I cannot append it to a create a single string or dataframe since my main memory is limited to 8 gb. Furthermore, I cannot perform any operation on each block since the blocks read are incosistent each time the last line clips off some of the columns.
Following is the snippet that I am using to read the csv file:
import libarchive
with libarchive.file_reader(r'D:\Features\four_grams.7z') as e:
for entry in e:
for b in entry.get_blocks():
print(b.decode('utf-8'))
Following is the pastebin of the a single block of output:
https://pastebin.com/7agwAAds
Notice the clipping of the final row.
I would appreciate any help with reading complete and chunks of rows from a huge csv file that is archived.

How to access this kind of data in Spark

The data is stored in the following forms:
data/file1_features.mat
data/file1_labels.txt
data/file2_features.mat
data/file2_labels.txt
...
data/file100_features.mat
data/file100_labels.txt
Each data/file*_features.mat stores the features of some samples and each row is a sample. Each data/file*_labels.txt stores the labels of those samples and each row is a number (e.g., 1,2,3,...). In the whole 100 files, there are total about 80 million samples.
In Spark, how to access this data set?
I have checked the spark-2.0.0-preview/examples/src/main/python/mllib/random_forest_classification_example.py. It has the following lines:
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
(trainingData, testData) = data.randomSplit([0.7, 0.3])
I run this example in ./bin/pyspark, it shows the data object is a PythonRDD.
PythonRDD[32] at RDD at PythonRDD.scala:48
The data/mllib/sample_libsvm_data.txt is just one file. In my case, there are many files. Is there any RDD in Spark to handle this case conveniently? Does it need to merge all 100 files to one big file and process it as the example? I want to use the Spark engine to scale the data set (mean-std normalization or min-max normalization).
Simply point
dir = "<path_to_data>/data"
sc.textFile(dir)
Spark automatically picks up all of the files inside that directory
If you want load specific file type for processing then you can use regular expression for loading files into RDD.
dir = "data/*.txt"
sc.textFile(dir)
Spark will all files ending with txt extension.

how to skip corrupted gzips with pyspark?

I need to read from a lot of gzips from hdfs, like this:
sc.textFile('*.gz')
while some of these gzips are corrupted, raises
java.io.IOException: gzip stream CRC failure
stops the whole processing running.
I read the debate here, where someone has the same need, but get no clear solution. Since it's not appropriate to achieve this function within spark (according to the link), is there any way just brutally skip corrupted files? There seem to have hints for scala user, no idea how to deal with it in python.
Or I can only detect corrupted files first, and delete them?
What if I have large amount of gzips, and after a day of running, find out the last one of them are corrupted. The whole day wasted. And having corrupted gzips are quite common.
You could manually list all of the files and then read over the files in a map UDF. The UDF could then have try/except blocks to handles corrupted files.
The code would look something like
import gzip
from pyspark.sql import Row
def readGzips(fileLoc):
try:
...
code to read file
...
return record
except:
return Row(failed=fileLoc)
from os import listdir
from os.path import isfile, join
fileList = [f for f in listdir(mypath) if isfile(join(mypath, f))]
pFileList = sc.parallelize(fileList)
dataRdd = pFileList.map(readGzips).filter((lambda x: 'failed' not in x.asDict()))

Resources