compress dataframe to one json string apache spark

compress dataframe to one json string apache spark - python-3.x

This bounty has ended. Answers to this question are eligible for a +50 reputation bounty. Bounty grace period ends in 12 hours.
Mike3355 is looking for a canonical answer.
I have a dataframe that when I write it to json it has several hundred lines of json but that are exactly the same. I am trying to compress it to one json line. Is there an out of the box way to accomplish this?
def collect_metrics(df) -> pyspark.sql.DataFrame:
neg_value = df.where(df.count < 0).count()
return df.withColumn(loader_neg_values, F.lit(neg_value))
main(args):
df_metrics = collect_metrics(df)
df_metrics.write.json(args.metrics)
In the end the goal is the write one json line and the file has to be a json file, not compressed.

It seems like you have hundreds of (duplicated) lines but you only want to keep one. You can use limit(1) in that case:
df_metrics.limit(1).write.json(args.metrics)

You want something like this:
df_metrics.limit(1).repartition(1).write.json(args.metrics)
.repartition(1) guarantees 1 output file, and .limit(1) guarantees one output row.

Related

Difficulty with encoding while reading data in Spark

In connection with my earlier question, when I give the command,
filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.collect()
some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'. The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.
I feel that it has something to do with encoding. I tried all methods like using replaceoption in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" ")), flatMap(lambda line: line.replace(u'\xa0', u' ').split(" ")), but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.
Can anyone please help me? Thanks in advance.

Check the encoding of your file. When you use sc.textFile, spark expects an UTF-8 encoded file.
One of the solution is to acquire your file with sc.binaryFiles and then apply the expected encoding.
sc.binaryFile create a key/value rdd where key is the path to file and value is the content as a byte.
If you need to keep only the text and apply an decoding function, :
filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.map(lambda x :x[1].decode('utf-8')) #or another encoding depending on your file

Python KafkaConsumer start consuming messages from a timestamp

I'm planning to skip the start of the topic and only read messages from a certain timestamp to the end. Any hints on how to achieve this?

I'm guessing you are using kafka-python (https://github.com/dpkp/kafka-python) as you mentioned "KafkaConsumer".
You can use the offsets_for_times() method to retrieve the offset that matches a timestamp. https://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html#kafka.KafkaConsumer.offsets_for_times
Following that just seek to that offset using seek(). https://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html#kafka.KafkaConsumer.seek
Hope this helps!

I got around it, however I'm not sure about the values that I got from using the method.
I have a KafkaConsumer (ck), I got the partitions for the topic with the assignment() method. Thus, I can create a dictionary with the topics and the timestamp I'm interested into (in this case 100).
Side Question: Should I use 0 in order to get all the messages?.
I can use that dictionary as the argument in the offsets_for_times(). However, the values that I got are all None
zz = dict(zip(ck.assignment(), [100]*ck.assignment() ))
z = ck.offsets_for_times(zz)
z.values()
dict_values([None, None, None])

Parse a huge JSON file

I have a very large JSON file (about a gigabyte) which I want to parse.
I tried the JsonSlurper, but it looks like it tries to load the whole file into memory which causes out of memory exception.
Here is a piece of code I have:
def parser = new JsonSlurper().setType(JsonParserType.CHARACTER_SOURCE);
def result = parser.parse(new File("equity_listing_full_201604160411.json"))
result.each{
println it.Listing.ID
}
And Json is something like this but much longer with more columns and rows
[
{"Listing": {"ID":"2013056","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927445"],}},
{"Listing": {"ID":"2013057","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927446"],}},
{"Listing": {"ID":"2013058","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927447"],}}
]
I want to be able to read it row by row. I can probably just parse each row separately, but was thinking that there might be something for parsing as you read.

Suggest using GSON by Google.
There is a streaming Parsing Option here: https://sites.google.com/site/gson/streaming

Overwriting specific lines in Python

I have a simple program that manipulates some stored data on some text files. However I have to store the name and the password on different files for python to read.
I was wondering if I could get these two words (The name and the password) on two separate lines on one file and get python to overwrite just one of the lines based on what I choose to overwrite (either the password or the name).
I can get python to read specific lines with:
linenumber=linecache.getline("example.txt",4)
Ideally id like something like this:
linenumber=linecache.writeline("example.txt","Hello",4)
So this would just write "Hello" in "example.txt" only on line 4.
But unfortunately it doesn't seem to be as simple as that, I can get the words to be stored on separate files but overall doing this on a larger scale, I'm going to have a lot of text files all named differently and with different words on them.
If anyone would be able to help, it would be much appreciated!
Thanks, James.

You can try with built in open() function:
def overwrite(filename,newline,linenumber):
try:
with open(filename,'r') as reading:
lines = reading.readlines()
lines[linenumber]=newline+'\n'
with open(filename,'w') as writing:
for i in lines:
writing.write(i)
return 0
except:
return 1 #when reading/writing gone wrong, eg. no such a file
Be careful! It is writing all the lines all over again in a loop and when it comes to exception example.txt may already be blank. You may want to store all the lines in list all the time to write them back to file in exception. Or keep backup of your old files.

Csv to csv (XSLT)

We have to perform a CSV transformation into another CSV (1 file to 1 file). We are looking for a cheap solution. The first idea that popped into my mind was Excel, but the file will be to big.
1) Is it possible to do a CSV to CSV conversion through XSLT? I can't seem to find a tool or google result which tells me how I could possibly do it.
2) Is there a better approach to do CSV transformations?
Edit:
It should be possible to automate/schedule the process

My answers below
1) No, XSLT only transforms XML files.
2) Yes, as the answer to question 1 is "No", it is reasonable to assert there are better approaches. As CSV is not a standardised format there are a plethora of varied approaches to choose from.

Use Rscript to automate the transformation of CSV:
# Rscript --vanilla myscript.R
Where myscript.R is something like:
csv <- read.csv(file="input.csv",head=TRUE,sep=",")
# Modify your CSV ...
write.csv(data, file = "output.csv")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

compress dataframe to one json string apache spark - python-3.x

It seems like you have hundreds of (duplicated) lines but you only want to keep one. You can use limit(1) in that case: df_metrics.limit(1).write.json(args.metrics)

You want something like this: df_metrics.limit(1).repartition(1).write.json(args.metrics) .repartition(1) guarantees 1 output file, and .limit(1) guarantees one output row.

Related

Difficulty with encoding while reading data in Spark

Python KafkaConsumer start consuming messages from a timestamp

Parse a huge JSON file

Overwriting specific lines in Python

Csv to csv (XSLT)

Categories

Resources