How to add comma between JSON elements using Spark Scala - apache-spark

I'm loading a table data into a dataframe and creating multiple JSON part files. The structure of the data is good, but the elements in JSON are not separated by commas.
This is the output:
{"time_stamp":"2016-12-08 01:45:00","Temperature":0.8,"Energy":111111.5,"Net_Energy":1111.3}
{"time_stamp":"2016-12-08 02:00:00","Temperature":21.9,"Energy":222222.5,"Net_Energy":222.0}
I'm supposed to get something like this:
{"time_stamp":"2016-12-08 01:45:00","Temperature":0.8,"Energy":111111.5,"Net_Energy":1111.3},
{"time_stamp":"2016-12-08 02:00:00","Temperature":21.9,"Energy":222222.5,"Net_Energy":222.0}
How do I do this?

Your output is correct JSONlines output: one JSON record per line, separated by newlines. You do not need commas between the lines. In fact, that would be invalid JSON.
If you absolutely need to turn the entire output of a Spark job into a single JSON array of objects, there are two ways to do this:
For data that fits in driver RAM, df.as[String].collect.mkString("[", ",", "]").
For data that does not fit in driver RAM... you really shouldn't do it... but if you absolutely have to, use shell operations to begin with [, add a comma to every line of output and end in ].

Related

I want to join two strings, but separate the two strings by a space

I want to process files quickly through a program called star, but I have many files and want to pre-format the input from my files to save time. The format required is-
sample1read1.fq, sample2read1.fq \space\ sample1read2.fq, sample2read2.fq
[EDIT]: My files look like this:
trimmed_Sample_RX.fq where X can either be 1 or 2.
Star wants me to load all of the R1's together, separated by a comma, then a space and then all of my R2's together separated by a comma. To tackle this problem I have attempted to use the join command in python:
def identifier(x):
return(x[-10:])
read1= list(sorted(fnmatch.filter(os.listdir(PATH),'*_R1.fq'), key= identifier))
read1.append(' ')
read2= list(sorted(fnmatch.filter(os.listdir(PATH),'*_R2.fq'), key= identifier))
first_half= ','.join(read1)
second_half= ','.join(read2)
star_input= first_half + second_half
print(star_input)= 'trimmed_sample1R1.fq,trimmed_sample2R1.fq, , trimmed_sample1R2.fq,trimmed_sample2R2.fq'
I attempt to add a space to the end of my file list read1. Then I turn everything into a string and attempt to join the two strings together, but that space I added into my first half pops up in the concatenation as a comma
'trimmed_sample1R1.fq,trimmed_sample2R1.fq, , trimmed_sample1R2.f,trimmed_sample2R2.fq'
If I remove the step where I append a blank space and then concatenate the two strings I get the following
'trimmed_sample1R1.fq,trimmed_sample2R1.fqtrimmed_sample1R2.fq,trimmed_sample2R2.fq'
So now the comma is gone, but I also lose the space.
Thanks.
I think I found a workaround, but if there is a better way to do this please point it out to me. Basically I kept my code the same, but I change the last step. Now the flow looks like this:
first_half=','.join(read1)
second_half=','.join(read2)
star_input= '{} {}'.format(first_half, second_half)
print(star_input)
'trimmed_sample1R1.fq,trimmed_sample2R1 trimmed_sample1R2.fq,trimmedsample2R2.fq'

TextInputFormat vs HiveIgnoreKeyTextOutputFormat

I'm just starting out with Hive, and I have a question about Input/Output Format. I'm using the OpenCSVSerde serde, but I don't understand why for text files the Input format is org.apache.hadoop.mapred.TextInputFormat but the output format is org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.
I've read this question but it's still not clear to my why the Input/Output formats are different. Isn't that basically saying your going to store data added to this table differently the data that's read from the table??
Anyway, any help would be appreciated
In a TextInputFormat, Keys are the position in the file (long data type), and values are the line of text. When the program reads a file, It might use the keys for random read, where while writing the text data using HiveIgnoreKeyTextOutputFormat there is no value in maintaining position as it doesn't make sense.
Hence, using HiveIgnoreKeyTextOutputFormat passes keys as null to underlining RecordWriter. When the RecordWriter receives key as null, it ignores key and just write the value with line separator. Otherwise, RecordWriter will key, then delimiter, then value and finally a line separator.

Parse Fortran String Like A File

I want to pass a string from C to Fortran and then process it line-by-line as if I was reading a file. Is this possible?
Example String - contains newlines
File description: this file contains stuff
3 Values
1 Description
2 Another description
3 More text
Then, I would like to parse the string line-by-line, like a file. Similar to this:
subroutine READ_STR(str, len)
character str(len),desc*70
read(str,'(a)') desc
read(str,*) n
do 10 i=1,n
read(str,*) parm(i)
10 continue
Not without significant "manual" intervention. There are a couple of issues:
The newline character has no special meaning in an internal file. Records in an internal file correspond to elements in an array. You would either need to first manually preprocess your character scalar into an array, or use a single READ that skipped over the newline characters.
If you did process the string into an array, then internal files do not maintain a file position between parent READ statements. You would need to manually track the current record yourself, or process the entire array with a single READ statement that accessed multiple records.
If you have variable width fields, then writing a format specification to process everything in a single READ could be problematic, though this depends on the details of the input.

Separating fields out of a string in Hive

I have the following problem...
I work with Hive and want to add a file with several (different) rows of Strings. Those contain fields with a fixed size, like this:
A20130420bcd 34 fgh
where the fields have the length 1,8,6,4,3.
Separated it would look like this:
"A,20130420,bcd,fgh"
Is there any possibility to read the String and sort it into a field besides getting it as a substring for every field like
substring(col_value,1,1) Field1
etc?
I would imagine that cutting the already read part of the string would increase the performance, but i could think of any way to do this with the given functions here.
Secondly, as stated before, there are different types of strings, ordered and identified by the first character.right now just check those with the WHERE-Statement, but it's horrible, as it runs through the whole file just to find only the first String. Is there any way to read specific lines by their number? If i know, that the first string will be of a certain kind, can read it directly?
right it looks like this:
insert overwrite table TEST
SELECT
substring(col_value,1,1) field1,
...
substring(col_value,10,3) field 5
from temp_data WHERE substring(col_value,1,1) = 'A';
any ideas on this?
I would love to hear some ideas =)
You need to write yours generic-UDF parser that output the struct or map or whatever appropriate. you can refer to UDF that output multi-values.
then you can write
insert overwrite table output
select parsed.first, parsed.second
from (
select parse(taget)
from input
) parsed
where first='X';
About second question,you may need to check "explain" command of hive to see if hive do filter push-down for you.(just see how many map reduce it takes, theoretically it should be one map, depending on 1.hive version,
2.output table format
.)
In general sense, this is why database is popular -- take optimization into consideration for you .

Reading all but last few lines of a data file

I can easily skip the header of a data file using getline, but then when I parse through the data file and get to the footer of the file, I end up stuck in a loop because the program is trying to parse columns of data that no longer exist. Is there an easy way to stop reading when there is no longer data in the line? It looks like there is a blank line followed by some footer information, but I cannot guarantee that all of my data files will look like that (i.e. I need something pretty generic).
Looking at your existing code (edit your question and put it there, not in a comment), I see you have nested loops. But what you really want is one loop with two reasons to exit.
while ((q < 16) && (liness >> temp)) { ... }
Read the line into a string, parse if only if you see \n at the end.

Resources