pyspark DataFrame get original row CSV string - apache-spark

I'm loading a CSV file to a spark DataFrame
At this point I'm doing some parsing and validation, if the validation is failed - I want to write the original CSV line to a different file
Is it possible getting the original string from the DataFrame object?
I thought about getting the ln number from the DataFrame, and extracting it from the original file
I guess it will be better using the DF object, but if not possible - extract from file

Related

Writing Dataframe to a parquet file but no headers are being written

I have the following code:
print(df.show(3))
print(df.columns)
df.select('port', 'key', 'return_b', 'return_a', 'return_c', 'return_d', 'return_g').write.format("parquet").save("qwe.parquet")
For some reason this doesn't write the Dataframe into the parquet file with the headers. The print statement above shows me those columns exist but the parquet file doesn't have those headers.
I have also tried:
df.write.option("header", "true").mode("overwrite").parquet(write_folder)
You may find df.to_parquet(...) more convenient.
If you wish to project down to selected columns,
do that first,
and then write to parquet.

Column Type mismatch in Spark Dataframe and source

I am trying to read data from elastic, I could see Column is present as Array of string in elastic but while I am reading by Spark as Dataframe i am seeing as a Srting, how could I handle this data in Spark.
Note: I am trying to read with mode (sqlContext.read.format("org.elasticsearch.spark.sql") becuase i need to write it as CSV file in future.

Sorting records in a CSV file

Tell me how you can sort records in CSV files using typescript + node js. Sort by Id.
The number of records in files can be up to 1 million.
Here's an example of file entries:
Blockquote
Here's a conceptual solution:
Create a new SQLite db with a table having the appropriate schema for your columns
Stream the data from the source CSV file, reading one line at a time: parse and insert the data from the line into the db table from the previous step
Create the output CSV file and append the header line
Iterate over the db table entries in the desired sort order, one at a time: convert each entry back into a CSV line in the correct column order, and then append the line to the CSV file in the previous step
Cleanup: (Optionally validate your new CSV file, and then) delete the SQLite db
If you can fit the entire parsed CSV data in memory at once, you push each line into an array instead of using a db. Then you just sort the array in-place and iterate its elements.

create different dataframe based on field value in Spark/Scala

I have a dataframe in below format with 2 fields. One of the field contains code and other field contains XML.
EventCd|XML_VALUE
1.3.6.10|<nt:SNMP>
<nt:var id="1.3.0" type="STRING"> MESSAGE </nt:var>
<nt:var id="1.3.9" type="STRING">AB-CD-EF</nt:var>
</nt:SNMP>
1.3.6.11|<nt:SNMP>
<nt:var id="1.3.1" type="STRING"> CALL </nt:var>
<nt:var id="1.3.2" type="STRING">XX-AC-EF</nt:var>
</nt:SNMPe
Based on value in code field I want to create different dataframe conditionally and place the data in corresponding hdfs folder.
if code is 1.3.6.10, it should create message dataframe and place files under ../message/ HDFS folder and if the code is 1.3.6.11, it should create call dataframe and write data into call hdfs folder like ../call/
I am able to create the dataframes using multiple filter options but is there any option to call only one dataframe and corresponding HDFS write command.
Can someone suggest how can I do this in spark/scala please.

Load single column from csv file

I have a csv file that contains large number of columns. I want to load just one column from that file using spark.
I know that we can use select statement to filter a column. But what i want, while doing the read operation itself, it should load just one column.
In this way, i should be able to avoid extra memory getting used by other columns. Is there any way to do this?
Spark will load complete file and parse for columns. As you mentioned, you can use select to restrict columns in dataframe, so dataframe will have only one column.
Spark will load the complete file in memory and will filter down the column you want with the help of select statements which you have mentioned.
Because all the read operation in spark, reads and scans the whole file as a distributed stream reader gets created (the reader gets instantiated at every node where the data has been stored).
And if your problem is to read the data column-wise then you can store the file in parquet format and read that file. Indeed, parquet is columnar storage and it is exactly meant for this type of use case(you can verify it using the explain).

Resources