Export from pig to CSV - excel

I'm having a lot of trouble getting data out of pig and into a CSV that I can use in Excel or SQL (or R or SPSS etc etc) without a lot of manipulation ...
I've tried using the following function:
STORE pig_object INTO '/Users/Name/Folder/pig_object.csv'
USING CSVExcelStorage(',','NO_MULTILINE','WINDOWS');
It creates the folder with that name with lots of part-m-0000# files. I can later join them all up using cat part* > filename.csv but there's no header which means I have to put it in manually.
I've read that PigStorageSchema is supposed to create another bit with a header but it doesn't seem to work at all, eg, I get the same result as if it's just stored, no header file:
STORE pig_object INTO '/Users/Name/Folder/pig_object'
USING org.apache.pig.piggybank.storage.PigStorageSchema();
(I've tried this in both local and mapreduce mode).
Is there any way of getting the data out of Pig into a simple CSV file without these multiple steps?
Any help would be much appreciated!

I'm afraid there isn't a one-liner which does the job,but you can come up with the followings (Pig v0.10.0):
A = load '/user/hadoop/csvinput/somedata.txt' using PigStorage(',')
as (firstname:chararray, lastname:chararray, age:int, location:chararray);
store A into '/user/hadoop/csvoutput' using PigStorage('\t','-schema');
When PigStorage takes '-schema' it will create a '.pig_schema' and a '.pig_header' in the output directory. Then you have to merge '.pig_header' with 'part-x-xxxxx' :
1. If result need to by copied to the local disk:
hadoop fs -rm /user/hadoop/csvoutput/.pig_schema
hadoop fs -getmerge /user/hadoop/csvoutput ./output.csv
(Since -getmerge takes an input directory you need to get rid of .pig_schema first)
2. Storing the result on HDFS:
hadoop fs -cat /user/hadoop/csvoutput/.pig_header
/user/hadoop/csvoutput/part-x-xxxxx |
hadoop fs -put - /user/hadoop/csvoutput/result/output.csv
For further reference you might also have a look at these posts:
STORE output to a single CSV?
How can I concatenate two files in hadoop into one using Hadoop FS shell?

if you will store your data as PigStorage on HDFS and then merge it using -getmerge -nl:
STORE pig_object INTO '/user/hadoop/csvoutput/pig_object'
using PigStorage('\t','-schema');
fs -getmerge -nl /user/hadoop/csvoutput/pig_object /Users/Name/Folder/pig_object.csv;
Docs:
Optionally -nl can be set to enable adding a newline character (LF) at
the end of each file.
you will have a single TSV/CSV file with the following structure:
1 - header
2 - empty line
3 - pig schema
4 - empty line
5 - 1st line of DATA
6 - 2nd line of DATA
...
so we can simply remove lines [2,3,4] using AWK:
awk 'NR==1 || NR>4 {print}' /Users/Name/Folder/pig_object.csv > /Users/Name/Folder/pig_object_clean.csv

Related

Why can't I merge multiple parquet files using "cat file1.parquet file2. parquet > result.parquet"?

I have created multiple parquet files using pyspark and now I'm trying to merge all the parquet files to 1. I'm able to merge the files, but while reading in the resulting file, I'm getting an error. Have anyone faced this issue before?
You cannot simply concatenate Parquet files using cat as they are binary files with a "table of contents" in the footer. To merge two files, you would have to read them both in and write out a completely new file. This could be done easily using the merge command in the parquet-tools.
The technical background that merging two Parquet files using cat isn't working comes down to the fact that a Parquet file is useless without a footer. Every Parquet file is made up roughly by the following structure:
RowGroup(nrows=..)
Column with nrows
Column with nrows
..
RowGroup(nrows=..)
..
..
Footer
Schema (tells you the type of the columns)
total_nrows
Location of RowGroups in the file
If you cat two files together, you would only see the last footer of the two files. Additionally, if the library used to read the files does some integrity checks, it will realise that your file is broken in some fashion and completely error out.

Can underlying parquet files be deleted without negatively impacting DeltaLake _delta_log

Using .vacuum() on a DeltaLake table is very slow (see Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs).
If I manually deleted the underlying parquet files and did not add a new json log file or add a new .checkpoint.parquet file and change the _delta_log/_last_checkpoint file that points to it; what would the negative impacts to the DeltaLake table be, if any?
Obviously time-traveling, i.e. loading a previous version of the table that relied on the parquet files I removed, would not work. What I want to know is, would there be any issues reading, writing, or appending to the current version of the DeltaLake table?
What I am thinking of doing in pySpark:
### Assuming a working SparkSession as `spark`
from subprocess import check_output
import json
from pyspark.sql import functions as F
awscmd = "aws s3 cp s3://my_s3_bucket/delta/_delta_log/_last_checkpoint -"
last_checkpoint = str(json.loads(check_output(awscmd, shell=True).decode("utf-8")).get('version')).zfill(20)
s3_bucket_path = "s3a://my_s3_bucket/delta/"
df_chkpt_del = (
spark.read.format("parquet")
.load(f"{s3_bucket_path}/_delta_log/{last_checkpoint}.checkpoint.parquet")
.where(F.col("remove").isNotNull())
.select("remove.*")
.withColumn("deletionTimestamp", F.from_unixtime(F.col("deletionTimestamp")/1000))
.withColumn("delDateDiffDays", F.datediff(F.col("deletionTimestamp"), F.current_timestamp()))
.where(F.col("delDateDiffDays") < -7 )
)
There are a lot of options from here. One could be:
df_chkpt_del.select("path").toPandas().to_csv("files_to_delete.csv", index=False)
Where I could read files_to_delete.csv into a bash array and then use a simple bash for loop passing each parquet file s3 path to an aws s3 rm command to remove the files one by one.
This may be slower than vacuum(), but at least it will not be consuming cluster resources while it is working.
If I do this, will I also have to either:
write a new _delta_log/000000000000000#####.json file that correctly documents these changes?
write a new 000000000000000#####.checkpoint.parquet file that correctly documents these changes and change the _delta_log/_last_checkpoint file to point to that checkpoint.parquet file?
The second option would be easier.
However, if there will be no negative effects if I just remove the files and don't change anything in the _delta_log, then that would be the easiest.
TLDR. Answering this question.
If I manually deleted the underlying parquet files and did not add a new json log file or add a new .checkpoint.parquet file and change the _delta_log/_last_checkpoint file that points to it; what would the negative impacts to the DeltaLake table be, if any?
Yes, this could potentially corrupt your delta table.
Let me briefly answers how delta-lake reads a version using _delta_log.
If you want to read version x then it will go to delta log of all versions from 1 to x-1 and will make a running sum of parquet files to read. Summary of this process is saved as a .checkpoint after every 10th version to make this process of running sum efficient.
What do I mean by this running sum?
Assume,
version 1 log says, add add file_1, file_2, file_3
version 2 log says, add delete file_1, file_2, and add file_4
So when reading version no 2, total instruction will be
add file_1, file_2, file_3 -> delete file_1, file_2, and add file_4
So, resultant files read will be file_3 and file_4.
What if you delete a parquet from a file system?
Say in version 3, you delete file_4 from file system. If you don't use .vacuum then delta log will not know that file_4 is not present, it will try to read it and will fail.

HDFS and Spark: Best way to write a file and reuse it from another program

I have some results from a Spark application saved in the HDFS as files called part-r-0000X (X= 0, 1, etc.). And, because I want to join the whole content in a file, I'm using the following command:
hdfs dfs -getmerge srcDir destLocalFile
The previous command is used in a bash script which makes empty the output directory (where the part-r-... files are saved) and, inside a loop, executes the above getmerge command.
The thing is I need to use the resultant file in another Spark program which need that merged file as input in the HDFS. So I'm saving it as local and then I upload it to the HDFS.
I've thought another option which is write the file from the Spark program in this way:
outputData.coalesce(1, false).saveAsTextFile(outPathHDFS)
But I've read coalesce() doesn't help with the performance.
Any other ideas? suggestions? Thanks!
You wish to merge all the files into a single one so that you can load all the files at once into a Spark rdd, is my guess.
Let the files be in Parts(0,1,....) in HDFS.
Why not load it with wholetextFiles, which actually does what you need.
wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn
Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:
(a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content)
Try SPARK BucketBy.
This is a nice feature via df.write.saveAsTable(), but this format can only be read by SPARK. Data shows up in Hive metastore but cannot be read by Hive, IMPALA.
The best solution that I've found so far was:
outputData.saveAsTextFile(outPath, classOf[org.apache.hadoop.io.compress.GzipCodec])
Which saves the outputData in compressed part-0000X.gz files under the outPath directory.
And, from the other Spark app, it reads those files using this:
val inputData = sc.textFile(inDir + "part-00*", numPartition)
Where inDir corresponds to the outPath.

pyspark read csv file multiLine option not working for records which has newline spark2.3 and spark2.2

I am trying to read the dat file using pyspark csv reader and it contains newline character ("\n") as part of the data. Spark is unable to read this file as single column, rather treating it as new row.
I tried using the "multiLine" option while reading , but still its not working.
spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)
Data is something like this. Here $ is CRLF for newline shown in vim.
name,test,12345,$
$
,desc$
name2,test2,12345,$
$
,desc2$
So pyspark is treating desc as next record.
How to read such data in pyspark .
Tried this in both spark2.2 and spark2.3 versions.
I created my own hadoop Custom Record Reader and was able to read it by invoking the api .
spark.sparkContext.newAPIHadoopFile(file_path,'com.test.multi.reader.CustomFileFormat','org.apache.hadoop.io.LongWritable','org.apache.hadoop.io.Text',conf=conf)
And in the Custom Record Reader implemented the logic to handle the newline characters encountered .

Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark
I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:
I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.
Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?
You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)
In Spark 2.2 there was added new option - wholeFile. If you write this:
spark.read.option("wholeFile", "true").csv("file.csv")
it will read all file and handle multiline CSV.
There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison
wholeFile does not exist (anymore?) in the spark api documentation:
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
This solution will work:
spark.read.option("multiLine", "true").csv("file.csv")
From the api documentation:
multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

Resources