apache pig, store result in a txt file - text

Hello i'm a new PIG user,
I'm trying to stock some data in a txt file but when i'm using the STORE command, it creates a folder that contains the following files : _SUCCESS and part-r-00000.
How to get this result in a txt file ?
Thanks.

This is how STORE output usually look like.
You can run Hadoop fs command from inside pig so you can write something like the below inside your pig (see the documentation here)
fs -getmerge /my/hdfs/output/dir/* /my/local/dir/result.txt
fs -copyFromLocal /my/local/dir/result.txt /my/hdfs/other/output/dir/

Read the files using cat command and pipe the output to .txt file using put command
hadoop fs -cat /in_dir/part-* | hadoop fs -put - /out_dir/output.txt
or
Merge the files in the folder using the getmerge command to the output .txt file
hadoop fs -getmerge /in_dir/ /out_dir/output.txt

That's the way, map reduce job writes the output.
As Pig runs map-reduce job internally so job writes output in the form of part files :
part-m-00000(map output) or part-r-00000(reduce output).
Let's say you are giving following output dir("/user/output1.txt") in your script, so it will have:
/user/output1.txt/part-r-00000
/user/output1.txt/_SUCCESS
There may be multiple part files created inside output1.txt, so in that case you can merge those into one.
hadoop fs -getmerge /user/output1.txt/* /localdir/output/result.txt
hadoop fs -copyFromLocal /localdir/output/result.txt /user/output/result.txt

Related

extract gz files using hdfs/spark

I have large gzip files stored HDFS location
- /dataset1/sample.tar.gz --> contains 1.csv & 2.csv .... and so on
I would like extract
/dataset1/extracted/1.csv
/dataset1/extracted/2.csv
/dataset1/extracted/3.csv
.........................
.........................
/dataset1/extracted/1000.csv
Is there any hdfs commands that can be used to extract tar gz file (without copying to local machine) or use python/scala spark?
I tried using spark but since spark can not parallelize reading a gzipfile and the gzip file is very huge like 50GB.
I want to split the gzip and use those for spark aggregations.

Pyspark: Load a tar.gz file into a dataframe and filter by filename

I have a tar.gz file that has multiple files. The hierarchy looks as below. My intention is to read the tar.gz file, filter out the contents of b.tsv as it is static metadata where all the other files are actual records.
gzfile.tar.gz
|- a.tsv
|- b.tsv
|- thousand more files.
By pyspark load, I'm able to load the file into a dataframe. I used the command:
spark = SparkSession.\
builder.\
appName("Loading Gzip Files").\
getOrCreate()
input = spark.read.load('/Users/jeevs/git/data/gzfile.tar.gz',\
format='com.databricks.spark.csv',\
sep = '\t'
With the intention to filter, I added the filename
from pyspark.sql.functions import input_file_name
input.withColumn("filename", input_file_name())
Which now generates the data like so:
|_c0 |_c1 |filename |
|b.tsv0000666000076500001440035235677713575350214013124 0ustar netsaintusers1|Lynx 2.7.1|file:///Users/jeevs/git/data/gzfile.tar.gz|
|2|Lynx 2.7|file:///Users/jeevs/git/data/gzfile.tar.gz|
Of course, the file field is populating with the tar.gz file, making that approach useless.
A more irritating problem is, the _c0 is getting populated with filename+garbage+first row values
At this point, I'm wondering if the file read itself is getting weird as it is a tar.gz file. When we did the v1 of this processing, (spark 0.9), we had another step that loaded the data from s3 into an ec2 box, extract and write back into s3. I'm trying to get rid of those steps.
Thanks in advance!
Databricks does not support direct *.tar.gz iteration. In order to process file, they have to be unzipped into temporary location. Databricks support bash than can do the job.
%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;
Above code will unzip all files with extension *.tar.gz in source to destination location.
If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable.
This can be achieved in %pyspark
import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'
Use following methods to load file, in assumption the content in *.csv file:
DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.csv')

Read csv file from Hadoop using Spark

I'm using spark-shell to read csv files from hdfs.
I can read those csv file using the following code in bash:
bin/hadoop fs -cat /input/housing.csv |tail -5
so this suggest the housing.csv is indeed in hdfs right now.
How can I read it using spark-shell?
Thanks in advance.
sc.textFile("hdfs://input/housing.csv").first()
I tried this way, but failed.
Include the csv package in the shell and
var df = spark.read.format("csv").option("header", "true").load("hdfs://x.x.x.x:8020/folder/file.csv")
8020 is the default port.
Thanks,
Ash
You can read this easily with spark using csv method or by specifying format("csv"). In your case either you should not specify hdfs:// or you should specify complete path hdfs://localhost:8020/input/housing.csv.
Here is a snippet of code that can read csv.
val df = spark.
read.
schema(dataSchema).
csv(s"/input/housing.csv")

Renaming only a particular string in a filename recursively

I have a directory with 90 files. The file name is like
/user/jk/2016d/IDPSRU20160219_2345.txt
I want to change the filename to /user/jkris03/2016d/IDPSRU20160223_2345.txt
Please note that only 19 is replaced with 23 in the filename and the subsequent _2345 will be different for each file.
I would appreciate very much if you could provide the answer.
Please note that, the direcory/files are in hdfs.
Thanks,
If you just want to replace 19_ with 23_, you can do something like this:
hdfs dfs -ls -C /user/jk/2016d/ | awk '{OLD=$0; sub("19_", "23_", $0); system("hdfs dfs -mv "OLD" "$0);}'
where,
hdfs dfs -ls -C /user/jk/2016d/ : is for listing the HDFS files
OLD=$0 : is for storing the old file name
sub("19_", "23_", $0) : is for creating new file name
system("hdfs dfs -mv "OLD" "$0) : is for renaming the file
Hope it helps!
You can have a look at the rename command. It allows you to rename by regex. I think it differs between different distributions, so use man rename to see how it works for you.

Spark 2.0: How to list or remove dirs and files in s3

Is there any way to list files and dirs, remove files and dirs, check if a dir exists, etc directly from spark 2.0 shell?
I am able to use os python library but it just 'sees' local dirs, not s3.
I have also found this but I cannot make it work
http://bigdatatech.taleia.software/2015/12/21/check-if-exists-a-amazon-s3-path-from-apache-spark/
Thanks
You can use s3cmd http://s3tools.org/s3cmd-howto , in order to use it inside python, you'll need to use os.system or os.subprocess
List your buckets again with s3cmd ls
~$ s3cmd ls
2007-01-19 01:41 s3://logix.cz-test
Upload a file into the bucket
~$ s3cmd put addressbook.xml s3://logix.cz-test/addrbook.xml
File 'addressbook.xml' stored as s3://logix.cz-test/addrbook.xml (123456 bytes)
Another options is using tinys3 lib https://www.smore.com/labs/tinys3/
Another option is using simples3 http://sendapatch.se/projects/simples3/
s = S3Bucket(bucket, access_key=access_key, secret_key=secret_key)
print s
<S3Bucket ... at 'https://s3.amazonaws.com/...'>

Resources