Read csv file from Hadoop using Spark - apache-spark

I'm using spark-shell to read csv files from hdfs.
I can read those csv file using the following code in bash:
bin/hadoop fs -cat /input/housing.csv |tail -5
so this suggest the housing.csv is indeed in hdfs right now.
How can I read it using spark-shell?
Thanks in advance.
sc.textFile("hdfs://input/housing.csv").first()
I tried this way, but failed.

Include the csv package in the shell and
var df = spark.read.format("csv").option("header", "true").load("hdfs://x.x.x.x:8020/folder/file.csv")
8020 is the default port.
Thanks,
Ash

You can read this easily with spark using csv method or by specifying format("csv"). In your case either you should not specify hdfs:// or you should specify complete path hdfs://localhost:8020/input/housing.csv.
Here is a snippet of code that can read csv.
val df = spark.
read.
schema(dataSchema).
csv(s"/input/housing.csv")

Related

PySpark read gzip of multiple json file Failed

I have no issue in reading a standalone JSON file which consists of a single line in the format of {xxx...}. However, when I compress it using tar -zcvf into one-file.json.gz and attempt to read it, I receive a single column named
root
|-- _corrupt_record: string (nullable = true)
This code to read that gzip file.
df = (
spark.read.option("recursiveFileLookup", "true")
.json("../../one-file.json.gz")
)
When I try to use Aws Glue, it have some exceptions like:
"Failure Reason": "Unable to parse file: one-file.json.gz\n"
I want to know what's wrong here.
The reason is simple. The spark can only read json format data and .json.gz is not json. You have to untar the file before it is read by spark. You can use the tarfile module to do it like:
import tarfile
tar_file = tarfile.open("*.tar.gz")
tar_file.extractall()
spark.read.option("recursiveFileLookup", "true").json(tar_file)
tar_file .close()

Can not read the data from HDFS in pySpark

I am a beginner in coding. Currently trying to read a file (which was imported to HDFS using sqoop) with the help of pyspark. The spark job is not progressing and my jupyter pyspark kernel is like stuck. I am not sure whether I used the correct way to import the file to hdfs and whether the code used to read the file with spark is correct or not.
The sqoop import code I used is as follows
sqoop import --connect jdbc:mysql://upgraddetest.cyaielc9bmnf.us-east-1.rds.amazonaws.com/testdatabase --table SRC_ATM_TRANS --username student --password STUDENT123 --target-dir /user/root/Spar_Nord -m 1
The pyspark code I used is
df = spark.read.csv("/user/root/Spar_Nord/part-m-00000", header = False, inferSchema = True)
Also please advice how we can know the file type that we imported with sqoop? I just assumed .csv and wrote the pyspark code.
Appreciate a quick help.
When pulling data into HDFS via sqoop, the default delimiter is the tab character. Sqoop creates a generic delimited text file based on the parameters passed into the sqoop command. To make the file output with a comma delimiter to match a generic csv format, you should add:
--fields-terminated-by <char>
So your sqoop command would look like:
sqoop import --connect jdbc:mysql://upgraddetest.cyaielc9bmnf.us-east-1.rds.amazonaws.com/testdatabase --table SRC_ATM_TRANS --username student --password STUDENT123 --fields-terminated-by ',' --target-dir /user/root/Spar_Nord -m 1

How to move files written with Pandas on Spark cluster to HDFS?

I'm running a Spark Job using Cluster Mode and writing few files using Pandas and I think it's writing in temp directory, now I want to move these files or write these files in HDFS.
You have multiple options:
convert Pandas Dataframe into PySpark DataFrame and simply save it into HDFS
spark_df = spark.createDataFrame(pandas_df)
spark_df.write.parquet("hdfs:///path/on/hdfs/file.parquet")
save file locally using Pandas and use subprocess to copy file to HDFS
import subprocess
command = "hdfs dfs -copyFromLocal -f local/file.parquet /path/on/hdfs".split()
result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout)
print(result.stderr)
save file locally and use 3rd party library - hdfs3 - to copy file to HDFS
from hdfs3 import HDFileSystem
hdfs = HDFileSystem()
hdfs.cp("local/file.parquet", "/path/on/hdfs")

Pyspark: Load a tar.gz file into a dataframe and filter by filename

I have a tar.gz file that has multiple files. The hierarchy looks as below. My intention is to read the tar.gz file, filter out the contents of b.tsv as it is static metadata where all the other files are actual records.
gzfile.tar.gz
|- a.tsv
|- b.tsv
|- thousand more files.
By pyspark load, I'm able to load the file into a dataframe. I used the command:
spark = SparkSession.\
builder.\
appName("Loading Gzip Files").\
getOrCreate()
input = spark.read.load('/Users/jeevs/git/data/gzfile.tar.gz',\
format='com.databricks.spark.csv',\
sep = '\t'
With the intention to filter, I added the filename
from pyspark.sql.functions import input_file_name
input.withColumn("filename", input_file_name())
Which now generates the data like so:
|_c0 |_c1 |filename |
|b.tsv0000666000076500001440035235677713575350214013124 0ustar netsaintusers1|Lynx 2.7.1|file:///Users/jeevs/git/data/gzfile.tar.gz|
|2|Lynx 2.7|file:///Users/jeevs/git/data/gzfile.tar.gz|
Of course, the file field is populating with the tar.gz file, making that approach useless.
A more irritating problem is, the _c0 is getting populated with filename+garbage+first row values
At this point, I'm wondering if the file read itself is getting weird as it is a tar.gz file. When we did the v1 of this processing, (spark 0.9), we had another step that loaded the data from s3 into an ec2 box, extract and write back into s3. I'm trying to get rid of those steps.
Thanks in advance!
Databricks does not support direct *.tar.gz iteration. In order to process file, they have to be unzipped into temporary location. Databricks support bash than can do the job.
%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;
Above code will unzip all files with extension *.tar.gz in source to destination location.
If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable.
This can be achieved in %pyspark
import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'
Use following methods to load file, in assumption the content in *.csv file:
DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.csv')

apache pig, store result in a txt file

Hello i'm a new PIG user,
I'm trying to stock some data in a txt file but when i'm using the STORE command, it creates a folder that contains the following files : _SUCCESS and part-r-00000.
How to get this result in a txt file ?
Thanks.
This is how STORE output usually look like.
You can run Hadoop fs command from inside pig so you can write something like the below inside your pig (see the documentation here)
fs -getmerge /my/hdfs/output/dir/* /my/local/dir/result.txt
fs -copyFromLocal /my/local/dir/result.txt /my/hdfs/other/output/dir/
Read the files using cat command and pipe the output to .txt file using put command
hadoop fs -cat /in_dir/part-* | hadoop fs -put - /out_dir/output.txt
or
Merge the files in the folder using the getmerge command to the output .txt file
hadoop fs -getmerge /in_dir/ /out_dir/output.txt
That's the way, map reduce job writes the output.
As Pig runs map-reduce job internally so job writes output in the form of part files :
part-m-00000(map output) or part-r-00000(reduce output).
Let's say you are giving following output dir("/user/output1.txt") in your script, so it will have:
/user/output1.txt/part-r-00000
/user/output1.txt/_SUCCESS
There may be multiple part files created inside output1.txt, so in that case you can merge those into one.
hadoop fs -getmerge /user/output1.txt/* /localdir/output/result.txt
hadoop fs -copyFromLocal /localdir/output/result.txt /user/output/result.txt

Resources