Can not read the data from HDFS in pySpark - apache-spark

I am a beginner in coding. Currently trying to read a file (which was imported to HDFS using sqoop) with the help of pyspark. The spark job is not progressing and my jupyter pyspark kernel is like stuck. I am not sure whether I used the correct way to import the file to hdfs and whether the code used to read the file with spark is correct or not.
The sqoop import code I used is as follows
sqoop import --connect jdbc:mysql://upgraddetest.cyaielc9bmnf.us-east-1.rds.amazonaws.com/testdatabase --table SRC_ATM_TRANS --username student --password STUDENT123 --target-dir /user/root/Spar_Nord -m 1
The pyspark code I used is
df = spark.read.csv("/user/root/Spar_Nord/part-m-00000", header = False, inferSchema = True)
Also please advice how we can know the file type that we imported with sqoop? I just assumed .csv and wrote the pyspark code.
Appreciate a quick help.

When pulling data into HDFS via sqoop, the default delimiter is the tab character. Sqoop creates a generic delimited text file based on the parameters passed into the sqoop command. To make the file output with a comma delimiter to match a generic csv format, you should add:
--fields-terminated-by <char>
So your sqoop command would look like:
sqoop import --connect jdbc:mysql://upgraddetest.cyaielc9bmnf.us-east-1.rds.amazonaws.com/testdatabase --table SRC_ATM_TRANS --username student --password STUDENT123 --fields-terminated-by ',' --target-dir /user/root/Spar_Nord -m 1

Related

How to move files written with Pandas on Spark cluster to HDFS?

I'm running a Spark Job using Cluster Mode and writing few files using Pandas and I think it's writing in temp directory, now I want to move these files or write these files in HDFS.
You have multiple options:
convert Pandas Dataframe into PySpark DataFrame and simply save it into HDFS
spark_df = spark.createDataFrame(pandas_df)
spark_df.write.parquet("hdfs:///path/on/hdfs/file.parquet")
save file locally using Pandas and use subprocess to copy file to HDFS
import subprocess
command = "hdfs dfs -copyFromLocal -f local/file.parquet /path/on/hdfs".split()
result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(result.stdout)
print(result.stderr)
save file locally and use 3rd party library - hdfs3 - to copy file to HDFS
from hdfs3 import HDFileSystem
hdfs = HDFileSystem()
hdfs.cp("local/file.parquet", "/path/on/hdfs")

Read csv file from Hadoop using Spark

I'm using spark-shell to read csv files from hdfs.
I can read those csv file using the following code in bash:
bin/hadoop fs -cat /input/housing.csv |tail -5
so this suggest the housing.csv is indeed in hdfs right now.
How can I read it using spark-shell?
Thanks in advance.
sc.textFile("hdfs://input/housing.csv").first()
I tried this way, but failed.
Include the csv package in the shell and
var df = spark.read.format("csv").option("header", "true").load("hdfs://x.x.x.x:8020/folder/file.csv")
8020 is the default port.
Thanks,
Ash
You can read this easily with spark using csv method or by specifying format("csv"). In your case either you should not specify hdfs:// or you should specify complete path hdfs://localhost:8020/input/housing.csv.
Here is a snippet of code that can read csv.
val df = spark.
read.
schema(dataSchema).
csv(s"/input/housing.csv")

How to pass string via STDIN into terminal command being executed within python script?

I need to generate postgres schema from a dataframe. I found csvkit library to come closet to matching datatypes. I can run csvkit and generate postgres schema over a csv on my desktop via terminal through this command found in docs:
csvsql -i postgresql myFile.csv
csvkit docs - https://csvkit.readthedocs.io/en/stable/scripts/csvsql.html
And I can run the terminal command in my script via this code:
import os
a=os.popen("csvsql -i postgresql Desktop/myFile.csv").read()
However I have a dataframe, that I have converted to a csv string and need to generate schema from the string like so:
csvstr = df.to_csv()
In the docs it says that under positional arguments:
The CSV file(s) to operate on. If omitted, will accept
input on STDIN
How do I pass my variable csvstr into the line of code a=os.popen("csvsql -i postgresql csvstr").read() as a variable?
I tried to do the below line of code but got an error OSError: [Errno 7] Argument list too long: '/bin/sh':
a=os.popen("csvsql -i postgresql {}".format(csvstr)).read()
Thank you in advance
You can't pass such a big string via commandline! You have to save the data to a file and pass its path to csvsql.
import csv
csvstr = df.to_csv()
with open('my_cool_df.csv', 'w', newline='') as csvfile:
csvwriter= csv.writer(csvfile)
csvwriter.writerows(csvstr)
And later:
a=os.popen("csvsql -i postgresql my_cool_df.csv")

executing hive load command using hive -e '<hive command>'

I am trying to execute hive command using java code. My hive is installed on linux virtual machine and the java code is on a remote windows machine which acts as client. I am able to successfully call the hive commands like:
hive -e 'Select * from mytable;'
But when I tried using load command with syntax as :
hive -e 'LOAD DATA LOCAL INPATH '/home/mapr/file.csv' INTO TABLE mytable;'
It throws me an error saying "FAILED: ParseException line 1:23 mismatched input '/' expecting StringLiteral near 'INPATH' in load statement"
This seems to be a syntax error near the file path probable an escape character issue, because I am able to execute "Select * from mytable" without error.
Can anyone help me with the syntax for hive load command using hive -e ?
By looking at your error message, it is clear that you are using single quote escape character twice and massing up your hive command.
So now use single and double quote to distinguish escape character and it will work.
New hive statement can be given below:
hive -e 'LOAD DATA LOCAL INPATH "/home/mapr/file.csv" INTO TABLE mytable;'
Hope this help you!!!

D2RQ parameters for generate-mapping

We are currently working on a project involving an "ordinary" relational database, but we wish to enable SPARQL requests towards this database.
d2rq.org is a tool that enables SPARQL to be run towards the database with the help of a .ttl file which defines the database to RDF mapping.
This .ttl file can be built automatically with a D2RQ tool named "generate-mapping".
http://d2rq.org/generate-mapping takes quite a few arguments, some preceeded with a single dash "-" and some double "--". My challenge is that any argument preceeded with a double dash generates this error:
Command:
./generate-mapping -u root -p password -o testmappingLocal.ttl --verbose jdbc:mysql:///iswc
Result:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown argument: --verbose
at jena.cmdline.CommandLine.handleUnrecognizedArg(CommandLine.java:215)
at jena.cmdline.CommandLine.process(CommandLine.java:177)
at d2rq.generate_mapping.main(generate_mapping.java:41)
Any help with the double-dash arguments will be greatly appreciated.
OS: Ubuntu Linux, D2RQ version: 0.8
D2rq and mysql database using generate mapping file & rdf files.
1).mapping file generate commands:
./generate-mapping -u root -p root -o /home/bigtapp/Documents/d2rqgenerate_mapping/mapfile.ttl jdbc:mysql://localhost:3306/d2rq
note: 1. root -p root -> mysql db username & password.
2. /home/bigtapp/Documents/d2rqgenerate_mapping/mapfile.ttl -> file save output path .
3.jdbc:mysql://localhost:3306 ->mysql driver.
4./d2rq ->database name.
2).the mapping file using RDF creation:
use following command.
The RDF syntax to use for output. Supported syntaxes are “TURTLE”, “RDF/XML”, “RDF/XML-ABBREV”, “N3”, and “N-TRIPLE” (the default). “N-TRIPLE” works best for large databases.
command:
./dump-rdf -f RDF/XML -b localhost:3306 -o /home/bigtapp/Documents/d2rqgenerate_mapping/dumpfile.rdf /home/bigtapp/Documents/d2rqgenerate_mapping/mapfile.ttl.
apache-jena-fuseki create dataset then rdf file uploadserver then your using sparql query ..you get the result...

Resources