How to read data from a file with double delimiter on spark - apache-spark

Could someone please help, how to handle this case.
PySpark Code:
from pyspark.sql import SparkSession, types
spark = SparkSession.builder.master("local").appName('read csv').getOrCreate()
sc = spark.sparkContext
df = spark.read.option('delimiter', ',').csv('filename')
#Error:
error more than 1 character.

I have come across a similar issue. please try with below, see if that works. Please feel free to make changes to the code based on ur data format.
'''#PySpark Code.
from pyspark.sql import SparkSession, types
spark = SparkSession.builder.master("local").appName('read csv').getOrCreate()
sc = spark.sparkContext
#df = spark.read.option('delimiter',',').csv('filename')
df = spark.read.text('filename')
header = df.first()[0]
schema = header.split('~~')
df_input = df.filter(df['value']!= header).rdd.map(lambda x: [0].split('~~')).toDF(schema)
'''

Related

context.py:79: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead

I use PySpark in my system.
I got the warninig: context.py:79: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
my script:
scSpark = SparkSession.builder.config("spark.driver.extraClassPath", "./mysql-connector-java-8.0.29.jar").getOrCreate()
sqlContext = SQLContext(scSpark)
jdbc_url = "jdbc:mysql://{0}:{1}/{2}".format(hostname, jdbcPort, dbname)
connectionProperties = {
"user": username,
"password": password
}
#df=scSpark.read.jdbc(url=jdbc_url, table='bms_title', properties= connectionProperties)
#df.show()
df = scSpark.read.csv(data_file, header=True, sep=",", encoding='UTF-8').cache()
df2 = df.first()
df = df.exceptAll(scSpark.createDataFrame([df2]))
df.createTempView("books")
output = scSpark.sql('SELECT `Postgraduate Course` AS Postgraduate_Course FROM books'))
Why I got this warning as I have already used SparkSession.builder.getOrCreate()
How could I correct this warning?
try to change
sqlContext = SQLContext(scSpark)
to
sqlContext = scSpark.sparkContext
or even
sc = scSpark.sparkContext
SQLContext is deprecated. more details you can find here: Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?
Try to change this one:
scSpark = SparkSession.builder.config("spark.driver.extraClassPath",
"./mysql-connector-java-8.0.29.jar").getOrCreate()
to this:
scSpark = SparkSession.builder.config("spark.driver.extraClassPath",
"./mysql-connector-java-8.0.29.jar").enableHiveSupport().getOrCreate()
.enableHiveSupport() should fix it.
It also happened to me when I had toPandas in my code when I needed to convert the pyspark df to pandas df. In such a case, I tried to import the data from the beginning as pandas and not pyspark instead of converting later.

Error while using dataframe show method in pyspark

I am trying to read data from BigQuery using pandas and pyspark. I am able to get the data but somehow getting below error while converting it into Spark DataFrame.
py4j.protocol.Py4JJavaError: An error occurred while calling o28.showString.
: java.lang.IllegalStateException: Could not find TLS ALPN provider; no working netty-tcnative, Conscrypt, or Jetty NPN/ALPN available
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.defaultSslProvider(GrpcSslContexts.java:258)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.configure(GrpcSslContexts.java:171)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.forClient(GrpcSslContexts.java:120)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.buildTransportFactory(NettyChannelBuilder.java:401)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:444)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:223)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel(InstantiatingGrpcChannelProvider.java:169)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel(InstantiatingGrpcChannelProvider.java:156)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ClientContext.create(ClientContext.java:157)
Following is the environment detail
Python version : 3.7
Spark version : 2.4.3
Java version : 1.8
The code is as follow
import google.auth
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession , SQLContext
from google.cloud import bigquery
# Currently this only supports queries which have at least 10 MB of results
QUERY = """ SELECT * FROM test limit 1 """
#spark = SparkSession.builder.appName('Query Results').getOrCreate()
sc = pyspark.SparkContext()
bq = bigquery.Client()
print('Querying BigQuery')
project_id = ''
query_job = bq.query(QUERY,project=project_id)
# Wait for query execution
query_job.result()
df = SQLContext(sc).read.format('bigquery') \
.option('dataset', query_job.destination.dataset_id) \
.option('table', query_job.destination.table_id)\
.option("type", "direct")\
.load()
df.show()
I am looking some help to solve this issue.
I managed to find the better solution referencing this link , below is my working code :
Install pandas_gbq package in python library before writing below code .
import pandas_gbq
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
project_id = "<your-project-id>"
query = """ SELECT * from testSchema.testTable"""
athletes = pandas_gbq.read_gbq(query=query, project_id=project_id,dialect = 'standard')
# Get a reference to the Spark Session
sc = SparkContext()
spark = SparkSession(sc)
# convert from Pandas to Spark
sparkDF = spark.createDataFrame(athletes)
# perform an operation on the DataFrame
print(sparkDF.count())
sparkDF.show()
Hope it helps to someone ! Keep pysparking :)

I am able to connect to the Hive database using pyspark but when i run my program data is not showing

I have written the below code to read the data from HIVE table and when I am trying to run no compilation errors and no data displaying.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext, SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars hive-jdbc-2.1.0.jar
pyspark-shell'
sparkConf = SparkConf().setAppName("App")
sc = SparkContext(conf=sparkConf)
sqlContext = SQLContext(sc)
hiveContext = HiveContext(sc);
source_df = hiveContext.read.format('jdbc').options(
url='jdbc:hive2://localhost:10000/sample',
driver='org.apache.hive.jdbc.HiveDriver',
dbtable='abc',
user='root',
password='root').load()
print source_df.show()
When i run this, I am getting below output and not able to fetch the
data from table.
+--------+------+
|abc.name|abc.id|
+--------+------+
+--------+------+
Just try
df = hiveContext.read.table("your_hive_table") //reads from default db
df = hiveContext.read.table("your_db.your_hive_table") //reads from your db
you could also do
df = hiveContext.sql("select * from your_table")

AttributeError: 'StructField' object has no attribute '_get_object_id': with loading parquet file with custom schema

I am trying to read group of parquet files using PySpark using custom schema but it gives AttributeError: 'StructField' object has no attribute '_get_object_id' error.
Here is my sample code:
import pyspark
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as func
from pyspark.sql.types import *
sc = pyspark.SparkContext()
spark = SparkSession(sc)
sqlContext = SQLContext(sc)
l = [('1',31200,'Execute',140,'ABC'),('2',31201,'Execute',140,'ABC'),('3',31202,'Execute',142,'ABC'),
('4',31103,'Execute',149,'DEF'),('5',31204,'Execute',145,'DEF'),('6',31205,'Execute',149,'DEF')]
rdd = sc.parallelize(l)
trades = rdd.map(lambda x: Row(global_order_id=int(x[0]), nanos=int(x[1]),message_type=x[2], price=int(x[3]),symbol=x[4]))
trades_df = sqlContext.createDataFrame(trades)
trades_df.printSchema()
trades_df.write.parquet('trades_parquet')
trades_df_Parquet = sqlContext.read.parquet('trades_parquet')
trades_df_Parquet.printSchema()
# The schema is encoded in a string.
schemaString = "global_order_id message_type nanos price symbol"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
trades_df_Parquet_n = spark.read.format('parquet').load('trades_parquet',schema,inferSchema =False)
#trades_df_Parquet_n = spark.read.parquet('trades_parquet',schema)
trades_df_Parquet_n.printSchema()
Can any one please help me with your suggestion.
Specify the name of the option schema so it knows it's not format:
Signature: trades_df_Parquet_n.load(path=None, format=None, schema=None, **options)
You get:
trades_df_Parquet_n = spark.read.format('parquet').load('trades_parquet',schema=schema, inferSchema=False)

pyspark : NameError: name 'spark' is not defined

I am copying the pyspark.ml example from the official document website:
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
df = spark.createDataFrame(data, ["features"])
kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(df)
However, the example above wouldn't run and gave me the following errors:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-28-aaffcd1239c9> in <module>()
1 from pyspark import *
2 data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
----> 3 df = spark.createDataFrame(data, ["features"])
4 kmeans = KMeans(k=2, seed=1)
5 model = kmeans.fit(df)
NameError: name 'spark' is not defined
What additional configuration/variable needs to be set to get the example running?
You can add
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
to the begining of your code to define a SparkSession, then the spark.createDataFrame() should work.
Answer by ηŽ‡ζ€€δΈ€ is good and will work for the first time.
But the second time you try it, it will throw the following exception :
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local) created by __init__ at <ipython-input-3-786525f7559f>:10
There are two ways to avoid it.
1) Using SparkContext.getOrCreate() instead of SparkContext():
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
2) Using sc.stop() in the end, or before you start another SparkContext.
Since you are calling createDataFrame(), you need to do this:
df = sqlContext.createDataFrame(data, ["features"])
instead of this:
df = spark.createDataFrame(data, ["features"])
spark stands there as the sqlContext.
In general, some people have that as sc, so if that didn't work, you could try:
df = sc.createDataFrame(data, ["features"])
You have to import the spark as following if you are using python then it will create
a spark session but remember it is an old method though it will work.
from pyspark.shell import spark
If it errors you regarding other open session do this:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)
scraped_data=spark.read.json("/Users/reihaneh/Desktop/nov3_final_tst1/")

Resources