module 'pyspark_csv' has no attribute 'csvToDataframe' - python-3.x

I am new to spark and facing an error while converting .csv file to dataframe. I am using pyspark_csv module for the conversion but gives an error saying "module 'pyspark_csv' has no attribute 'csvToDataframe".
here is my code:
import findspark
findspark.init()
findspark.find()
import pyspark
sc=pyspark.SparkContext(appName="myAppName")
sqlCtx = pyspark.SQLContext
#csv to dataframe
sc.addPyFile('/usr/spark-1.5.0/python/pyspark_csv.py')
sc.addPyFile('https://raw.githubusercontent.com/seahboonsiew/pyspark-csv/master/pyspark_csv.py')
import pyspark_csv as pycsv
#skipping the header
def skip_header(idx, iterator):
if(idx == 0):
next(iterator)
return iterator
#loading the dataset
data=sc.textFile('gdeltdata/20160427.CSV')
data_header = data.first()
data_body = data.mapPartitionsWithIndex(skip_header)
data_df = pycsv.csvToDataframe(sqlctx, data_body, sep=",", columns=data_header.split('\t'))
AttributeError Traceback (most recent call last)
<ipython-input-10-8e47cd9759e6> in <module>()
----> 1 data_df = pycsv.csvToDataframe(sqlctx, data_body, sep=",", columns=data_header.split('\t'))
AttributeError: module 'pyspark_csv' has no attribute 'csvToDataframe'

As mentioned in https://github.com/seahboonsiew/pyspark-csv
Please try using the following command:
csvToDataFrame
with Frame instead of frame

Related

Writing a dictionary of Spark data frames to S3 bucket

Suppose we have a dictionary of PySpark dataframes. Is there a way to write this dictionary to an S3 bucket? The purpose of this is to read these PySpark data frames and then convert them into pandas data frames. Below is some code and the errors I get:
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
#spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df1 = rdd.toDF()
df1.printSchema()
columns = ["language","users_count"]
data = [("C", "2000"), ("Java", "10000"), ("Lisp", "300")]
#spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df2 = rdd.toDF()
df2.printSchema()
spark_dict = {df1: '1', df2: '2'}
import boto3
import pickle
s3_resource = boto3.resource('s3')
bucket='test'
key='pickle_list.pkl'
pickle_byte_obj = pickle.dumps(spark_dict)
try:
s3_resource.Object(bucket,key).put(Body=pickle_byte_obj)
except:
print("Error in writing to S3 bucket")
with this error:
An error was encountered:
can't pickle _thread.RLock objects
Traceback (most recent call last):
TypeError: can't pickle _thread.RLock objects
Also tried dumping the dictionary of PySpark data frames to a json file:
import json
flatten_dfs_json = json.dumps(spark_dict)
and got this error:
An error was encountered:
Object of type DataFrame is not JSON serializable
Traceback (most recent call last):
File "/usr/lib64/python3.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib64/python3.7/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib64/python3.7/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/lib64/python3.7/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type DataFrame is not JSON serializable
Suppose we have a dictionary of PySpark dataframes. Is there a way to write this dictionary to an S3 bucket?
Yes (you might need to configure access key and secret key)
df.write.format('json').save('s3a://bucket-name/path')
The purpose of this is to read these PySpark data frames and then convert them into pandas data frames.
My 2 cents: This sounds wrong to me, you don't have to convert data to Pandas, that defeats the purpose of using Spark at the first place.

'SparkSession' object has no attribute 'textFile'

I am currently using SparkSession and was told that SparkContext is within SparkSession. However, when doing up the code, it is showing me an error that SparkContext does not exist in SparkSession
Below is the code that i have done
import findspark
findspark.init()
from pyspark.sql import SparkSession, Row
import collections
spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file://C:/temp").appName("SparkSQL").getOrCreate()
lines = spark.textFile('C:/Users/file.xslx')
The error is as follow:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_59944/722806425.py in <module>
----> 1 lines = spark.textFile('C:/Users/samue/bt4221_spark/exercise/week5/customer-orders.xslx')
AttributeError: 'SparkSession' object has no attribute 'textFile'
My current version of
findspark: 1.4.2
pyspark: 3.0.3
I dont think its related to any version issue. Any help is greatly appreciated! :)
textFile is present in SparkContext class not in SparkSession.
spark.sparkContext.textFile('filepath')

module 'pandas' has no attribute 'series'

i got this error while running this code
import numpy as np
import pandas as pd
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array(my_list)
d = {'a':10,'b':20,'c':30}
pd.series(data = my_list)
full error msg
AttributeError Traceback (most recent call last)
<ipython-input-10-494578c29940> in <module>
----> 1 pd.series(data = my_list)
F:\New folder (8)\lib\site-packages\pandas\__init__.py in __getattr__(name)
260 return _SparseArray
261
--> 262 raise AttributeError(f"module 'pandas' has no attribute '{name}'")
263
264
AttributeError: module 'pandas' has no attribute 'series'
Series is a Pandas class, so it starts with a capital letter. The below should work.
pd.Series(data = my_list)

IndexError: list index out of range in Python 3.7

I have written some code below In Python 3.7(64 bit) & Anaconda 1.9.7.
import pandas as pd
import re
import glob
import os
data = data.join(data.pop('Serial Number')
.str.strip(',')
.str.split(',', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('Serial Number')).reset_index(drop=True)
data['Serial Number']
After running this file i got below error message
Traceback (most recent call last):
File "servicereport.py", line 70, in <module>
.str.split(',', expand=True)
IndexError: list index out of range
I am not getting what exactly missing in this as this code is running on system on which it is created.
How can I fix this error?

Cannot from pandas import Dataframe

from pandas import Dataframe
ImportError Traceback (most recent call last)
in ()
----> 1 from pandas import Dataframe
ImportError: cannot import name 'Dataframe'
I understand there are workarounds but I need to do this for an assignment. I am using Jupiter Python ver 3.6.
Thsnks in Advance
from pandas import DataFrame
Notice capitalization

Resources