NameError: name 'split' is not defined with Spark - apache-spark

I have been working on a big dataset with Spark. Last week when I ran the following lines of code it worked perfectly, now it is throwing an error: NameError: name 'split' is not defined. Can somebody explain why this is not working and what should I do? Name split is not defined... Should I define the method? Is it a dependency that I should import? The documentation doesn't say I ahve to import anything in order to use the split method. The code below.
test_df = spark_df.withColumn(
"Keywords",
split(col("Keywords"), "\\|")
)

You can use pyspark.sql.functions.split(), but you first need to import this function:
from pyspark.sql.functions import split
It's better to explicitly import just the functions you need. Do not do from pyspark.sql.functions import *.

Related

Pandas Is Not Reading_csv Raw Data When Names Are Defined in a Second Line

I just started my first IRIS FLOWER project based on your example. After completing two projects, I will move to the next step, statistical and deep learning. Of course, before that I will get your book and study it.
Despite, I faced with error in my first project. The problem is I couldn't load/read the data from online or from my local computer. My computer is equipped with all necessary modules (find an attachment).
I applied the same procedure you illustrated in your example. My system read the data only when I removed the name definitions from the second line, which is names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'].
When I deleted the definitions of the names, from the coding, pandas read_csv file directly from online and also it read from the local computer. But, the retrieved data has no heading (field) at the top.
When I tried to read the data with the name definitions in the second line, it gives the following error message:
NameError: the name 'pandas' is not defined
How I can deal with this problem?
#Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)
print(dataset)
I'm guessing that you put import pandas as pd in your imports. Use pd.read_csv() instead. If you didn't import pandas, then you need to import it at the top of your Python file with import pandas or import pandas as pd (which is what pretty much everyone else uses).
Otherwise, your code looks fine.

Using relative import without from

Instead of:
from .model import Foo, Bar
I would like to:
import .model
This raises a syntax error. Is there a way to do it?
Anything following an import keyword must be a valid Python name as it will be added to your scope under that same name.
Instead, do the following.
from . import model

Spark deep learning Import error

I am trying to replicate a deep learning project from https://medium.com/linagora-engineering/making-image-classification-simple-with-spark-deep-learning-f654a8b876b8 . I am working on spark version 1.6.3. I have installed keras and tensorflow. But everytime i try to import from sparkdl it throws an error. I am working on Pyspark. When I run this:-
from sparkdl import readImages
I get this error:-
File "C:\Users\HP\AppData\Local\Temp\spark-802a2258-3089-4ad7-b8cb-
6815cbbb019a\userFiles-c9514201-07fa-45f9-9fd8-
c8a3a0b4bf70\databricks_spark-deep-learning-0.1.0-spark2.1-
s_2.11.jar\sparkdl\transformers\keras_image.py", line 20, in <module>
ImportError: cannot import name 'TypeConverters'
Can someone pls help?
Its not a full fix, as i have yet to be able to import things from sparkdl in jupyter notebooks aswell, but!
readImages is a function in pyspark.ml.image package
so to import it you need to:
from pyspark.ml.image import ImageSchema
to use it:
imagesDF = ImageSchema.readImages("/path/to/imageFolder")
This will give you a dataframe of the images, with column "image"
You can add a label column as such:
labledImageDF = imagesDF.withColumn("label", lit(0))
but remember to import functions from pyspark.sql to use lit function
from pyspark.sql.functions import *
Hope this at least partially helps

Attribute Error : Function object has no attribute

I see there are so many questions about this title or for this issue, But I still don't understand why it is occurring.
I have imported Pandas and Numpy.
Then I read my file using pd.read_excel.
Then I viewed the head of my file using .head()
Now, after I sliced my data also the .head method was working fine. But now suddenly it throws an Attribute error and it gets resolved once I re-import my file again, but then, after some time it again gives me the same error. What is wrong that I am doing? and I don't understand this error clearly.
import pandas as pd
import numpy as np
sales = pd.read_excel('SALESC.xlsx', header=0)
sales.isnull().sum()
sales["Date"] = pd.to_datetime(sales['Date of document'])
sales = sales[pd.notnull(sales['Quantity sold']) & pd.notnull(sales['Unit
selling price including tax'])]
sales = sales.iloc[:,[3,6,8,9,10,11,19,35,39]]
sales.head(5)
Can someone explain the problem? and how to resolve it, thanks in advance

How to import .dta via pandas and describe data?

I am new to python and have a simple problem. In a first step, I want to load some sample data I created in Stata. In a second step, I would like to describe the data in python - that is, I'd like a list of the imported variable names. So far I've done this:
from pandas.io.stata import StataReader
reader = StataReader('sample_data.dta')
data = reader.data()
dir()
I get the following error:
anaconda/lib/python3.5/site-packages/pandas/io/stata.py:1375: UserWarning: 'data' is deprecated, use 'read' instead
warnings.warn("'data' is deprecated, use 'read' instead")
What does it mean and how can I resolve the issue? And, is dir() the right way to get an understanding of what variables I have in the data?
Using pandas.io.stata.StataReader.data to read from a stata file has been deprecated in pandas 0.18.1 version and hence you are getting that warning.
Instead, you must use pandas.read_stata to read the file as shown:
df = pd.read_stata('sample_data.dta')
df.dtypes ## Return the dtypes in this object
Sometimes this did not work for me especially when the dataset is large. So the thing I propose here is 2 steps (Stata and Python)
In Stata write the following commands:
export excel Cevdet.xlsx, firstrow(variables)
and to copy the variable labels write the following
describe, replace
list
export excel using myfile.xlsx, replace first(var)
restore
this will generate for you two files Cevdet.xlsx and myfile.xlsx
Now you go to your jupyter notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('Cevdet.xlsx')
This will allow you to read both files into jupyter (python 3)
My advice is to save this data file (especially if it is big)
df.to_pickle('Cevdet')
The next time you open jupyter you can simply run
df=pd.read_pickle("Cevdet")

Resources