PySpark textFile replace text - apache-spark

The following is a few rows from an example file which is ~ 30GB
### s3://mybucket/tmp/file_in.txt
"one"|"mike"|"456"|"2010-01-04"
"two"|"lisa"|"789"|"2011-03-08"
"three"|"ann"|"845"|"2012-06-11"
I'd like to use PySpark to...
read the text file using spark's parallelism
replace the "n" character with "X"
output the updated text to a new text file with the same format
so the resulting file would look like this:
### s3://mybucket/tmp/file_out.txt
"oXe"|"mike"|"456"|"2010-01-04"
"two"|"lisa"|"789"|"2011-03-08"
"three"|"aXX"|"845"|"2012-06-11"
I have tried a variety of ways to achieve this seemingly simple task...
data = sc.textFile('s3://mybucket/tmp/file_in.txt')
def make_replacement(row):
result = row.replace("n", "X")
return result
out_data = data.map(make_replacement).collect()
#out_data = data.map(lambda line: make_replacement(line)).collect()
out_data.coalesce(1).write.format("text").option("header", "false").save("s3://mybucket/tmp/file_out.txt")
but I continue to see the following errors:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 21, <<my_server>>, executor 9): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application....
at org.apache.spark.api.python.VirtualEnvFactory.execCommand(VirtualEnvFactory.scala:120)
Note: solutions using read.csv or dataframe will not be applicable to this problem
Any recommendations on how to solve this?

You can create an expression and call the expression in select
from pyspark.sql import functions as F
df = spark.read.csv('s3://mybucket/tmp/file_in.txt','\t')
expr = [F.regexp_replace(F.col(column), pattern="n", replacement="X").alias(column) for column in df.columns]
df = df.select(expr)
df.write.csv.format("text").option("header", "false").save("s3://mybucket/tmp/file_out.txt")

If you need not to play with data set then why you even looking for spark .
use python file read and write code and replace the character.
sample code

Related

Getting error while doing Standardization after Window Partitioning of Pyspark Dataframe

Dataframe:
Above is my dataframe, I want to add a new column with value 1, if first transaction_date for an item is after 01.01.2022, else 0.
To do this i use the below window.partition code:
windowSpec = Window.partitionBy("article_id").orderBy("transaction_date")
feature_grid = feature_grid.withColumn("row_number",row_number().over(windowSpec)) \
.withColumn('new_item',
when(
(f.col('row_number') == 1) & (f.col('transaction_date') >= ‘01.01.2022’), 1) .otherwise(0))\
.drop('row_number')
I want to perform clustering on the dataframe, for which I am using VectorAssembler with the below code:
from pyspark.ml.feature import VectorAssembler
input_cols = feature_grid.columns
assemble=VectorAssembler(inputCols= input_cols, outputCol='features')
assembled_data=assemble.transform(feature_grid)
For standardisation;
from pyspark.ml.feature import StandardScaler
scale=StandardScaler(inputCol='features',outputCol='standardized')
data_scale=scale.fit(assembled_data)
data_scale_output=data_scale.transform(assembled_data)
display(data_scale_output)
The standardisation code chunk gives me the below error, only when I am using the above partitioning method, without that partitioning method, the code is working fine.
Error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 182.0 failed 4 times, most recent failure: Lost task
0.3 in stage 182.0 (TID 3635) (10.205.234.124 executor 1): org.apache.spark.SparkException: Failed to execute user defined
function (VectorAssembler$$Lambda$3621/907379691
Can someone tell me what am I doing wrong here, or what is the cause of the error ?
This error is triggered by the null values in columns, which are assembled when using the spark VectorAssembler. Please fill the null column before transform your dataframe.

Why can't I read these dataframes

I'm having trouble with reading several dataframes. I have this function
def readDF(hdfsPath:String, more arguments): DataFrame = {//function goes here}
it takes an hdfs path for a partition and returns a dataframe (it basically uses spark.read.parquet but I have to use it). I'm trying to read several of them by using show partitions in the following fashion:
val dfs = spark.sql("show partitions table")
.where(col("partition").contains(someFilterCriteria))
.map(partition => {
val hdfsPath = s"hdfs/path/to/table/$partition"
readDF(hdfsPath)
}).reduce(_.union(_))
but it gives me this error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 3.0 failed 4 times, most recent failure: Lost task 12.3 in stage 3.0 (TID 44, csmlcsworki0021.unix.aacc.corp, executor 1): java.lang.NullPointerException
I think it's because I'm doing spark.read.parquet inside a map operation for a dataframe, because if I change my code for this one
val dfs = spark.sql("show partitions table")
.where(col("partition").contains(someFilterCriteria))
.map(row=> row.getString(0))
.collect
.toSeq
.map(partition => {
val hdfsPath = s"hdfs/path/to/table/$partition"
readDF(hdfsPath)
}).reduce(_.union(_))
it loads the data correctly. However, I don't want to use collect if possible. How can achieve my purpose?
readDF is creating a data frame from parquet files in HDFS. It must be executed on driver side. The first version, in which you execute using a map function over the rows of the original dataframe, suggest you're trying to create a DF in the executors, and this is not feasible.

Not able to use JohnSnowLabs pretrained model in Zeppelin

I want to use the JohnSnowLabs pretrained spell check module in my Zeppelin notebook. As mentioned here I have added com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.3 to the Zeppelin dependency section as shown below:
However, when I try to run the following simple code
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.Finisher
val df = Seq("tiolt cde", "eefg efa efb").toDF("names")
val nlpPipeline = new Pipeline().setStages(Array(
new DocumentAssembler().setInputCol("names").setOutputCol("document"),
new Tokenizer().setInputCols("document").setOutputCol("tokens"),
NorvigSweetingModel.pretrained().setInputCols("tokens").setOutputCol("corrected"),
new Finisher().setInputCols("corrected")
))
df.transform(df => nlpPipeline.fit(df).transform(df)).show(false)
it gives an error as follows:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, xxx.xxx.xxx.xxx, executor 0): java.io.FileNotFoundException: File file:/root/cache_pretrained/spell_fast_en_1.6.2_2_1534781328404/metadata/part-00000 does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
...
How can I add this JohnSnowLabs spelling check pretrained model in Zeppelin? The above code works when directly ran on the Spark-shell.
Whenever you have problem with auto download of pre-trained models/pipelines due to your environment, you can always load them manually.
Here is an example for loading a French model (same concept for any other annotator):
val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/")
.setInputCols("document", "token")
.setOutputCol("pos")
Source:
https://nlp.johnsnowlabs.com/docs/en/models

ClassNotFoundException: org.apache.zeppelin.spark.ZeppelinContext when using Zeppelin input value inside spark DataFrame's filter method

I'm having a trouble for two days already, and can't find any solutions.
I'm getting
ClassNotFoundException: org.apache.zeppelin.spark.ZeppelinContext
when using input value inside spark DataFrame's filter method.
val city = z.select("City",cities).toString
oDF.select("city").filter(r => city.equals(r.getAs[String]("city"))).count()
I even tried copying the input value to another val with
new String(bytes[])
but still get the same error.
The same code work seamlessly if instead of getting the value from z.select
I declare as a String literal
city: String = "NY"
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 49.0 failed 4 times, most recent failure: Lost task 0.3 in stage
49.0 (TID 277, 10.6.60.217): java.lang.NoClassDefFoundError:
Lorg/apache/zeppelin/spark/ZeppelinContext;
You are taking this in the wrong direction:
val city="NY"
gives you a scala String with NY as the string, but when you say
z.select("City",cities)
then this returns you dataFrame and then you are converting this object to String using method toString and then trying to compare.!
This wont work !
What you can do is either collect one dF and then pass the scala String accordingly into the other Df or you can do a join if you want to do it for multiple values.
But this approach will not work for sure !

Apache SPARK with SQLContext:: IndexError

I am trying to execute a basic example provided in Inferring the Schema Using Reflection segment of Apache SPARK documentation.
I'm doing this on Cloudera Quickstart VM(CDH5)
The example I'm trying to execute is as below ::
# sc is an existing SparkContext.
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# Load a text file and convert each line to a Row.
lines = sc.textFile("/user/cloudera/analytics/book6_sample.csv")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.registerTempTable("people")
# SQL can be run over DataFrames that have been registered as a table.
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
print(teenName)
I ran the code exactly as shown as above but always getting the error "IndexError: list index out of range" when I execute the last command(the for loop).
The input file book6_sample is available at
book6_sample.csv.
I ran the code exactly as shown as above but always getting the error "IndexError: list index out of range" when I execute the last command(the for loop).
Please suggest pointers on where I'm going wrong.
Thanks in advance.
Regards,
Sri
Your file has one empty line at the end which is causing this error.Open your file in text editor and remove that line hope it will work

Resources