Using a very simple-minded approach to read data, select a subset of it, and write it, I'm getting that 'DataFrameWriter' object is not callable.
I'm surely missing something basic.
Using an AWS EMR:
$ pyspark
> dx = spark.read.parquet("s3://my_folder/my_date*/*.gz.parquet")
> dx_sold = dx.filter("keywords like '%sold%'")
# select customer ids
> dc = dx_sold.select("agent_id")
Question
The goal is to now save the values of dc ... e.g. to s3 as a line-separated text file.
What's a best-practice to do so?
Attempts
I tried
dc.write("s3://my_folder/results/")
but received
TypeError: 'DataFrameWriter' object is not callable
Also tried
X = dc.collect()
but eventually received a TimeOut error message.
Also tried
dc.write.format("csv").options(delimiter=",").save("s3://my_folder/results/")
But eventually received messages of the form
TaskSetManager: Lost task 4323.0 in stage 9.0 (TID 88327, ip-<hidden>.internal, executor 96): TaskKilled (killed intentionally)
The first comment is correct: it was an FS problem.
Ad-hoc solution was to convert desired results to list and then serialize the list. E.g.
dc = dx_sold.select("agent_id").distinct()
result_list = [str(c) for c in dc.collect()]
pickle.dump(result_list, open(result_path, "wb"))
Related
Continuation to Managing huge zip files in dataBricks
Databricks hangs after 30 files. What to do?
I have split huge 32Gb zip into 100 stand-alone pieces. I've split header from the file and can thus process it like any CSV-file. I need to filter the data based on columns. Files are in Azure Data Lake Storage Gen1 and must be stored there.
Trying to read single file (or all 100 files) at once fails after working for ~30 min. (see linked question above.)
What I've done:
def lookup_csv(CR_nro, hlo_lista =[], output = my_output_dir ):
base_lib = 'adl://azuredatalakestore.net/<address>'
all_files = pd.DataFrame(dbutils.fs.ls(base_lib + f'CR{CR_nro}'), columns = ['full', 'name', 'size'])
done = pd.DataFrame(dbutils.fs.ls(output), columns = ['full', 'name', 'size'])
all_files = all_files[~all_files['name'].isin(tehdyt['name'].str.replace('/', ''))]
all_files = all_files[~all_files['name'].str.contains('header')]
my_scema = spark.read.csv(base_lib + f'CR{CR_nro}/header.csv', sep='\t', header=True, maxColumns = 1000000).schema
tmp_lst = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT'] + [i for i in hlo_lista if i in my_scema.fieldNames()]
for my_file in all_files.iterrows():
print(my_file[1]['name'], time.ctime(time.time()))
data = spark.read.option('comment', '#').option('maxColumns', 1000000).schema(my_scema).csv(my_file[1]['full'], sep='\t').select(tmp_lst)
data.write.csv( output + my_file[1]['name'], header=True, sep='\t')
This works... Kinda. It works thought ~30 files and then hangs up on
Py4JJavaError: An error occurred while calling o70690.csv.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 154.0 failed 4 times, most recent failure: Lost task 0.3 in stage 154.0 (TID 1435, 10.11.64.46, executor 7): com.microsoft.azure.datalake.store.ADLException: Error creating file <my_output_dir>CR03_pt29.vcf.gz/_started_1438828951154916601
Operation CREATE failed with HTTP401 : null
Last encountered exception thrown after 2 tries. [HTTP401(null),HTTP401(null)]
I tried to add some deletion and sleeps:
data.unpersist()
data = []
time.sleep(5)
Also some try-exception tries.
for j in range(1,24):
for i in range(4):
try:
lookup_csv(j, hlo_lista =FN_list, output = blake +f'<my_output>/CR{j}/' )
except Exception as e:
print(i, j, e)
time.sleep(60)
No luck with these. Once it fails, it keeps failing.
Any idea how to handle this issue? I'm thinking that connection to ADL-drive fails after a time, but if I queue the commands:
lookup_csv(<inputs>)
<next cell>
lookup_csv(<inputs>)
it works, fails and works next cell just fine. I can live with this, but is highly annoying that basic loop fails to work in this environment.
The best solution would be to permanently mount ADSL storage and use azure app for that.
In Azure please go to App registrations - register app with name for example "databricks_mount". Add IAM role "Storage Blob Data Contributor" for that app in your delta lake storage.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<your-client-id>",
"fs.azure.account.oauth2.client.secret": "<your-secret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<your-endpoint>/oauth2/token"}
dbutils.fs.mount(
source = "abfss://delta#yourdatalake.dfs.core.windows.net/",
mount_point = "/mnt/delta",
extra_configs = configs)
You can access without mount but still you need to register an app and apply config via spark settings in your notebook to get the access to ADLS. It should be permanent for whole session thanks to azure app:
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"),
spark.conf.set("fs.azure.account.oauth2.client.id", "<your-client-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<your-secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<your-endpoint>/oauth2/token")
This explanation is the best https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access.html#access-adls-gen2-directly although I remember that first times I had also problems with that. On that page is also explained how to register an app. Maybe it will be ok for your company policies.
I'm working on data processing using spark and cassandra.
What I want to do is read and load the data from cassandra first. Process the data and write them back to cassandra.
When spark does the map function, an error occurs - Row is read-only <class 'Exception'>
Here is my method. Showing as the below
def detect_image(image_attribute):
image_id = image_attribute['image_id']
image_url = image_attribute['image_url']
if image_attribute['status'] is None:
image_attribute['status'] = Status()
image_attribute['status']['detect_count'] += 1
... # the other item assignment
cassandra_data = sql_context.read.format("org.apache.spark.sql.cassandra").options(table="photo",
keyspace="data").load()
cassandra_data_processed = cassandra_data.rdd.map(process_batch_image)
cassandra_data_processed.toDF().write \
.format("org.apache.spark.sql.cassandra") \
.mode('overwrite') \
.options(table="photo", keyspace="data") \
.save()
The error of Row is read-only <class 'Exception'> are in line
image_attribute['status'] = Status() and
image_attribute['status']['detect_count'] += 1
is it necessary to copy the image_attribute to be a new object? However, the image_attribute is a nested objects. It will be so hard to copy one by one layer.
Your suggestion is absolutely right. The map function converts an incoming type to another type. That is at least thr intention. The incoming object is immutable to make this operation idempotent. I guess there is no way around copying the image objects (manually or using something like deepcopy)
Hope that helps
Can anyone elaborate to me what exactly Input, Output, Shuffle Read, and Shuffle Write specify in spark UI?
Also, Can someone explain how is input in this job is 25~30% of shuffle write?
As per my understanding, shuffle write is sum of temporary data that cannot be hold in memory and data that needs to sent to other executors during aggregation or reducing.
Code Below :
hiveContext.sql("SELECT * FROM TABLE_NAME WHERE PARTITION_KEY = 'PARTITION_VALUE'")
.rdd
.map{case (row:Row)
=>((row.getString(0), row.getString(12)),
(row.getTimestamp(11), row.getTimestamp(11),
row))}
.filter{case((client, hash),(d1,d2,obj)) => (d1 !=null && d2 !=null)}
.reduceByKey{
case(x, y)=>
if(x._1.before(y._1)){
if(x._2.after(y._2))
(x)
else
(x._1, y._2, y._3)
}else{
if(x._2.after(y._2))
(y._1, x._2, x._3)
else
(y)
}
}.count()
Where ReadDailyFileDataObject is a case Class which holds the row fields as a container.
Container is required as there are 30 columns, which exceeds tuple limit of 22.
Updated Code, removed case class, as I see same issue, when i use Row itself instead of case Class.
Now currently i see
Task : 10/7772
Input : 2.1 GB
Shuffle Write : 14.6 GB
If it helps, i am trying to process table stored as parquet file, containing 21 billion rows.
Below are the parameters i am using,
"spark.yarn.am.memory" -> "10G"
"spark.yarn.am.cores" -> "5"
"spark.driver.cores" -> "5"
"spark.executor.cores" -> "10"
"spark.dynamicAllocation.enabled" -> "true"
"spark.yarn.containerLauncherMaxThreads" -> "120"
"spark.executor.memory" -> "30g"
"spark.driver.memory" -> "10g"
"spark.driver.maxResultSize" -> "9g"
"spark.serializer" -> "org.apache.spark.serializer.KryoSerializer"
"spark.kryoserializer.buffer" -> "10m"
"spark.kryoserializer.buffer.max" -> "2001m"
"spark.akka.frameSize" -> "2020"
SparkContext is registered as
new SparkContext("yarn-client", SPARK_SCALA_APP_NAME, sparkConf)
On Yarn, i see
Allocated CPU VCores : 95
Allocated Memory : 309 GB
Running Containers : 10
The shown tips when you hover your mouse over Input Output Shuffle Read Shuffle Write explain themselves quite well:
INPUT: Bytes and records read from Hadoop or from Spark storage.
OUTPUT: Bytes and records written to Hadoop.
SHUFFLE_WRITE: Bytes and records written to disk in order to be read by a shuffle in a future stage.
Shuffle_READ: Total shuffle bytes and records read (includes both data read locally and data read from remote executors).
In your situation, 150.1GB account for all the 1409 finished task's input size (i.e, the total size read from HDFS so far), and 874GB account for all the 1409 finished task's write on node's local disk.
You can refer to What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming? to understand the overall shuffle functionality well.
It's actually hard to provide an answer without the code, but it is possible that you are going through your data multiple times, so the total volume you are processing is actually "X" times your original data.
Can you post the code you are running?
EDIT
Looking at the code, I have had this kind of issue before, and it was due to the serialization of the Row, so this might be your case as well.
What is "ReadDailyFileDataObject"? Is it a class, a case class?
I would first try running your code like this:
hiveContext.sql("SELECT * FROM TABLE_NAME WHERE PARTITION_KEY = 'PARTITION_VALUE'")
.rdd
.map{case (row:Row)
=>((row.get(0).asInstanceOf[String], row.get(12).asInstanceOf[String]),
(row.get(11).asInstanceOf[Timestamp], row.get(11).asInstanceOf[Timestamp]))}
.filter{case((client, hash),(d1,d2)) => (d1 !=null && d2 !=null)}
.reduceByKey{
case(x, y)=>
if(x._1.before(y._1)){
if(x._2.after(y._2))
(x)
else
(x._1, y._2)
}else{
if(x._2.after(y._2))
(y._1, x._2)
else
(y)
}
}.count()
If that gets rids of your shuffling problem, then you can refactor it a little:
- Make it a case class, if it isn't already.
- Create it like "ReadDailyFileDataObject(row.getInt(0), row.getString(1), etc..)"
Hope this counts as an answer, and helps you find your bottleneck.
I built a Spark cluster.
workers:2
Cores:12
Memory: 32.0 GB Total, 20.0 GB Used
Each worker gets 1 cpu, 6 cores and 10.0 GB memory
My program gets data source from MongoDB cluster. Spark and MongoDB cluster are in the same LAN(1000Mbps).
MongoDB document format:
{name:string, value:double, time:ISODate}
There is about 13 million documents.
I want to get the average value of a special name from a special hour which contains 60 documents.
Here is my key function
/*
*rdd=sc.newAPIHadoopRDD(configOriginal, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])
Apache-Spark-1.3.1 scala doc: SparkContext.newAPIHadoopFile[K, V, F <: InputFormat[K, V]](path: String, fClass: Class[F], kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)]
*/
def findValueByNameAndRange(rdd:RDD[(Object,BSONObject)],name:String,time:Date): RDD[BasicBSONObject]={
val nameRdd = rdd.map(arg=>arg._2).filter(_.get("name").equals(name))
val timeRangeRdd1 = nameRdd.map(tuple=>(tuple, tuple.get("time").asInstanceOf[Date]))
val timeRangeRdd2 = timeRangeRdd1.map(tuple=>(tuple._1,duringTime(tuple._2,time,getHourAgo(time,1))))
val timeRangeRdd3 = timeRangeRdd2.filter(_._2).map(_._1)
val timeRangeRdd4 = timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
if(timeRangeRdd4.isEmpty()){
return basicBSONRDD(name, time)
}
else{
return timeRangeRdd4.map(tuple => {
val bson = new BasicBSONObject()
bson.put("name", tuple._1)
bson.put("value", tuple._2/60)
bson.put("time", time)
bson })
}
}
Here is part of Job information
My program works so slowly. Does it because of isEmpty and reduceByKey? If yes, how can I improve it ? If not, why?
=======update ===
timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
is on the line of 34
I know reduceByKey is a global operation, and may costs much time, however, what it costed is beyond my budget. How can I improvet it or it is the defect of Spark. With the same calculation and hardware, it just costs several seconds if I use multiple thread of java.
First, isEmpty is merely the point at which the RDD stage ends. The maps and filters do not create a need for a shuffle, and the method used in the UI is always the method that triggers a stage change/shuffle...in this case isEmpty. Why it's running slow is not as easy to discern from this perspective, especially without seeing the composition of the originating RDD. I can tell you that isEmpty first checks the partition size and then does a take(1) and verifies whether data was returned or not. So, the odds are that there is a bottle neck in the network or something else blocking along the way. It could even be a GC issue... Click into the isEmpty and see what more you can discern from there.
... by checking whether a columns' value is in a seq.
Perhaps I'm not explaining it very well, I basically want this (to express it using regular SQL): DF_Column IN seq?
First I did it using a broadcast var (where I placed the seq), UDF (that did the checking) and registerTempTable.
The problem is that I didn't get to test it since I ran into a known bug that apparently only appears when using registerTempTable with ScalaIDE.
I ended up creating a new DataFrame out of seq and doing inner join with it (intersection), but I doubt that's the most performant way of accomplishing the task.
Thanks
EDIT: (in response to #YijieShen):
How to do filter based on whether elements of one DataFrame's column are in another DF's column (like SQL select * from A where login in (select username from B))?
E.g:
First DF:
login count
login1 192
login2 146
login3 72
Second DF:
username
login2
login3
login4
The result:
login count
login2 146
login3 72
Attempts:
EDIT-2: I think, now that the bug is fixed, these should work. END EDIT-2
ordered.select("login").filter($"login".contains(empLogins("username")))
and
ordered.select("login").filter($"login" in empLogins("username"))
which both throw Exception in thread "main" org.apache.spark.sql.AnalysisException, respectively:
resolved attribute(s) username#10 missing from login#8 in operator
!Filter Contains(login#8, username#10);
and
resolved attribute(s) username#10 missing from login#8 in operator
!Filter login#8 IN (username#10);
My code (following the description of your first method) runs normally in Spark 1.4.0-SNAPSHOT on these two configurations:
Intellij IDEA's test
Spark Standalone cluster with 8 nodes (1 master, 7 worker)
Please check if any differences exists
val bc = sc.broadcast(Array[String]("login3", "login4"))
val x = Array(("login1", 192), ("login2", 146), ("login3", 72))
val xdf = sqlContext.createDataFrame(x).toDF("name", "cnt")
val func: (String => Boolean) = (arg: String) => bc.value.contains(arg)
val sqlfunc = udf(func)
val filtered = xdf.filter(sqlfunc(col("name")))
xdf.show()
filtered.show()
Output
name cnt
login1 192
login2 146
login3 72
name cnt
login3 72
You should broadcast a Set, instead of an Array, much faster searches than linear.
You can make Eclipse run your Spark application. Here's how:
As pointed out on the mailing list, spark-sql assumes its classes are loaded by the primordial classloader. That's not the case in Eclipse, were the Java and Scala library are loaded as part of the boot classpath, while the user code and its dependencies are in another one. You can easily fix that in the launch configuration dialog:
remove Scala Library and Scala Compiler from the "Bootstrap" entries
add (as external jars) scala-reflect, scala-library and scala-compiler to the user entry.
The dialog should look like this:
Edit: The Spark bug was fixed and this workaround is no longer necessary (since v. 1.4.0)