How to extract a Dataset content n rows by n rows? - apache-spark

I have to output the results of a Dataset into a Postgis (spatial) database. Spark doesn't handle it and I had to write specific code that cannot be serialized. It means that I can't use dataset.foreach(...) method, and I have to execute my database insertions from outside Spark tasks.
But a whole
List<Row> rows = ds.collectAsList()
will produce an out of memory error.
And a
List<Row> row = takeList();
only returns the n first rows of the dataset.
Is there a way to read sequentially the dataset, so that I can read its whole content from the beginning to the end, extracting each time only a fixed amount of rows ?

You can try randomSplit method to split your dataframe into multiple dataframes.
For example, to split into 3:
ds.randomSplit(Array(1,1,1))

Related

How to efficiently perform a lookup from a column in a large spark dataframe into a small (broadcastable) array

I have a smallish (a couple of thousand) list/array of pairs of doubles and a very large (> 100 million rows) spark dataframe. In the large dataframe I have a column containing an integer which i want to use to index into the smaller list. I want to return a dataframe with all the original columns and the related two values from the list.
I could obviously create a dataframe from the list and do an inner join but that seems inefficient as the optimiser doesn't know it only needs to get the single pair from the small list and that it can index directly into the list using the integer column from the large dataframe.
What's the most efficient way of doing this? Happy for answers using any api - scala, pyspark, sql, dataframe or rdd.

Why is row count different when using spark.table().count() and df.count()?

I am trying to use Spark to read data stored in a very large table (contains 181,843,820 rows and 50 columns) which is my training set, however, when I use spark.table() I noticed that the row count is different than the row count when calling the DataFrame's count(), I am currently using PyCharm.
I want to preprocess the data in the table before I can use it further as a training set for a model I need to train.
When loading the table I found out that the DataFrame I'm loading the table to is much smaller (10% of the data in this case).
what I have tried:
raised spark.kryoserializer.buffer.max capacity.
load a smaller table into the DataFrame (70k rows) and actually found no difference in the count() outputs.
this sample is very similar to the code I ran in order to investigate the problem.
df = spark.table('myTable')
print(spark.table('myTable').count()) # output: 181,843,820
print(df.count()) # output 18,261,961
I expect both outputs to be the same (the original 181m), yet they are not, and I dont understand why.

Apache Spark page results or view results on large datasets

I am using Hive with Spark 1.6.3
I have a large dataset (40000 rows, 20 columns or so and each column contains maybe 500 Bytes - 3KB of data)
The query is a join to 3 datasets
I wish to be able to page the final join dataset, and i have found that i can use row_number() OVER (ORDER BY 1) to generate a unique row number for each row in the dataset.
After this I can do
SELECT * FROM dataset WHERE row between 1 AND 100
However, there are resources which advise not to use ORDER BY as it puts all data into 1 partition (I can see this is the case in the logs where the shuffle allocation is moving the data to one partition), when this happens I get out of memory exceptions.
How would i go about paging through the dataset in a more efficient way?
I have enabled persist - MEMORY_AND_DISK so that if a partition is too large it will spill to disk (and for some of the transformation I can see that at least some of the data is spilling to disk when I am not using row_number() )
One strategy could be select only the unique_key of the dataset first and apply row_number function on that dataset only. Since you are selecting a single column from a large dataset chances are higher that it will fit in a single partition.
val dfKey = df.select("uniqueKey")
dfKey.createOrUpdateTempTable("dfKey")
val dfWithRowNum = spark.sql(select dfKey*, row_number() as row_number OVER (ORDER BY 1))
// save dfWithRowNum
After to complete the row_number operation on the uniqueKey; save that dataframe. Now in the next stage join this dataframe with the bigger dataframe and append the row_number column to that.
dfOriginal.createOrUpdateTempTable("dfOriginal")
dfWithRowNum.createOrUpdateTempTable("dfWithRowNum")
val joined = spark.sql("select dfOriginal.* from dfOriginal join dfWithRowNum on dfOriginal.uniqueKey = dfWithRowNum.uniqueKey")
// save joined
Now you can query
SELECT * FROM joineddataset WHERE row between 1 AND 100
For the persist with MEMORY_DISK, I found that occasionally fail with insufficient memory. I would rather use DISK_ONLY where performance is penalized although the execution is guaranteed.
Well, you can apply this method on your final join dataframe.
You should also persist the dataframe as a file to guarantee the ordering, as reevaluation could creates a different order.

Divide operation in spark using RDD or dataframe

Suppose there is a dataset with some number of rows.
I need to find out the Heterogeneity i.e.
distinct number of rows divide by total number of rows.
Please help me with spark query to execute the same.
Dataset and dataframe supports distinct function which finds distinct rows in the dataset.
So essentially you need to do
val heterogeneity = dataset.distinct.count / dataset.count
Only thing is if the dataset is big the distinct could be expensive and you might need to set the spark shuffle partition correctly.

Spark Python: Converting multiple lines from inside a loop into a dataframe

I have a loop that is going to create multiple rows of data which I want to convert into a dataframe.
Currently I am creating a CSV format string and inside the loop keep appending to it along separated by a newline. I am creating a CSV file so that I can also save it as a text file for other processing.
File Header:
output_str="Col1,Col2,Col3,Col4\n"
Inside for loop:
output_str += "Val1,Val2,Val3,Val4\n"
I then create an RDD by splitting it with the newline and then convert in into the dataframe as follows.
output_rdd = sc.parallelize(output_str.split("\n"))
output_df = output_rdd.map(lambda x: (x, )).toDF()
It creates a dataframe but only has 1 column. I know that is because of the map function where I am making it into a list with only 1 item in the set. What I need is a list with multiple items. So perhaps I should be calling split() function on every line to get a list. But I am getting a feeling that there should be a much more straight-forward way. Appreciate any help. Thanks.
Edit: To give more information, using Spark SQL I have filtered my dataset to those rows that contain the problem. However the rows contain information in following format (separated by '|'). And I need to extract those values from column 3 which has corresponding flag set to 1 in column 4 (Here it is 0xcd)
Field1|Field2|0xab,0xcd,0xef|0x00,0x01,0x00
So I am collecting the output at the driver and then parsing the last 2 columns after which I am left with regular strings that I want to put back in a dataframe. I am not sure if I can achieve the same using Spark SQL to parse the output in the manner I want.
Yes, indeed your current approach seems a little too complicated... Creating large string in Spark Driver and then parallelizing it with Spark is not really performant.
First of all question from where you are getting your input data? In my opinion you should use one of existing Spark readers to read it. For example you can use:
CSV -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
jdbc -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.jdbc
json -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
parquet -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet
not structured text file -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext.textFile
In next step you can preprocess it using Spark DataFrame or RDD API depending on your use case.
A bit late, but currently you're applying a map to create a tuple for each row containing the string as the first element. Instead of this, you probably want to split the string, which can easily be done inside the map step. Assuming all of your rows have the same number of elements you can replace:
output_df = output_rdd.map(lambda x: (x, )).toDF()
with
output_df = output_rdd.map(lambda x: x.split()).toDF()

Resources