Can I read a file into different partitions in spark? - apache-spark

I am trying to use spark for data processing for my big data but spark opens excessive amount of connection to my database resulting in overload. Is it possible to create empty partitions and read the desired data into them? With some function like forEachPartition? (This should be in initial reading stage which means I can not read the data fully and then make changes)

If I understand your question correctly you are facing issue with too many connection to the database.
you can refer this official documentation : https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
you need to set the numPartitions property which controls the number of parallel connections to the database and also you can configure the partitionColumn, lowerBound and upperBound to get more control over the partition size in spark dataframe.

Related

Spark 2.4.6 + JDBC Reader: When predicate pushdown set to false, is data read in parallel by spark from the engine?

I am trying to extract data from a big table in SAP HANA, which is around 1.5tb in size, and the best way is to run in parallel across nodes and threads. Spark JDBC is the perfect candidate for the task, but in order to actually extract in parallel it requires partition column, lower/upper bound and number of partitions option to be set. To make the operation of the extraction easier, I considered adding an added partition column which would be the row_number() function and use MIN(), MAX() as lower/upper bounds respectively. And then the operations team just would be required to provide the number of partitions to have.
The problem is that HANA runs out of memory and it is very likely that row_number() is too costly on the engine. I can only imagine that over 100 threads run the same query during every fetch to apply the where filters and retrieve the corresponding chunk.
So my question is, if I disable the predicate pushdown option, how does spark behave? is it only read by one executor and then the filters are applied on spark side? Or does it do some magic to split the fetching part from the DB?
What could you suggest for extracting such a big table using the available JDBC reader?
Thanks in advance.
Before executing your primary query from Spark, run pre-ingestion query to fetch the size of the Dataset being loaded, i.e. as you have mentioned Min(), Max() etc.
Expecting that the data is uniformly distributed between Min and Max keys, you can partition across executors in Spark by providing Min/Max/Number of Executors.
You don't need(want) to change your primary datasource by adding additional columns to support data ingestion in this case.

Is there way to get a rowcount on a query using Snowflake and its Spark Connector?

I am running a query in my Spark application that returns a substantially large amount of data. I would like to know how many rows of data are being queried for logging purposes. I can't seem to find a way to get the number of rows without either manually counting them, or calling a method to count for me, as the data is fairly large this gets expensive for logging. Is there a place that the rowcount is saved and available to grab?
I have read here that the Python connector saves the rowcount into the object model, but i can't seem to find any equivalent for the Spark Connector or its underlying JDBC.
The most optimal way I can find is rdd.collect().size on the RDD that Spark provides. It is about 15% faster than calling rdd.count()
Any help is appreciated 😃
The limitation is within Spark's APIs that do not directly offer metrics of a completed distributed operation such as a row count metric after a save to table or file. Snowflake's Spark Connector is limited to the calls Apache Spark offers for its integration, and the cursor attributes otherwise available in the Snowflake Python and JDBC Connectors are not accessible through Py/Spark.
The simpler form of the question of counting an executed result, removing away Snowflake specifics, has been discussed previously with solutions: Spark: how to get the number of written rows?

Mysql or Spark Processing of 400gb data

If I use spark in my case, based on block and cores will it be useful ?
I have 400 GB of data in single table i.e. User_events with multiple columns in MySQL. This table stores all user events from application. Indexes are there on required columns. I have an user interface where user can try different permutation and combination of fields under user_events
Currently I am facing the performance issues where query either takes 15/20 seconds or even longer or times out.
I have gone through couple of Spark tutorial but I am not sure if it can help here. Per mine understanding from spark,
First Spark has to bring all the data in memory. Bring 100 M record on netwok will be costly operation and I will be needing big memory for the
same. Isn't it ?
Once data in memory, Spark can distribute the data among partition based on cores and input data size. Then it can filter the data on each partition
in parallel. Here Spark can be beneficial as it can do the parallel operation while MySQL will be sequential. Is that correct ?
Is my understanding correct ?

How to combine small parquet files with Spark?

I have a Hive table that has a lot of small parquet files and I am creating a Spark data frame out of it to do some processing using SparkSQL. Since I have a large number of splits/files my Spark job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive provides, that is, to combine these small input splits into larger ones by specifying a max split size setting. How can I achieve this with Spark? I tried using the coalesce function, but I can only specify the number of partitions with it (I can only control the number of output files with it). Instead I really want some control over the (combined) input split size that a task processes.
Edit: I am using Spark itself, not Hive on Spark.
Edit 2: Here is the current code I have:
//create a data frame from a test table
val df = sqlContext.table("schema.test_table").filter($"my_partition_column" === "12345")
//coalesce it to a fixed number of partitions. But as I said in my question
//with coalesce I cannot control the file sizes, I can only specify
//the number of partitions
df.coalesce(8).write.mode(org.apache.spark.sql.SaveMode.Overwrite)
.insertInto("schema.test_table")
I have not tried but read it in getting started guide that setting this property should work "hive.merge.sparkfiles=true"
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
In case using Spark on Hive, than Spark's abstraction doesn't provide explicit split of data. However we can control the parallelism in several ways.
You can leverage DataFrame.repartition(numPartitions: Int) to explicitly control the number of partitions.
In case you are using Hive Context than ensure hive-site.xml contains the CombinedInputFormat. That may help.
For more info, take a look at following documentation about Spark data parallelism - http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism.

spark datasax cassandra connector slow to read from heavy cassandra table

I am new to Spark/ Spark Cassandra Connector. We are trying spark for the first time in our team and we are using spark cassandra connector to connect to cassandra Database.
I wrote a query which is using a heavy table of the database and I saw that Spark Task didn't start until the query to the table fetched all the records.
It is taking more than 3 hours just to fetch all the records from the database.
To get the data from the DB we use.
CassandraJavaUtil.javaFunctions(sparkContextManager.getJavaSparkContext(SOURCE).sc())
.cassandraTable(keyspaceName, tableName);
Is there a way to tell spark to start working even if all the data didn't finish to download ?
Is there an option to tell spark-cassandra-connector to use more threads for the fetch ?
thanks,
kokou.
If you look at the Spark UI, how many partitions is your table scan creating? I just did something like this and I found that Spark was creating too many partitions for the scan and it was taking much longer as a result. The way I decreased the time on my job was by setting the configuration parameter spark.cassandra.input.split.size_in_mb to a value higher than the default. In my case it took a 20 minute job down to about four minutes. There are also a couple more Cassandra read specific Spark variables that you can set found here.
These stackoverflow questions are what I referenced originally, I hope they help you out as well.
Iterate large Cassandra table in small chunks
Set number of tasks on Cassandra table scan
EDIT:
After doing some performance testing with regards to fiddling with some Spark configuration parameters, I found that Spark was creating far too many table partitions when I wasn't giving the Spark executors enough memory. In my case, upping the memory by a gigabyte was enough to render the input split size parameter unnecessary. If you can't give the executors more memory, you may still need to set spark.cassandra.input.split.size_in_mbhigher as a form of workaround.

Resources