Spark Java union/concat Multiple Dataframe/sql in loop - apache-spark

I have a requirement wherein I want to union/concat multiple dataframes. overall we have around 14000 such dataframes/sql which we generate at run time and then union all before writing to hive. I tried two ways but both are very slow. Is there any way to optimize below or run them in parallel.
Note I need the solution only in spark java.
Psuedo code
1st way:
Dataset dfunion = null;
for (int i = 0; i <= 14000; i++) {
String somesql = "select columns from table where conditions(depending on each loop)"
if (i == 1)
dfunion = spark.sql(somesql);
else{
dfunion = dfunion.union(spark.sql(somesql));
}
}
dfunion.writetohive
2nd way:
for (int i = 0; i <= 14000; i++) {
String somesql = "select columns from table where conditions(depending on each loop)"
if (i == 1)
spark.sql(somesql).write.mode(overwrite).parquet;
else {
spark.sql(somesql).write.mode(append).parquet;
}
}
Dataset dfread = spark.read.parquet().writetohive;
Any help would be appreciated.

Related

Looking for an Excel COUNTA equivalent for a DataTable VB.Net

I'm trying to find an equivalent to the Excel CountA function for a DataTable.
'This code works for searching through a range of columns in Excel
If xlApp.WorksheetFunction.CountA(WS.Range("A" & i & ":G" & i)) > 0 Then
DataExists = True
End If
'This is the code I need help with for searching though a DataTable
If DataTbl.Rows(i).Item(0:6).ToString <> "" Then
DataExists = True
End If
Hoping someone can help with this.
I think you simply need a for-each loop.
internal static int CountForEach(this DataTable? dt)
{
if (dt == null)
return 0;
int count = 0;
foreach (DataRow row in dt.Rows)
foreach (object? o in row.ItemArray)
if (o != DBNull.Value)
count++;
return count;
}
Usage:
DataTable dt = GetYourDataTable();
int countValues = dt.CountNotNullValues_ForEach();
This is also doable with LINQ but I think it would be slower -- I'll run some benchmarks later and update my answer.
EDIT
I added these two LINQ methods:
internal static int CountLinqList(this DataTable? dt)
{
int count = 0;
dt?.Rows.Cast<DataRow>().ToList().ForEach(row => count += row.ItemArray.Where(g => g != DBNull.Value).Count());
return count;
}
internal static int CountLinqParallel(this DataTable? dt)
{
ConcurrentBag<int> ints = new();
dt?.AsEnumerable().AsParallel().ForAll(row => ints.Add(row.ItemArray.Where(g => g != DBNull.Value).Count()));
int count = ints.Sum();
return count;
}
These are the statistics obtained with BenchmarkDotNet:
I used a pseudo-randomly generated datatable of around 5.5 million rows and three columns as test.
I think these results may change with larger datatables, but for smaller (around 500k rows and less) the fastest method will probably be the simple for-each loop.
Fastest methods:
For each loop
Linq parallel
Linq list > for each
I'm surely not a LINQ-guru but I'd like to be, so if someone has a better LINQ implementation please let me know.
By the way, I don’t think this could be the typical LINQ use case.

Creating a Bubble Chart of text columns

I have two text columns which look like this:
and now I need to make a bubble chart that looks like this:
Any idea how I can achieve this using excel 2016?
There is a way to do this by using Javascript. This langage has a lot of powerful libraries for data visualization and data processing. And there is a way to use it in Excel by using an Excel Add-in called funfun.
I have written a working code for you:
https://www.funfun.io/1/#/edit/5a7c4d5db8b2864030f9de15
I used an online editor with an embedded spreadsheet to build this chart. I use a Json file(short/full underneath Settings) to get the data from the spreadsheet to my javascript code:
{
"data": "=A1:B18"
}
I then store it in local variables in the script.js so I can use them correctly in the chart I will create:
var Approaches = []; // list of Approaches
var Contribution = []; // list of contribution
var tmpC = [];
/*
* Parse your spreadsheet to count how much approaches and contribution there are
*/
for (var x = 1; x < $internal.data.length; x++) {
if (Approaches.indexOf($internal.data[x][0]) <= -1)
Approaches.push($internal.data[x][0]);
if (tmpC.indexOf($internal.data[x][1]) <= -1)
tmpC.push($internal.data[x][1]);
}
/*
* sort the array so that other is at the end
* (remove if you want you don't care about the order, replace 'tmpC' by 'Contribution' above)
*/
for (var t = tmpC.length - 1; t >= 0; t--)
Contribution.push(tmpC[t]);
var techniquesIndex = new Array(Contribution.length); // how much of one contribution is made per approach
var total = 0; // total of contribution
var totalPerApproaches = new Array(Approaches.length); //total of contributions for one Approach
for (var z = 0; z < totalPerApproaches.length; z++) {
totalPerApproaches[z] = 0;
}
var data = []; // data for the chart
/*
* Parse your every approach
*/
for (var x = 0; x < Approaches.length; x++) {
for (var z = 0; z < techniquesIndex.length; z++) {
techniquesIndex[z] = 0;
}
/*
* Parse your spreadsheet to count the number of contribution in this approach
*/
for (var y = 0; y < $internal.data.length; y++) {
if (Approaches.indexOf($internal.data[y][0]) == x) {
total += 1;
techniquesIndex[Contribution.indexOf($internal.data[y][1])] += 1;
}
}
for (var c = 0; c < Contribution.length; c++) {
/*
* calculate the total of contribution on this approach
*/
totalPerApproaches[x] += techniquesIndex[c];
/*
* removes the values equals to zero off the chart
* (remove this condition if you want to show the zeros)
*/
if (techniquesIndex[c] == 0)
continue;
/*
* adds a bubble to the charts with the number of Contribution per Approach
*/
data.push(
{
x: x, // -> index of array Approach[x]
y: c, // -> index of array Contribution[c]
z: techniquesIndex[c], // number of contribution[c] in Approach[x]
name: techniquesIndex[c] // ..
});
}
}
The $Internal.data is the data from the spreadsheet accessible thanks to the Json file. The array data (at the end) will be used to create all the bubbles of the charts.
Once I have my data stored in the right format I create the chart in index.html using a data visualization library called Highcharts, it has lots of examples and good documentation for beginners. You can choose to add many options for your chart and at the end you pass your data to the chart as such:
series: [{
data: data // use the data from script.js
}]
Once you've build your chart you can open it in Excel by pasting the URL in the Funfun excel add-in. Here is how it looks like with my example:
You can as much lines as you want you just need to make sure that the range of data in the Json file is what you need.
you can then save the chart in many formats:
Hope this helps !
Disclosure : I’m a developer of funfun

Is the number of Parameters in the IN-Operator in Cassandra limited?

I have a pretty simple question which I can't find an answer to on the Internet or on stackoverflow:
Is the number of Parameters in the IN-Operator in Cassandra limited?
I have made some tests with a simple table with Integer-Keys from 1 to 100000. If I put the keys from 0 to 1000 in my IN-Operator (like SELECT * FROM test.numbers WHERE id IN (0,..,1000)) I get the correct number of rows back. But for example for 0 to 100000 I always get only 34464 rows back. And for 0 to 75000 its 9464.
I am using the Datastax Java Driver 2.0 and the relevant codeparts look like the following:
String query = "SELECT * FROM test.numbers WHERE id IN ?;";
PreparedStatement ps = iot.getSession().prepare(query);
bs = new BoundStatement(ps);
List<Integer> ints = new ArrayList<Integer>();
for (int i = 0; i < 100000; i++) {
ints.add(i);
}
bs.bind(ints);
ResultSet rs = iot.getSession().execute(bs);
int rowCount = 0;
for (Row row : rs) {
rowCount++;
}
System.out.println("counted rows: " + rowCount);
It's also possible that I'm binding the list of Integers in a wrong way. If that's the case I would appreciate any hints too.
I am using Cassandra 2.0.7 with CQL 3.1.1.
This is not a real-limitation but a PreparedStatement one.
Using a BuiltStatement and QueryBuilder I didn't have any of these problems.
Try it yourself:
List<Integer> l = new ArrayList<>();
for (int i = 0; i < 100000; i++) {
l.add(i);
}
BuiltStatement bs = QueryBuilder.select().column("id").from("test.numbers").where(in("id", l.toArray()));
ResultSet rs = Cassandra.DB.getSession().execute(bs);
System.out.println("counted rows: " + rs.all().size());
HTH,
Carlo

Search an integer in a row-sorted two dim array, is there any better approach?

I have recently come across with this problem,
you have to find an integer from a sorted two dimensional array. But the two dim array is sorted in rows not in columns. I have solved the problem but still thinking that there may be some better approach. So I have come here to discuss with all of you. Your suggestions and improvement will help me to grow in coding. here is the code
int searchInteger = Int32.Parse(Console.ReadLine());
int cnt = 0;
for (int i = 0; i < x; i++)
{
if (intarry[i, 0] <= searchInteger && intarry[i,y-1] >= searchInteger)
{
if (intarry[i, 0] == searchInteger || intarry[i, y - 1] == searchInteger)
Console.WriteLine("string present {0} times" , ++cnt);
else
{
int[] array = new int[y];
int y1 = 0;
for (int k = 0; k < y; k++)
array[k] = intarry[i, y1++];
bool result;
if (result = binarySearch(array, searchInteger) == true)
{
Console.WriteLine("string present inside {0} times", ++ cnt);
Console.ReadLine();
}
}
}
}
Where searchInteger is the integer we have to find in the array. and binary search is the methiod which is returning boolean if the value is present in the single dimension array (in that single row).
please help, is it optimum or there are better solution than this.
Thanks
Provided you have declared the array intarry, x and y as follows:
int[,] intarry =
{
{0,7,2},
{3,4,5},
{6,7,8}
};
var y = intarry.GetUpperBound(0)+1;
var x = intarry.GetUpperBound(1)+1;
// intarry.Dump();
You can keep it as simple as:
int searchInteger = Int32.Parse(Console.ReadLine());
var cnt=0;
for(var r=0; r<y; r++)
{
for(var c=0; c<x; c++)
{
if (intarry[r, c].Equals(searchInteger))
{
cnt++;
Console.WriteLine(
"string present at position [{0},{1}]" , r, c);
} // if
} // for
} // for
Console.WriteLine("string present {0} times" , cnt);
This example assumes that you don't have any information whether the array is sorted or not (which means: if you don't know if it is sorted you have to go through every element and can't use binary search). Based on this example you can refine the performance, if you know more how the data in the array is structured:
if the rows are sorted ascending, you can replace the inner for loop by a binary search
if the entire array is sorted ascending and the data does not repeat, e.g.
int[,] intarry = {{0,1,2}, {3,4,5}, {6,7,8}};
then you can exit the loop as soon as the item is found. The easiest way to do this to create
a function and add a return statement to the inner for loop.

Is it possible to do a Levenshtein distance in Excel without having to resort to Macros?

Let me explain.
I have to do some fuzzy matching for a company, so ATM I use a levenshtein distance calculator, and then calculate the percentage of similarity between the two terms. If the terms are more than 80% similar, Fuzzymatch returns "TRUE".
My problem is that I'm on an internship, and leaving soon. The people who will continue doing this do not know how to use excel with macros, and want me to implement what I did as best I can.
So my question is : however inefficient the function may be, is there ANY way to make a standard function in Excel that will calculate what I did before, without resorting to macros ?
Thanks.
If you came about this googling something like
levenshtein distance google sheets
I threw this together, with the code comment from milot-midia on this gist (https://gist.github.com/andrei-m/982927 - code under MIT license)
From Sheets in the header menu, Tools -> Script Editor
Name the project
The name of the function (not the project) will let you use the func
Paste the following code
function Levenshtein(a, b) {
if(a.length == 0) return b.length;
if(b.length == 0) return a.length;
// swap to save some memory O(min(a,b)) instead of O(a)
if(a.length > b.length) {
var tmp = a;
a = b;
b = tmp;
}
var row = [];
// init the row
for(var i = 0; i <= a.length; i++){
row[i] = i;
}
// fill in the rest
for(var i = 1; i <= b.length; i++){
var prev = i;
for(var j = 1; j <= a.length; j++){
var val;
if(b.charAt(i-1) == a.charAt(j-1)){
val = row[j-1]; // match
} else {
val = Math.min(row[j-1] + 1, // substitution
prev + 1, // insertion
row[j] + 1); // deletion
}
row[j - 1] = prev;
prev = val;
}
row[a.length] = prev;
}
return row[a.length];
}
You should be able to run it from a spreadsheet with
=Levenshtein(cell_1,cell_2)
While it can't be done in a single formula for any reasonably-sized strings, you can use formulas alone to compute the Levenshtein Distance between strings using a worksheet.
Here is an example that can handle strings up to 15 characters, it could be easily expanded for more:
https://docs.google.com/spreadsheet/ccc?key=0AkZy12yffb5YdFNybkNJaE5hTG9VYkNpdW5ZOWowSFE&usp=sharing
This isn't practical for anything other than ad-hoc comparisons, but it does do a decent job of showing how the algorithm works.
looking at the previous answers to calculating Levenshtein distance, I think it would be impossible to create it as a formula.
Take a look at the code here
Actually, I think I just found a workaround. I was adding it in the wrong part of the code...
Adding this line
} else if(b.charAt(i-1)==a.charAt(j) && b.charAt(i)==a.charAt(j-1)){
val = row[j-1]-0.33; //transposition
so it now reads
if(b.charAt(i-1) == a.charAt(j-1)){
val = row[j-1]; // match
} else if(b.charAt(i-1)==a.charAt(j) && b.charAt(i)==a.charAt(j-1)){
val = row[j-1]-0.33; //transposition
} else {
val = Math.min(row[j-1] + 1, // substitution
prev + 1, // insertion
row[j] + 1); // deletion
}
Seems to fix the problem. Now 'biulding' is 92% accurate and 'bilding' is 88%. (whereas with the original formula 'biulding' was only 75%... despite being closer to the correct spelling of building)

Resources