Are there efficient ways to process data column-wise (vs row-wise) in spark?
I'd like to do some whole-database analysis of each column. I'd like to iterate through each column in a database and compare it to another column with a significance test.
colA = "select id, colA from table1"
foreach table, t:
foreach id,colB in t: # "select id, colB from table2"
# align colA,colB by ID
ab = join(colA,colB)
yield comparefunc(ab)
I have ~1M rows but ~10k columns.
Issuing ~10k selects is very slow, but shouldn't I be able to do a select * and broadcast each column to a different node for processing.
Related
I'm trying to join two tables in spark sql. Each table has 50+ columns. Both has column id as the key.
spark.sql("select * from tbl1 join tbl2 on tbl1.id = tbl2.id")
The joined table has duplicated id column.
We can of course specify which id column to keep like below:
spark.sql("select tbl1.id, .....from tbl1 join tbl2 on tbl1.id = tbl2.id")
But since we have so many columns in both tables, I do not want to type all the other column names in the query above. (other than id column, no other duplicated column names).
what should I do? thanks.
If id is the only column name in common, you can take advantage of the USING clause:
spark.sql("select * from tbl1 join tbl2 using (id) ")
The using clause matches columns that have the same name in both tables. When using select *, the column appears only once.
Assuming, you want to preserve the "duplicates", you can try to use the internal row-id or equivalents for your help. This helped me in the past, if I had to delete exactly one of two identical rows.
select *,ctid from table;
outputs in postgresql also the internal counter id. Your before exact identical rows become different now. I don't know about spark.sql, but I assume, that you can access a similar attribute there.
val joined = spark
.sql("select * from tbl1")
.join(
spark.sql("select * from tbl2"),
Seq("id"),
"inner" // optional
)
joined should have only one id column. Tested with Spark 2.4.8
I'm trying to drop duplicates based on column1 and select the row with max value in column2. The column2 has "year"(2019,2020 etc) as values and it is of type "String". The solution I have is, converting the column 2 into integer and selecting the max value.
Dataset<Row> ds ; //The dataset with column1,column2(year), column3 etc.
Dataset<Row> newDs = ds.withColumn("column2Int", col("column2").cast(DataTypes.IntegerType));
newDs = newDs.groupBy("column1").max("column2Int"); // drops all other columns
This approach drops all other columns in the original dataset 'ds' when I do a "group by". So I have to do a join between 'ds' and 'newDS' to get back all the original columns. Also casting the String column to Integer looks like an ineffective workaround.
Is it possible to drop the duplicates and get the row with bigger string value from the original dataset itself ?
This is a classic de-duplication problem and you'll need to use Window + Rank + filter combo for this.
I'm not very familiar with the Java syntax, but the sample code should look like something below,
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
Dataset<Row> df = ???;
WindowSpec windowSpec = Window.partitionBy("column1").orderBy(functions.desc("column2Int"));
Dataset<Row> result =
df.withColumn("column2Int", functions.col("column2").cast(DataTypes.IntegerType))
.withColumn("rank", functions.rank().over(windowSpec))
.where("rank == 1")
.drop("rank");
result.show(false);
Overview of what happened,
Add the casted integer column to the df for future sorting.
Subsections/ windows were formed in your dataset (partitions) based on the value of column1
For each of these subsections/ windows/ partitions the rows were sorted on column casted to int. Desc order as you want max.
Ranks like row numbers are assigned to the rows in each partition/ window created.
Filtering is done for all row where rank is 1 (max value as the ordering was desc.)
I want to read the data in spark sql by joining two very large tables. But i just need a fix number (let's say 500) from the resultant dataframe.
For example -
SELECT id, name, employee.deptno, deptname
FROM employee INNER JOIN department ON employee.deptno = department.deptno
Here i can use head(500) or limit(500) function on the resultant dataframe to limit the rows from resultant dataframe, But still it is going to read full data from both of the tables first and then on the resultant dataframe it will apply the limit.
Is there a way in which i can avoid reading full data before applying the limit ?
Something like this:
employee = spark.sql('select id, name, deptno from employee limit 500')
department = spark.sql('select deptno, deptname from department limit 500')
employee = employee.join(department, on = 'deptno', how = 'inner')
Table A having 20 records and table B showing 19 records. How to find that one record is which is missing in table B. How to do compare/subtract records of these two tables; to find that one record. Running query in Apache Superset.
The exact answer depends on which column(s) define whether two records are the same. Assuming you wanted to use some primary key column for the comparison, you could try:
SELECT a.*
FROM TableA a
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.pk = a.pk);
If you wanted to use more than one column to compare records from the two tables, then you would just add logic to the exists clause, e.g. for three columns:
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.col1 = a.col1 AND
b.col2 = a.col2 AND
b.col3 = a.col3)
I have a query to join the tables. How do I optimize to run it faster?
val q = """
| select a.value as viewedid,b.other as otherids
| from bm.distinct_viewed_2610 a, bm.tets_2610 b
| where FIND_IN_SET(a.value, b.other) != 0 and a.value in (
| select value from bm.distinct_viewed_2610)
|""".stripMargin
val rows = hiveCtx.sql(q).repartition(100)
Table descriptions:
hive> desc distinct_viewed_2610;
OK
value string
hive> desc tets_2610;
OK
id int
other string
the data looks like this:
hive> select * from distinct_viewed_2610 limit 5;
OK
1033346511
1033419148
1033641547
1033663265
1033830989
and
hive> select * from tets_2610 limit 2;
OK
1033759023
103973207,1013425393,1013812066,1014099507,1014295173,1014432476,1014620707,1014710175,1014776981,1014817307,1023740250,1031023907,1031188043,1031445197
distinct_viewed_2610 table has 1.1 million records and i am trying to get similar id's for that from table tets_2610 which has 200 000 rows by splitting second column.
for 100 000 records it is taking 8.5 hrs to complete the job with two machines
one with 16 gb ram and 16 cores
second with 8 gb ram and 8 cores.
Is there a way to optimize the query?
Now you are doing cartesian join. Cartesian join gives you 1.1M*200K = 220 billion rows. After Cartesian join it filtered by where FIND_IN_SET(a.value, b.other) != 0
Analyze your data.
If 'other' string contains 10 elements in average then exploding it will give you 2.2M rows in table b. And if suppose only 1/10 of rows will join then you will have 2.2M/10=220K rows because of INNER JOIN.
If these assumptions are correct then exploding array and join will perform better than Cartesian join+filter.
select distinct a.value as viewedid, b.otherids
from bm.distinct_viewed_2610 a
inner join (select e.otherid, b.other as otherids
from bm.tets_2610 b
lateral view explode (split(b.other ,',')) e as otherid
)b on a.value=b.otherid
And you do not need this :
and a.value in (select value from bm.distinct_viewed_2610)
Sorry I cannot test query, do it yourself please.
If you are using orc formate change to parquet as per your data i woud say choose range partition.
Choose proper parallization to execute fast.
I have answred on follwing link may be help you.
Spark doing exchange of partitions already correctly distributed
Also please read it
http://dev.sortable.com/spark-repartition/