Dataframe column updation - apache-spark

I have csv file dataframe of country, state, city.
my task is to validate the country, state, city data with his master table entry and want to update error_code column with code like
'S001','L001','C001' if the data does not exists in master table.
table for country, state, city are separate

Related

Getting data in database format from excel file [Python]

I have a Table in Excel that has this structure:
Country
County
20/01/2020
21/01/2020
Country
County
Value 1
Value 2
I would like to be able to convert the table into the following format so that I could add it in a table in my database.
Country
County
Date
Values
Country
County
20/01/2020
Value 1
Country
County
20/01/2020
Value 2
Is there a quick way to do this or should I iterate over each row and create a dataframe from there? The excel file has millions of entries.
pd.melt is what you're looking for:
pd.melt(df, id_vars = ['Country','County'], var_name = 'Date')
supply the dataframe as the first argument, then id_vars tells the function which columns you want to keep for each row. the rest of the columns will be turned into values in a new column in the melted dataframe. then var_name says what you want that new column to be called.

Is there a way to limit the data being read when joining tables in spark sql?

I want to read the data in spark sql by joining two very large tables. But i just need a fix number (let's say 500) from the resultant dataframe.
For example -
SELECT id, name, employee.deptno, deptname
FROM employee INNER JOIN department ON employee.deptno = department.deptno
Here i can use head(500) or limit(500) function on the resultant dataframe to limit the rows from resultant dataframe, But still it is going to read full data from both of the tables first and then on the resultant dataframe it will apply the limit.
Is there a way in which i can avoid reading full data before applying the limit ?
Something like this:
employee = spark.sql('select id, name, deptno from employee limit 500')
department = spark.sql('select deptno, deptname from department limit 500')
employee = employee.join(department, on = 'deptno', how = 'inner')

Spark Dataset appending unique ID

I'm looking whether there is an "already implemented alternative" to append unique ID on a spark dataset.
My scenario:
I have an incremental job that runs each day processing a batch of information. In this job, I create a dimension table of something and assign unique IDs to each row using monotonically_increasing_id(). On next day, I want to append some rows to that something table and want to generate unique IDs for those rows.
Example:
day 1:
something_table
uniqueID name
100001 A
100002 B
day 2:
something_table
uniqueId name
100001 A
100002 B
100003 C -- new data that must be created on day 2
Sniped code for day 1:
case class BasicSomething(name: String)
case class SomethingTable(id: Long, name: String)
val ds: Dataset[BasicSomething] = spark.createDataset(Seq(BasicSomething("A"), BasicSomething("B")))
ds.withColumn("uniqueId", monotonically_increasing_id())
.as[SomethingTable]
.write.csv("something")
I have no idea of how to keep state for monotonically_increasing_id() in a way that in the next day it will know the existing ids from something_table unique id.
You can always get the last uniqueId of a dataset that you have created. Thus you can use that uniqueId with monotically_increasing_id() and create new uniqueIds.
ds.withColumn("uniqueId", monotonically_increasing_id()+last uniqueId of previous dataframe)

Row Level Security In oracle

I have a table ATM_Plan
CREATE TABLE ATM_PLAN
(
BRACNH VARCHAR2(4) Primary Key,
SAMITY_CODE VARCHAR2(4),
SAMITY_NAME VARCHAR2(30),
INT_CLS_MONTH DATE,
TTL_MEMBER NUMBER,
TTL_LONEE NUMBER,
CM_TRG_DT DATE,
ACT_CM_DT DATE,
USER_CODE VARCHAR2(5));
Sample Record
insert into ATM_PLAN (BRANCH, SAMITY_CODE) VALUES ('001', '20');
insert into ATM_PLAN (BRANCH, SAMITY_CODE) VALUES ('002', '20');
I have also developed a form for entry in this table. Multiple users will Insert record in this table. But I want to restrict to the entry, Specific users can entry on specific branch record only.
As example branch 001 can not entry or update on branch 002 records.
I have 100 branches.
Is it Possible ?
thanks in advance

Import Excel Spreadsheet into Existing MS Access Tables

I have an Access database. Here is the setup for a few tables.
id - companyID (autonumber) PK, company (text), firstName (text), lastName (text)
category - companyID (number) combined PK with category, category (text)
salesmen - companyID (number) combined PK with code, code (text)
There is a 1-many relationship between id and category and between id and salesmen.
If I have a spreadsheet with columns of company, firstName, lastName, category1, category2, category3, salesman1, salesman2, how could I import the columns into the appropriate tables?
My first idea was to import the spreadsheet and then append company, firstName and lastName to the id table. Then I would join the imported spreadsheet with the id table to create a new table with all of the spreadsheet columns plus the auto generated companyID. Then I could append companyID and category1 to the category table. Then do the same for category2 and 3 and so on.
This seems really complicated if I have a lot of spreadsheets to import. Also, the person who will be importing the spreadsheets isn't a programmer, so she wants it to be as user-friendly as possible.
Is there a better way to import these spreadsheets?
Thanks!
What I would do is create another table to import the raw data into, then INSERT the data from there into the relevant tables.
DoCmd.RunSQL ("DELETE * FROM ImportDataTable;")
DoCmd.TransferSpreadsheet acImport, acSpreadsheetTypeExcel12, "ImportDataTable", "C:\exceldata.xls"
The second line in Access VBA will import the data into the table called ImportDataTable (the ImportDataTable column names should be F1, F2, F3, etc.
Then use an append query (INSERT INTO) for each table that some of the ImportDataTable data needs to go in to. All this code can be put behind a button on a form so that the user(s) only need to press a button when new data is available.

Resources