Dataframe loses its contents after writing to database - apache-spark

We had working code as below.
print(f"{file_name} Before insert count", datetime.datetime.now(), scan_df_new.count())
scan_df_new.show()
scan_20220908120005_10 Before insert count 2022-09-14 11:37:15.853588 3
+-------------------+----------+-------------------+--------------------+----------+
| tran_id|t_store_id| scan_datetime| customer_id|updated_by|
+-------------------+----------+-------------------+--------------------+----------+
|1230000000000000004| 4395|2022-09-08 03:00:01|20220816a51cee4264f1|Databricks|
|1230000000000000005| 4394|2022-09-08 02:58:00|20220816a51cee4264f1|Databricks|
|1230000000000000006| 4393|2022-09-08 03:00:04|20220816a51cee4264f1|Databricks|
+-------------------+----------+-------------------+--------------------+----------+
The data frame after the write operation is used for further business logic processing. This was working earlier. But recently we are observing a strange behavior, where in the data in the data frame is getting lost . When wee check the contents, or even the dataframe count its shows empty.
scan_df_new.write.format("jdbc").option("url", jdbcUrl).option("dbtable", scan_table).mode("append").save()
print(f"{file_name} After insert count", datetime.datetime.now(), scan_df_new.count())
scan_df_new.show()
None
scan_20220908120005_10 After insert count 2022-09-14 11:37:18.372147 0
+-------+----------+-------------+-----------+----------+
|tran_id|t_store_id|scan_datetime|customer_id|updated_by|
+-------+----------+-------------+-----------+----------+
Anything recently changed in the Databricks, which is impacting this?

Related

How to append to error column when using the Macro Design (before change) inside of MS Access?

I have an MS Access database that receives hundreds of row records coming into it. I need a way to validate that the incoming data is consistent with the business logic. For example, when we get a record the state column should be "California", otherwise an error should be appended to an Error column that specifies why it failed. And for that same record if the Income is less than $1,000,000 an error should be appended for that too.
I found that inside MS Access while highlighting your table if you at the top bar click on Table > Before Change, you can create If-then logic for incoming rows. If there is a better way to accomplish this task, please let me know.
Once I am inside the "Before change" window I then write this logic out:
If [State] <> "California" Then
SetField
Name Error
Value = "Failed on incorrect state"
Else if [Income] < 1000000 Then
SetField
Name Error
Value = "Failed on incorrect income"
End if
When multiple errors occur such as both incorrect state and income it only shows the first error. Is there a way to have both errors appended to the same error column?
Thanks
You can try to use AND statement in the IF section.
If [STATE] <> "California" && [INCOME] < 10000000 Then
//Write your logic here. Such as "Incorrect Value".
Hope it works for you.

Excel removes my query connection on it's own and gives me several error messages

I know that this is a really long post but I'm not sure of what part of my process is making my file crash, so I tried to detail everything about what I did to get to the error messages.
So, first of all, I created a query on Kusto, which looks something similar to this but in reality is 160 lines of code, this is just a summarized version of what my code might do just to show my working process.
First, what I do in Session_Id_List is create a list of all distinct Session Id's from the past day.
Then on treatment_alarms1 I count the amount of alarms for each type of alarm that was active during each session.
Then, on treatment_alarms2 I create a list which might look something like this
1x Alarm_Type_Number1
30x Alarm_Type_Number2
7x Alarm_Type_Number3
and like that for each treatment, so I have a list of all alarms that were active for that treatment.
Lastly, I create a left outer join with Session_Id_List and treatment_alarms2. This means that I will get shown all of the treatment ID's, even the ones that did not have any active alarms.
let _StartTime = ago(1d);
let _EndTime = ago(0d);
let Session_Id_List = Database1
| where StartTime >= _StartTime and StartTime <= _EndTime
| summarize by SessionId, SerialNumber, StartTime
| distinct SessionId, StartTime, SerialNumber;
let treatment_alarms1 = Database1
| where StartTime >= _StartTime and StartTime <= _EndTime and TranslatedData_Status == "ALARM_ACTIVE"
| summarize number_alarms = count() by TranslatedData_Value, SessionId
| project final_Value = strcat(number_alarms, "x ", TranslatedData_Value), SessionId;
let treatment_alarms2 = Database1
| where StartTime >= _StartTime and StartTime <= _EndTime and TranslatedData_Status == "ALARM_ACTIVE"
| join kind=inner treatment_alarms1 on SessionId
| summarize list_of_alarms_all = make_set(final_Value) by SessionId
| project SessionId, list_of_alarms_all;
let final_join = Session_Id_List
| join kind=leftouter treatment_alarms2 on SessionId;
final_join
| project SessionId, list_of_alarms_all
Then I put this query into Excel, by using the following method
I go to Tools -> Query to Power BI on Kusto Explorer
I go to Data -> Get Data -> From Other Sources -> Blank Query
I go to advanced editor
I copy and paste my query and press "Done" at the bottom
If you see now, the preview of my data will show "List" on the list_of_alarms_all column, rather than showing me the actual values of the list.
To fix this issue I first press the arrows on the header of the column
I press on "Extract Values"
I select Custom -> Concatenate using special characters -> Line Feed -> Press OK
That works fine for all of the ID's that do have alarms on them, it shows them as a list and tells me how many there are, the issue is with the ID's that did not have any treatments where I get "Error" on the Excel preview. Once I press "Close & Load" the data is put on the worksheet and it looks fine, the "Error" are all gone and instead I get empty cells where the "Error" would be at.
The problem now starts when I close the file and try to open it again.
First I get this message. So I press yes to try and enter the file.
Then I get this other message. The problem with this message is that it says that I have the file open when that is not true. I even tried to restart my laptop and open the file again and I would still get the message when in reality I don't have that file open.
Then I get this message, telling me that the connection to the query was removed.
So my problem here is that 1) I can't edit the file anymore unless I make a copy because I keep getting the message saying that I already have the file opened and it is locked for editing and 2) I would like to refresh this query with VBA maybe once a week from now on but I can't because when I save the file the connection to the query is deleted by excel itself.
I'm not sure of why this is happening, I'm guessing it's because of the "Error" I get on the empty cells when I try to extract the values from the lists. If anybody has any information on how I can fix this so I don't get these error messages please let me know.
I was not able to reproduce your issue, however there are some things you might want to try.
Within ADX, you could wrap you query with a function, so you won't have to copy a large piece of code into your Excel.
You could deal with null values (this is what gives you the Error values) already in your query. Note the use of coalesce.
// I used datatable to mimic your query results
.create-or-alter function SessionAlarms()
{
datatable (SessionId:int,list_of_alarms_all:dynamic)
[
1, dynamic([10,20,30])
,2, dynamic([])
,3, dynamic(null)
]
| extend list_of_alarms_all = coalesce(list_of_alarms_all, dynamic([]))
}
You can use Power Query ADX connector and copy your query/function As Is
If you haven't dealt with null values in you KQL you can take care of the error in Excel by using Replace Errors

Trying to check if value is in sqlite3 with Python

I am trying to check if a value is in SQLite with python to then either update the table if the value exists or create a new value if it is not. I have tried to create a cursor to check rows, append the rows to a list with loop, check if value exists, check the count of the rows... I seems to get hung up on the if statement when trying to access the value initialized from the query. Here is the code:
checkT = db.execute("SELECT COUNT(*) FROM trans WHERE stock=:stock AND id=:user_id", stock=request.form.get("symbol"), user_id=session["user_id"])
if checkT > 0:
print("there")
else:
print("not there")
How can I fix this? Thank you!
From the CS50 Library for Python doc for execute
Returns
for SELECTs, a list of dict objects, each of which represents a row in the result set; for INSERTs, the primary key of a newly inserted row (or None if none); for UPDATEs, the number of rows updated; for DELETEs, the number of rows deleted; for CREATEs, True on success; on error, a RuntimeError is raised
checkT is a list with one element, which is a dict with one key/value pair.
This checkT[0]['COUNT(*)'] will give the number returned from the sql. Counting the rows would not be appropriate in this case because this query will always return one row.
One hint: column names in a SELECT can be aliased, given a different name, like so:
SELECT COUNT(*) as count from....... It would just be typing convenience, because then the key in the returned dict will be count instead of COUNT(*).
Remember: in the flask run terminal there is a traceback with gives more details information on the error received (assuming "hung up" means a 500 Internal Server Error).

How do I achieve this in Apache Spark Java or Scala?

A device on a car will NOT send a TRIP ID when the trip starts but will send one when the TRIP ends. How do I apply corresponding TRIP IDS to the corresponding records
09:30,25,DEVICE_1
10:30,55,DEVICE_1
10:25,0,DEVICE_1,TRIP_ID_0
11:30,45,DEVICE_1
10:30,55,DEVICE_2
10:30,55,DEVICE_3
11:30,45,DEVICE_3
12:30,0,DEVICE_3,TRIP_ID_3
10:30,55,DEVICE_4
11:30,45,DEVICE_4
11:30,45,DEVICE_2
12:30,0,DEVICE_2,TRIP_ID_2
12:30,0,DEVICE_4,TRIP_ID_4
10:30,55,DEVICE_5
11:30,45,DEVICE_5
12:30,0,DEVICE_5,TRIP_ID_5
12:30,0,DEVICE_1,TRIP_ID_1
So the above becomes like this,
09:30,25,DEVICE_1,TRIP_ID_0
10:25,0,DEVICE_1,TRIP_ID_0
10:30,55,DEVICE_1,TRIP_ID_1
11:30,45,DEVICE_1,TRIP_ID_1
12:30,0,DEVICE_1,TRIP_ID_1
10:30,55,DEVICE_2,TRIP_ID_2
11:30,45,DEVICE_2,TRIP_ID_2
12:30,0,DEVICE_2,TRIP_ID_2
10:30,55,DEVICE_3,TRIP_ID_3
11:30,45,DEVICE_3,TRIP_ID_3
12:30,0,DEVICE_3,TRIP_ID_3
10:30,55,DEVICE_4,TRIP_ID_4
11:30,45,DEVICE_4,TRIP_ID_4
12:30,0,DEVICE_4,TRIP_ID_4
10:30,55,DEVICE_5,TRIP_ID_5
11:30,45,DEVICE_5,TRIP_ID_5
12:30,0,DEVICE_5,TRIP_ID_5
An interesting problem. Had to fix one bug!
You will need to convert to spark.sql as I tried this in ORACLE. But WITH clause is supported in spark.sql. Also, instead of using date strings, due to the fact it is quite late I just used numbers to represent time, so you will need to look at that.
But here is the SQL that you can adapt.
with X as (select device, time_asc, trip_id from trips where trip_id is not null)
select Y.TRIP_ID, Y.DEVICE, Y.TIME_ASC FROM (
select T1.TIME_ASC, T1.DEVICE, X.TRIP_ID, X.TIME_ASC AS TIME_ASC_COMPARE
,RANK() OVER (PARTITION BY T1.TIME_ASC, T1.DEVICE ORDER BY X.TIME_ASC) AS RANK_VAL from trips T1, X
where T1.DEVICE = X.DEVICE
and T1.TIME_ASC <= X.TIME_ASC) Y
where RANK_VAL = 1
order by TRIP_ID, TIME_ASC
Get rid of the order by, just used to show.
This data as input:
('1','A',null);
('2','A','TRIP_01');
('5','A',null);
('6','A',null);
('7','A',null);
('23','A','TRIP_02');
('56','A',null);
('60','A','TRIP_04');
('8','B',null);
('10','B','TRIP_03');
('1','E',null);
('2','E','TRIP_05');
removes quotes as I exported and got this format, returns the following, which I think will meet your needs - again excuse formatting:
('TRIP_01','A','1');
('TRIP_01','A','2');
('TRIP_02','A','5');
('TRIP_02','A','6');
('TRIP_02','A','7');
('TRIP_02','A','23');
('TRIP_03','B','8');
('TRIP_03','B','10');
('TRIP_04','A','56');
('TRIP_04','A','60');
('TRIP_05','E','1');
('TRIP_05','E','2');
Am wondering how well SPARK handles this with under the hood performance. This took some effort late at night, so some appreciation is sought. Enjoyable as well.

Inserting data into database with python/sqlite3 by recognising the column name

I've got a problem that I don't know how to solve, I've tried many solutions but always getting that Operational error: near...
def insert_medicine_to_table():
con = sqlite3.connect('med_db3.db')
cur = con.cursor()
table_name = 'medicines'
column_name = "présentation"
value = 'Boîte de 2 seringues pré-remplies'
cur.execute("INSERT INTO medicines {} VALUES (?)".format(column_name), value)
con.commit()
con.close()
sqlite3.OperationalError: near "présentation": syntax error
The goal here is that either the script or python has to recognize the field (column name) and insert the value into "that" field, like the following:
fields = ['présentation', 'princeps', 'distributeur_ou_fabriquant', 'composition', 'famille', 'code_atc', 'ppv', 'prix_hospitalier', 'remboursement', 'base_de_remboursement__ppv', 'nature_du_produit']
values = ['Boîte de 2 seringues pré-remplies', 'Oui', 'SANOFI', 'Héparine', 'Anticoagulant héparinique', 'B01AB01', '43.80', '27.40', 'Oui', '43.80', 'Médicament']
That is one entry in the database. The problem here is that other entries can or not have one or more values for some field, and also the fields are not presented in the same order in other entries.
It has to recognize each field in the database table and insert each value into the right column.
The problem causing your error is that your SQL isn't valid. The statement you are trying to execute is:
INSERT INTO medicines présentation VALUES (?)
The statement you want to execute is:
INSERT INTO medicines ("présentation") VALUES (?)
As far as your larger question is concerned, if you create both the list of columns ("présentation") and list of parameter markers (?) and build the query using them, you're most of the way there.
If a field can have multiple values supplied for each "entry" in your database, you may need to change your database design to handle that. You'll at least need to figure out how you want to handle the situation, but that would be a matter for a different question.

Resources