Same transaction returns different results when i ran multiply times - tidb

When i was using TiDB, I found it strange when i make two transactions run at the same time. I was expecting to get the the same value 2 like what MySQL did, but all i got is like 0, 2, 0, 2, 0, 2...
For both databases, the tx_isolation is set to 'read-committed'. So it is reasonable that the select statement returns 2 as it has already committed.
Here's the test code:
for i in range(10):
conn1 = mysql.connector.connect(host='',
port=4000,
user='',
password='',
database='',
charset='utf8')
conn2 = mysql.connector.connect(host='',
port=4000,
user='',
password='',
database='',
charset='utf8')
cur1 = conn1.cursor()
cur2 = conn2.cursor()
conn1.start_transaction()
conn2.start_transaction()
cur2.execute("update t set b=%d where a=1" % 2)
conn2.commit()
cur1.execute("select b from t where a=1")
a = cur1.fetchone()
print(a)
cur1.execute("update t set b=%d where a=1" % 0)
conn1.commit()
cur1.close()
cur2.close()
conn1.close()
conn2.close()
The table t is created like this:
CREATE TABLE `t` (
`a` int(11) NOT NULL AUTO_INCREMENT,
`b` int(11) DEFAULT NULL,
PRIMARY KEY (`a`)
)
and (1,0) is inserted initially.

First of All:
For TiDB only support SNAPSHOT(latest version)
Transactions Isolation Level. but it only can see committed data before Transaction started.
and TiDB also will not update the same value in transaction,
like MySQL and SQL Server etc.
For MySQL, when use the READ COMMITTED isolation level, it
will read committed data, so it will read the other transactions
committed data.
So as your code snippet:
TiDB round 1 workflow:
T1 T2
+--------------------+
| transaction start |
| (b = 0) |
+---------+----------+
|
|
| +------------------------------+
| <----------------------+ update `b` to 2, and commit |
| +------------------------------+
|
|
+-----------+-----------+
| select b should be 0, |
| since tidb will only |
| get the data before |
| transaction committed |
+-----------+-----------+
|
v
+------------------------------+
| update value to 0 |
| (since 0 is equal to the |
| transaction started value, |
| tidb will ignore this update)|
+------------------------------+
+
|
|
|
v
+-------------------------+
|so finally `b` will be 2 |
+-------------------------+
TiDB round 2 workflow:
T1 T2
+--------------------+
| transaction start |
| (b = 2) |
+---------+----------+
|
|
| +------------------------------+
| <----------------------+ update `b` to 2, and commit |
| +------------------------------+
|
|
+-----------+-----------+
| select b should be 2, |
| since tidb will only |
| get the data before |
| transaction committed |
+-----------+-----------+
|
v
+------------------------------+
| update value to 0 |
| (since 0 is not equal to 2 |
+------------------------------+
+
|
|
|
v
+-------------------------+
|so finally `b` will be 0 |
+-------------------------+
So for TiDB will output like:
0, 2, 0, 2, 0, 2...
MySQL workflow:
T1 T2
+----------------------+
| transaction start |
| ( b = 0 ) |
+-----------+----------+
|
|
|
| +---------------------------+
| <----------------------+update `b` to 2, and commit|
| +---------------------------+
|
|
v
+--------------------------------------------+
| select b should be 2, |
| since use READ COMMITTED isolation level, |
| it will read committed data. |
+---------------------+----------------------+
|
|
v
+--------------------+
| update value to 0 |
+--------------------+
+
|
|
|
v
+--------------------------+
| so finally `b` will be 0 |
+--------------------------+
so MySQL can continuously output:
2, 2, 2, 2...
Last of word
I think this is very strange for TiDB to skip the update same value in Transaction, but when with the different value it also can be updated success, like we can update b to different value in the loop, we always can get the latest changed b.
So maybe it should be better keep same behavior between same value and different value.
I have created a issue for this:
https://github.com/pingcap/tidb/issues/7644
References:
https://github.com/pingcap/docs/blob/master/sql/transaction-isolation.md

Related

Find cell address of value found in range

tl;dr In Google Sheets/Excel, how do I find the address of a cell with a specified value within a specified range where value may be in any row or column?
My best guess is
=CELL("address",LOOKUP("My search value", $search:$range))
but it doesn't work. When it finds a value at all, it returns the rightmost column every time, rather than the column of the cell it found.
I have a sheet of pretty, formatted tables that represent various concepts. Each table consists of
| Title |
+------+------+-------+------+------+-------+------+------+-------+
| Sub | Prop | Name | Sub | Prop | Name | Sub | Prop | Name |
+------+------+-------+------+------+-------+------+------+-------+
| Sub prop | value | Sub prop | value | Sub prop | value |
+------+------+-------+------+------+-------+------+------+-------+
| data | data | data | data | data | data | data | data | data |
| data | data | data | data | data | data | data | data | data |
⋮
I have 8 such tables of variable height arranged in a grid within the sheet 3 tables wide and 3 tables tall except the last column which has only 2 tables--see image. These fill the range C2:AI78.
Now I have a table off to the right consisting in AK2:AO11 of
| Table title | Table title address | ... |
+---------------+-----------------------+-----+
| Table 1 Title | | ... |
| Table 2 Title | | ... |
⋮
| Table 8 Title | | ... |
I want to fill out the Table title address column. (Would it be easier to do this manually for all of 8 values? Absolutely. Did I need to in order to write this question? Yes. But using static values is not the StackOverflow way, now, is it?)
Based on very limited Excel/Google Sheets experience, I believe I need to use CELL() and LOOKUP() for this.
=CELL("address",LOOKUP($AK4, $C$2:$AI$78))
This retrieves the wrong value. For AL4 (looking for value Death Wave), LOOKUP($AK4, $C$2:$AI$78) should retrieve cell C2 but it finds AI2 instead.
| Max Levels |
+------------------+---------------+----+--+----+
| UW | Table Address | | | |
+------------------+---------------+----+--+----+
| Death Wave | $AI$3 | 3 | | 15 |
| Poison Swamp | $AI$30 | | | |
| Smart Missiles | $AI$56 | | | |
| Black Hole | #N/A | 1 | | |
| Inner Land Mines | $AI$3 | | | |
| Chain Lightning | #N/A | | | |
| Golden Tower | $AI$3 | | | |
| Chrono Field | #N/A | 25 | | |
The error messages for the #N/A columns is
Did not find value '<Table Title>' in LOOKUP evaluation.
My expected table is
| Max Levels |
+------------------+---------------+----+--+----+
| UW | Table Address | | | |
+------------------+---------------+----+--+----+
| Death Wave | $C$2 | 3 | | 15 |
| Poison Swamp | $C$28 | | | |
| Smart Missiles | $C$54 | | | |
| Black Hole | $O$2 | 1 | | |
| Inner Land Mines | $O$28 | | | |
| Chain Lightning | $O$54 | | | |
| Golden Tower | $AA$2 | | | |
| Chrono Field | $AA$39 | 25 | | |
try:
=INDEX(ADDRESS(
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&"​"&ROW(D2:F4)), "​"), 2, ),
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&"​"&COLUMN(D2:F4)), "​"), 2, ), 4))
or if you want to create jump links:
=INDEX(LAMBDA(x, HYPERLINK("#gid=1273961649&range="&x, x))(ADDRESS(
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&"​"&ROW(D2:F4)), "​"), 2, ),
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&"​"&COLUMN(D2:F4)), "​"), 2, ), 4)))
Try this:
=QUERY(
FLATTEN(
ARRAYFORMULA(
IF(
C:AI=$AK4,
ADDRESS(ROW(C:AI), COLUMN(C:AI)),
""
)
)
), "
SELECT
Col1
WHERE
Col1<>''
"
, 0)
Basically, cast all cells in the search range to addresses if they equal the search term. Then flatten that 2D range and filter out non-nulls.

Pandas merge have two results with the same code and input data

I have two dataframe to merge.When I run the program with the same input data and code,there will be two situations(First:Successful merge;Second:The data belongs to 'annotate' in merge data is NaN.)
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
Then I have a test:
count = 10001
while (count > 10000):
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
count = len(raw_df2[raw_df2["type"]=="unkown"])
print(count)
If merge is faild,"raw_df" always is falied during the run.I must resubmit the script,and the result may be successful.
[First two columns are from 'annotate';Others are 'from raw_df']
The failed result:
| type | gene | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
| unknow | 0610040J01Rik | chr5:63812494-63899619 | Ctrl | SPION10 | OK | 2.02125 | 0.652688 |
| unknow | 1110008F13Rik | chr2:156863121-156887078 | Ctrl | SPION10 | OK | 87.7115 | 49.8795 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
The successful result:
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| gene | type | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| St18 | misc_RNA | chr1:6487230-6860940 | Ctrl | SPION10 | OK | 1.90988 | 3.91643 |
| Arid5a | misc_RNA | chr1:36307732-36324029 | Ctrl | SPION10 | OK | 1.33796 | 2.21057 |
| Carf | misc_RNA | chr1:60076867-60153953 | Ctrl | SPION10 | OK | 0.846988 | 1.47619 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+
I have a solution,but I still don't know what cause the previous problem.
Set the column in two dataframe that I want to merge as the Index.Then use the index to merge two dataframe.
Run the script more than 10 times,the result is no longer wrong.
# the first dataframe
DataQiime = pd.read_csv(args.FileTranseq,header=None,sep=',') #
DataQiime.columns=['Feature.ID','Frequency']
DataQiime_index = DataQiime.set_index('Feature.ID', inplace=False, drop=True)
# the second dataframe
DataTranseq = pd.read_table(args.FileQiime,header=0,sep='\t',encoding='utf-8') #
DataTranseq_index = DataTranseq.set_index('Feature.ID', inplace=False, drop=True)
# merge by index
DataMerge = pd.merge(DataQiime,DataTranseq,left_index=True,right_index=True,how="inner")

PySpark getting distinct values over a wide range of columns

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.
------------------------------------------------
| columnname | value | count |
|------------|-----------------------|---------|
| evar1 | en-GB | 7654321 |
| evar1 | en-US | 1234567 |
| evar2 | www.myclient.com | 123 |
| evar2 | app.myclient.com | 456 |
| ...
The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.
i = 1
df_evars = None
while i <= 30:
colname = "evar" + str(i)
df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
.withColumn("colName", fn.lit(colname))
if df_evars:
df_evars = df_evars.union(df_temp)
else:
df_evars = df_temp
display(df_evars)
Am I missing a better solution?
Update
This has been marked as a duplicate but the two responses IMO only solve part of my question.
I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.
The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.
The second response seems less relevant than the first. To clarify, source data like this:
-----------------------
| evar1 | evar2 | ... |
|---------------|-----|
| A | A | ... |
| B | A | ... |
| B | B | ... |
| B | B | ... |
| ...
Should result in the output
--------------------------------
| columnname | value | count |
|------------|-------|---------|
| evar1 | A | 1 |
| evar1 | B | 3 |
| evar2 | A | 2 |
| evar2 | B | 2 |
| ...
Using melt borrowed from here:
from pyspark.sql.functions import col
melt(
df.select([col(c).cast("string") for c in df.columns]),
id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()
Adapted from the answer by user6910411.

How can I do multiple concurrent insert transactions against postgres without causing a deadlock?

I have a large dump file that I am processing in parallel and inserting into a postgres 9.4.5 database. There are ~10 processes that all are starting a transaction, inserting ~X000 objects, and then committing, repeating until their chunk of the file is done. Except they never finish because the database locks up.
The dump contains 5 million or so objects, each object representing an album. An object has a title, a release date, a list of artists, a list of track names etc. I have a release table for each one of these (who's primary key comes from the object in the dump) and then join tables with their own primary keys for things like release_artist, release_track.
The tables look like this:
Table: mdc_releases
Column | Type | Modifiers | Storage | Stats target | Description
-----------+--------------------------+-----------+----------+--------------+-------------
id | integer | not null | plain | |
title | text | | extended | |
released | timestamp with time zone | | plain | |
Indexes:
"mdc_releases_pkey" PRIMARY KEY, btree (id)
Table: mdc_release_artists
Column | Type | Modifiers | Storage | Stats target | Description
------------+---------+------------------------------------------------------------------+---------+--------------+-------------
id | integer | not null default nextval('mdc_release_artists_id_seq'::regclass) | plain | |
release_id | integer | | plain | |
artist_id | integer | | plain | |
Indexes:
"mdc_release_artists_pkey" PRIMARY KEY, btree (id)
and inserting an object looks like this:
insert into release(...) values(...) returning id; // refer to id below as $ID
insert into release_meta(release_id, ...) values ($ID, ...);
insert into release_artists(release_id, ...) values ($ID, ...), ($ID, ...), ...;
insert into release_tracks(release_id, ...) values ($ID, ...), ($ID, ...), ...;
So the transactions look like BEGIN, the above snippet 5000 times, COMMIT. I've done some googling on this and I'm not sure why what look to me like independent inserts are causing deadlocks.
This is what select * from pg_stat_activity shows:
| state_change | waiting | state | backend_xid | backend_xmin | query
+-------------------------------+---------+---------------------+-------------+--------------+---------------------------------
| 2016-01-04 18:42:35.542629-08 | f | active | | 2597876 | select * from pg_stat_activity;
| 2016-01-04 07:36:06.730736-08 | f | idle in transaction | | | BEGIN
| 2016-01-04 07:37:36.066837-08 | f | idle in transaction | | | BEGIN
| 2016-01-04 07:37:36.314909-08 | f | idle in transaction | | | BEGIN
| 2016-01-04 07:37:49.491939-08 | f | idle in transaction | | | BEGIN
| 2016-01-04 07:36:04.865133-08 | f | idle in transaction | | | BEGIN
| 2016-01-04 07:38:39.344163-08 | f | idle in transaction | | | BEGIN
| 2016-01-04 07:36:48.400621-08 | f | idle in transaction | | | BEGIN
| 2016-01-04 07:34:37.802813-08 | f | idle in transaction | | | BEGIN
| 2016-01-04 07:37:24.615981-08 | f | idle in transaction | | | BEGIN
| 2016-01-04 07:37:10.887804-08 | f | idle in transaction | | | BEGIN
| 2016-01-04 07:37:44.200148-08 | f | idle in transaction | | | BEGIN

Removing redundant rows in a Spark data frame with time series data

I have a Spark data frame that looks like this (simplifying timestamp and id column values for clarity):
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 4 | 1 | in-progress |
| 5 | 3 | in-progress |
| 6 | 1 | pending |
| 7 | 4 | closed |
| 8 | 1 | pending |
| 9 | 1 | in-progress |
It's a time series of status events. What I'd like to end up with is only the rows representing a status change. In that sense, the problem can be seen as one of removing redundant rows - e.g. entries at times 4 and 8 - both for id = 1 - should be dropped as they do not represent a change of status for a given id.
For the above set of rows, this would give (order being unimportant):
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 5 | 3 | in-progress |
| 6 | 1 | pending |
| 7 | 4 | closed |
| 9 | 1 | in-progress |
Original plan was to partition by id and status, order by timestamp, and pick the first row for each partition - however this would give
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 5 | 3 | in-progress |
| 7 | 4 | closed |
i.e. it loses repeated status changes.
Any pointers appreciated, I'm new to data frames and may be missing a trick or two.
Using the lag window function should do the trick
case class Event(timestamp: Int, id: Int, status: String)
val events = sqlContext.createDataFrame(sc.parallelize(
Event(1, 1, "pending") :: Event(2, 2, "pending") ::
Event(3, 1, "in-progress") :: Event(4, 1, "in-progress") ::
Event(5, 3, "in-progress") :: Event(6, 1, "pending") ::
Event(7, 4, "closed") :: Event(8, 1, "pending") ::
Event(9, 1, "in-progress") :: Nil
))
events.registerTempTable("events")
val query = """SELECT timestamp, id, status FROM (
SELECT timestamp, id, status, lag(status) OVER (
PARTITION BY id ORDER BY timestamp
) AS prev_status FROM events) tmp
WHERE prev_status IS NULL OR prev_status != status
ORDER BY timestamp, id"""
sqlContext.sql(query).show
Inner query
SELECT timestamp, id, status, lag(status) OVER (
PARTITION BY id ORDER BY timestamp
) AS prev_status FROM events
creates table as below where prev_status is a previous value of status for a given id and ordered by timestamp.
+---------+--+-----------+-----------+
|timestamp|id| status|prev_status|
+---------+--+-----------+-----------+
| 1| 1| pending| null|
| 3| 1|in-progress| pending|
| 4| 1|in-progress|in-progress|
| 6| 1| pending|in-progress|
| 8| 1| pending| pending|
| 9| 1|in-progress| pending|
| 2| 2| pending| null|
| 5| 3|in-progress| null|
| 7| 4| closed| null|
+---------+--+-----------+-----------+
Outer query
SELECT timestamp, id, status FROM (...)
WHERE prev_status IS NULL OR prev_status != status
ORDER BY timestamp, id
simply filters rows where prev_status is NULL (first row for a given id) or prev_status is different than status (there was a status change between consecutive timestamps). Order added just to make a visual inspection easier.

Resources