`
byebyegov5qq
  • 浏览: 1220235 次
文章分类
社区版块
存档分类
最新评论

转alter index rebuild online引发的血案

 
阅读更多
‘早上起来没有一个人……‘,伸手抓起手机,‘喂,……应用hang住了……rac的一节点在手工shutdown……需要到现场……’。赶紧穿上衣服,拿起电脑往客户办公室赶。路上还接到客户电话具体描述了下故障现象。
原来是客户在rac的其中一个节点进行index rebuild online操作,结果命令发出去不久应用就hang住了,没有办法的客户就直接用了大绝招:把操作的节点数据库重启了,现在的问题是shutdown immediate也hang住了。听到这里,马上喊他看日志,回复日志没报错,郁闷的shutdown,如果有大事务在运行,岂不是要rollback 死,由于此应用级别很高,等不起,只能死马当活马医了。不管了,先让它关了在说,先直接操作系统杀客户端进程,使资源能更快的释放。于是,指导输入命令 ps –ef|grep oracle|grep LOCAL=NO|grep –v grep|awk ‘{awk print $2}’|xargs kill -9。系统终于关闭了,直接startup……
等我到现场时,系统居然还没启动起来,难道真的有大事务需要恢复,难道就是重建索引?
登录服务器查看,日志显示数据库正在recover,想起经常的recover几个小时的情况,我再次对贸然的shutdown很是郁闷。不过好在是 rac,可客户反应即使连接到好的节点还是不能使用,登录查看,检查锁,检查等待。发现存在大量的锁,而且有一个锁已经2个月了,恰恰就是重建索引的表,明显这个锁有问题,直接kill。等一会,另一个节点启动好了,问应用的情况,也恢复了,就这么简单?就是那个锁的问题?我晕,守一会业务后我就撤退了。
在路上我就在想,rebuild index online怎么会把整个应用hang死,难道它锁了整个表,怎么会呢,在印象中online的同时是可以进行dml操作的啊,奇了怪了,先睡一觉,空了再来重现故障分析……
N天过后……
先上一篇文章,关于rebuild和rebuild online的区别metalink Note:272762.1
大家耐心看完哟,原理是很重要的……
========
- Online Index rebuild takes a long time.
- ONLINE INDEX REBUILD SCANS THE BASE TABLE AND NOT THE INDEX
Symptoms:
=========
Performance issues while rebuilding very large indexes.
- The offline rebuilds of their index is relatively quick -finishes in 15 minutes.
- Issuing index rebuild ONLINE statement => finishes in about an hour.
- This behavior of ONLINE index rebuilds makes it a non-option for large tables
as it just takes too long to scan the table to rebuild the index. The offline may not be feasible due to due to the 24/7 nature of the database.
- This may be a loss of functionality for such situations.
- If we attempt to simultaneously ONLINE rebuild the same indexes we may encounter hanging behavior indefinitely (or more than 6 hours).
DIAGNOSTIC ANALYSIS:
--------------------
We can trace the sessions rebuilding the indexes with 10046 level 12.
Comparing the IO reads for the index-rebuild and the index-rebuild-online reveals the following:
-ONLINE index rebuilds
It scans the base table and it doesn't scan the blocks of the index.
-OFFLINE index rebuilds
It scans the index for the build operation.
- This behaviour is across all versions.
Cause
Cause/Explanation
=============
When you rebuild index online,
- it will do a full tablescan on the base table.
- At the same time it will maintain a journal table for DML data, which has
changed during this index rebuilding operation.
So it should take longer time, specially if you do lots of DML on the same table,while rebuilding index online.
On the other hand, while rebuilding the index without online option, Oracle will grab the index in X-mode and rebuild a new index segment by selecting the data from the old index. So here we are
- not allowing any DML on the table hence there is no journal table involved
- and it is doing an index scan
Hence it will be pretty fast.
Fix
Solution/Conclusion:
===========
- The ONLINE index rebuild reads the base table, and this is by design.
- Rebuilding index ONLINE is pretty slow.
- Rebuilding index offline is very fast, but it prevents any DML on the base table.
好,我们现在明白了,rebuild online是可以同时进行dml操作的,但是online会维护一个操作日志,会使rebuild时间大幅加长。巴拉巴拉……这一切的一切,跟我遇到的应用挂起没有任何关系啊,难道是应用骗了我,其实应用可以用?还是模拟一下当时的场景再说。
Action:
Conn oracle/oracle
--建表
create table test(a int,b varchar2(64));
--插入数据
begin
for i in 1..1000000 loop
insert into test values(i,'ok');
commit;
end loop;
end;
/
--创建索引
create index idx_a1 on test(a);

准备工作做好了……
--更新其中一条记录,制造一个排他锁,不要提交
SQL> update test set a=1000000 where a=103;
--打开另一个窗口,在线重建索引
SQL> alter index idx_a1 rebuild online;
……
发现此命令长时间不能执行完,没关系,大表的索引重建是需要大量时间的,查下等待和锁吧。

sys@ORCL_SQL> select sid,event,P1TEXT,state from v$session_wait where event not in ('SQL*Net message from client');

SID EVENT P1TEXT STATE
------- ------------------------------ ------------------------------ --------------------------------------
139 enq: TM - contention name|mode WAITING
149 Streams AQ: waiting for time m WAITING
anagement or cleanup tasks

151 Streams AQ: qmn coordinator id WAITING
le wait

155 rdbms ipc message timeout WAITING


sys@ORCL_SQL> select * from v$lock where block>0;

ADDR KADDR SID TYPE ID1 ID2Lock HeldLock Req. CTIME BLOCK
-------- -------- ------- ---- -------- -------- ---------- ---------- ---------- ----------
315C4134 315C414C 154 TM 52543 0 ########## ########## 24 1
看到木有,重建根本就没有进行,而是在等待sid154,这个sid就是我们开始的update的命令,看来rebuild online是不会阻碍dml操作,但是在它之前的dml操作它会去等待,知道资源释放,如果这时有个大事务一直不释放资源,那就恭喜了,你的重建就会 hang在这里。慢着,这里hang住了对应用不会有影响吧,想当然是,马上测试
又打开个新连接
SQL> update test set a=1000000 where a=102;
……
1分钟过去了,此命令没有成功,真的hang住了,整个表被锁住了?
再次检查锁

WAITING_SESSION LOCK_TYPE MODE_REQUESTED MODE_HELD LOCK_ID1 LOCK_ID2
----------------- ----------------- ---------------------------- ---------------------------- ----------------- -----------------
154 None
139 DML Share Row-X (SX) 52543 0
142 DML Row-X (SX) Row-X (SX) 52543 0
这里解释一下,154是我们最开始update的sid,139是rebuild online的sid,而142是我们刚才update的sid,看来它们都在等sid154,也就是说,这个时候我不能对此表做dml操作了,也就是对于此操作非常频繁的表来说,应用已经hang住了!!
看来rebuild online之前一定要检查下系统,其实是做所有操作之前都要这样做,确认系统没问题后再做操作。
在贴下锁的具体情况,大家看rebuild操作的139有几个行排他锁,我怀疑是它需要先把所有的数据都用排他锁锁住,然后其它dml的操作它会记入辅助日志,而我这里的情况是有个行锁一直存在,就造成了它把其他行都锁住了,但一直等待最开始的那个行锁释放,这个时候如果我开始那个锁不释放,系统也就挂在这里了!!
SID SERIAL# USERNAME Term Resource Type TAB OWNER ID1 ID2 Lock Held Lock Req. CTIME
------- ------- ---------- ------ -------------------- ---------- -------- -------- -------- -------------- -------------- ----------
139 11 ORACLE pts/1DLSystem TEST ORACLE 52543 0 Row Exclusive 2958
139 11 ORACLE pts/1DLSystem TEST ORACLE 52543 0 Row Exclusive 2958
154 9 ORACLE pts/2TM - DML Enqueue TEST ORACLE 52543 0 Row Exclusive 2967
139 11 ORACLE pts/1TM - DML Enqueue TEST ORACLE 52543 0 Row Share Share 2958
139 11 ORACLE pts/1TM - DML Enqueue SYS_JOURNA ORACLE 52558 0 Share 2957
L_52544

142 33 ORACLE pts/4TM - DML Enqueue TEST ORACLE 52543 0 Row Excl 2844


sys@ORCL_SQL> select sql_text from v$sqltext where hash_value=&hash_value order by piece;
Enter value for hash_value: 1016442092

SQL_TEXT
alter index idx_a1 rebuild online

忘记说版本了,是10.2.0.1
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics