12.20.2010

The Oracle 11g RACaThon Online Conference @BrainSurface

Sign-Up OR Login to attend this FREE 4-Week Webinar Series! Learn ALL about Oracle Real Application Clusters (RAC) 11g from the experts! This series covers Oracle RAC 11g from "A to Z" including Demos & Hands-On Exercises at BrainSurface The After-Oracle ConnectSpace for the Oracle & Java Community.

Oracle 11g RACaThon Session 1 - Wednesday - Jan 05, 2011 11am - 12:15pm EST

Oracle RAC Comprehensive Concepts Overview, Insight, Recommendations, Best Practices and a whole lot more
Oracle 11g RACaThon Session 2 - Wednesday - Jan 12, 2011 11am - 12:15pm EST
Oracle RAC Internals, Cache Fusion and RAC Performance Tuning

Oracle 11g RACaThon Session 3 - Wednesday - Jan 19, 2011 11am - 12:15pm EST
Build your own Oracle RAC 11gR2 Cluster over Oracle Unbreakable Linux on Oracle VM VirtualBox (Exercise, Best Practices, Tips/Tricks & Gotchas)

Oracle 11g RACaThon Session 4 - Wednesday - Jan 26, 2011 11am - 12:15pm EST
Implement an Oracle RAC 11gR2 Cluster in a Virtualized Environment over Oracle VM (Exercise, Best Practices, Tips/Tricks & Gotchas)

To get more details about this event and speakers list, click the following URL, http://www.brainsurface.com/eventsurface/icalrepeat.detail/2011/01/05/64/the-oracle-11g-racathon-online-conference-brainsurface/

Looking forward to see your particiaption.
Regards,

Jaffar

10.06.2010

A mini RMAN recovery quiz

A few months ago, I have come across of an interesting circumstances while doing an RMAN point-in-time-recovery (PITR)  on one of our business critical RAC databases. I thought it would be worthy to blog about the interesting scenario that I come across. Ok, I need to take my memories back in time to explain what had happen.
At that time, I was advised to rollback the database using the last night ONLINE RMAN backup and recovery (forward) it to a particular point in time. I was able to restore the database successfully using the last night backup.  However, while doing the recovery, RMAN was requesting archived logs that were generated almost a week back. I was surprised a little bit about this behavior as I was in a impression that RMAN looks for the archived logs that were generated after the online backup. Though I have restore and recover the database successfully to a PIT, I thought of conducting a mini quiz on this behavior to get cross your perspective on this, before I reveal the secrets  and blog about it. So, here goes your quiz:

    Under which circumstances does an RMAN recovery needs archive logs generated before the backup sets that are used in the restore operation?

Best of luck and awaits for your response....

Jaffar

9.28.2010

Beware of unethical LinkedIn connect or view pics request

Beware of an unethical LinkedIn connect or view photos request. It could be a trap to install some stupid spam and malware softwares on your PC and which screams you later  with pesky messges. In fact, myself had fallen to this trap a few days ago when I received such request from a friend to connect on the LinkedIn social networking site After I click the connect link, a malware software had installed and started giving some very bothersome messages on my PC. Indeed it really irritated me a lot and took almost 5 hours of my time to get rid off this problem.
Ensure its a genuine connect request from the LinkedIn site before you go ahead and click the link.


Jaffar

9.26.2010

kghfrunp: latch: nowait paralyzes RAC database instances

When I arrived office this morning, I received a complain from a business user that he is unable to log in on one of our business critical RAC database (v10gR2, configured with 2 instances) running on HP Superdome. Nevertheless, after the complain, I was able to establish a connection on the first instance, but, on the second instance, I couldn't, the connection just got hanged. Suspecting the blocking scenario, I then tried to run some queries on the first instance to understand the ongoing situation of the database. Alas, the query too hanged and the cursor didn't return back to the SQL prompt.
The alert.log of the second instance just shows an 'WARNING: inbound connection timed out (ORA-3136)' message. However, the instance alter.log file shows the following information:

ORA-12012: error on auto execute of job 42781
ORA-27468: "EXFSYS.RLM$SCHDNEGACTION" is locked by another process
Errors in file /u00/app/oracle/admin/XXXX/bdump/xxxx1_j000_23982.trc:
ORA-12012: error on auto execute of job 42781
ORA-27468: "EXFSYS.RLM$SCHDNEGACTION" is locked by another process


I then had a closer look at the details given in the DIAG trace file on the second node and found some valuable information, as listed below:

         waiter count=24
          gotten 8128165 times wait, failed first 518198 sleeps 267045
          gotten 83569 times nowait, failed: 232295
        possible holder pid = 24 ospid=14113
      on wait list for c0000001b8cdd6f0
    Process Group: DEFAULT, pseudo proc: c0000001bc2f4d90
    O/S info: user: oracle, term: UNKNOWN, ospid: 14391
    OSD pid info: Unix process pid: 14391, image: oracle@usogp08 (LCK0)
  waiting for c0000001b8cdd6f0 Child row cache objects level=4 child#=16
        Location from where latch is held: kghfrunp: clatch: nowait:
        Context saved from call: 0
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:

    ----------------------------------------
    SO: c0000001b5387868, type: 16, owner: c0000001bf2a61c0, flag: INIT/-/-/0x00
    (osp req holder)
Enqueue blocker waiting on 'latch: row cache objects'

After analyzing the text closely, it was loud and clear that the database indeed is suffering from the blocking problem. Instead of aborting the second instance (could be managed through sqlplus -prelim / as sysdba command though) in order to release the lock held on the database, I thought it would be better to find the lock holder session and kill it's process. I manage to get the details about the holder session from the diag file, ' possible holder pid = 24 ospid=14113'. After getting the ospid details of the possible holder session, I found that it was a job queue oracle background process. Without any second thought, I just killed the process using from the OS using  the 'kill -9 processid' command. Once process has been killed, the database was back to the normal functionality. I then took a AWR report of the second instance and it reveals the following facts about the sitaution:

top 5 timed events
latch: row cache objects     420,942 64,707 154   40.3 Concurrency
rdbms ipc reply               32,832 62,721 1,910 39.1 Other
cursor: pin S wait on X    1,898,149 25,440 13    15.8 Concurrency
row cache lock               321,355 10,961 34     6.8 Concurrency
CPU time                              3,953        2.5  


shared KQR L PO 214.04 56.62 -73.55
library cache kqlmbfre: child: in loop 0 66 175
library cache kqlmbfre: child: no obj to free 0 18 258

shared pool kghfrunp: clatch: wait   0 308,002 417,684
shared pool kghfrunp: alloc: wait    0 129,691 128,527
shared pool kghfrunp: clatch: nowait 0 125,485 0
shared pool kghfre                   0   3,043 3,386
shared pool kghalo                   0 258 777
shared pool kgh_next_free            0  79 322
shared pool kghalp                   0  18 48


The ML LCK temporarily stuck waiting for latch 'Child row cache objects' [ID 843638.1] points towards shared pool stress and below is the abstract about the cause of the problem:

The shared pool is stressed and memory need to be freed for the new cursors. As a consequence, the dictionary cache is reduced in size by the LCK process causing a temporal hang of the instance since the LCK can't do other activity during that time. Since the dictionary cache is a memory area protected clusterwide in RAC, the LCK is responsible to free it in collaboration with the dictionary cache users (the sessions using cursors referenced in the dictionary cache). This process can be time consuming when the dictionary cache is big.

After understanding the facts & figures shown in the AWR report, indeed, there was a pressure on the shared pool's dictionary cache memory component.  As suggested in the ML, I need to pay attention to the larger text SQLs and tune or need to apply a patch to avoid the embarrassing situation in the future.
Stay tune, I will post more details in the coming days.

References
BUG:8266531 - LCK STUCK WAITING FOR LATCH 'CHILD ROW CACHE OBJECTS'
BUG:8666117 - LCK0 PROCESS STUCK AT WAITING FOR "LATCH: ROW CACHE OBJECTS"


9.09.2010

Wishing you a happy and prosperous EID

Yet another a pleasing holy month of RAMADA ended today by the grace of almighty and a great day of EID celebration ahead! Hence, I would like take this opportunity to wish everyone out there a very happy,hale and prosperous EID. Enjoy your EID with your parents,family,friends,colleagues, relatives and etc.

Jaffar

8.23.2010

Ad: Interested in writing? Here is your chance

Packt Publishing is planning to expand its catalogue of Oracle books and is currently inviting Oracle Fusion Middleware experts interested in becoming published authors to get in touch.
 
You don't need any previous writing experience to write for Packt. All that Packt requires from you as an author is a good knowledge of your subject, a passion to share it with others and an ability to communicate clearly in English.
 
Here are some topics that Packt finds immediately interesting:
 
•Oracle WebLogic Suite 11g
•Oracle WebCenter Suite 11g
•Oracle Identity Management 11g
•Other components of the Fusion Middleware Suite...
•and others... 

The list of topics here is definitely not exhaustive. If you are aware of areas which need Oracle Fusion Middleware 11g books, Packt is definitely keen to hear.
So, if you love Oracle Fusion Middleware and fancy writing a book, send an overview of your book idea to author@packtpub.com. If you do not have a book idea and are simply interested in writing a book, Packt will still love to hear from you!
 
 
Best of luck..
Jaffar

7.31.2010

libjox10.so (Permission denied (errno:13)) while applying CPU patch on HPUX

Notwithstanding the successful CPU April 2010 patch deployment on development servers, we ran into some sort of troubles while applying the same on our 6 node production RAC servers on HPUX platform. At some stage whilst  installation going on, the following error has been thrown:


The following actions have failed:
WARNING:OUI-67124:Copy failed from '/u00/CPU_APRIL2010_PATCH/9352191/8568398/files/lib/libjox10.so' to '/u00/app/oracle/product/10.2.0/db_1/lib/libjox10.so'...

Previously when we use to apply the CPU or any interim patches, we use to anticipate troubles like,  'Text file busy genclntsh: Failed to remove ..;.  And this time the error seems to be a different one.  Therefore, we stopped the patch installation to make sure no Oracle binaries are running and to investigate why the copy had failed. 

In the the second attempt, the patch installation had failed again with the following error:

File not backed up from /u00/app/oracle/product/10.2.0/db_1/lib/libjox10.so to /u00/app/oracle/product/10.2.0/db_1/.patch_storage/8568398_Jun_11_2009_12_49_36/files/lib/libjox10.so... '/u00/app/oracle/product/10.2.0/db_1/.patch_storage/8568398_Jun_11_2009_12_49_36/files/lib/libjox10.so (Permission denied (errno:13))'

Apparently the problem is more clear this time, indicating permission issue. When we paid a close look at the file in the context we found that writable permission is missing from the file, as show below:


-rxr-xr-x   1 oracle     oinstall   20041776 Jun 11  2009 libjox10.so

As a precautionary step, we first copied the source file and then added write permission to the file, you may also change permission with 755, using the following chmod OS command:

chmod +w libjox10.so

Upon adding the write permission to the file, the subsequent patch installation attempt went smoothly and the patch deployed successfully.

What's next?.. Well, an another 8 node production RAC environment is awaiting for the same patch. Wish me good luck mates.
Happy reading,

Jaffar




6.30.2010

PMP exam preparation course

It has been a while since I have updated my blog. Well, here is a quick update about what I was been doing past 3 weeks. I have attend an in-house PMP exam preparation course 2 hrs a day for 3 weeks duration and it is delivered by our Manager Mr. Ahmed Darwish who is PMP certified and well known PMP instructor here in Middle East.

There were nearly 30 attendees in the course and we thoroughly enjoyed the lecture. I must say, the way he presents, explains real world scenarios and answering difficult questions was quite impeccable. Personally, I have gain thorough knowledge about project management, such as, what needs to be followed for a successful project. If I come across of an opportunity to manage a project, certainly I gonna apply all those golden rules mentioned in the PMP course.

In my next blog entry, I will try to share our group picture which we took at the end of the course.

Happy reading,

6.05.2010

Oracle 11g OCP upgrade

Friends,

A short update about my Oracle certification. On 2nd June 2010, I have successfully completed my Oracle 11g OCP upgrade exam.



Jaffar

5.09.2010

FAQ: Oracle mutexes

I have come across of an interesting FAQ about Oracle mutexes while surfing the net. The following link would take you to the FAQ: Oracle mutexes page where you can find some very good information about the Oracle mutex behavior and it's functionality:

http://www.sap2048.com/faq-oracle-mutexes-933.htm

happy reading,

Jaffar

5.01.2010

crs_stat has been deprecated in 11gR2

While reading whats new and whats deprecated in Oracle 11gR2, I came across of the following:

  • Command crs_stat has been deprecated  in 11gR2, do not use it anymore, to find out all user resource state use, $GRID_HOME/bin/crsctl stat res -t
  •  By default ora.gsd is OFFLINE if there's no 9i database in the cluster, ora.oc4j is OFFLINE in 11.2.0.1 as Database Workload Management(DBWLM) is unavailable.


By the way, in 11gR2 or above, you can use the below command to find out the clusterware process state:

  • $GRID_HOME/bin/crsctl stat res -t -init
 

4.25.2010

KJC: Wait for msg sends to complete

To investigate a query performance issues on one of our RAC database with two instances, I have initiated the 'oradebug dump errorstack 3' command for the session which was running the query. However, after few minutes I had canceled the operation as it was taking a long time and surprisingly, I found that the session is waiting for 'KJC: Wait for msg sends to complete' event. Although the event disappeared after few minutes, but it gave me an enough nightmare by that time. I didn't find any useful info about this event, but, I have to take sometime to investigate the behavior.

happy reading,

Jaffar

4.24.2010

SELECT FOR UPDATE OF" Hangs in 11.2.0.1

While going through a bug list on metalink I have come across of a ML note: "SELECT FOR UPDATE OF" Hangs in 11.2.0.1 [ID 1081727.1] and thought to blog about it which could be useful for the people who are on 11g or planning to go to 11g. Below is the extract from the note:

Applies to:  Version: 11.2.0.1 - Release: 11.2 applies to any platform. 

Symptoms "SELECT FOR UPDATE OF" Hangs in 11.2.0.1.


SQL:
SELECT table1.rowid
FROM table1 JOIN table2 t2 ON table1.id = t2.id
FOR UPDATE OF table1.text1,table1.text2;


Cause
Bug 9278351 "SELECT FOR UPDATE OF" HANGS IN 11.2.0.1
Base Bug 9294110 LEFT OUTER JOIN WITH FOR UPDATE CLAUSE HANGS FOREVER

The 10046 trace shows high 'cursor: pin S wait on X':
Solution
  Apply Patch 9294110.


happy reading

Jaffar

3.08.2010

Congratulations to Mr. Sabdar and Mr. Ben Prusinski

Many congratulations to my fellow Oracle ACEs,  Mr. Syed Sabdar Hussain (my brother) and Mr. Ben Prusinski (friend and co-author of my upcoming Oracle 11g RAC book) whose short interview published in Oracle Magazine's (March/April 2010 edition) peer-to-peer section.

3.05.2010

Book Review: Oracle 11g Streams Implementer's Guide

I know its been more than a month I had promised to write a review on this book at my blog. The fact is that I was occupied by too many production critical  issues that doesn't allow me to concentrate on any thing. Nevertheless, I took sometime time to fulfill my promise and here is my chapter-by-chapter review:

Chapter 1: All the Pieces: The Parts of an Oracle 11g Streams Environment
I strongly believe in that the basic understanding of any architecture is the most important factor for  the successful implementation.This is one of the most vital chapter of the book where it discussed thoroughly setting about various types of streams environments, such as, single database, bi-directional, master-to-master plus it also talked about the their architecture and some internal mechanics.  Some very important key factors that are involved in the implementation phase, like, the importance of network design involved between the sites, important initialization parameters that are needed to tweak for the good performance also have been discussed in this chapter.

Conclusion: The chapter contains a thorough explanation about the streams setup and architecture.

Chapter 2: Plot Your Course: Design consideration
This chapter provides the guidelines on the details to be considered during the streams design phase. It talks about the business requirement of your data, talks about the strategies on 'Who, What, when, Where, Why and How' terms. The other important key factor involved in the implementation is about the resource requirements on the server and the additional cost involved with the implementation.
Conclusion: The chapter provides the insights needed to design and the strategies, server resources and other important factors required during the stream implementation phase.

Chapter 3 : Prepare the Rafts and Secure Your Gear: The pre-work before configuring Oracle 11g Streams
This chapter talks about the the pre configuration work that is needed  for streams implementation. Where it discussed the importance of having reliable and stable network setup, important database parameters, setting up supplemental logging on the database, required Streams Administrator user and its permissions. It also suggest to have a separate tablespace for the LogMiner to avoid using SYSTEM or SYSAUX tablespaces are being used.

Conclusion: Over all, its a nice chapter discussing the pre work that is required for the streams implementation.

Chapter 4: Single-Source Configuration
This chapter demonstrated how to setup a single-source (uni-directional) streamed environment utilizing one of the convenient methods, Enterprise Manager (DB Console) wizards. Subsequently discussed the code which run behind the screen step-by-step. It also demonstrated how to schedule the jobs.

Conclusion: The first hands-on chapter for the readers.

Chapter 5: N-Way Replication
This chapter neatly explains about building N-Way replication(between multiple sources) topic, such as, pre-planning, techniques,avoiding conflicts, setup and etc.

Conclusion: It contains some very good examples which is can be very handy.

Chapter 6: Get Fancy with Streams Advanced Configurations
As the heading states, this chapter explained some of the advanced configuration features of Streams, such as, synchronizing capture, tag, rules, heterogeneous replication (basics) and etc. I particularly liked all the examples shows on its way of explanation about the advanced features.

Conclusion: A bit lengthy chapter, but, contains very useful set of examples.

Chapter 7 : Document What You Have and How It Is Working
I must totally agree with the authors on this point (preparing documents). This chapter started with a strong question, 'concern of loosing a key member of any team?'. Correct me if I am wrong saying if there is any boring part for the technical guys is the documentation part, don't you agree with my guys? The goal of this chapter is to use some of the Oracle packages to learn how to document, set up and monitor Streams environment.

Chapter 8: Dealing with the Ever Constant Tides of Change
This chapter has discussed the techniques for identifying and troubleshooting issues with Streams when there is a change or even no changes made. There is an in-depth explanation about the approach while doing any changes to the Stream environment, whether the change directly reflects the Streams environment or may impact from the outside of the environment.

Conclusion: A good proactive chapter.

Chapter 9: Appendix and Glossary
This chapter discussed various other options, such as Streams on Oracle RAC, Oracle GolddenGate and alternative method of monitoring streams.

Overall Impression:
I strongly recommend this book to fellow Oracle DBAs who are planning to setup Streams Environment or also who already implemented the streams and wanted to monitor their environment. I thoroughly enjoyed and learned many practical implementation details and insights in this book.  Finally, I thank Shernell at Packt Publications for providing me an opportunity to read and write a review for this book.

2.10.2010

Why alter system kill session IMMEDIATE is good

I am pretty sure that many of us come across of situations when a killed session by 'alter system kill session' command did put the session in 'KILLED' status and never released the session  for a long time on the database. It could be due to the fact that the session would be rolling back the ongoing transaction.
Whenever we are in such situation, we generally try to find out the OS pid (on UNIX OS) associated with the killed session (which is a bit difficult task, as the killed session paddr in v$session changes while the addr corresponding value in v$process does not), and kill the associated OS process with 'kill -9' command on the OS level.
I have found the IMMEDIATE option with the 'alter system kill session' is more useful as it writes the following information in the alert.log file after killing the session and also try to finish the things at the earliest possible to close the session from the database:

Wed Feb 10 11:02:39 2010 
Immediate Kill Session#: 515, Serial#: 36366
Immediate Kill Session: sess: c0000001be20d9f0  OS pid: 14686

As you see, it writes the time stamp when the session was killed, and also gives the associated OS pid of the killed session in the alert.log. As per Oracle documentation, 'Specify IMMEDIATE to instruct Oracle Database to roll back ongoing transactions, release all session locks, recover the entire session state, and return control to you immediately.'

Syntax:

alter system kill session 'sid,serial#' IMMEDIATE;

Regards,

Jaffar

2.02.2010

What happens when the dump file and log file names are the same during expdp?

Though it was a silly mistake, I felt discussing the funny part here.

Few days ago, I was doing an export (datapump) and come across the following error:

Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORA-39001: invalid argument value
ORA-39000: bad dump file specification
ORA-31641: unable to create dump file "/tmp_backup/dpump/text.dmp"
ORA-27038: created file already exists
Additional information: 1

For the first few minutes, I was wondering what went wrong and started checking all the arguments that were passed with expdp command as the error message suggested  'invalid argument value'. After confirming everything is fine with the command and then started looking at the subsequent error messages.  My second thought was, the file could be already exits. I deleted the file and run the expdp command which result in same error.
Any guesses what could be wrong here? Well, no prize for guessing. It was a silly mistake that I had committed. I had given the similar file name to the dumpfile and logfile arguments in the expdp command.

Sometime we may overlook the basic problems.

Happy reading,

Jaffar

1.27.2010

Book Review: Oracle 11g Streams Implementers Guide

Very soon I will be receiving the 'Oracle 11g Streams Implementers Guide' ebook by Packt Publishing  and will spend some time to read the book to write the review on my blog. In the meantime, if you want to learn more about this book, you can visit here.

Happy reading,

Jaffar

1.26.2010

Advt: Oracle 11g R1/R1 Real Application Clusters Handbook


Well, its my turn to do a shameless marketing for our upcoming 'Oracle 11g R1/R2 Real Application Clusters' handbook. This book will be hitting the market some time in April 2010 and you can get more details on this book here.
Almost a year back an idea about writing a book on RAC came up in my mind and started looking for fellow co-authors plus a publishing house for our book and finally ended up writing this book for Packt Publication along with other co-authors, Ben Prusinski and Guenadi Jilevski.
One thing I must agree that writing a book requires a lot of research, patience, time and knowledge and if you are working whilst writing the book, is like a uphill task to balance the work and coping up with the chapters schedule time.Working in one of the largest RAC setup organization here in Saudi Arabia helped me tremendously to achieve this feet.
We tried to give our level best to cover most of the useful topics for the DBA who wants to administrate and maintain a complex RAC environment. Since Oracle 11gR2 is released, we had to modify a lot of contents to cover most of the useful 11gR2 features in the book. 

I will keep you update on my blog about my book and will let you know the Contents of the book as well.

Happy reading,

Jaffar

1.24.2010

Oracle bugs - Part 3(GETTING "CONNECTION NOT ACTIVE" IN EVMD.LOG )

I am back again with Part 3 of Oracle bugs series. Although this bug won't bug you much, nevertheless, the frequent appearance of the following message in the evmd.log (under the cluster home) may raise some alarms (doubts) in your mind about your network stability:



2009-11-29 15:34:04.821: [ COMMCRS][1]clscsendx: (6000000000397610) Physical connection (6000000000b09c40) not active
2009-11-29 18:17:41.748: [ COMMCRS][1]clscreceive: (6000000000397910) Connection not active
2009-11-29 18:17:41.748: [ COMMCRS][1]clscreceive: (6000000000397910) Connection not active
2010-01-23 06:51:22.669: [ COMMCRS][1]clscsendx: (6000000000398150) Physical connection (60000000009ff860) not active
2010-01-23 09:43:06.210: [ COMMCRS][1]clscsendx: (6000000000398150) Physical connection (60000000009ff860) not active
2010-01-23 12:52:23.196: [ COMMCRS][1]clscsendx: (60000000003982d0) Physical connection (60000000009ff860) not active



Those warnings are flagged CLSC_TRACE_ALWAYS as indicated in Bug 3251584.
They concerns connections between the evmd daemons and other processes that are disconnected from the point of view of the evmd, i.e. the evmd detect the remote connections are not active anymore, but the remote side has not disconnect the link in a proper way. Hence the warning message. They may not be removed as indicated in Bug 3251584.

Those evmd warning messages can be safely ignored as they are informative only.

Happy reading,

Jaffar

1.11.2010

Oracle Bugs - Part 2 (ASM instance on RAC crashed due to ORA-600 [kfclUpdateBuffers20])

Here comes the Part 2 story of my encounter with various Oracle bugs in the recent past.

An ASM instance on one of the node of 8 node RAC environment crashed due to 'ORA-00600: internal error code, arguments: [kfclUpdateBuffers20]' error and all the active instances on this node crashed too with 'communication failure with ASM instance'.

 The cause is due to an intermittent buffer conversion problem in ASM internal structures,  LMS b/g process has terminated the ASM instance with an ORA-600 kfclUpdateBuffers20 error.


Following bugs have been registered against this behavior:
Bug 6489250 INTERMITTENT ORA-600 [KFCLUPDATEBUFFERS20] WITH ASM
Bug:6605106 AN ASM NULL CONVERTED BUFFER MAY BE REFERENCED
 

Details:
ORA-600 [kfclUpdateBuffers20] can occur in ASM as a null converted
buffer may be accessed. Other errors could also occur.


To fix this bug, either apply one-off patch which suits your environment version or apply 10.2.0.5 or upgrade to 11g.

Happy oracle bugs reading.

Jaffar

1.04.2010

Oracle bugs - Part 1 (an Instance hung on 'WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK')

Since couple of months, I have been dealing with various Oracle bugs on our 8 node RAC running on HPUX environment with Oracle V10.2.0.4. That really keeps my life busy and given enough nightmares. I just thought of sharing my experience dealing these bugs in a three part series. In the first part of the series, I am going to discuss a situation where an RAC database with two instances hung scenario and how we handled the situation.

An RAC database with two instances was hung with the 'WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!'  error message in the alert.log file. Initially, neither I was able to connect to the database via sqlplus nor able to shutdown the database using SRVCTL STOP DATABASE command. I had to connect one of the instance using the 'sqlplus -prelim' option and before I perform instance abort, I have collected the hanganalyze (level 3) and systemstate (level 266) dumps using the oradebug utility. Upon investigating the hanganlyze tracefile, it shows the following contention:

63CA629C:BE596F7A 300 215 10005 2 KSL WAIT END [enq: SQ - contention] 1397817350/0x53510006 144/0x90 0/0x0 time=2929720
63CA62C9:BE596F7B 300 215 10005 1 KSL WAIT BEG [enq: SQ - contention] 1397817350/0x53510006 144/0x90 0/0x0

 
Further analysis shows a dead lock between two process which were either directly or indirectly blocking all other sessions in the database. Digging further more with the trace file reveals the real problem as you can see in the subsequent lines:

Deadlock between process 33 (J000) running DBMS_STATS_SCH and process 45.
Process 33 is waiting for 'cursor: pin S wait on X' mutex idn=57bc44f9 in shared mode while holding library cache lock handle=c00000010e9123c0 in exclusive mode

Mutex c0000001074df688(490, 0) idn 57bc44f9 oper GET_SHRD --> waiting
LIBRARY OBJECT LOCK: lock=c000000101141390 handle=c00000010e9123c0 mode=X

Process 45 is waiting for waiting for 'library cache lock' on SYS.C_COBJ#  handle address=c00000010e9123c0 in shared mode while holding mutex idn 57bc44f9 in mode EXCL.

Mutex c0000001074df688(490, 0) idn 57bc44f9 oper EXCL
LIBRARY OBJECT LOCK: lock=c000000108b0b208 handle=c00000010e9123c0 request=S




As usual, an Oracle support informed me that its a bug 6011045 and the behavior is due to the following reasons:





Abstract:            DBMS_STATS causes deadlock between 'cursor: pin S wait on X' and 'library cache lock'
Fixed-Releases:     B1
Tags:               DBMSPKG DEADLOCK OPS
Details:          
 Gathering statistics (With DBMS_STATS) on bootstrap objects (like ind$) can cause  a deadlock on sessions waiting on between 'cursor: pin S wait on X'  and 'library cache lock' in RAC.  In some instances sessions waiting on 'DFS lock handle' and 'row cache lock'   are also part of the deadlock path.



Yes, we had a job scheduled daily at might night to collect the database statistics (dbms_stats.gather_database_stats). Though it was running smoothly more than a year now, on any unlucky day, you gonna face this sort of troubles. My initial impression was that this bug is fixed in v.10.2.0.4. However, the Oracle support informed me that the bug is not fixed in 10.2.0.4, either I  need to upgrade  to 11g or request for a one-off patch for my platform.

In part 2, I am going to discuss a situation where one of the RAC database instance crashed due to 'ORA-07445 error'.

Happy reading

Jaffar