Its just a simple change right?
Putting a database in archivelog mode? Set the destination, the archive log format, shutdown immediate, startup mount, alter database archivelog, alter database open, alter system archive log current. Done.
Except, alter system archive log current failed:
Huh, thats ok. This is an Oracle EBS environment and I created the archive destination directory with the applmgr user, not oracle. No problem, su - root, chown oracle:dba . , done.
alter system archive log current again, success!
Can't remember what query I issued next (its been a hazzy night) but I was greeted with the following error:
WTF? Just some background about this environment. I only cloned it the day before and I didn't have any backups. I was putting the database in archivelog mode so I could take backups.
This error typically means the datafile is offline, tho i'm not sure why. So I bring it online:
But:
Hrmm.. ok, so I try to open the database:
WTF? Check the alert log:
Lovely. At this point I have 3 options:
1. Work with Oracle Support.
2. Rebuild the environment since I don't have any backups. Developers lose 1 day.
3. Back it up, research the error and see if I can find a solution.
I worked with Oracle support up until 6:30am, at which point I believe my analyst was finished for the day and went home. This is a UAT environment, so the SR is a P2. I wasn't expecting 24x7 support, so this was just bad timing.
I searched metalink and didn't find any solid hits. So I tried google. Not many hits, but I received one exact match, in chinese. Google graciously offered to translate the page for me and about the only useful text on the page was:
"Who came across the internal bug? 我用了alter database clear logfile group 后再重启动数据库就发生这个错误了。 I used the alter database clear logfile group restart the database after this error has occurred. "
At this point I had an hour left before testers started and only 2 options remaining. I figure I was going to have to rebuild the database anyways, so i'd try clearing the log files. I cleared the inactive groups without issue. The only group remaining was the current and if the clear logfile group command failed i'd have to proceed with the rebuild.
From the alert log:
Success!!!
Alter database open; Success!!!
Thats my 6 hr simple change (most of that time was working with Oracle Support, taking a backup [only 5MB/s IO??!] of the environment before I played around with it, etc.) Now I need some sleep. You know your tired when you almost use dish washing liquid for cream in your coffee. Luckily an attentive coworker prevented disaster! Normally an all nighter doesn't leave me this drained but the previous night I only had 3 hrs sleep.
Except, alter system archive log current failed:
ORA-19504: failed to create file "/u02/archive/ORCL/ORCL_1_1_686533687.dbf"
ORA-27040: file create error, unable to create file
Linux Error: 13: Permission denied
Huh, thats ok. This is an Oracle EBS environment and I created the archive destination directory with the applmgr user, not oracle. No problem, su - root, chown oracle:dba . , done.
alter system archive log current again, success!
Can't remember what query I issued next (its been a hazzy night) but I was greeted with the following error:
Errors in file /u01/oracle/ORCL/db/tech_st/10.2.0/admin/ORCL_myserver/bdump/orcl_smon_10547.trc:
ORA-00376: file 5 cannot be read at this time
ORA-01110: data file 5: '/u01/oracle/ORCL/db/apps_st/data/system05.dbf'
WTF? Just some background about this environment. I only cloned it the day before and I didn't have any backups. I was putting the database in archivelog mode so I could take backups.
This error typically means the datafile is offline, tho i'm not sure why. So I bring it online:
alter database datafile 5 online;
recover datafile 5;
But:
ORA-00283: recovery session canceled due to errors
ORA-00264: no recovery required
Hrmm.. ok, so I try to open the database:
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
WTF? Check the alert log:
Errors in file /u01/oracle/ORCL/db/tech_st/10.2.0/admin/ORCL_rc1657/bdump/uat2_lgwr_11557.trc:
ORA-00600: internal error code, arguments: [3712], [1], [1], [1388],
[3796296244], [1388], [3796296241], []
Lovely. At this point I have 3 options:
1. Work with Oracle Support.
2. Rebuild the environment since I don't have any backups. Developers lose 1 day.
3. Back it up, research the error and see if I can find a solution.
I worked with Oracle support up until 6:30am, at which point I believe my analyst was finished for the day and went home. This is a UAT environment, so the SR is a P2. I wasn't expecting 24x7 support, so this was just bad timing.
I searched metalink and didn't find any solid hits. So I tried google. Not many hits, but I received one exact match, in chinese. Google graciously offered to translate the page for me and about the only useful text on the page was:
"Who came across the internal bug? 我用了alter database clear logfile group 后再重启动数据库就发生这个错误了。 I used the alter database clear logfile group restart the database after this error has occurred. "
At this point I had an hour left before testers started and only 2 options remaining. I figure I was going to have to rebuild the database anyways, so i'd try clearing the log files. I cleared the inactive groups without issue. The only group remaining was the current and if the clear logfile group command failed i'd have to proceed with the rebuild.
From the alert log:
alter database clear unarchived logfile group 1
Tue May 12 08:14:34 2009
Thread 1 advanced to log sequence 6 (preparation to clear logfile)
WARNING! CLEARING REDO LOG WHICH HAS NOT BEEN ARCHIVED. BACKUPS TAKEN
BEFORE 05/12/2009 08:14:34 (CHANGE 1) CANNOT BE USED FOR RECOVERY.
Clearing online log 1 of thread 1 sequence number 5
Tue May 12 08:17:34 2009
Completed: alter database clear unarchived logfile group 1
Success!!!
Alter database open; Success!!!
Thats my 6 hr simple change (most of that time was working with Oracle Support, taking a backup [only 5MB/s IO??!] of the environment before I played around with it, etc.) Now I need some sleep. You know your tired when you almost use dish washing liquid for cream in your coffee. Luckily an attentive coworker prevented disaster! Normally an all nighter doesn't leave me this drained but the previous night I only had 3 hrs sleep.
Friday, March 20, 2009
Monitoring your RMAN backup scripts
There is an easy way to monitor your RMAN backup scripts in a unix environment. I've seen quite a few DBA's log the rman output to a file and then grep it for errors. Ex.
So another way of writing the above script would be
NOTE: Ignore the backlashes before the exclamation points above. I'm not sure how to format it properly within blogger. I'll see if I can fix it later.
As you can see above, after rman executes I set a status variable. $? is automatically set to the return code from the rman client. After that I check the value and there was an error I send an email saying the backup failed and put the contents of the log file into the body of the message. If the backup was successful, I still send a copy of the logfile.
The reason I do this is because each day I will expect to see either a successful or failed email. I've had issues in the past with cron not executing scheduled jobs (for numerous reasons). So if I don't receive an email I'll verify the backup manually.
rman target / CMDFILE /home/oracle/scripts/rman.sql LOG $logfile
errors=`cat $logfile|grep -i error|wc -l`
if test $errors -gt 0
then
mailx -s "BACKUP FAILED" email1@myserver.com, email2@myserver.com <<\!
`cat $logfile`
!
fi
Another method would be to examine the return code from the rman client. Rman returns a value of 0 to the shell if the backup was successful. Otherwise it returns a nonzero value and from what I have seen this is always 1. (Have you seen any other values?)So another way of writing the above script would be
rman target / CMDFILE /home/oracle/scripts/rman.sql LOG $logfile
status=$?
if [ $status -gt 0 ] ; then
mailx -s "BACKUP FAILED: TESTDB" email1@myserver.com <<\!
`cat $logfile`
\!
else
mailx -s "SuccessfulBackup: TESTDB" email1@myserver.com <<\!
`cat $logfile`
\!
fi
NOTE: Ignore the backlashes before the exclamation points above. I'm not sure how to format it properly within blogger. I'll see if I can fix it later.
As you can see above, after rman executes I set a status variable. $? is automatically set to the return code from the rman client. After that I check the value and there was an error I send an email saying the backup failed and put the contents of the log file into the body of the message. If the backup was successful, I still send a copy of the logfile.
The reason I do this is because each day I will expect to see either a successful or failed email. I've had issues in the past with cron not executing scheduled jobs (for numerous reasons). So if I don't receive an email I'll verify the backup manually.
Friday, January 16, 2009
Recovery Scenario – Losing a Redo Log file.
I’m in the process of studying for the OCP exam and I personally find it much easier to remember something if I write it down. What better way to do this than creating a blog post? Unfortunately its some pretty basic stuff so it might not garner much interest.
My test environment has the following redo log configuration:
As you can see, there are 3 redo log groups with 2 members each. In order to simulate a failure I am going to remove one of the members.
Oracle will not detect a problem until it tries to use the missing redo log file so I had to issue two logfile switches before redo01a.log was accessed. At that point the following errors show up in the alert.log.
Errors in file /home/oracle/product/10.2.0/db_1/admin/orcl/bdump/orcl_lgwr_2648.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/home/oracle/oradata/orcl/redo01a.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
Thu Jan 15 15:40:44 2009
Errors in file /home/oracle/product/10.2.0/db_1/admin/orcl/bdump/orcl_lgwr_2648.trc:
ORA-00321: log 1 of thread 1, cannot update log file header
ORA-00312: online log 1 thread 1: '/home/oracle/oradata/orcl/redo01a.log'
Thu Jan 15 15:40:45 2009
Errors in file /home/oracle/product/10.2.0/db_1/admin/orcl/bdump/orcl_lgwr_2648.trc:
ORA-00313: open failed for members of log group 1 of thread 1
As well, if we take another look at v$logfile, the missing redo log will have a status of INVALID:
In this particular case, the loss of redo01a.log file is not critical. Since the group is multiplexed (more than 1 member) Oracle can operate without impacting users. Each time Oracle tries to use the missing redo log file the errors will be displayed again in alert log.
There are two methods in order to recover form this scenario:
Which one should you choose? That depends on the circumstances surrounding why the redo log is missing. The CLEAR LOGFILE command reinitializes the logfile group, similiar to dropping and adding each member. If the failure is due to a corrupted disk, then this command will fail as it recreates the group in place. If this is the case, you will need to use the second method of dropping and adding a new member so you can specify a new location.
Lets try issuing the CLEAR LOGFILE command:
Thats it, everything is back to normal.
Lets try the other method of dropping and adding a new member:
So, what happens if you lose all members of a redo log group? The steps you have to take are dependent on the groups status and whether or not the database is in archivelog mode.
INACTIVE:
ACTIVE:
CURRENT:
The current redo log group is the one which Oracle is currently writing to. If you lose this group or files become corrupted, LGWR may terminate. For example:
Corrupted files:
Fri Jan 30 14:14:55 2009
Errors in file /apps/oravis/db/tech_st/10.2.0/admin/test/bdump/test_lgwr_25599.trc:
ORA-00316: log 1 of thread 1, type 8944 in header is not log file
ORA-00312: online log 1 thread 1: '/oradata/dbf2/testdata/test/redo01.log'
Fri Jan 30 14:14:55 2009
LGWR: terminating instance due to error 316
Instance terminated by LGWR, pid = 25599
Missing files:
Fri Jan 30 15:04:34 2009
Errors in file /apps/oravis/db/tech_st/10.2.0/admin/test/bdump/test_lgwr_1434.trc:
ORA-00320: cannot read file header from log 1 of thread 1
ORA-00312: online log 1 thread 1: '/oradata/dbf2/testdata/test/redo01.log'
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
SVR4 Error: 11: Resource temporarily unavailable
Additional information: 4
Additional information: 1
LGWR: terminating instance due to error 320
Instance terminated by LGWR, pid = 1434
My test environment has the following redo log configuration:
1 select group#, sequence#, bytes/1024/1024 as "Size (MB)", members, status
2* from v$log
L> /
GROUP# SEQUENCE# Size (MB) MEMBERS STATUS
-------- ---------- ---------- ---------- ----------------
1 5 50 2 INACTIVE
2 6 50 2 CURRENT
3 4 50 2 INACTIVE
SQL> l
1 select group#, member
2 from v$logfile
3* order by 1
SQL> /
GROUP# MEMBER
---------- ----------------------------------------
1 /home/oracle/oradata/orcl/redo01.log
1 /home/oracle/oradata/orcl/redo01a.log
2 /home/oracle/oradata/orcl/redo02a.log
2 /home/oracle/oradata/orcl/redo02.log
3 /home/oracle/oradata/orcl/redo03a.log
3 /home/oracle/oradata/orcl/redo03.log
As you can see, there are 3 redo log groups with 2 members each. In order to simulate a failure I am going to remove one of the members.
SQL> !rm /home/oracle/oradata/orcl/redo01a.log
SQL> alter system switch logfile;
System altered.
SQL> /
System altered.
Oracle will not detect a problem until it tries to use the missing redo log file so I had to issue two logfile switches before redo01a.log was accessed. At that point the following errors show up in the alert.log.
Errors in file /home/oracle/product/10.2.0/db_1/admin/orcl/bdump/orcl_lgwr_2648.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/home/oracle/oradata/orcl/redo01a.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
Thu Jan 15 15:40:44 2009
Errors in file /home/oracle/product/10.2.0/db_1/admin/orcl/bdump/orcl_lgwr_2648.trc:
ORA-00321: log 1 of thread 1, cannot update log file header
ORA-00312: online log 1 thread 1: '/home/oracle/oradata/orcl/redo01a.log'
Thu Jan 15 15:40:45 2009
Errors in file /home/oracle/product/10.2.0/db_1/admin/orcl/bdump/orcl_lgwr_2648.trc:
ORA-00313: open failed for members of log group 1 of thread 1
As well, if we take another look at v$logfile, the missing redo log will have a status of INVALID:
SQL>
GROUP# MEMBER STATUS
---------- ---------------------------------------- -------
3 /home/oracle/oradata/orcl/redo03.log
2 /home/oracle/oradata/orcl/redo02.log
1 /home/oracle/oradata/orcl/redo01.log
1 /home/oracle/oradata/orcl/redo01a.log INVALID
2 /home/oracle/oradata/orcl/redo02a.log
3 /home/oracle/oradata/orcl/redo03a.log
In this particular case, the loss of redo01a.log file is not critical. Since the group is multiplexed (more than 1 member) Oracle can operate without impacting users. Each time Oracle tries to use the missing redo log file the errors will be displayed again in alert log.
There are two methods in order to recover form this scenario:
- Issue the CLEAR LOGFILE command.
- Drop the missing member and add a new one.
Which one should you choose? That depends on the circumstances surrounding why the redo log is missing. The CLEAR LOGFILE command reinitializes the logfile group, similiar to dropping and adding each member. If the failure is due to a corrupted disk, then this command will fail as it recreates the group in place. If this is the case, you will need to use the second method of dropping and adding a new member so you can specify a new location.
Lets try issuing the CLEAR LOGFILE command:
SQL> alter database clear logfile group 1;
Database altered.
Thats it, everything is back to normal.
Lets try the other method of dropping and adding a new member:
SQL> !rm /home/oracle/oradata/orcl/redo01a.log
SQL> alter database drop logfile member '/home/oracle/oradata/orcl/redo01a.log';
Database altered.
SQL> alter database add logfile member '/home/oracle/oradata/orcl/redo01a.log' to group 1;
Database altered.
So, what happens if you lose all members of a redo log group? The steps you have to take are dependent on the groups status and whether or not the database is in archivelog mode.
INACTIVE:
- If the affected redo log group has a status of INACTIVE it is no longer required for crash recovery. If you are in NOARCHIVELOG mode issue either CLEAR LOGFILE or recreate the group manually.
- If you are in ARCHIVELOG mode and the group has been archived, issue either of the steps above.
- If you are in ARCHIVELOG mode and the group hasn’t been archived then issue CLEAR UNARCHIVED LOGFILE. If you don’t specify the UNARCHIVED keyword you will receive an error. After you execute this command, you will have a gap in your archivelogs, so perform a complete backup (including the control file) ASAP.
ACTIVE:
- If the redo log group has a status of ACTIVE, it is still required for crash recovery. Issue the command
ALTER
SYSTEM
CHECKPOINT
, If successful then follow the steps above for INACTIVE. - If the checkpoint fails, then you need to perform recovery. If you are in NOARCHIVELOG mode then you need to perform a complete recovery of the last cold backup.
- If the checkpoint fails and you are in ARCHIVELOG mode then you can perform an INCOMPLETE recovery up to the previous log file.
CURRENT:
The current redo log group is the one which Oracle is currently writing to. If you lose this group or files become corrupted, LGWR may terminate. For example:
Corrupted files:
Fri Jan 30 14:14:55 2009
Errors in file /apps/oravis/db/tech_st/10.2.0/admin/test/bdump/test_lgwr_25599.trc:
ORA-00316: log 1 of thread 1, type 8944 in header is not log file
ORA-00312: online log 1 thread 1: '/oradata/dbf2/testdata/test/redo01.log'
Fri Jan 30 14:14:55 2009
LGWR: terminating instance due to error 316
Instance terminated by LGWR, pid = 25599
Missing files:
Fri Jan 30 15:04:34 2009
Errors in file /apps/oravis/db/tech_st/10.2.0/admin/test/bdump/test_lgwr_1434.trc:
ORA-00320: cannot read file header from log 1 of thread 1
ORA-00312: online log 1 thread 1: '/oradata/dbf2/testdata/test/redo01.log'
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
SVR4 Error: 11: Resource temporarily unavailable
Additional information: 4
Additional information: 1
LGWR: terminating instance due to error 320
Instance terminated by LGWR, pid = 1434
- If LGWR terminates you have to recover the database:
- If you are in ARCHIVELOG mode, perform an incomplete recovery up to the previous log file.
- If you are in NOARCHIVELOG, perform a complete recovery of the last cold backup.
- If the database is still online, you can try and clear the logfile group. If the command is not successful then you will need to recover.
Thursday, May 15, 2008
Refreshing an 11i Database using Rman
In order to facilitate troubleshooting we maintain a test environment which is a nightly copy of our 11i production environment. Since this environment is usually used to test data fixes it has to be as up to date as possible. To perform the database refresh we use rman's duplicate feature.
The goal of this article isn't just to provide the entire set of scripts and send you on your way. I think its safe to say that most EBS environments aren't identical, so its not like you could take them and execute with no issues. Instead i'll highlight the steps we follow and some of the key scripts.
NOTE: This doesn't include any pre-setup steps such as, if this is the first time duplicating the database make sure you have the parameters db_file_name_convert and log_file_name_convert specified in your test environments init file.
Hopefully this article was of some use even tho it was pretty vague at times. If you have any questions feel free to ask. Any corrections or better methods don't hesitate to leave a comment either.
The goal of this article isn't just to provide the entire set of scripts and send you on your way. I think its safe to say that most EBS environments aren't identical, so its not like you could take them and execute with no issues. Instead i'll highlight the steps we follow and some of the key scripts.
NOTE: This doesn't include any pre-setup steps such as, if this is the first time duplicating the database make sure you have the parameters db_file_name_convert and log_file_name_convert specified in your test environments init file.
- Step 1: Shutdown the test environment. If you are using 10g then remove any tempfiles. In 10g, rman now includes tempfile information and if they exist you will encounter errors. Check this previous post. Startup the database in nomount mode.
- Step 2: Build a Rman Script. There are a couple of ways to recover to a point in time and we have decided to use SCN numbers. Since this process needs to be automated, we query productions rman catalog and determine the proper SCN to use and build an rman script. Here it is:
set feedback off set echo off set serverout on spool $HOME/scripts/prod_to_vis.sql declare vmax_fuzzy number; vmax_ckp number; scn number; db_name varchar2(3) := 'VIS'; log_file_dest1 varchar2(30) := '/dbf/visdata/'; begin select max(absolute_fuzzy_change#)+1, max(checkpoint_change#)+1 into vmax_fuzzy, vmax_ckp from rc_backup_datafile; if vmax_fuzzy > vmax_ckp then scn := vmax_fuzzy; else scn := vmax_ckp; end if; dbms_output.put_line('run {'); dbms_output.put_line('set until scn '||to_char(scn)||';'); dbms_output.put_line('allocate auxiliary channel ch1 type disk;'); dbms_output.put_line('allocate auxiliary channel ch2 type disk;'); dbms_output.put_line('duplicate target database to '||db_name); dbms_output.put_line('logfile group 1 ('||chr(39)||log_file_dest1||'log01a.dbf'||chr(39)||','); dbms_output.put_line(chr(39)||log_file_dest1||'log01b.dbf'||chr(39)||') size 10m,'); dbms_output.put_line('group 2 ('||chr(39)||log_file_dest1||'log02a.dbf'||chr(39)||','); dbms_output.put_line(chr(39)||log_file_dest1||'log02b.dbf'||chr(39)||') size 10m;}'); dbms_output.put_line('exit;'); end; / spool off; This script produces a spool file called, prod_to_vis.sql: run { set until scn 14085390202; allocate auxiliary channel ch1 type disk; allocate auxiliary channel ch2 type disk; duplicate target database to VIS logfile group 1 ('/dbf/visdata/log01a.dbf', '/dbf/visdata/log01b.dbf') size 10m, group 2 ('/oradata/dbf/visdata/log02a.dbf', '/dbf/visdata/log02b.dbf') size 10m;} exit;
Note: Our production nightly backups are on disk which are NFS mounted to our test server. - Step 3: Execute the rman script. Launch rman, connect to the target, catalog, auxiliary and execute the script above:
ie.
rman> connect target sys/syspasswd@PROD catalog rmancat/catpasswd@REPO auxiliary /
You may want to put some error checking around rman to alert you if it fails. We have a wrapper script which supplies the connection information and calls the rman script above. Our refresh is critical so if it fails we need to be paged.rman @$SCRIPTS/prod_to_vis.sql if [ $? != 0 ] then echo Failed echo "RMAN Dupcliate Failed!"|mailx -s "Test refresh failed" pageremail@mycompany.com exit 1 fi
- Step 4: If production is in archivelog mode but test isn't, then mount the database and alter database noarchivelog;
- Step 5: If you are using a hotbackup for cloning then you need to execute adupdlib.sql. This updates libraries with correct OS paths. (Appendix B of Note:230672.1)
- Step 6: Change passwords. For database accounts such as sys, system and other non-applications accounts change the passwords using alter user. For applications accounts such as apps/applsys, modules, sysadmin, etc use FNDCPASS to change their passwords.
ie. To change the apps password:
FNDCPASS apps/<production appspassword=> 0 Y system/<system_passwd> SYSTEM applsys <new apps passwd> - Step 7: Run autoconfig.
- Step 8: Drop any database links that aren't required in the test environment, or repoint them to the proper test environments.
- Step 9: Follow Section 3: Finishing Tasks of Note:230672.1
- Update any profile options which have still reference the production instance.
Example:
UPDATE FND_PROFILE_OPTION_VALUES SET
profile_option_value = REPLACE(profile_option_value,'PROD','TEST')
WHERE profile_option_value like '%PROD%
Specifically check the FND_PROFILE_OPTION_VALUES,ICX_PARAMETERS, WF_NOTIFICATION_ATTRIBUTES andWF_RESOURCES tables and look for production hostnames and ports. We also update the forms title bar with the date the environment was refreshed:
UPDATE apps.FND_PROFILE_OPTION_VALUES SET
profile_option_value = 'TEST:'||' Refreshed from '||'Production: '||SYSDATE
WHERE profile_option_id = 125
; - Cancel Concurrent requests. We don't need concurrent requests which are scheduled in production to keep running in test. We use the following update to cancel them. Also, we change the number of processes for the standard manager.
update fnd_concurrent_requests
set phase_code='C',
status_code='D'
where phase_code = 'P'
and concurrent_program_id not in (
select concurrent_program_id
from fnd_concurrent_programs_tl
where user_concurrent_program_name like '%Synchronize%tables%'
or user_concurrent_program_name like '%Workflow%Back%'
or user_concurrent_program_name like '%Sync%responsibility%role%'
or user_concurrent_program_name like '%Workflow%Directory%')
and (status_code = 'I' OR status_code = 'Q');
update FND_CONCURRENT_QUEUE_SIZE
set min_processes = 4
where concurrent_queue_id = 0;
- Update any profile options which have still reference the production instance.
- Step 10: Perform any custom/environment specific steps. We have some custom modules which required some modifications as part of cloning.
- Step 11: Startup all of the application processes. ($S_TOP/adstrtal.sh)
NOTE: If you have an application tier you may have to run autoconfig before starting up the services.
Hopefully this article was of some use even tho it was pretty vague at times. If you have any questions feel free to ask. Any corrections or better methods don't hesitate to leave a comment either.
Thursday, January 17, 2008
Control File Autobackups
One of the many features included in rman is the ability to automatically backup your control file. What is a control file? The control file contains information about your database including its physical structure. If you are using rman without a recovery catalog then your backup information is also stored in the control file. For an overview of control files take a look atOracles Documentation.
It is very important for (some) recovery purposes to ensure you have a recent copy of your control file, especially if it contains your recovery catalog. There is one feature of control file autobackups which alot of dba's seem to be unfamiliar with, which is why I thought i'd write up a brief article on it. If you have this feature enabled and you make a structural change to the database, the control file is automatically backed up. ie, If you add a datafile, rename a file, etc. Information which could affect your ability to recover. This backup will be placed on disk even if your regular backup goes to tape.
A quick test. First, ensure you have control file autobackups enabled and select a location on disk. By default the control file backup will be placed in your $ORACLE_HOME/dbs directory unless you specify a location.
A quick test to make sure the backup works and is on disk:
Next we need to make a structural change to the database to see if the control file is automatically backed up. For this example i'll add a small datafile to the system tablespace.
A directory listing of the backup location should show a second control file backup:
I promise my next post will be a bit more advanced and related to EBS.
It is very important for (some) recovery purposes to ensure you have a recent copy of your control file, especially if it contains your recovery catalog. There is one feature of control file autobackups which alot of dba's seem to be unfamiliar with, which is why I thought i'd write up a brief article on it. If you have this feature enabled and you make a structural change to the database, the control file is automatically backed up. ie, If you add a datafile, rename a file, etc. Information which could affect your ability to recover. This backup will be placed on disk even if your regular backup goes to tape.
A quick test. First, ensure you have control file autobackups enabled and select a location on disk. By default the control file backup will be placed in your $ORACLE_HOME/dbs directory unless you specify a location.
oravis@myserver=> rman target / nocatalog
Recovery Manager: Release 10.2.0.2.0 - Production on Wed Jan 16 14:48:49 2008
Copyright (c) 1982, 2005, Oracle. All rights reserved.
connected to target database: VIS (DBID=767668735)
using target database control file instead of recovery catalog
RMAN> CONFIGURE CONTROLFILE AUTOBACKUP ON;
new RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP ON;
new RMAN configuration parameters are successfully stored
RMAN> CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '/oradata/bck/vis_backup/%F';
new RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '/oradata/bck/vis_backup/%F';y
A quick test to make sure the backup works and is on disk:
oravis@myserver=> rman target / nocatalog
Recovery Manager: Release 10.2.0.2.0 - Production on Wed Jan 16 15:25:24 2008
Copyright (c) 1982, 2005, Oracle. All rights reserved.
connected to target database: VIS (DBID=767668735)
using target database control file instead of recovery catalog
RMAN> backup current controlfile;
Starting backup at 16-JAN-08
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=414 devtype=DISK
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
including current control file in backupset
channel ORA_DISK_1: starting piece 1 at 16-JAN-08
channel ORA_DISK_1: finished piece 1 at 16-JAN-08
piece handle=/mnt1/app/visdb/1020_64bit/dbs/01j6adrd_1_1 tag=TAG20080116T152532 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:02
Finished backup at 16-JAN-08
Starting Control File Autobackup at 16-JAN-08
piece handle=/oradata/bck/vis_backup/c-767668735-20080116-00 comment=NONE
Finished Control File Autobackup at 16-JAN-08
RMAN> exit
Recovery Manager complete.
oravis@myserver=> ls -alrt /oradata/bck/vis_backup
total 42036
drwxrwxr-x 28 oravis dba 1024 Jan 16 14:47 ..
-rw-rw---- 1 oravis dba 19398656 Jan 16 15:25 c-767668735-20080116-00
drwxrwxr-x 2 oravis dba 512 Jan 16 15:25 .
Next we need to make a structural change to the database to see if the control file is automatically backed up. For this example i'll add a small datafile to the system tablespace.
SQL> alter tablespace SYSTEM add datafile '/oradata/dbf/visdata/system09.dbf' size 25M;
Tablespace altered.
A directory listing of the backup location should show a second control file backup:
oravis@myserver=> ls -alrt /oradata/bck/vis_backup
total 79972
drwxrwxr-x 28 oravis dba 1024 Jan 16 14:47 ..
-rw-rw---- 1 oravis dba 19398656 Jan 16 15:25 c-767668735-20080116-00
-rw-rw---- 1 oravis dba 19398656 Jan 16 15:26 c-767668735-20080116-01
drwxrwxr-x 2 oravis dba 512 Jan 16 15:26 .
I promise my next post will be a bit more advanced and related to EBS.
Friday, August 31, 2007
Rman and Windows
Thought I would point out a nice little utility i've used on windows 2003 environments. The program is called linkd and it is part of the Windows Server 2003 Resource Kit. Linkd is a utility which allows you to create Junction points on a windows server. Junction point is just the term used for directory symbolic links.
Why is this handy? Recently I was asked to restore a windows production database on a test server. As you may know with rman, it is very picky on paths. For example, if you place your backup in a directory called d:\backup\nightly and would like to restore this on another server, then you have to place the backup in the same path.
Unfortunately there wasn't enough free space on D: for me to place the backup. However, there was more than enough on the H: drive. I used the linkd utility to create a symbolic link from D:\backup\nightly to H:\backup\nightly (can be any path) and ran the restore.
On unix symbolic links are common knowledge, not the same for windows... In this case I took over from a couple of dba's who were trying to get this working for over a day (24hrs straight). They opened a case with Oracle Support and before I took it over they were trying to extract the datafiles manually from the backup. Yikes, painful. So hopefully this tool will become more well known.
Related links:
How to create and manipulate NTFS junction points - MS Support Site
Junction V1.05 - MS Technet
Windows 2003 Resource Kit download page
Why is this handy? Recently I was asked to restore a windows production database on a test server. As you may know with rman, it is very picky on paths. For example, if you place your backup in a directory called d:\backup\nightly and would like to restore this on another server, then you have to place the backup in the same path.
Unfortunately there wasn't enough free space on D: for me to place the backup. However, there was more than enough on the H: drive. I used the linkd utility to create a symbolic link from D:\backup\nightly to H:\backup\nightly (can be any path) and ran the restore.
On unix symbolic links are common knowledge, not the same for windows... In this case I took over from a couple of dba's who were trying to get this working for over a day (24hrs straight). They opened a case with Oracle Support and before I took it over they were trying to extract the datafiles manually from the backup. Yikes, painful. So hopefully this tool will become more well known.
Related links:
How to create and manipulate NTFS junction points - MS Support Site
Junction V1.05 - MS Technet
Windows 2003 Resource Kit download page
Monday, July 16, 2007
ORA-00376: file 6 cannot be read at this time
It seems each night I try to hit the sack a little early (1am) I get paged shortly after. Last night was no exception with a page coming in from one of our application support teams. Users in Australia were having issues logging into one of our customer support applications.
All of the environments I support are unix based but our team does support some Windows servers. So while i'm on call i'm responsible for those as well.
The first thing I do is check the alert log and I notice the following errors repeating over and over:
Errors in file d:\orant9i\admin\orcl\udump\orcl_j000_2096.trc:
ORA-12012: error on auto execute of job 22
ORA-00376: file 6 cannot be read at this time
ORA-01110: data file 6: 'D:\ORCL\MYDATA01.DBF'
ORA-06512: at "SCHEMA.STOREDPROCEDURE", line 3
ORA-06512: at line 1
I seemed to recall some of our other windows based database servers having issues in the past with backup software locking the files. So the next thing I do is called the Intel Support group in Australia and ask them if any backups were executed today. Indeed this past weekends full backups failed and they initiated some manual backups today.
The support person confirmed that the OS backup had completed, so I restarted the database. The database opened successfully and I checked the alert logs to verify there were no errors. As it had in the past occurrences, these steps resolved the issue.
Since I couldn't get back to sleep I kept an eye on the alert log and surfed the web on my home PC. About 20 minutes later the error appeared again followed by an email from a manager in Australia.
I was kind of surprised by this so I checked v$datafile. The status of 3 files was "RECOVER". My first thought was, why do I always get the fun issues when i'm on pager?!
This was a relatively small database and before I attempt any recovery I always take a cold backup. (If time allows.) Once I restarted the database I executed the following commands for each datafile that was marked as needing recovery:
SQL> recover datafile 'c:\my\datafile.dbf';
SQL> alter database datafile 'c:\my\datafile.dbf' online;
Media recovery was successful and the problem did not reoccur. So just a note for those windows DBA's out there. If you have automated backup software ensure that your datafile directories are excluded. While on unix the worst case would be an inconsistent backup, on windows, file locking behaves differently and could cause an outage like the one I experienced.
Well, i'm off to bed shortly and hopefully will get a good nights sleep!
All of the environments I support are unix based but our team does support some Windows servers. So while i'm on call i'm responsible for those as well.
The first thing I do is check the alert log and I notice the following errors repeating over and over:
Errors in file d:\orant9i\admin\orcl\udump\orcl_j000_2096.trc:
ORA-12012: error on auto execute of job 22
ORA-00376: file 6 cannot be read at this time
ORA-01110: data file 6: 'D:\ORCL\MYDATA01.DBF'
ORA-06512: at "SCHEMA.STOREDPROCEDURE", line 3
ORA-06512: at line 1
I seemed to recall some of our other windows based database servers having issues in the past with backup software locking the files. So the next thing I do is called the Intel Support group in Australia and ask them if any backups were executed today. Indeed this past weekends full backups failed and they initiated some manual backups today.
The support person confirmed that the OS backup had completed, so I restarted the database. The database opened successfully and I checked the alert logs to verify there were no errors. As it had in the past occurrences, these steps resolved the issue.
Since I couldn't get back to sleep I kept an eye on the alert log and surfed the web on my home PC. About 20 minutes later the error appeared again followed by an email from a manager in Australia.
I was kind of surprised by this so I checked v$datafile. The status of 3 files was "RECOVER". My first thought was, why do I always get the fun issues when i'm on pager?!
This was a relatively small database and before I attempt any recovery I always take a cold backup. (If time allows.) Once I restarted the database I executed the following commands for each datafile that was marked as needing recovery:
SQL> recover datafile 'c:\my\datafile.dbf
SQL> alter database datafile 'c:\my\datafile.dbf
Media recovery was successful and the problem did not reoccur. So just a note for those windows DBA's out there. If you have automated backup software ensure that your datafile directories are excluded. While on unix the worst case would be an inconsistent backup, on windows, file locking behaves differently and could cause an outage like the one I experienced.
Well, i'm off to bed shortly and hopefully will get a good nights sleep!
Wednesday, July 11, 2007
Resetlogs and your Rman catalog
This is one of those, I knew it at one point but its been so long ago items. Also, the manner in which it was discovered had us scratching our heads for a minute. I try not to blog about work specific items, you never know who is watching out there and i'm sure we have all heard of bloggers who have been fired. Playing it safe, lets just say, someone made a boo boo.....
So just a reminder that when a database is opened with resetlogs your creating a new incarnation of the database. If your using an rman catalog, you will have to login to your catalog and execute register database. Otherwise you will get an error:
Recovery Manager: Release 9.2.0.8.0 - Production
Copyright (c) 1995, 2002, Oracle Corporation. All rights reserved.
connected to target database: MYDB (DBID=2508956331)
connected to recovery catalog database
RMAN> 2> 3> 4> 5> 6>
Starting backup at 11-JUL-07
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at 07/11/2007 03:30:00
RMAN-06004: ORACLE error from recovery catalog database: RMAN-20003: target database incarnation not found in recovery catalog
So just a reminder that when a database is opened with resetlogs your creating a new incarnation of the database. If your using an rman catalog, you will have to login to your catalog and execute register database. Otherwise you will get an error:
Recovery Manager: Release 9.2.0.8.0 - Production
Copyright (c) 1995, 2002, Oracle Corporation. All rights reserved.
connected to target database: MYDB (DBID=2508956331)
connected to recovery catalog database
RMAN> 2> 3> 4> 5> 6>
Starting backup at 11-JUL-07
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at 07/11/2007 03:30:00
RMAN-06004: ORACLE error from recovery catalog database: RMAN-20003: target database incarnation not found in recovery catalog
Thursday, June 28, 2007
10.2.0.2 Rman duplicate issue
Each night we use rman's duplicate database feature to refresh a test environment to aid in troubleshooting production issues. This past weekend we upgraded our production database from 9i to 10gR2. We also applied alot of patches and the simplest way to update our test environment was to reclone it.
Last night we used our refresh scripts (rman duplicate) to rebuild the test database. Early this morning users reported the following error:
Since this problem wasn't reproducible in production we knew it had to be something with the refresh. The first thing we did was to look for differences between production and test databases. Some objects were invalid in test but recompiling them didn't resolve the issue. Checking through the log files for the refresh I noticed that autoconfig didn't complete successfully.
Uploading Context file /apps/appsora/oraappl/admin/ora_myserver.xml
Context File upload for /apps/appsora/oraappl/admin/ora_myserver.xml failed.
Check log file created by FNDCPUCF program.
. . .
[AutoConfig Error Report]
The following report lists errors AutoConfig encountered during each
phase of its execution. Errors are grouped by directory and phase.
The report format is:
[PROFILE PHASE]
AutoConfig could not successfully execute the following scripts:
Directory: /apps/appsora/oracomn/admin/install/ora_myserver
afcpctx.sh INSTE8_PRF 1
AutoConfig is exiting with status 1
afcpctx.sh uploads you context file to the database by calling FNDCPUCF. I assuming FNDCPUCF stands for FND Concurrent Process Upload Context File but I can't seem to find much information on it. Since it fires off a concurrent request the next step was to check the log files generated for errors. The following error was in one of them:
Uploading context file...
Oracle error -1187: ORA-01187: cannot read from file 702 because it failed verification tests
ORA-01110: data file 702: '/oradata/dbf/oradata/temp02.dbf' has been detected in FND_GSM_UTIL.APPEND_CTX_FRAGMENT.
I haven't encountered that error before so but a quick search on Metalink turned up the following note:
Subject: RMAN After 10.2.0.2 Duplicate Database Gets Ora-1187 And Ora-1110 on TempFiles
Doc ID: Note:374934.1
In 9i tempfile information is removed from the database during the duplication process and you need to re-add them manually. In 10g, tempfiles are now included. However, if the files already exist on the OS then the tempfiles are not re-created and you will see errors in the alert log. So, in order to resolve this issue we dropped the tempfiles and re-added them (as per the metalink note).
To drop tempfiles:
alter database tempfile '/u01/oracle/mydb/temp01.dbf' drop;
To add a new one (assuming your temporary tablespace is named temp):
alter tempspace temp add tempfile '/u01/oracle/mydb/temp01.dbf' size 500M;
Once this was done, autoconfig completed without error and we restarted the application.
Last night we used our refresh scripts (rman duplicate) to rebuild the test database. Early this morning users reported the following error:
Since this problem wasn't reproducible in production we knew it had to be something with the refresh. The first thing we did was to look for differences between production and test databases. Some objects were invalid in test but recompiling them didn't resolve the issue. Checking through the log files for the refresh I noticed that autoconfig didn't complete successfully.
Uploading Context file /apps/appsora/oraappl/admin/ora_myserver.xml
Context File upload for /apps/appsora/oraappl/admin/ora_myserver.xml failed.
Check log file created by FNDCPUCF program.
. . .
[AutoConfig Error Report]
The following report lists errors AutoConfig encountered during each
phase of its execution. Errors are grouped by directory and phase.
The report format is:
[PROFILE PHASE]
AutoConfig could not successfully execute the following scripts:
Directory: /apps/appsora/oracomn/admin/install/ora_myserver
afcpctx.sh INSTE8_PRF 1
AutoConfig is exiting with status 1
afcpctx.sh uploads you context file to the database by calling FNDCPUCF. I assuming FNDCPUCF stands for FND Concurrent Process Upload Context File but I can't seem to find much information on it. Since it fires off a concurrent request the next step was to check the log files generated for errors. The following error was in one of them:
Uploading context file...
Oracle error -1187: ORA-01187: cannot read from file 702 because it failed verification tests
ORA-01110: data file 702: '/oradata/dbf/oradata/temp02.dbf' has been detected in FND_GSM_UTIL.APPEND_CTX_FRAGMENT.
I haven't encountered that error before so but a quick search on Metalink turned up the following note:
Subject: RMAN After 10.2.0.2 Duplicate Database Gets Ora-1187 And Ora-1110 on TempFiles
Doc ID: Note:374934.1
In 9i tempfile information is removed from the database during the duplication process and you need to re-add them manually. In 10g, tempfiles are now included. However, if the files already exist on the OS then the tempfiles are not re-created and you will see errors in the alert log. So, in order to resolve this issue we dropped the tempfiles and re-added them (as per the metalink note).
To drop tempfiles:
alter database tempfile '/u01/oracle/mydb/temp01.dbf' drop;
To add a new one (assuming your temporary tablespace is named temp):
alter tempspace temp add tempfile '/u01/oracle/mydb/temp01.dbf' size 500M;
Once this was done, autoconfig completed without error and we restarted the application.
No comments:
Post a Comment