. /var/opt/oracle/oraset $1
#
/ as sysdba
SET HEAD OFF FEED OFF
SELECT count(*)
FROM dba_users
WHERE lock_date IS NOT NULL
AND username in ('CIAP','REPV','CIAL','STARPROD');
EOF)
#
echo $crit_var
echo "locked acct. issue with $1" | mailx -s "locked acct. issue" [email protected] else
echo $crit_var
echo "no locked accounts"
fi
exit 0
This shell script is called from a scheduling tool such as cron. For example, this cron entry instructs the job to run every ten minutes:
#---------------------------------------------------
# Job to detect locked database accounts
0,10,20,30,40,50 * * * * /home/oracle/bin/lock.bsh DWREP
1>/home/oracle/bin/log/lock.log 2>&1
#---------------------------------------------------
In this way, I am notified when one of the production database accounts becomes locked. In this cron entry, the code should be on one line. It's placed on two lines in this book so that it fits on the page.
Checking for Files over a Certain Age
For some jobs that I run from shell scripts (like backups) I'll first check to see if another backup job is already running. The check involves looking for a lock file. If the lock file exists then the shell script exits.
If the lock file doesn't exist, then one is created. At the end of the job the lock file is removed.
What happens sometimes is that there's a problem with the job and it aborts abnormally before the lock file can be removed. In these situations, I want to know if a lock file exists on a server that is older 590
CHAPTER 21 ■ AUTOMATING JOBS
than one day old. The existence of an old lock file indicates there has been an issue, so I need to investigate. Listing 21–7 checks for any lock files in the /tmp directory older than one day.
Listing 21–7.
A Script that checks for any lock files in the /tmp directory older than one day
#!/bin/bash
BOX=$(uname -a | awk '{print $2}')
# find all lock files gt 1 day old.
# Find all lock files in /tmp, if found, find any older than one day
ls /tmp/*.lock 2>/dev/null && \
filevar=$(find /tmp/*lock -type f -mtime +1 | wc -l) || filevar=0
if [ $filevar -gt 0 ]; then
echo "$BOX, lockfile issue: $filevar" | \
mailx -s "$BOX lockfile problem" [email protected]
else
echo "Lock file ok: $filevar"
fi
exit 0
I usually check for the existence of a lock file on a daily basis. Here is a typical cron entry to run the prior script named lock_chk.bsh:
#---------------------------------------------------
# lock file count check
33 5 * * * /orahome/oracle/bin/lock_chk.bsh
1>/orahome/oracle/bin/log/lock_chk.log 2>&1
#---------------------------------------------------
(Again, the code for this cron entry should be on one line. It's placed on two lines in this book so that it fits on the page.)
Checking for Too Many Processes
On some database servers you may have many background SQL*Plus jobs. These batch jobs might perform tasks such as copying data from remote databases, large daily update jobs, and so on. In these environments it's useful to know if at any given time there are an abnormal number of shell scripts running or an unusually large number of SQL*Plus processes running on the database server. An abnormal amount of job could be an indication that something is broken or hung.
The shell script in Listing 21–8 has two checks in it, one check for the number of shell scripts that are named with the extension of bsh, and another check for the number of processes that contain the string of sqlplus:
Listing 21–8.
A Script That Runs Two Checks
#!/bin/bash
#
echo "Usage: $0"
exit 1
fi
#
if [ $crit_var -lt 20 ]; then
591
CHAPTER 21 ■ AUTOMATING JOBS
echo $crit_var
echo "processes running normal"
else
echo "too many processes"
echo $crit_var | mailx -s "too many bsh procs: $1" [email protected] fi
#
if [ $crit_var -lt 30 ]; then
echo $crit_var
echo "processes running normal"
else
echo "too many processes"
echo $crit_var | mailx -s "too many sqlplus procs: $1" [email protected] fi
#
The prior shell script is named proc_count.bsh and is run once an hour from a cron job:
#---------------------------------------------------
# Process count check, sqlplus process count check.
33 * * * * /home/oracle/bin/proc_count.bsh
1>/home/oracle/bin/log/proc_count.log 2>&1
#---------------------------------------------------
(Again, the code for this cron entry should be on one line. It's placed on two lines in this book so that it fits on the page.)
Verifying Integrity of RMAN Backups
As part of your backup and recovery strategy you should periodically validate the integrity of the backup files. RMAN provides a RESTORE...VALIDATE command that checks for physical corruption within the backup files. Listing 21–9 is a script starts RMAN and spools a log file. The log file is subsequently searched for the keyword "error". If there are any errors in the log file, an email is sent.
Listing 21–9.
A Script that Starts RMAN and Spools a Log File
#!/bin/bash
#
echo "Usage: $0 SID"
exit 1
fi
# source oracle OS variables
. /var/opt/oracle/oraset $1
#
BOX=ùname -a | awk '{print$2}'`
rman nocatalog <
connect target /
spool log to $HOME/bin/log/rman_val.log
set echo on;
restore database validate;
592
CHAPTER 21 ■ AUTOMATING JOBS
EOF
grep -i error $HOME/bin/log/rman_val.log
if [ $? -eq 0 ]; then
echo "RMAN verify issue $BOX, $1" | \
mailx -s "RMAN verify issue $BOX, $1" [email protected]
else
echo "no problem..."
fi
#
exit 0
The RESTORE...VALIDATE doesn't actually restore any files; it only validates that the required files to restore the database are available and checks for physical corruption.
If you need to also check for logical corruption, specify the CHECK LOGICAL clause. For example, to check for logical corruption, Listing 21–6 would have this line in it: restore database validate check logical;
For large databases, the validation process can take a great deal of time (because it checks each block in the backup file for corruption). If you only want to check that the backup files exist, specify the VALIDATE HEADER clause, like so:
restore database validate header;
This command only checks for valid information in the header of each file that would be required for a restore and recovery.
Summary
Automating routine database jobs is a key attribute of the successful DBA. Automated jobs ensure that tasks are repeatable, verifiable, and that you are immediately notified when there are any problems.
Your job as a DBA depends on successfully running backups and ensuring the database is highly available. This chapter includes several scripts and examples that detail how to run routine jobs at defined frequencies.
If you are a DBA who works in a Linux/Unix shop, you should familiarize yourself with the cron utility. This scheduler is simple to use and is almost universally available. Even if you don't use cron in your current work assignment, you're sure to encounter its use in future work environments.
Oracle provides Oracle Scheduler utility (implemented via the DBMS_SCHEDULER PL/SQL package) for scheduling jobs. This tool can be used to automate any type of database task. You can also initiate jobs based on system events or based on the success/failure of other scheduled jobs. I prefer to use cron for scheduling database jobs. However, you might have sophisticated scheduling requirements that dictate the use of a tool such as the Oracle Scheduler.
At this point in the book, you've learned how to implement and perform many tasks required of a DBA. Even if you manage just one database, no doubt you've also been embroiled in a vast number of troubleshooting activities. The next chapter of the book focuses on diagnosing and resolving many of the issues that a DBA encounters.
593
CHAPTER 22 ■ DATABASE TROUBLESHOOTING
Checking Database Availability
The first few checks that I perform don't require the DBA to logon to the database server. Rather, they can be performed remotely via SQL*Plus and operating system commands. In fact, I perform all of the initial checks remotely over the network; this establishes whether all of the system components are working.
One quick check to establish whether the remote server is available, the database is up, the network is working, and the listener is accepting incoming connections is to connect via a SQL*Plus client to the remote database over the network. I usually have a standard database account and password that I create in all databases for use in such scenarios. Here's an example of connecting over the network to a remote database as the barts user with a password of l1sa; the network connect information is embedded directly into the connect string (where dwdb1 is the server, 1521 is the port, and dwrep1 is the database service name):
$ sqlplus barts/l1sa@'dwdb1:1521/dwrep1'
If a connection can be established, then the remote server is available and the database and listener are up and working. At this point, I contact whomever reported an issue and see if the connectivity issue has something to do with the application or something other than the database.
If the prior SQL*Plus command doesn't work, try to establish whether the remote server is available.
This example uses the ping command to the remote server named dwdb1:
$ ping dwdb1
If ping works, you should see output similar to this:
64 bytes from dwdb1 (192.168.254.215): icmp_seq=1 ttl=64 time=0.044 ms If ping doesn't work, there is probably an issue with either the network or the remote server. If the remote server isn't available, I usually try to contact a system administrator or network administrator.
If ping does work, I check to see if the remote server is reachable via the port that the listener is listening on. I use the telnet command to accomplish this:
$ telnet IP
In this example, a network connection is attempted to the server's IP address on the 1521 port: $ telnet 192.168.254.215 1521
If the IP address is reachable on the specified port, you should see "Connected to ..." in the output, like so:
Trying 192.168.254.216...
Connected to ora04.
Escape character is '^]'.
If the telnet command doesn't work, I contact the system administrator or the network administrator.
If the telnet command works, there is network connectivity to the server on the specified port.
Next, I use tnsping command to test network connectivity using Oracle Net to the remote server and database. This example attempts to reach the DWREP1 remote service:
$ tnsping DWREP1
If successful, the output should contain the string "OK", like so: Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = DWDB1) (PORT = 1521)) (CONNECT_DATA = (SERVICE_NAME = DWREP1)))
OK (20 msec)
596
CHAPTER 22 ■ DATABASE TROUBLESHOOTING
If tnsping works, it means the remote listener is up and working. It doesn't necessarily mean that the database is up, so you may need to log onto the database server to further investigate. If tnsping doesn't work, then the listener or the database is down or hung. At this point, I logon directly to the server to perform additional checks such as a mount point filling up.
Investigating Disk Fullness
To further diagnose issues, you need to logon directly to the remote server. Typically, you will need to logon as the owner of the Oracle software (usually the oracle operating system account). When first logging onto a box, one issue that will cause a database to hang or have problems is a full mount point.
The df command with the human readable -h switch assists with verifying disk fullness: $ df -h
Any mount point that is full needs to be investigated. If the mount point that contains ORACLE_HOME
becomes full, then you'll receive errors like this when connecting to the database: Linux Error: 28: No space left on device
To fix issues with a full mount point, first identify files that can either be moved or removed.
Usually, I first look for old trace files; often, there’s a gigabyte of old files that can safely be removed.
Locating the Alert Log and Trace Files