Monday, 17 October 2016

SQL and PL/SQL by Example

SQL and PL/SQL by Example
A number of our established “powerful” programming language features, even beloved ones, could very well turn out to belong rather to “the problem set” than to “the solution set.”

-Dutch computer scientist Edsger Dijkstra, advocate of structured programming and winner of the 1972 Turing Award, in “Correctness Concerns and, Among Other Things, Why They Are Resented”

All database activity is conducted in SQL, and therefore database administrators need to be intimately
familiar with it. Figure 2-1 illustrates that the greatest potential for performance improvement usually lies within the software application, not within the database where the application stores its data or within the physical infrastructure where the database is housed.

A common example of performance problems caused by applications is that the applications aren’t
designed to take advantage of indexing schemes. As an example of a performance problem that you might encounter in SQL, consider the query in Listing 2-1, which retrieves the names of employees with first_name “Steven” and last_name “King”. An index exists on the first_name and last_name columns of the employees table, but the query can’t use the index because it uses the upper function in the restriction. Because Oracle Database can’t use the index, it has to retrieve and examine every single record in the employees table.

Listing 2-1. An Inefficient SQL Query
SELECT first_name, last_name
FROM employees
WHERE upper(first_name) = 'STEVEN'
AND upper(last_name) = 'KING';

Another source of performance problems is the temptation to use the Oracle Database engine as a
calculator to perform arithmetic calculations and string manipulations. This is a highly inefficient use of SQL and is better done using math and string libraries instead. However, it is usually fine to issue a simple query against dual such as in Listing 2-2 that executes just one time.

Listing 2-2. Finding the Day of the Week
select to_char(sysdate, 'DAY') from dual;

An equally important reason why database administrators need to understand SQL is that all database
administration activities such as database maintenance and user management are also conducted in SQL. It should come as no surprise, therefore, that the Oracle Database 12c SQL Language Reference is almost 2,000 pages—compare this with the 20 or so pages in this chapter. Fortunately, you can go online and search the SQL manual or download the electronic version free of charge.3 You can purchase a printed copy if you like having a mountain of paper on your desk.

A third reason why database administrators need to understand SQL is that it has deficiencies that must be guarded against. These deficiencies include redundancy, problems introduced by nullable data items, and the absence of prohibitions on duplicate data records. I return to this subject later in this chapter.

TRUE STORY
One day, an angry software developer demanded that we find out why Oracle was not responding to “simple queries.” We found that he had submitted a query that indiscriminately joined seven tables. His query was of the form SELECT COUNT(*) ... FROM Table#1, Table#2, Table#3, Table#4, Table#5, Table#6, Table#7.4 The number of rows produced by such a query equals the product of the number of rows in each table; if each specified table contained 1,000 rows, the query would produce 100 trillion rows.

When we asked the software developer why he hadn’t specified any joining criteria, he said that he first wanted to determine whether Oracle could handle a “simple” query before submitting a “complex” query! 
It happened to me—it could happen to you!

Railroad Diagrams
SQL statements have many optional clauses, so the Oracle reference manuals use railroad diagrams as a visual aid. Figure 2-2 is an example of a railroad diagram for a hypothetical ROW command.

Start at the left of the diagram, and take any path the diagram allows you to take. Along the way,
you encounter mandatory words and clauses as well as optional words and clauses and even subdiagrams. The diagram indicates that the ROW command can take the forms indicated in Listing 2-3.

Listing 2-3. Examples of ROW Commands
ROW YOUR BOAT
ROW, ROW YOUR BOAT
ROW, ROW, ROW YOUR CORACLE
ROW, ROW, ROW, ROW YOUR PIROGUE
ROW, ROW, ROW, ROW, ROW YOUR LARGE CORACLE

Let’s look at some railroad diagrams for a simplified version of the SELECT statement. The SELECT
statement starts with the keyword SELECT and is used to retrieve data from a relational database. What you see here is a very small subset of the complete syntax, but it’s powerful enough for many purposes.

The diagrams in Figures 2-3 through 2-8 show that the keyword SELECT is followed a number of clauses:  the mandatory SELECT_list and FROM_clause and the optional WHERE_clause, GROUP_BY_clause, and
ORDER_BY_clause. Separate railroad diagrams are also shown for each of the clauses.

As an example, let’s generate a report of the total salary in each department, sorted by descending order of average salary. Let’s assemble the necessary clauses.

The SELECT_list clause is a comma-separated list of column names or expressions in which you’re
interested. Here’s what you need. The upper function produces the uppercase version of a character string, and the avg function produces the average of values in a group:
SELECT upper(department_name),
AVG(salary)

The WHERE_clause portion of the statement is a list of tables that must be joined. You need the departments table and the employees table. You give each table a short alias that will come in handy in other clauses:
FROM departments d,
employees e

The optional GROUP_BY_clause portion of the statement tells Oracle Database that you apply aggregate functions such as avg to groups of records. In this case, you need to group by upper(department_name):
GROUP BY upper(department_name)

Finally, the ORDER_BY_clause portion of the statement specifies how you want the report sorted. In this case, you need to sort by descending values of AVG(salary):
ORDER BY AVG(salary) DESC;

Let’s put this all together:
SELECT upper(department_name),
AVG(salary)
FROM departments d,
employees e
WHERE e.department_id = d.department_id
GROUP BY upper(department_name)
ORDER BY AVG(salary) DESC;

And here is the result:

UPPER(DEPARTMENT_NAME)                      AVG(SALARY)
------------------------------                                     -----------
EXECUTIVE                                                      19333.3333
ACCOUNTING                                                           10150
PUBLIC RELATIONS                                                 10000
MARKETING                                                                9500
SALES                                                                 8955.88235
FINANCE                                                                       8600
HUMAN RESOURCES                                                 6500
IT                                                                                    5760
ADMINISTRATION                                                     4400
PURCHASING                                                              4150
SHIPPING                                                           3475.55556

Notice the formatting of the results. Formatting instructions aren’t part of the SQL query; the formatting is left to the application program—SQL*Plus in this case—that sent the SQL query to the Oracle Database engine for processing.

Now take a minute and check out the complete syntax of the SELECT statement in the SQL reference
manual. Notice the subquery diagram in particular and how it references itself; just before the ORDER BY clause, you can optionally specify a UNION, INTERSECT, or MINUS clause followed by another subquery. Also observe that the subquery diagram is referenced by other diagrams. For example:

• If you check the Subquery Factoring Clause diagram, you see that it too refers to the subquery diagram.

• If you check the Table Reference subdiagram, you see that a table reference can include a subquery. This kind of subquery is called an inline view.

• Subqueries that produce a single value can be used any place where a single value is indicated. This kind of subquery is called a scalar subquery. For example, scalar subqueries can be used in a SELECT list and in a WHERE clause.

Friday, 7 October 2016

What Makes a Relational Database Management System Relational?

What Makes a Relational Database Management System Relational?

Having already discussed the meaning of both relational database and database management system, it may appear that the subject is settled. But the natural implications of the relational model are so numerous and profound that critics contend that, even today, a “truly relational” database management system doesn’t exist. For example, Dr. Edgar Codd wanted the database management system to treat views in the same manner as base tables whenever possible, but the problem of view updateability is unsolved to the present day. Codd listed more than 300 separate requirements that a database management system must meet in order to fulfill his vision properly, and I have time for just one of them: physical data independence. Here is the relevant quote from Codd’s book:

RP-1 Physical Data Independence: The DBMS permits a suitably authorized user to make changes in storage representation, in access method, or in both—for example, for performance reasons. Application programs and terminal activities remain logically unimpaired whenever any such changes are made.

What Codd meant was that you and I shouldn’t have to worry about implementation details such as the storage structures used to store data.

Summary
I hope that you now have an appreciation for the theoretical foundations of Oracle Database 12c. You can find more information about the subjects I touched on in the books mentioned in the bibliography at the end of the chapter. Here is a short summary of the concepts discussed in this chapter:

• A database is an information repository that must be competently administered using the principles laid out in the IT Infrastructure Library (ITIL), including security management, availability management, continuity management, change management, incident management, problem management, configuration management, release management, and capacity management.

• Relation is a precise mathematical term for what is loosely called a data table. Relational database technology swept aside earlier technologies because of the power and expressiveness of relational algebra and because it made performance the responsibility of the database management system instead of the application developer.

• A database management system provides efficient algorithms for the processing of table operations as well as indexing schemes for data storage. The query optimizer rearranges relational algebra expressions in the interests of efficiency but without changing the meaning or the results that are produced.

• A database management system is defined as a software layer that provides services such as transaction management, data security, and data integrity.

• A transaction is a logical unit of work characterized by atomicity, consistency, isolation, and durability.

• Relational database theory has many consequences, including logical data independence, which implies that changes to the way in which data is stored or indexed shouldn’t affect the logical behavior of application programs.

Further Reading

• Bachman, Charles. “The Programmer as Navigator.” ACM Turing Award Lecture, 1973. http://amturing.acm.org/award_winners/bachman_9385610.cfm. To properly understand Codd’s theory, you must understand what preceded it.

• Codd, E. F. “Relational Database: A Practical Foundation for Productivity.” ACM Turing Award Lecture, 1981. http://amturing.acm.org/award_winners/ codd_1000892.cfm. It’s a shame that almost no database professionals have read the papers and articles written by the inventor of relational database theory. In my opinion, the 1981 ACM Turing Award Lecture should be required reading for every database professional.

• Silberschatz, Abraham, Henry Korth, and S. Sudarshan. Database System Concepts. 6th ed. McGraw-Hill, 2010. If you’re going buy just one book on database theory, this is the one I recommend. This college textbook, now in its sixth edition, offers not only copious amounts of theory but also some coverage of commercial products such as Oracle Database.

Tuesday, 4 October 2016

Efficiency of Relational Operators

Efficiency of Relational Operators

You may have noticed that the previous section made no mention of efficiency. The definitions of the
table operations don’t explain how the results can be efficiently obtained. This is intentional and is one of the greatest strengths of relational database technology—it’s left to the database management system to provide efficient implementations of the table operations. In particular, the selection operation depends heavily on indexing schemes, and Oracle Database provides a host of such schemes, including B-tree indexes, index-organized tables, partitioned tables, partitioned indexes, function indexes, reverse-key indexes, bitmap indexes, table clusters, and hash clusters. I discuss indexing possibilities as part of physical database design in Chapter 7.

Query Optimization

Perhaps the most important aspect of relational algebra expressions is that, except in very simple cases, they can be rearranged in different ways to gain a performance advantage without changing their meaning or causing the results to change. The following two expressions are equivalent, except perhaps in the order in which data columns occur in the result—a minor presentation detail, not one that changes the meaning of the result:

Table_1 JOIN Table_2
Table_2 JOIN Table_1

The number of ways in which a relational algebra expression can be rearranged increases dramatically as the expression grows longer. Even the relatively simple expression (Table_1 JOIN Table_2) JOIN Table_3 can be arranged in the following 12 equivalent ways that produce results differing only in the order in which columns are presented—a cosmetic detail that can be easily remedied before the results are shown to the user:

(Table_1 JOIN Table_2) JOIN Table_3
(Table_1 JOIN Table_3) JOIN Table_2
(Table_2 JOIN Table_1) JOIN Table_3
(Table_2 JOIN Table_3) JOIN Table_1
(Table_3 JOIN Table_1) JOIN Table_2
(Table_3 JOIN Table_2) JOIN Table_1

Table_1 JOIN (Table_2 JOIN Table_3)
Table_1 JOIN (Table_3 JOIN Table_2)
Table_2 JOIN (Table_1 JOIN Table_3)
Table_2 JOIN (Table_3 JOIN Table_1)
Table_3 JOIN (Table_1 JOIN Table_2)
Table_3 JOIN (Table_2 JOIN Table_1)

It isn’t obvious at this stage what performance advantage, if any, is gained by rearranging relational
algebra expressions. Nor is it obvious what criteria should be used while rearranging expressions. Suffice it to say that a relational algebra expression is intended to be a nonprocedural specification of an intended result, and the query optimizer may take any actions intended to improve the efficiency of query processing as long as the result isn’t changed. Relational query optimization is the subject of much theoretical research, and the Oracle query optimizer continues to be improved in every release of Oracle Database. I return to the subject of SQL query tuning in Chapter 17.

What Is a Database Management System?
Database management systems such as Oracle are the interface between users and databases. Database management systems differ in the range of features they provide, but all of them offer certain core features such as transaction management, data integrity, and security. And, of course, they offer the ability to create databases and to define their structure, as well as to store, retrieve, update, and delete the data in the databases.

Transaction Management
A transaction is a unit of work that may involve several small steps, all of which are necessary in order not to compromise the integrity of the database. For example, a logical operation such as inserting a row into a table may involve several physical operations such as index updates, trigger operations, and recursive operations. A transaction may also involve multiple logical operations. For example, transferring money from one bank account to another may require that two separate rows be updated. A DBMS needs to ensure that transactions are atomic, consistent, isolated, and durable.

The Atomicity Property of Transactions
It’s always possible for a transaction to fail at any intermediate step. For example, users may lose their connection to the database, or the database may run out of space and may not be able to accommodate new data that a user is trying to store. If a failure occurs, the database management system performs automatic rollback of the work that has been performed so far. Transactions are therefore atomic or indivisible from a logical perspective. The end of a transaction is indicated by an explicit instruction such as COMMIT.

The Consistency Property of Transactions
Transactions also have the consistency property. That is, they don’t compromise the integrity of the
database. However, it’s easy to see that a database may be temporarily inconsistent during the operation of the transaction. In the previous example, the database is in an inconsistent state when money has been subtracted from the balance in the first account but has not yet been added to the balance in the second account.

The Isolation Property of Transactions
Transactions also have the isolation property; that is, concurrently occurring transactions must not interact in ways that produce incorrect results. A database management system must be capable of ensuring that the results produced by concurrently executing transactions are serializable: the outcome must be the same as if the transactions were executed in serial fashion instead of concurrently.

For example, suppose that one transaction is withdrawing money from a bank customer’s checking
account, and another transaction is simultaneously withdrawing money from the same customer’s
savings account. Let’s assume that negative balances are permitted as long as the sum of the balances in the two accounts isn’t negative. Suppose that the operation of the two transactions proceeds in such a way that each transaction determines the balances in both accounts before either of them has had an opportunity to update either balance. Unless the database management system does something to prevent it, this can potentially result in a negative sum. This kind of problem is called write skew.

A detailed discussion of isolation and serializability properly belongs in an advanced course on application development, not in a beginner text on database administration. If you’re interested, you can find more information in the Oracle 12c Advanced Application Developer’s Guide, available at
https://docs.oracle.com/cd/E11882_01/appdev.112/e41502/toc.htm.

The Durability Property of Transactions
Transactions also have the durability property. This means once all the steps in a transaction have been successfully completed and the user notified, the results must be considered permanent even if there is a subsequent computer failure, such as a damaged disk. I return to this topic in the chapters on database backups and recovery; for now, note that the end of a transaction is indicated by an explicit command such as COMMIT.

Data Integrity
Data loses its value if it can’t be trusted to be correct. A database management system provides the ability to define and enforce what are called integrity constraints. These are rules you define, with which data in the database must comply. For example, you can require that employees not be hired without being given a salary or an hourly pay rate.

The database management system rejects any attempt to violate the integrity constraints when inserting, updating, or deleting data records and typically displays an appropriate error code and message. In fact, the very first Oracle error code, ORA-00001, relates to attempts to violate an integrity constraint. It’s possible to enforce arbitrary constraints using trigger operations; these can include checks that are as complex as necessary, but the more common types of constraints are check constraints, uniqueness constraints, and referential constraints. These work as follows:

• Check constraints: Check constraints are usually simple checks on the value of a data item. For example, a price quote must not be less than $0.00.

• Uniqueness constraints: A uniqueness constraint requires that some part of a record be unique. For example, two employees may not have the same employee number. A unique part of a record is called a candidate key, and one of the candidate keys is designated as the primary key. Intuitively, you expect every record to have at least one candidate key; otherwise, you would have no way of specifying which records you needed. Note that the candidate key can consist of a single item from the data record, a combination of items, or even all the items.

• Referential constraints: Consider an employee database in which all payments to employees are recorded in a table called SALARY. The employee number in a salary record must obviously correspond to the employee number in some employee record; this is an example of a referential constraint.

Data Security
A database management system gives the owners of the data a lot of control over their data—they can
delegate limited rights to others if they choose to. It also gives the database administrator the ability to restrict and monitor the actions of users. For example, the database administrator can disable the password of an employee who leaves the company, to prevent them from gaining access to the database. Relational database management systems use techniques such as views (virtual tables defined in terms of other tables) and query modification to give individual users access to just those portions of data they’re authorized to use.

Oracle also offers extensive query-modification capabilities under the heading Virtual Private Database (VPD). Additional query restrictions can be silently and transparently appended to the query by policy functions associated with the tables in the query. For example, the policy functions may generate additional query restrictions that allow employees to retrieve information only about themselves and their direct reports between the hours of 9 a.m. and 5 p.m. on weekdays only.
I return to the subject of data security in Chapter 8.