One of the painful areas to work on large Data Warehouse Databases are managing the stats of the databases especially ETL databases.
The nature of ETL databases and how statistics are meant to work are two poles apart , lets us first understand the statistics.
Ay a very high level Statistics are used by the Oracle Optimizer to gradually evolve a good path to reach your data and these incrementally changes with the data that comes in ( they are internally again measured with the help of histograms - lets us not get into histograms now ) . On the other hand ETL data does not incrementally they change as a wholesome data , most ETL jobs are staged into temporary tables where the bulk of the processing happens before they are actually put into use. These "staging" tables are the big pain point , the statistics on these tables are toast once the tables are recreated with with stage operation.
Per the exact vierbiage from Oracle Documentation , they are "marked" as STALE if more than 10% of the Data is changed , now if hypothetically if 11% of the Data is changed , they are still "marked" as STALE which makes the database optimizer works in a funky way , the % of measurement is not directly visibile ( there are lot of Oracle MOS Internal tools to measure these but its is beyond the scope of this blog )
-- Per Oracle Documentation
Objects are considered stale when 10% of the total rows have been changed. When you issue GATHER_TABLE_STATS with GATHER STALE,
the procedure checks the USER_TAB_MODIFICATIONS view. If a monitored table has been modified more than 10%, then statistics are gathered again. The information about changes of tables, as shown in the USER_TAB_MODIFICATIONS view, can be flushed from the SGA into the data dictionary with the DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO procedure.
The following Oracle Database Reference and Blogs helps provide deeper insight into them.
The general workaround is to improve the data model these ETL works , but a lot of ETL tools built mechasnism in cannot be changed by a poor Developer who is supposed to work on them , so the only solution is to ensure the stats are "updated at that point of time" before the ETL actually starts.
This is exactly why a dynamic stale stats scripts is needed to ensure the ETL jobs run as expected , consider this script as a multi vitamin supplement to your database than a cure for the ETL solution , for the cure the ETL jobs Data model holds they key.
Below is my github utility where Stale Stats can be dynamically generated.
The script would generate a script in the below format which would update stats for the tables and corresponding Indexes.
EXECUTE DBMS_STATS.GATHER_TABLE_STATS(OWNNAME =>'OE' , TABNAME =>'ORDERS' ,OPTIONS =>'GATHER AUTO' , CASCADE =>TRUE , DEGREE =>8 , METHOD_OPT =>'FOR ALL COLUMNS SIZE AUTO');
EXECUTE DBMS_STATS.GATHER_TABLE_STATS(OWNNAME =>'HR' , TABNAME =>'EMPLOYEES' ,OPTIONS =>'GATHER AUTO' , CASCADE =>TRUE , DEGREE =>8 , METHOD_OPT =>'FOR ALL COLUMNS SIZE AUTO');
I request this script to be tested and used with required parallelism , however the number of Stale Stats in schemas can vary from one environment to another.