Cong Li

Posted on Jul 26

GBase 8a Migration Plan Based on Netezza (1) - Migration Methods and Recommendations

#database

1. Introduction

This article introduces the methods and recommendations for migrating from Netezza to GBase 8a MPP Cluster, including hardware configuration replacement, cluster scale assessment, table and data migration plans, and post-migration data auditing.

Future articles will cover successful project cases of replacing Netezza with GBase 8a MPP Cluster and the differences between the two databases in data types, operators, predicates, and functions during migration.

This plan is suitable for project managers, developers, project plan writers, and implementers involved in the migration from Netezza to GBase 8a MPP Cluster.

The contents of each chapter are as follows:

Chapter 1: Overview of the plan content and applicable scope.
Chapter 2: Introduction of hardware configuration replacement for GBase 8a MPP Cluster and cluster scale assessment methods.
Chapter 3: Introduction of the migration process, object migration, data migration, and data auditing methods.

2. Hardware Resources and Node Scale Assessment

This chapter provides a general plan for the evaluation of hardware resources and cluster scale when replacing Netezza with GBase 8a MPP Cluster. Specific evaluations should consider the actual business, hardware, and network conditions of the customer.

2.1 Introduction to Netezza Hardware Configuration

According to IBM official documentation, Netezza hardware platforms are divided into several series (see Table 2-1), with detailed configurations for each series listed in Table 2-2.

Table 2-1

Table 2-2

2.2 Hardware Configuration and Cluster Scale Assessment

2.2.1 Hardware Evaluation Method

Netezza appliances have certain hardware optimizations, so the performance of X86_PC Server hardware with the same configuration may differ from Netezza appliances. Therefore, when evaluating hardware for GBase 8a MPP Cluster deployment, it is recommended that the configuration be superior to that of the Netezza appliance.

CPU Configuration Evaluation: GBase 8a MPP Cluster is mainly used in data IO-intensive scenarios. The system performance is closely related to IO capability, and CPU usage is mainly for data compression, decompression, and SQL calculations. CPU configuration is usually based on project experience and current market technology and prices, generally matching or slightly exceeding the Netezza CPU core count.
Memory Configuration Evaluation: This is usually related to the amount of data managed per node and CPU configuration. Industry practice often uses 4GB to 16GB of memory per CPU core. For high-stress comprehensive analysis scenarios, 16GB of memory per CPU core is recommended, with at least 256GB of memory.
Disk Configuration: Based on the raw data stored in Netezza, future data growth, and project budget, disk capacity should be evaluated.

2.2.2 Cluster Scale Evaluation Method

Based on the optimal data amount per node and considering that GBase 8a MPP Cluster typically configures 1 replica (primary + replica), the cluster scale evaluation formula is:

Cluster Scale = (Migrated Raw Data + Future Data Growth) * 2 ÷ Data Amount per Node (uncompressed raw data)

When evaluating cluster size, consider the following two points:

It is recommended to use SSD or NVMe disks. The disk utilization should not exceed 80%. For instance, if you are storing 5TB of compressed data files, you should prepare approximately 7TB of available disk capacity.
For SAS or SATA disks, the disk performance gradually degrades as capacity increases. When the total capacity exceeds 70%, IO performance drops sharply. Therefore, for clusters using SAS or SATA disks, each node should reserve sufficient disk space. For example, if storing 5TB of compressed data files, you should prepare no less than 8TB of available disk capacity.

2.2.3 Hardware Evaluation Example

In a real case of replacing Netezza, the project used Netezza to handle external data transmission, store and manage internal data, and provide timely data query analysis for the upper application layer. The hardware configuration of Netezza and the deployment hardware configuration of GBase 8a MPP Cluster are shown in Table 2-3, with a deployment scale of 6 nodes (3 management nodes, 6 data nodes), and 12T of uncompressed raw data.

1) Deployment Hardware Configuration Evaluation: Based on Netezza appliance hardware configuration and the evaluation method in Section 2.2, the hardware configuration for deploying GBase 8a MPP Cluster was assessed. The single-node CPU core count of GBase 8a MPP Cluster is slightly higher than that of each Netezza blade. Disk capacity evaluation considers current raw data in Netezza and future data growth, with disk capacity about twice that of Netezza. Memory is calculated based on 16GB per CPU core (16 * 12 = 192GB, recommended 256GB memory).

Table 2-3

2) Cluster Scale Evaluation: For 12T of raw data, following the cluster scale evaluation method in Section 2.2, the steps are as follows:

Raw data storage requirement is 12TB, configured with 1 replica, without considering future data growth.
Based on the recommended value of 5T per node (15T uncompressed data) and the requirement of at least 3 management nodes, the minimum cluster scale is 3 (12 * 2 ÷ 15 ≈ 3).
Considering future data growth, new system performance enhancement, and project budget, it is recommended to deploy 6 nodes (3 management nodes + 6 data nodes).

3. Migration Plan

Before performing the Netezza to GBase 8a MPP Cluster migration, it is necessary to develop a migration plan, migration schedule, and migration evaluation to create an implementable migration method and plan.

3.1 Migration Content

Netezza migration to GBase 8a MPP Cluster generally involves three main tasks: database object migration, data migration, and business migration.

Database Object Migration: Export DDL scripts from Netezza and modify them in bulk according to GBase 8a MPP Cluster syntax to ensure successful object creation in GBase 8a MPP Cluster.
Data Migration: Export historical data from Netezza using the specified hexadecimal delimiter and character encoding through the creation of external tables or using nzsql -c "select * from tb" to generate text files. Load data into GBase 8a MPP Cluster using the load tool.
Business Migration: Convert Netezza stored procedures to GBase 8a MPP Cluster syntax, replacing functions according to a function comparison table. Manually rewrite unsupported functions.

3.2 Migration Process

Refer to the following process for Netezza migration:

Collect Original Data: Assess original data and disk space usage.
Import Table Structure: Create GBase 8a MPP Cluster database and table structures based on Netezza table structures.
Data Classification: Count data volume in Netezza tables and classify them by data size.
Export Original Data: Export data in batches from Netezza to text files based on the classification, storing them on a designated server.
Load Data: Load exported data text files into GBase 8a MPP Cluster using load data infile.
Data Validation: Create stored procedures and establish statistical tables to compare and validate data between the original and new databases.
Incremental Data Import: After full data import, identify and import incremental data, performing data validation.

3.3 Table Structure Migration

Database object migration includes the migration of table structures and fields. For Netezza database objects such as table column names, data types, lengths, precisions, constraints, and indexes, compare and convert them to the GBase 8a MPP Cluster database. As a data warehouse database, GBase 8a MPP Cluster supports SQL standard table creation syntax, but syntax details differ slightly from Netezza. The following considerations are important:

3.3.1 Table Structure Migration

Table Migration: Although GBase 8a MPP Cluster supports SQL standard syntax for table creation, syntax details may differ from Netezza. For example, Netezza table column comments use comment on, whereas GBase 8a MPP Cluster uses comment as a keyword.
Random Distribution Table: In Netezza, distribute on random explicitly specifies random data distribution. GBase 8a MPP Cluster uses random distribution by default if no distribution key is specified.
Replication Table: Netezza does not support replication tables (tables with the same data on all nodes), but GBase 8a MPP Cluster does. Small, frequently joined tables (e.g., dimension tables) can be set as replication tables to avoid unnecessary data pulling across nodes.
Selection of Data Distribution Key: Netezza automatically selects a column as the distribution key if none is specified. GBase 8a MPP Cluster V953 supports multiple distribution keys; typically, the Netezza distribution key is reused during migration.
Compression Strategy Selection: GBase 8a MPP Cluster can specify compression algorithms for each table, offering lightweight (5,5) and heavyweight (3,1) compression. If no specific compression requirement exists, use the (5,5) algorithm.
Temporary Table: GBase 8a MPP Cluster's temporary tables are session-level and automatically deleted upon session termination. To retain intermediate table data, do not create them as temporary tables. Add the TEMPORARY keyword after CREATE for temporary tables.
Fields: Field definitions in GBase 8a MPP Cluster resemble those in traditional relational databases. Migration should reference documentation provided by NanJing General for guidance on field syntax comparisons, including character, numeric, date, and large field types.
Indexes: As a distributed column-store database focused on massive data queries and statistical analysis, GBase 8a MPP Cluster has different index principles than traditional databases. It supports full-column, automated maintenance, and coarse-grained indexes for filtering. For precise query optimization, use HASH indexes, which improve performance for equality queries, especially for large data volumes. Netezza's ORGANIZE ON keyword for clustered indexes can be considered for sorting cluster index columns.

3.4 Data Migration

Data migration is a critical step in the overall database migration process. It includes data initialization for full synchronization and regular periodic synchronization. Initial full synchronization involves transferring historical data from the existing data warehouse system to the new database, while regular periodic synchronization ensures that, after the initial full synchronization, the new database is kept in sync with the existing data warehouse. In scenarios involving massive amounts of data, batch export and loading methods, which are more efficient, are primarily used.

3.4.1 Preparations Before Data Migration

Before formally migrating Netezza, the following preparations can be made:

Considering data security, it is recommended to request a regular Netezza database user (without deletion permissions) from the customer, who only has select and export permissions.
Classify and inventory the databases and tables in Netezza, categorizing them based on data DDL, DML operations, and the importance of the tables. Prioritize the tables based on these categories. Record the time columns (primary column names) of each table to prepare for exporting full data using WHERE conditions later.
For data containing "null" values and leading or trailing spaces, communicate with the customer in advance to decide whether to retain the spaces, ensuring data consistency.
Before migration, distinguish between full and incremental data. Use Netezza's data DDL operations table and data table importance to differentiate. Confirm the division of full and incremental data with the application developer.
- For example, for a Netezza data DDL operations table, if the data before October 2021 is historical (full data) and not updated, and data from November 2021 is updated daily (incremental data), export full data based on the WHERE condition (first entry before October 2021).
- For incremental data, consider parallel loading into both Netezza and GBase 8a MPP Cluster, followed by data comparison. Temporary data tables, which do not differentiate between full and incremental data, can use the SELECT * FROM tb export method.
Evaluate the time window for full data migration to decide if it should be done in one go or in batches. The evaluation factors include:
- Source database migration data volume: Assess the compression ratio.
- Allowed downtime window (source database needs to be read-only with a light load): This is a hard constraint.
- Source database export performance: Test to determine export performance.
- Number of file servers, IO performance, and network bandwidth to the 8a cluster: These are constraints for the migration environment.
- GBase 8a MPP Cluster node loading performance: Consider both IO performance and network bandwidth.
- Type of incremental business (append only, IDU, full loading zipper table): Decide the way to track incremental data and support batch migration.
- Whether the data warehouse design supports layering and idempotent data processing: Determine if migration can be done vertically or horizontally by business or layer.
Assess temporary space required for data migration.
- Given that Netezza stores large amounts of data, there may be insufficient available disk space to support local export.
- Handle this by exporting in batches, transferring exported data to a file server, or using external storage like high-capacity mobile disks or NAS file systems.

3.4.2 Evaluation of Data Migration Time Window

Before the migration, implementation personnel need to evaluate the overall project migration time to determine if it meets the overall project schedule requirements of the application developer. The evaluation formula is as follows:

Total Data Migration Time = Nettezza Export Time + GBase 8a Load Time + File Transfer Time

File Transfer Time = File Size / Disk or Network IO Performance

(Note: The file transfer time depends on the file system to which the data file is exported. For example, if the data is exported table by table to the Nettezza appliance's file system, the table files need to be transferred to a file server first and then loaded uniformly from the file server.)

Nettezza Export Time = Nettezza Stored Data Volume (in GB) / Nettezza Parallel Export Performance (GB/hour)

(Note: The Nettezza export time should include the periods when data export cannot be performed due to busy business hours. Typically, there can be a 12-hour window each day to perform data export tasks.)

GBase 8a Load Time = Nettezza Exported Data Volume (in GB) / GBase 8a Parallel Load Performance (GB/hour)

GBase 8a Parallel Load Performance = Number of Load Machines * 1000MB/s * 1/2 * 3600

When evaluating the Nettezza migration time, the following points should be noted:

The Nettezza stored data volume must be converted into the data volume of the exported flat files, so the compression ratio must be evaluated.
The performance of Nettezza's parallel export needs to be tested, considering Nettezza's load during the export.
The parallel load performance of GBase 8a MPP Cluster needs to be tested. The method for estimating time can refer to the formula for calculating GBase 8a MPP Cluster parallel load performance.

3.4.3 Data Unloading

Before performing the formal data migration, the data unloading plan is an essential aspect for implementation engineers to consider. This includes factors such as the time required for data unloading and the storage location of the unloaded data. In a practical Nettezza migration case, the Nettezza network environment was gigabit, and unloading 20TB of raw data from Nettezza, after testing by on-site support engineers, had an unloading rate of only 200MB/s, with a maximum work time of 12 hours per day. After evaluation, it was determined that under these conditions, unloading 20TB of data would take a minimum of 1 month. To meet the project's final completion deadline, the client upgraded Nettezza's gigabit network environment to a ten-gigabit network, increasing the unloading rate to 700MB/s and reducing the unloading time by two-thirds, significantly shortening the overall project timeline. Therefore, before migration, implementation engineers need to evaluate the data unloading time (for specific evaluation methods, refer to section 3.4.2 Evaluation of Data Migration Time Window).

Currently, in the client's production environment, most of the Nettezza appliance disks have limited capacity. When unloading data locally, storage issues need to be considered. Data can be stored via NAS or external disks, and the specific method should be chosen based on the project's actual situation.

Data can be unloaded from Nettezza using either the CREATE EXTERNAL TABLE method or the nzsql -c "select * from tb" export method, and both methods should be used in combination according to actual conditions.

Using nzsql -c "select * from tb" for export:

Advantages: Allows specifying a line break character, making data loading easier. Suitable for unloading small tables or tables with carriage returns.
Disadvantages: Slower unloading rate.

Using CREATE EXTERNAL TABLE for data unloading:

Advantages: Faster unloading speed. Under gigabit network conditions, the speed can reach 200MB/s; under ten-gigabit network conditions, the unloading rate can reach 700MB/s. Recommended for unloading large tables, with delimiters using hexadecimal symbols.
Disadvantages: Cannot specify enclosing characters during unloading.

3.4.4 Data Loading

Data Initialization Full Synchronization
- Synchronize data from the existing data warehouse to the new one, ensuring consistency.
- Create table structures.
- Use LOAD DATA INFILE SQL to load data.
- Compare the number of records before and after loading.
- Verify and correct any garbled data using Loader_log.
- For batch synchronization, continue loading incremental data until synchronization is achieved.
Regular Periodic Data Synchronization
- After initial full synchronization, maintain data consistency with regular synchronization based on the required timeliness.
- Only migrate raw data without secondary processing.
Data Acquisition (Push Data)
- Store data files in the file system server via the top-level system.
- Use SFTP to load data files into GBase 8a MPP Cluster.

3.5 Business Migration

3.5.1 Stored Procedure Migration

Business systems built in Netezza often use stored procedures for data processing. Convert Netezza's SQL to GBase 8a SQL, handle stored procedures, views, and UDFs. Use script conversion tools for 90% automation, with manual rewriting for the remaining 10%.

Ensure the original business logic and algorithms function correctly.
Minimize changes to flow control scripts.
Adapt Netezza SQL details to GBase 8a MPP Cluster SQL.

3.5.2 Script Program Migration

Business scripts using shell, python, etc., encapsulate business SQL. Migrate scripts by rewriting SQL syntax and replacing database interfaces with tools like gccli or python APIs for GBase 8a MPP Cluster.

3.5.3 Development Interface Migration

This work is for application developers, who replace Netezza's interfaces with GBase 8a MPP Cluster's JDBC, ODBC, ADO.Net, and rewrite SQL syntax.

3.6 Introduction to Migration Tools

Use migration tools to reduce labor and time costs and ensure uniform migration standards. Available tools include:

GBase Migration Toolkit: Migrates table structures and data from the source database to GBase 8a MPP Cluster, with adjustments for optimization.
ntzddl2gb: A python script for DDL migration, supporting comments, distribution columns, and field types, covering 99% of table object migration.
ntzproc2gb: A python script for automatic rewriting of 90% of stored procedures, supporting:
- %rowtype migration
- CHARACTER VARYING(ANY) to (10)
- Conversion of chr to char
- Alias FOR $x to input parameters
- Conversion of raise notice to select, timestamp to datetime
- Dynamic SQL variable declaration and execution
- Variable assignment conversion
- Exception handling conversion
- FOR loop conversion
- Associative update conversion
- Comment conversion (--)
- max, min multi-parameter input conversion
- truncate to trunc conversion
- char type conversion
- :: type conversion.

3.7 Data Auditing

3.7.1 Auditing Approach

Two approaches for data auditing:

Top-down auditing: Verify upper-level metrics first.
Bottom-up error tracing: Investigate discrepancies in upper-level metrics from the bottom.

3.7.2 Auditing Methods

Data auditing includes raw data, processing stage, and application access data auditing. Use the lifecycle to decide focus areas:

Total Audit
- Verify total record count, total value, mean, max, and min values to ensure no duplicate or missing data.
Partial Audit
- Audit data distribution for each dimension and measure. Correct distribution ensures correct multi-dimensional analysis.
Comprehensive Audit
- For large datasets, compare results processed on both the existing and new systems directly.

Raw Data Auditing:

Initialization: Record count and metric checks.
Incremental: Compare data post-loading to resolve discrepancies, requiring manual intervention for analysis and solution proposals.

This concludes the migration plan overview. Subsequent articles will showcase actual migration case studies. Stay tuned!

DEV Community