DEV Community

GauravKankaria
GauravKankaria

Posted on

Creating Feature Store for streamlined Model Building using Amazon SageMaker

Referring to my earlier blog on “Performing Advance Analytics in NBFCs using Data Lake and Customer 360 using Feature Store on AWS cloud environment”, we learnt how utilising the Feature Store feature on Amazon SageMaker helped businesses save 1,000+ person-hours and increased their ML model building speed and capabilities.

This blog will be a hands-on guide for technical folks on how they can create a Feature Store on Amazon SageMaker for your organization. Suppose you are an aspiring data scientist or machine learning engineer wanting to work on the AWS cloud ML Tech stack. In that case, this blog will be the perfect starting point for learning and exploring Amazon SageMaker’s Feature Store.

You can follow the steps below to learn how to create and use a feature store. You can also download the data and code used for the tutorial from the below links

  1. Code - Link
  2. Data - Link

*Step 1: *

Create/Use a s3 bucket where we would keep our data.

Example:- I created a bucket named “quickstart-feature-store-demo”. On successful creation I uploaded the transaction data csv file on s3.

Image description

*Step 2: *

Open AWS Sagemaker Studio and create a new notebook named Feature Store

Image description

Step 3:

Get started with importing modules & doing basic setup

#Imports
from sagemaker.feature_store.feature_group import FeatureGroup
from time import gmtime, strftime, sleep
from random import randint
import pandas as pd
import numpy as np
import subprocess
import sagemaker
import importlib
import logging
import time
import sys
import boto3
from datetime import datetime, timezone, date

if sagemaker.__version__ < '2.48.1':
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker==2.48.1'])
    importlib.reload(sagemaker)
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())
logger.info(f'Using SageMaker version: {sagemaker.__version__}')
logger.info(f'Using Pandas version: {pd.__version__}')

#Default Settings
sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
logger.info(f'Default S3 bucket = {default_bucket}')
prefix = 'sagemaker-feature-store'
region = sagemaker_session.boto_region_name

Enter fullscreen mode Exit fullscreen mode

*Step 4: *

Read data from input S3 bucket

#Update to variables below to point to your input data
bucket='quickstart-feature-store-demo'
data_key = 'Transaction Dataset Sample Features.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

orders_df = pd.read_csv(data_location)

Enter fullscreen mode Exit fullscreen mode

Note: If you get a Forbidden error, you can follow the steps below

  • Print the role being used using the below command
  • Ensure the role printed has read access to the s3 bucket containing the input data
#Get IAM role used by sagemaker
role = sagemaker.get_execution_role()

Enter fullscreen mode Exit fullscreen mode

*Step 5: *

Data Processing

#Generate a unique event timestamp
def generate_event_timestamp(x):
    # naive datetime representing local time
    naive_dt = datetime.now()
    # take timezone into account
    aware_dt = naive_dt.astimezone()
    # time in UTC
    utc_dt = aware_dt.astimezone(timezone.utc)
    # transform to ISO-8601 format
    event_time = utc_dt.isoformat(timespec='milliseconds')
    event_time = event_time.replace('+00:00', 'Z')
    return event_time

orders_df['event_time'] = orders_df['Date'].apply(lambda x : generate_event_timestamp(x))

for cols in orders_df.columns:
    orders_df[cols] = orders_df[cols].astype('string')

Enter fullscreen mode Exit fullscreen mode

*Step 6: *

Create a feature group & its definition.

#Create feature group
current_timestamp = strftime('%m-%d-%H-%M', gmtime())
orders_feature_group_name = f'orders-{current_timestamp}'
%store orders_feature_group_name

#Create feature group definition
logger.info(f'Orders feature group name = {orders_feature_group_name}')
orders_feature_group = FeatureGroup(name=orders_feature_group_name, sagemaker_session=sagemaker_session)
orders_feature_group.load_feature_definitions(data_frame=orders_df)

def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get('FeatureGroupStatus')
    print(f'Initial status: {status}')
    while status == 'Creating':
        logger.info(f'Waiting for feature group: {feature_group.name} to be created ...')
        time.sleep(5)
        status = feature_group.describe().get('FeatureGroupStatus')
    if status != 'Created':
        raise SystemExit(f'Failed to create feature group {feature_group.name}: {status}')
    logger.info(f'FeatureGroup {feature_group.name} was successfully created.')

orders_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', 
                            record_identifier_name='OrderId', 
                            event_time_feature_name='event_time', 
                            role_arn=role, 
                            enable_online_store=True)
wait_for_feature_group_creation_complete(orders_feature_group)

Enter fullscreen mode Exit fullscreen mode

Step 7:

Ingest data into feature store

%%time

logger.info(f'Ingesting data into feature group: {orders_feature_group.name} ...')
orders_feature_group.ingest(data_frame=orders_df, max_processes=16, wait=True)
logger.info(f'{len(orders_df)} order records ingested into feature group: {orders_feature_group.name}')

Enter fullscreen mode Exit fullscreen mode

Note:- It can take a couple of minutes for the data to reflect on the s3 buckets

Step 8:

You can boto session to list the feature groups and query the data as well

#Get a list of all feature stores
boto_session = boto3.Session(region_name=region)
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
sagemaker_client.list_feature_groups()

Enter fullscreen mode Exit fullscreen mode

You can use the following AWS documentation link to try out other features to fetch bulk batch data etc,

Step 9:

AWS Sagemaker feature store keeps metadata information in the AWS Glue. We can use the glue console to view the table.

Image description

Step 10:

Image description

Conclusion:

In this article we have learned the following things.

  1. Importance for feature store
  2. How to create and Implement a feature store on Amazon Sagemaker
  3. Ways to access the feature stores on Amazon Sagemaker

Top comments (0)