Lena Hall for Microsoft Azure

Posted on May 20, 2020 • Edited on Aug 14, 2020

Large-scale Data Analytics with Azure Synapse - Workspaces with CLI

#cloud #azure #computerscience #data

One of the challenges of large scale data analysis is being able to get the value from data with least effort. Doing that often involves multiple stages: provisioning infrastructure, accessing or moving data, transforming or filtering data, analyzing and learning from data, automating the data pipelines, connecting with other services that provide input or consume the output data, and more. There are quite a few tools available to solve these questions, but it's usually difficult to have them all in one place and easily connected.

If this article was helpful or interesting to you, follow @lenadroid on Twitter.

Introduction

This is the first article in this series, which will cover what Azure Synapse is and how to start using it with Azure CLI. Make sure your Azure CLI is installed and up-to-date, and add a synapse extension if necessary:

$ az extension add --name synapse

What is Azure Synapse?
In Azure, we have Synapse Analytics service, which aims to provide managed support for distributed data analysis workloads with less friction. If you're coming from GCP or AWS background, Azure Synapse alternatives in other clouds are products like BigQuery or Redshift. Azure Synapse is currently in public preview.

Serverless and provisioned capacity
In the world of large-scale data processing and analytics, things like autoscale clusters and pay-for-what-you-use has become a must-have. In Azure Synapse, you can choose between serverless and provisioned capacity, depending on whether you need to be flexible and adjust to bursts, or have a predictable resource load.

Native Apache Spark support
Apache Spark has demonstrated its power in data processing for both batch and real-time streaming models. It offers a great Python and Scala/Java support for data operations at large scale. Azure Synapse provides built-in support for data analytics using Apache Spark. It's possible to create an Apache Spark pool, upload Spark jobs, or create Spark notebooks for experimenting with the data.

SQL support
In addition to Apache Spark support, Azure Synapse has excellent support for data analytics with SQL.

Other features
Azure Synapse provides smooth integration with Azure Machine Learning and Spark ML. It enables convenient data ingestion and export using Azure Data Factory, which connects with many Azure and independent data input and output sources. Data can be effectively visualized with PowerBI.

At Microsoft Build 2020, Satya Nadella announced Synapse Link functionality that will help get insights from real-time transactional data stored in operational databases (e.g. Cosmos DB) with a single click, without the need to manage data movement.

Get started with Azure Synapse Workspaces using Azure CLI

Prepare the necessary environment variables:

$ StorageAccountName='<come up with a name for your storage account>'
$ ResourceGroup='<come up with a name for your resource group>'
$ Region='<come up with a name of the region, e.g. eastus>'
$ FileShareName='<come up with a name of the storage file share>'
$ SynapseWorkspaceName='<come up with a name for Synapse Workspace>'
$ SqlUser='<come up with a username>'
$ SqlPassword='<come up with a secure password>'

Create a resource group as a container for your resources:

$ az group create --name $ResourceGroup --location $Region

Create a Data Lake storage account:

$ az storage account create \
  --name $StorageAccountName \
  --resource-group $ResourceGroup \
  --location $Region \
  --sku Standard_GRS \
  --kind StorageV2

The output of this command will be similar to:

{- Finished ..
  "accessTier": "Hot",
  "creationTime": "2020-05-19T01:32:42.434045+00:00",
  "customDomain": null,
  "enableAzureFilesAadIntegration": null,
  "enableHttpsTrafficOnly": false,
  "encryption": {
    "keySource": "Microsoft.Storage",
    "keyVaultProperties": null,
    "services": {
      "blob": {
        "enabled": true,
        "lastEnabledTime": "2020-05-19T01:32:42.496550+00:00"
      },
      "file": {
        "enabled": true,
        "lastEnabledTime": "2020-05-19T01:32:42.496550+00:00"
      },
      "queue": null,
      "table": null
    }
  },
  "failoverInProgress": null,
  "geoReplicationStats": null,
  "id": "/subscriptions/<subscription-id>/resourceGroups/Synapse-test/providers/Microsoft.Storage/storageAccounts/<storage-account-name>",
  "identity": null,
  "isHnsEnabled": null,
  "kind": "StorageV2",
  "lastGeoFailoverTime": null,
  "location": "eastus",
  "name": "<storage-account-name>",
  "networkRuleSet": {
    "bypass": "AzureServices",
    "defaultAction": "Allow",
    "ipRules": [],
    "virtualNetworkRules": []
  },
  "primaryEndpoints": {
    "blob": "https://<storage-account-name>.blob.core.windows.net/",
    "dfs": "https://<storage-account-name>.dfs.core.windows.net/",
    "file": "https://<storage-account-name>.file.core.windows.net/",
    "queue": "https://<storage-account-name>.queue.core.windows.net/",
    "table": "https://<storage-account-name>.table.core.windows.net/",
    "web": "https://<storage-account-name>.z13.web.core.windows.net/"
  },
  "primaryLocation": "eastus",
  "provisioningState": "Succeeded",
  "resourceGroup": "<resource-group-name>",
  "secondaryEndpoints": null,
  "secondaryLocation": "westus",
  "sku": {
    "capabilities": null,
    "kind": null,
    "locations": null,
    "name": "Standard_GRS",
    "resourceType": null,
    "restrictions": null,
    "tier": "Standard"
  },
  "statusOfPrimary": "available",
  "statusOfSecondary": "available",
  "tags": {},
  "type": "Microsoft.Storage/storageAccounts"
}

Retrieve the storage account key:

$ StorageAccountKey=$(az storage account keys list \
  --account-name $StorageAccountName \
  | jq -r '.[0] | .value')

Retrieve Storage Endpoint URL:

$ StorageEndpointUrl=$(az storage account show \
  --name $StorageAccountName \
  --resource-group $ResourceGroup \
  | jq -r '.primaryEndpoints | .dfs')

You can always check what your storage account key and endpoint are by looking at them, if you'd like:

$ echo "Storage Account Key: $StorageAccountKey"
$ echo "Storage Endpoint URL: $StorageEndpointUrl"

Create a fileshare:

$ az storage share create \
  --account-name $StorageAccountName \
  --account-key $StorageAccountKey \
  --name $FileShareName

Create a Synapse Workspace:

$ az synapse workspace create \
  --name $SynapseWorkspaceName \
  --resource-group $ResourceGroup \
  --storage-account $StorageAccountName \
  --file-system $FileShareName \
  --sql-admin-login-user $SqlUser \
  --sql-admin-login-password $SqlPassword \
  --location $Region

The output of the command should show the successful creation:

{- Finished ..
  "connectivityEndpoints": {
    "dev": "https://<synapse-workspace-name>.dev.azuresynapse.net",
    "sql": "<synapse-workspace-name>.sql.azuresynapse.net",
    "sqlOnDemand": "<synapse-workspace-name>-ondemand.sql.azuresynapse.net",
    "web": "https://web.azuresynapse.net?workspace=%2fsubscriptions%<subscription-id>%2fresourceGroups%2fS<resource-group-name>%2fproviders%2fMicrosoft.Synapse%2fworkspaces%<synapse-workspace-name>"
  },
  "defaultDataLakeStorage": {
    "accountUrl": "https://<storage-account-name>.dfs.core.windows.net",
    "filesystem": "<file-share-name>"
  },
  "id": "/subscriptions/<subscription-id>/resourceGroups/<resource-group-name>/providers/Microsoft.Synapse/workspaces/<synapse-workspace-name>",
  "identity": {
    "principalId": "<principal-id>",
    "tenantId": "<tenant-id>",
    "type": "SystemAssigned"
  },
  "location": "eastus",
  "managedResourceGroupName": "<managed-tesource-group-id>",
  "name": "<synapse-workspace-name>",
  "provisioningState": "Succeeded",
  "resourceGroup": "<resource-group-name>",
  "sqlAdministratorLogin": "<admin-login>",
  "sqlAdministratorLoginPassword": <admin-password>,
  "tags": null,
  "type": "Microsoft.Synapse/workspaces",
  "virtualNetworkProfile": null
}

After you successfully created these resources, you should be able to go to Azure Portal, and navigate to the resource called $SynapseWorkspaceName within $ResourceGroup resource group. You should see a similar page:

What's next?

You can now load data and experiment with it in Synapse Data Studio, create Spark or SQL pools and run analytics queries, connect to PowerBI and visualize your data, and many more.

Stay tuned for next articles in this series to learn more! Thanks for reading!

If this article was interesting to you, follow @lenadroid on Twitter.

DEV Community

Large-scale Data Analytics with Azure Synapse - Workspaces with CLI

Introduction

Get started with Azure Synapse Workspaces using Azure CLI

Top comments (0)

Read next

Updating MOERA Address for AD Connect Synchronized User in 2024

Top Python Libraries Every Developer Should Know

InvalidContentLink error in ARM deployments

Data Architecture Best Practices