May 22, 2020

Masterclass Episode 2 – Azure Advisor

David Summers
Microsoft Azure Lead at Data#3 Limited
Welcome to the second instalment in Data#3’s five-part series focusing on the detail outlined in our Azure blog, covering  the five top tips for long-term success for your Microsoft Azure-hosted infrastructure. Data#3 has delivered over 100 Azure Health checks for customers, and consistently encounters the same problems.

This instalment is all about the Azure Advisor service, and how you can optimise your Azure consumption. We’ll cover both technical and organisational benefits for leveraging advisor and cost management.

What is Advisor?

Despite what you may think, Microsoft has capabilities within Azure to help you save money running your Azure workloads.

Enter the Azure Advisor service, a free built-in service that proactively analyses your workload and provides recommendations for high availability, security performance, cost reporting and optimisation. We will dive deeper into the sub-sections of Advisor capability to showcase what this service can do for you. But first, let’s dig into the shared responsibility model.

The Shared Responsibility Model

This is extremely important to keep in mind. See below for a summary graphic of  the areas of responsibility for both Microsoft and your organisation.. Essentially, if you are in an on-premises world,  the responsibility is yours, to ensure your infrastructure is protected, secured, available and accessible. As we shift to the right and into Azure, the responsibility starts to move to Microsoft, and you reap the rewards of not having to worry about the mundane. I think we can all appreciate this quote  from a Microsoft  employee, summarising this:

“If you shift your workload into Azure you will never receive a Pager/SMS/Email notification late at night for a Disk/San/Tape/Hardware failure” – Ben Armstrong, Microsoft

With this in mind, Microsoft has created the Advisor service to help you with potential misconfiguration within your responsibility.

Why is this important? If you deploy a single VM into Azure it’s up to you to ensure that the configuration of the VM is optimal. Patching, monitoring, virus protection, security hardening and baselining, availability, public connectivity, traffic filtering, installed applications and access all fall within  your responsibility. I have often encountered situations where a customer has deployed something into Azure and when asked how they were protecting it, answered: “Azure!”

Advisor – High Availability

Our first section of capability High Availability, which helps you understand where your infrastructure may be vulnerable to service outages due to your current configuration. Public cloud infrastructure is complex and challenging from a provider perspective. When you deploy something into Azure, you need to ensure that cloud infrastructure is available. Your thing is going to run on hardware, abstracted by a Hypervisor. Hardware can often fail! While Azure has been built to tolerate failing hardware and is extremely resilient to outages from a physical hosting perspective, depending on your configuration may be susceptible to a service outage as your thing is evacuated onto healthy hardware.

For most use cases, this may be acceptable, especially for dev and test environments. However, what about that production system that is mission critical? If we were dealing with a single VM, then we would architect a HA FT solution with multiple VMs residing in an Availability Set or Availability Zone to ensure that at least 1 VM is available should the underlying physical host fail.

For Web services, we might leverage the Front Door or Traffic Manager service to ensure site availability between 2 or more App services.

Advisor helps with these situations by analysing your deployed workload, and determines where you may be exposed from an availability perspective. The below criteria make up the current rule set for these checks:

  • Ensure virtual machine fault tolerance
  • Ensure availability set fault tolerance
  • Use Managed Disks to improve data reliability
  • Known issue with Check Point Network Virtual Appliance image version
  • Ensure application gateway fault tolerance
  • Protect your virtual machine data from accidental deletion
  • Create Azure Service Health alerts to be notified when Azure issues affect you
  • Configure Traffic Manager endpoints for resiliency
  • Use soft delete on your Azure Storage Account to save and recover data after accidental overwrite or deletion
  • Configure your VPN gateway to active-active for connection resiliency
  • Use production VPN gateways to run your production workloads
  • Repair invalid log alert rules
  • Configure consistent indexing mode on your Cosmos DB collection
  • Configure your Azure Cosmos DB containers with a partition key
  • Upgrade your Azure Cosmos DB .NET SDK to the latest version from Nuget
  • Upgrade your Azure Cosmos DB Java SDK to the latest version from Maven
  • Upgrade your Azure Cosmos DB Spark Connector to the latest version from Maven
  • Enable virtual machine replication

Each of these rules are evaluated against what you have deployed. The Advisor service will notify you when your workload may be at risk and present it in this report.

Drilling into one of these alerts provides more detail, with the ability to postpone or dismiss a particular item, and more importantly, offers a shortcut to remediate the issue.

In this example, selecting the “Enable virtual machine backup” option takes me to a quick configuration page to enable backup for this workload. How handy is that?!

It’s also worth noting that these recommendations exist within the context of the thing also. If I navigate to the VM in question I can view all recommendations for the thing scoped to just this thing.

Now, we move to the next Advisor capability: Performance.

Advisor – Performance

Performance comes down to more than just CPU, Memory, IOPS or Network. We need to account for all metrics and configuration items that could lead to a detriment in end user experience. The Performance checks are comprised of the following rules:

  • Reduce DNS time to live on your Traffic Manager profile to fail over to healthy endpoints faster
  • Improve database performance with SQL DB Advisor
  • Upgrade your Storage Client Library to the latest version for better reliability and performance
  • Improve App Service performance and reliability
  • Use Managed Disks to prevent disk I/O throttling
  • Improve the performance and reliability of virtual machine disks by using Premium Storage
  • Remove data skew on your SQL data warehouse table to increase query performance
  • Create or update outdated table statistics on your SQL data warehouse table to increase query performance
  • Scale up to Optimise cache utilization on your SQL Data Warehouse tables to increase query performance
  • Convert SQL Data Warehouse tables to replicated tables to increase query performance
  • Migrate your Storage Account to Azure Resource Manager to get all of the latest Azure features
  • Design your storage accounts to prevent hitting the maximum subscription limit
  • Consider increasing the size of your VNet Gateway SKU to address high P2S use
  • Consider increasing the size of your VNet Gateway SKU to address high CPU
  • Increase batch size when loading to maximize load throughput, data compression, and query performance
  • Co-locate the storage account within the same region to minimize latency when loading
  • Unsupported Kubernetes version is detected
  • Optimise the performance of your Azure MySQL, Azure PostgreSQL, and Azure MariaDB servers
    • Fix the CPU pressure of your Azure MySQL, Azure PostgreSQL, and Azure MariaDB servers with CPU bottlenecks
    • Reduce memory constraints on your Azure MySQL, Azure PostgreSQL, and Azure MariaDB servers or move to a memory Optimised SKU
    • Use an Azure MySQL or Azure PostgreSQL Read Replica to scale out reads for read intensive workloads
    • Scale your Azure MySQL, Azure PostgreSQL, or Azure MariaDB server to a higher SKU to prevent connection constraints
  • Scale your Cache to a different size or SKU to improve Cache and application performance
  • Add regions with traffic to your Azure Cosmos DB account
  • Configure your Azure Cosmos DB indexing policy with customer included or excluded paths
  • Configure your Azure Cosmos DB query page size (MaxItemCount) to -1

Notice something missing from the above?

Not a single Performance rule is related to your Virtual Machine workload! More on that in the Cost section.

Now for the next section and arguably the most fun.

Advisor – Security

Now we get into the good stuff, and a critical component of the shared responsibility model. Here’s an example of what I mean.

For example, you deployed a VM and you need to Remote into it. You assign a public IP and open the default port for RDP access 3389. You’re in a hurry and don’t know your outbound IP address offhand, so you allow Any IP to connect via RDP. You’re not worried about bad third-party access as your local admin password on the machine is super secure- plus, who is going to guess the public IP address of your VM? Only you know that, right?

The countdown begins the second that the VM boots up to the logon screen. This countdown is the time to breach!

So, you have a task list to work through:

  1. Join VM to domain and configure GPOs (Security baseline)
  2. Install services and applications
  3. Add VM to backup, AV protection, monitoring and SIEM solutions
  4. Configure services and applications
  5. Update OS and applications to latest patch revisions
  6. Close RDP access

This is a basic task list and some components take time, so you may switch to another daily task while you wait for these to complete. Let’s assume it takes about 5 hours to complete this list and you are happy that your VM is ready for production release.

You could be very wrong with that assumption. So, let’s back up a little and go back to that Public IP that you thought was super-secret and secure.

I can absolutely guarantee that your IP address will fall into a range in this list.

I make that statement with total confidence, as this publicly downloadable list of Microsoft-owned IP ranges contains every possible IP that could be allocated to your VM. This list is available by design so that your Next Gen firewalls can dynamically update outbound firewall rules for destination Azure and other Microsoft online services.

If I was a villain, what do you think I would do with this list? Step one, I would build or rent a botnet to scrape through the list and iterate through every possible IP, looking for extremely high-value default port connections.

I would absolutely start with 3389.

So, your VM has just been built and my botnet just happened to be scanning in the IP range that your VM resides in. I get a hit; I now have a public IP that responds to a protocol query. That IP then goes into my second stage botnet that fingerprints the destination. This uncovers a vulnerability that has not been patched yet, automation then queues up execution of a vulnerability, and I am notified that a host is available to log into.

And with that, I now own your VM!

This could have happened at any stage after the VM booted up and, yes, would have been detected or mitigated at some stage, but what if you were called away straight after provisioning the VM? How much time could I have had to remote in, perform sideways reconnaissance and started attacking other resources? What about the thousands of legitimately bad actors out there who are conducting this exact scenario thousands of times per day?

So, going back to the shared responsibility model: whose fault was this? Microsoft or yours?

Hopefully you will understand the answer by now. So, where does Advisor fit into this?

The built-in rules for Advisor are as follows:

Network recommendations

  • Adaptive Network Hardening recommendations should be applied on internet facing virtual machines
  • All network ports should be restricted on NSG associated to your VM
  • DDoS Protection Standard should be enabled
  • Function App should only be accessible over HTTPS
  • Internet-facing virtual machines should be protected with Network Security Groups
  • IP forwarding on your virtual machine should be disabled
  • Just-in-time network access control should be applied on virtual machines
  • Management ports should be closed on your virtual machines
  • Secure transfer to storage accounts should be enabled
  • Subnets should be associated with a Network Security Group
  • Web Application should only be accessible over HTTPS

Container recommendations

  • Authorized IP ranges should be defined on Kubernetes Services
  • Pod Security Policies should be defined to reduce the attack vector by removing unnecessary application privileges (Preview)
  • Role-Based Access Control should be used to restrict access to a Kubernetes Service Cluster
  • The Kubernetes Service should be upgraded to the latest Kubernetes version
  • Vulnerabilities in Azure Container Registry images should be remediated (powered by Qualys)

App Service recommendations

  • API App should only be accessible over HTTPS
  • CORS should not allow every resource to access your API App
  • CORS should not allow every resource to access your Function App
  • CORS should not allow every resource to access your Web Applications
  • Diagnostic logs in App Services should be enabled
  • Function App should only be accessible over HTTPS
  • Remote debugging should be turned off for API App
  • Remote debugging should be turned off for Function App
  • Remote debugging should be turned off for Web Applications
  • Web Application should only be accessible over HTTPS

Compute and app recommendations

  • Adaptive Application Controls should be enabled on virtual machines
  • All authorization rules except RootManageSharedAccessKey should be removed from Event Hub namespace
  • All authorization rules except RootManageSharedAccessKey should be removed from Service Bus namespace
  • Authorization rules on the Event Hub entity should be defined
  • Automation account variables should be encrypted
  • Diagnostic logs in Azure Stream Analytics should be enabled
  • Diagnostic logs in Batch accounts should be enabled
  • Diagnostic logs in Event Hub should be enabled
  • Diagnostic logs in Logic Apps should be enabled
  • Diagnostic logs in Search services should be enabled
  • Diagnostic logs in Service Bus should be enabled
  • Disk encryption should be applied on virtual machines
  • Enable the built-in vulnerability assessment solution on virtual machines
  • Endpoint protection health issues should be resolved on your machines
  • Install endpoint protection solution on virtual machines
  • Install endpoint protection solution on your machines
  • Install monitoring agent on your virtual machines
  • Monitoring agent health issues should be resolved on your machines
  • Network traffic data collection agent should be installed on Linux virtual machines (Preview)
  • Network traffic data collection agent should be installed on Windows virtual machines (Preview)
  • OS version should be updated for your cloud service roles
  • Remediate vulnerabilities found on your virtual machines (powered by Qualys)
  • Service Fabric clusters should have the ClusterProtectionLevel property set to EncryptAndSign
  • Service Fabric clusters should only use Azure Active Directory for client authentication
  • System updates should be installed on your machines
  • Virtual machines should be migrated to new Azure Resource Manager resources
  • Vulnerabilities in container security configurations should be remediated
  • Vulnerabilities in security configuration on your machines should be remediated
  • Vulnerabilities should be remediated by a Vulnerability Assessment solution
  • Vulnerability assessment solution should be installed on your virtual machines
  • Your machines should be restarted to apply system updates

Virtual machine scale set recommendations

  • Diagnostic logs in Virtual Machine Scale Sets should be enabled
  • Endpoint protection health failures should be remediated on virtual machine scale sets
  • Endpoint protection solution should be installed on virtual machine scale sets
  • Monitoring agent should be installed on virtual machine scale sets
  • System updates on virtual machine scale sets should be installed
  • Vulnerabilities in security configuration on your virtual machine scale sets should be remediated

Data and storage recommendations

  • Access to storage accounts with firewall and virtual network configurations should be restricted
  • Advanced data security should be enabled on your managed instances
  • Advanced data security should be enabled on your SQL servers
  • An Azure Active Directory administrator should be provisioned for SQL servers
  • Auditing on SQL server should be enabled
  • Diagnostic logs in Azure Data Lake Store should be enabled
  • Diagnostic logs in Data Lake Analytics should be enabled
  • Only secure connections to your Redis Cache should be enabled
  • Secure transfer to storage accounts should be enabled
  • Sensitive data in your SQL databases should be classified
  • Storage accounts should be migrated to new Azure Resource Manager resources
  • Transparent Data Encryption on SQL databases should be enabled
  • Vulnerabilities on your SQL databases in VMs should be remediated
  • Vulnerabilities on your SQL databases should be remediated
  • Vulnerability assessment should be enabled on your SQL managed instances
  • Vulnerability assessment should be enabled on your SQL servers

Identity and access recommendations

  • A maximum of 3 owners should be designated for your subscription
  • Deprecated accounts should be removed from your subscription
  • Deprecated accounts with owner permissions should be removed from your subscription
  • Diagnostic logs in Key Vault should be enabled
  • External accounts with owner permissions should be removed from your subscription
  • External accounts with read permissions should be removed from your subscription
  • External accounts with write permissions should be removed from your subscription
  • MFA should be enabled on accounts with owner permissions on your subscription
  • MFA should be enabled on accounts with read permissions on your subscription
  • MFA should be enabled on accounts with write permissions on your subscription
  • There should be more than one owner assigned to your subscription

Technically these rules are native to Azure Security Centre (stay tuned for a future Masterclass on that), but Advisor integrates with the alerts from Security Centre and presents them here with the other reports.

In my example, multiple rules would have been fired to let me know that my VM was at a high risk, and I could have paused and started with secure access controls at the outset.

So, by now we have followed all of the Advisor recommendations and all our things are highly available, secure and performant. Now we go back to an initial statement in this blog: Microsoft wants to help you save money.

Advisor – Cost

Now we move into the interesting advisories, the current rule set is below:

  • Optimise virtual machine spend by resizing or shutting down underutilised instances
  • Reduce costs by eliminating un-provisioned ExpressRoute circuits
  • Reduce costs by deleting or reconfiguring idle virtual network gateways
  • Buy reserved virtual machine instances to save money on over pay-as-you-go costs
  • Delete un-associated public IP addresses to save money
  • Delete Azure Data Factory pipelines that are failing
  • Use Standard Snapshots for Managed Disks
  • Utilise Lifecycle Management
  • Create an Ephemeral OS Disk recommendation

The two most common rules are Reserved Instances (RI) and rightsizing VM workloads. RI allows you to save up to 80% of the consumption cost for VM’s and other services (if applied with Azure Hybrid Use Benefit).

Right-Sizing workloads is just as important. Let’s assume you migrated a VM from on-premises to Azure. It had 32 cores at source and you matched the compute allocation within Azure. While on-premises you did not really need to pay per core, so you allocated what you felt was appropriate. Now, post-migration, you’ve discovered that your VM only uses 5% of its allocated CPU. Advisor will alert you that over a 30-day period that particular VM has not reached past 5% consumption, and you have the option of re-evaluating the compute allocation for the VM to select a more appropriate VM size to match the workload.

Just how much money you could save depends on what you have deployed. Out of over a hundred Azure Health Checks we have performed, our largest potential yearly savings value to date is $956,889 AUD- for a single customer!

Automating Advisor

One of the key capabilities of Advisor is that you can schedule reports and create alert rules to notify key personnel who are responsible for certain aspects of Azure.

To start automating and alerting with Azure, navigate to the Advisor service within the Azure portal.

Select Recommendation Digest.

From here, we can select which subscription to report on, frequency of  reports, category of Azure Advisor reports (High Availability, Performance, Security and Cost) and the action group(s) for notification.

The action groups is where things get interesting.

Notice there is a lot more there than just straight email notification. Direct ITSM support along with Functions and Logic App support not only allow you to integrate with existing service desk systems, but we could also invoke self-healing actions.

Even just using a basic email reporting function will help with the optimal usage of Azure- this is  especially true with large team deployments. Visibility is key, so setting up alerts and digest reports allows you to maintain control over Azure. We will cover an extension to this, to help prevent misconfigured resources from being provisioned in a future masterclass.

More on action groups.

Final Thoughts

And that’s a wrap for this masterclass! Start exploring Azure Advisor today, and always keep the Shared Responsibility Model in mind.

Stay tuned for the next episode, which will take you through the compelling advantages of Azure Security. Find more resources in the Data#3 Knowledge Centre.

Need assistance with Azure Remediation tasks? Contact a Data#3 Azure specialist today.