May 22, 2020

Masterclass Episode 2 – Azure Advisor

David Summers

Microsoft Azure Lead at Data^#3 Limited

Connect

Welcome to the second instalment in Data^#3’s five-part series focusing on the detail outlined in our Azure blog, covering the five top tips for long-term success for your Microsoft Azure-hosted infrastructure. Data^#3 has delivered over 100 Azure Health checks for customers, and consistently encounters the same problems.

This instalment is all about the Azure Advisor service, and how you can optimise your Azure consumption. We’ll cover both technical and organisational benefits for leveraging advisor and cost management.

What is Advisor?
The Shared Responsibility Model
Advisor – High Availability
Advisor – Performance
Advisor – Security
Advisor – Cost
Automating Advisor
Final Thoughts

What is Advisor?

Despite what you may think, Microsoft has capabilities within Azure to help you save money running your Azure workloads.

Enter the Azure Advisor service, a free built-in service that proactively analyses your workload and provides recommendations for high availability, security performance, cost reporting and optimisation. We will dive deeper into the sub-sections of Advisor capability to showcase what this service can do for you. But first, let’s dig into the shared responsibility model.

The Shared Responsibility Model

This is extremely important to keep in mind. See below for a summary graphic of the areas of responsibility for both Microsoft and your organisation.. Essentially, if you are in an on-premises world, the responsibility is yours, to ensure your infrastructure is protected, secured, available and accessible. As we shift to the right and into Azure, the responsibility starts to move to Microsoft, and you reap the rewards of not having to worry about the mundane. I think we can all appreciate this quote from a Microsoft employee, summarising this:

“If you shift your workload into Azure you will never receive a Pager/SMS/Email notification late at night for a Disk/San/Tape/Hardware failure” – Ben Armstrong, Microsoft

With this in mind, Microsoft has created the Advisor service to help you with potential misconfiguration within your responsibility.

Why is this important? If you deploy a single VM into Azure it’s up to you to ensure that the configuration of the VM is optimal. Patching, monitoring, virus protection, security hardening and baselining, availability, public connectivity, traffic filtering, installed applications and access all fall within your responsibility. I have often encountered situations where a customer has deployed something into Azure and when asked how they were protecting it, answered: “Azure!”

Advisor – High Availability

Our first section of capability High Availability, which helps you understand where your infrastructure may be vulnerable to service outages due to your current configuration. Public cloud infrastructure is complex and challenging from a provider perspective. When you deploy something into Azure, you need to ensure that cloud infrastructure is available. Your thing is going to run on hardware, abstracted by a Hypervisor. Hardware can often fail! While Azure has been built to tolerate failing hardware and is extremely resilient to outages from a physical hosting perspective, depending on your configuration may be susceptible to a service outage as your thing is evacuated onto healthy hardware.

For most use cases, this may be acceptable, especially for dev and test environments. However, what about that production system that is mission critical? If we were dealing with a single VM, then we would architect a HA FT solution with multiple VMs residing in an Availability Set or Availability Zone to ensure that at least 1 VM is available should the underlying physical host fail.

For Web services, we might leverage the Front Door or Traffic Manager service to ensure site availability between 2 or more App services.

Advisor helps with these situations by analysing your deployed workload, and determines where you may be exposed from an availability perspective. The below criteria make up the current rule set for these checks:

Ensure virtual machine fault tolerance
Ensure availability set fault tolerance
Use Managed Disks to improve data reliability
Known issue with Check Point Network Virtual Appliance image version
Ensure application gateway fault tolerance
Protect your virtual machine data from accidental deletion
Create Azure Service Health alerts to be notified when Azure issues affect you
Configure Traffic Manager endpoints for resiliency
Use soft delete on your Azure Storage Account to save and recover data after accidental overwrite or deletion
Configure your VPN gateway to active-active for connection resiliency
Use production VPN gateways to run your production workloads
Repair invalid log alert rules
Configure consistent indexing mode on your Cosmos DB collection
Configure your Azure Cosmos DB containers with a partition key
Upgrade your Azure Cosmos DB .NET SDK to the latest version from Nuget
Upgrade your Azure Cosmos DB Java SDK to the latest version from Maven
Upgrade your Azure Cosmos DB Spark Connector to the latest version from Maven
Enable virtual machine replication

Each of these rules are evaluated against what you have deployed. The Advisor service will notify you when your workload may be at risk and present it in this report.

Drilling into one of these alerts provides more detail, with the ability to postpone or dismiss a particular item, and more importantly, offers a shortcut to remediate the issue.

In this example, selecting the “Enable virtual machine backup” option takes me to a quick configuration page to enable backup for this workload. How handy is that?!

It’s also worth noting that these recommendations exist within the context of the thing also. If I navigate to the VM in question I can view all recommendations for the thing scoped to just this thing.

Now, we move to the next Advisor capability: Performance.

Advisor – Performance

Performance comes down to more than just CPU, Memory, IOPS or Network. We need to account for all metrics and configuration items that could lead to a detriment in end user experience. The Performance checks are comprised of the following rules:

Reduce DNS time to live on your Traffic Manager profile to fail over to healthy endpoints faster
Improve database performance with SQL DB Advisor
Upgrade your Storage Client Library to the latest version for better reliability and performance
Improve App Service performance and reliability
Use Managed Disks to prevent disk I/O throttling
Improve the performance and reliability of virtual machine disks by using Premium Storage
Remove data skew on your SQL data warehouse table to increase query performance
Create or update outdated table statistics on your SQL data warehouse table to increase query performance
Scale up to Optimise cache utilization on your SQL Data Warehouse tables to increase query performance
Convert SQL Data Warehouse tables to replicated tables to increase query performance
Migrate your Storage Account to Azure Resource Manager to get all of the latest Azure features
Design your storage accounts to prevent hitting the maximum subscription limit
Consider increasing the size of your VNet Gateway SKU to address high P2S use
Consider increasing the size of your VNet Gateway SKU to address high CPU
Increase batch size when loading to maximize load throughput, data compression, and query performance
Co-locate the storage account within the same region to minimize latency when loading
Unsupported Kubernetes version is detected
Optimise the performance of your Azure MySQL, Azure PostgreSQL, and Azure MariaDB servers
- Fix the CPU pressure of your Azure MySQL, Azure PostgreSQL, and Azure MariaDB servers with CPU bottlenecks
- Reduce memory constraints on your Azure MySQL, Azure PostgreSQL, and Azure MariaDB servers or move to a memory Optimised SKU
- Use an Azure MySQL or Azure PostgreSQL Read Replica to scale out reads for read intensive workloads
- Scale your Azure MySQL, Azure PostgreSQL, or Azure MariaDB server to a higher SKU to prevent connection constraints
Scale your Cache to a different size or SKU to improve Cache and application performance
Add regions with traffic to your Azure Cosmos DB account
Configure your Azure Cosmos DB indexing policy with customer included or excluded paths
Configure your Azure Cosmos DB query page size (MaxItemCount) to -1

Notice something missing from the above?

Not a single Performance rule is related to your Virtual Machine workload! More on that in the Cost section.

Now for the next section and arguably the most fun.

Advisor – Security

Now we get into the good stuff, and a critical component of the shared responsibility model. Here’s an example of what I mean.

For example, you deployed a VM and you need to Remote into it. You assign a public IP and open the default port for RDP access 3389. You’re in a hurry and don’t know your outbound IP address offhand, so you allow Any IP to connect via RDP. You’re not worried about bad third-party access as your local admin password on the machine is super secure- plus, who is going to guess the public IP address of your VM? Only you know that, right?

The countdown begins the second that the VM boots up to the logon screen. This countdown is the time to breach!

So, you have a task list to work through:

Join VM to domain and configure GPOs (Security baseline)
Install services and applications
Add VM to backup, AV protection, monitoring and SIEM solutions
Configure services and applications
Update OS and applications to latest patch revisions
Close RDP access

This is a basic task list and some components take time, so you may switch to another daily task while you wait for these to complete. Let’s assume it takes about 5 hours to complete this list and you are happy that your VM is ready for production release.

You could be very wrong with that assumption. So, let’s back up a little and go back to that Public IP that you thought was super-secret and secure.

I can absolutely guarantee that your IP address will fall into a range in this list.

I make that statement with total confidence, as this publicly downloadable list of Microsoft-owned IP ranges contains every possible IP that could be allocated to your VM. This list is available by design so that your Next Gen firewalls can dynamically update outbound firewall rules for destination Azure and other Microsoft online services.

If I was a villain, what do you think I would do with this list? Step one, I would build or rent a botnet to scrape through the list and iterate through every possible IP, looking for extremely high-value default port connections.

I would absolutely start with 3389.

So, your VM has just been built and my botnet just happened to be scanning in the IP range that your VM resides in. I get a hit; I now have a public IP that responds to a protocol query. That IP then goes into my second stage botnet that fingerprints the destination. This uncovers a vulnerability that has not been patched yet, automation then queues up execution of a vulnerability, and I am notified that a host is available to log into.

And with that, I now own your VM!

This could have happened at any stage after the VM booted up and, yes, would have been detected or mitigated at some stage, but what if you were called away straight after provisioning the VM? How much time could I have had to remote in, perform sideways reconnaissance and started attacking other resources? What about the thousands of legitimately bad actors out there who are conducting this exact scenario thousands of times per day?

So, going back to the shared responsibility model: whose fault was this? Microsoft or yours?

Hopefully you will understand the answer by now. So, where does Advisor fit into this?

The built-in rules for Advisor are as follows:

Network recommendations

Adaptive Network Hardening recommendations should be applied on internet facing virtual machines
All network ports should be restricted on NSG associated to your VM
DDoS Protection Standard should be enabled
Function App should only be accessible over HTTPS
Internet-facing virtual machines should be protected with Network Security Groups
IP forwarding on your virtual machine should be disabled
Just-in-time network access control should be applied on virtual machines
Management ports should be closed on your virtual machines
Secure transfer to storage accounts should be enabled
Subnets should be associated with a Network Security Group
Web Application should only be accessible over HTTPS

Container recommendations

Authorized IP ranges should be defined on Kubernetes Services
Pod Security Policies should be defined to reduce the attack vector by removing unnecessary application privileges (Preview)
Role-Based Access Control should be used to restrict access to a Kubernetes Service Cluster
The Kubernetes Service should be upgraded to the latest Kubernetes version
Vulnerabilities in Azure Container Registry images should be remediated (powered by Qualys)

App Service recommendations

API App should only be accessible over HTTPS
CORS should not allow every resource to access your API App
CORS should not allow every resource to access your Function App
CORS should not allow every resource to access your Web Applications
Diagnostic logs in App Services should be enabled
Function App should only be accessible over HTTPS
Remote debugging should be turned off for API App
Remote debugging should be turned off for Function App
Remote debugging should be turned off for Web Applications
Web Application should only be accessible over HTTPS

Compute and app recommendations

Adaptive Application Controls should be enabled on virtual machines
All authorization rules except RootManageSharedAccessKey should be removed from Event Hub namespace
All authorization rules except RootManageSharedAccessKey should be removed from Service Bus namespace
Authorization rules on the Event Hub entity should be defined
Automation account variables should be encrypted
Diagnostic logs in Azure Stream Analytics should be enabled
Diagnostic logs in Batch accounts should be enabled
Diagnostic logs in Event Hub should be enabled
Diagnostic logs in Logic Apps should be enabled
Diagnostic logs in Search services should be enabled
Diagnostic logs in Service Bus should be enabled
Disk encryption should be applied on virtual machines
Enable the built-in vulnerability assessment solution on virtual machines
Endpoint protection health issues should be resolved on your machines
Install endpoint protection solution on virtual machines
Install endpoint protection solution on your machines
Install monitoring agent on your virtual machines
Monitoring agent health issues should be resolved on your machines
Network traffic data collection agent should be installed on Linux virtual machines (Preview)
Network traffic data collection agent should be installed on Windows virtual machines (Preview)
OS version should be updated for your cloud service roles
Remediate vulnerabilities found on your virtual machines (powered by Qualys)
Service Fabric clusters should have the ClusterProtectionLevel property set to EncryptAndSign
Service Fabric clusters should only use Azure Active Directory for client authentication
System updates should be installed on your machines
Virtual machines should be migrated to new Azure Resource Manager resources
Vulnerabilities in container security configurations should be remediated
Vulnerabilities in security configuration on your machines should be remediated
Vulnerabilities should be remediated by a Vulnerability Assessment solution
Vulnerability assessment solution should be installed on your virtual machines
Your machines should be restarted to apply system updates

Virtual machine scale set recommendations

Diagnostic logs in Virtual Machine Scale Sets should be enabled
Endpoint protection health failures should be remediated on virtual machine scale sets
Endpoint protection solution should be installed on virtual machine scale sets
Monitoring agent should be installed on virtual machine scale sets
System updates on virtual machine scale sets should be installed
Vulnerabilities in security configuration on your virtual machine scale sets should be remediated

Data and storage recommendations

Access to storage accounts with firewall and virtual network configurations should be restricted
Advanced data security should be enabled on your managed instances
Advanced data security should be enabled on your SQL servers
An Azure Active Directory administrator should be provisioned for SQL servers
Auditing on SQL server should be enabled
Diagnostic logs in Azure Data Lake Store should be enabled
Diagnostic logs in Data Lake Analytics should be enabled
Only secure connections to your Redis Cache should be enabled
Secure transfer to storage accounts should be enabled
Sensitive data in your SQL databases should be classified
Storage accounts should be migrated to new Azure Resource Manager resources
Transparent Data Encryption on SQL databases should be enabled
Vulnerabilities on your SQL databases in VMs should be remediated
Vulnerabilities on your SQL databases should be remediated
Vulnerability assessment should be enabled on your SQL managed instances
Vulnerability assessment should be enabled on your SQL servers

Identity and access recommendations

A maximum of 3 owners should be designated for your subscription
Deprecated accounts should be removed from your subscription
Deprecated accounts with owner permissions should be removed from your subscription
Diagnostic logs in Key Vault should be enabled
External accounts with owner permissions should be removed from your subscription
External accounts with read permissions should be removed from your subscription
External accounts with write permissions should be removed from your subscription
MFA should be enabled on accounts with owner permissions on your subscription
MFA should be enabled on accounts with read permissions on your subscription
MFA should be enabled on accounts with write permissions on your subscription
There should be more than one owner assigned to your subscription

Technically these rules are native to Azure Security Centre (stay tuned for a future Masterclass on that), but Advisor integrates with the alerts from Security Centre and presents them here with the other reports.

In my example, multiple rules would have been fired to let me know that my VM was at a high risk, and I could have paused and started with secure access controls at the outset.

So, by now we have followed all of the Advisor recommendations and all our things are highly available, secure and performant. Now we go back to an initial statement in this blog: Microsoft wants to help you save money.

Advisor – Cost

Now we move into the interesting advisories, the current rule set is below:

Optimise virtual machine spend by resizing or shutting down underutilised instances
Reduce costs by eliminating un-provisioned ExpressRoute circuits
Reduce costs by deleting or reconfiguring idle virtual network gateways
Buy reserved virtual machine instances to save money on over pay-as-you-go costs
Delete un-associated public IP addresses to save money
Delete Azure Data Factory pipelines that are failing
Use Standard Snapshots for Managed Disks
Utilise Lifecycle Management
Create an Ephemeral OS Disk recommendation

The two most common rules are Reserved Instances (RI) and rightsizing VM workloads. RI allows you to save up to 80% of the consumption cost for VM’s and other services (if applied with Azure Hybrid Use Benefit).

Right-Sizing workloads is just as important. Let’s assume you migrated a VM from on-premises to Azure. It had 32 cores at source and you matched the compute allocation within Azure. While on-premises you did not really need to pay per core, so you allocated what you felt was appropriate. Now, post-migration, you’ve discovered that your VM only uses 5% of its allocated CPU. Advisor will alert you that over a 30-day period that particular VM has not reached past 5% consumption, and you have the option of re-evaluating the compute allocation for the VM to select a more appropriate VM size to match the workload.

Just how much money you could save depends on what you have deployed. Out of over a hundred Azure Health Checks we have performed, our largest potential yearly savings value to date is $956,889 AUD- for a single customer!

Automating Advisor

One of the key capabilities of Advisor is that you can schedule reports and create alert rules to notify key personnel who are responsible for certain aspects of Azure.

To start automating and alerting with Azure, navigate to the Advisor service within the Azure portal.

Select Recommendation Digest.

From here, we can select which subscription to report on, frequency of reports, category of Azure Advisor reports (High Availability, Performance, Security and Cost) and the action group(s) for notification.

The action groups is where things get interesting.

Notice there is a lot more there than just straight email notification. Direct ITSM support along with Functions and Logic App support not only allow you to integrate with existing service desk systems, but we could also invoke self-healing actions.

Even just using a basic email reporting function will help with the optimal usage of Azure- this is especially true with large team deployments. Visibility is key, so setting up alerts and digest reports allows you to maintain control over Azure. We will cover an extension to this, to help prevent misconfigured resources from being provisioned in a future masterclass.

Final Thoughts

And that’s a wrap for this masterclass! Start exploring Azure Advisor today, and always keep the Shared Responsibility Model in mind.

Stay tuned for the next episode, which will take you through the compelling advantages of Azure Security. Find more resources in the Data^#3 Knowledge Centre.

Need assistance with Azure Remediation tasks? Contact a Data^#3 Azure specialist today.