Ansible for the M365 Admin: The Stuff Intune Doesn't Manage

Intune is not your universal infrastructure orchestrator. It is not supposed to be.

The problem starts when the machines outside Intune run the automation that touches your Microsoft tenant.

The Linux host running your monitoring stack. The container running your CI runner. The box with SSH keys to every other box. The utility VM where someone keeps the scripts that call Microsoft Graph.

That is not endpoint management. That is an operations plane.

And if the operations plane is built from memory, shell history, and one very confident admin, you do not have automation. You have folklore with root access.

This is where I use Ansible in my homelab.

Not as a universal answer. Not as a replacement for Microsoft-native management. As one boring way to put a control model around the layer Intune does not own.

The Intune boundary
#

Microsoft Intune is the right control plane for managed endpoints and apps. That includes Windows, macOS, mobile devices, and supported Linux endpoint and compliance scenarios — and on Linux that means specific desktop-class distributions — Ubuntu Desktop and RHEL 9/10 running a GNOME desktop — not the Ubuntu Server or RHEL Server boxes running your infrastructure.

That does not make Intune a general Linux server orchestration platform. It does not make Intune the right tool for every infrastructure problem.

It is not a Proxmox provisioning engine. It is not a Docker Compose deployment system. It is not the thing that registers your self-hosted CI runners. It is not where you should model every Linux server, container host, DNS node, log collector, and internal service dependency.

If Azure Arc, Azure Update Manager, Microsoft Defender for Cloud, Azure DevOps, GitHub, or another platform already owns this layer in your environment, good — use that.

The point is not that Ansible is the only answer. The point is that something has to own the infrastructure that runs privileged Microsoft 365 automation.

So teams do the natural thing.

They SSH in. They install packages. They edit config files. They copy the working command from last time. They promise themselves they will document it later.

Later is where drift lives.

Why this matters to an M365 admin
#

Modern Microsoft 365 administration is not only portal work.

The serious work usually involves automation:

PowerShell jobs
Microsoft Graph scripts
app registrations and service principals
Git repositories
CI/CD runners
monitoring nodes
report generators
remediation tooling
webhook receivers
inventory collectors

Some of those systems live outside Intune. Some live in Azure. Some live in a homelab. Some live on a box nobody officially owns anymore.

The location is less important than the control model.

If a host can run automation against Microsoft Entra ID, Microsoft Intune, Microsoft Defender, Exchange Online, SharePoint Online, or Azure, that host is security-relevant. Treat it like it matters.

Because it does.

The Microsoft identity boundary
#

The dangerous part is rarely Ansible itself.

It is the identity the automation uses after the host is built.

A script that calls Microsoft Graph with application permissions is not “just a script”. It is a workload identity with whatever tenant access admin consent gave it.

So design that boundary separately.

Know whether the job uses delegated permissions or application permissions. Use managed identity when the workload runs on Azure and the platform supports it. Use workload identity federation where supported for external automation. Avoid long-lived client secrets on runners. Prefer certificates or federation for non-interactive automation where that fits the platform. Separate read/report automation from write/remediation automation. Grant the least privileged Microsoft Graph permissions the job actually needs. Review service principals, app registrations, credentials, owners, and consent.

A runner with Group.ReadWrite.All is not a build helper. It is an actor that can change directory state, and Group.ReadWrite.All is tenant-wide — there is no native per-object scoping on it.

Hold that thought. It is what makes the controller a tenant-security problem, not just an ops one, and The security model below picks it up.

The pattern
#

I use Ansible in my homelab for the infrastructure layer that Intune does not own.

Not because Ansible is magic. Not because it is the Microsoft answer. It is not.

Because it gives me four boring things that matter:

Inventory — which hosts exist and what role they have.
Desired state — what baseline every host should receive.
Repeatability — a new host is built the same way as the last one.
Reviewability — changes live in Git instead of terminal memory.

That is the whole trick.

No hero shell sessions. No “I think that container is different because I built it before Christmas”. No undocumented monitoring agent install that only exists on the one host nobody wants to reboot.

The M365 admin translation table
#

If you live in Intune, the Ansible model is not alien. The nouns are different. The operating model is familiar.

Intune concept	Ansible equivalent	What actually matters
Device group	Inventory group	Which hosts receive the change
Configuration profile	Role or playbook	Desired state expressed once
Assignment	Host pattern	Where the desired state applies
Remediation script	Task	Action that enforces or fixes state
Report-only thinking	`--check` / `--diff`	See impact before changing state
Scope tags / RBAC thinking	Inventory boundaries and repo permissions	Who can change which part of the fleet

This is not the same product category. Do not force that comparison too far.

But the discipline is the same: define state, scope it deliberately, validate impact, then apply it.

If that sounds familiar, it should. I wrote up the CI/CD-deployment side of the same idea in GitOps for IT pros. This post is the configuration-management half: same “Git is the source of truth” discipline, different control plane.

A sanitized lab architecture
#

Here is the shape. Not the private topology. Not the real hostnames. Not the real addresses.

Git repository
  ├─ inventory/
  │   └─ hosts.yml
  ├─ playbooks/
  │   ├─ bootstrap-linux-host.yml
  │   ├─ deploy-docker-stack.yml
  │   └─ patch-and-reboot-window.yml
  └─ roles/
      ├─ common
      ├─ docker
      ├─ monitoring_agent
      ├─ ci_runner
      └─ tailscale

Ansible controller
  ├─ SSH key with scoped access to managed hosts
  ├─ no public inbound access
  ├─ logs retained
  └─ treated as privileged infrastructure

Managed hosts
  ├─ linux-app-01
  ├─ linux-monitor-01
  ├─ linux-runner-01
  └─ linux-lab-01

The names are boring on purpose. Boring survives screenshots.

The controller is the important part.

If the controller can SSH to every host, read deployment inventory, and run tasks as root, it is not “just a management box”. It is a privileged operations asset.

Build it like one.

The baseline role
#

Every host gets the same boring baseline.

Time. DNS. SSH. Package updates. Directory layout. Logging. Monitoring agent.

The baseline is where drift goes to die.

The example below is Debian/Ubuntu-shaped. If your hosts are something else, adjust package management, service names, SELinux, firewall handling, and validation instead of cargo-culting the block.

# roles/common/tasks/main.yml
- name: Set timezone
  community.general.timezone:
    name: Europe/Stockholm

# Before this runs: confirm key-based login already works on an open
# session. Disabling password auth without a working key locks you out.
- name: Disable SSH password authentication
  ansible.builtin.lineinfile:
    path: /etc/ssh/sshd_config
    regexp: '^#?PasswordAuthentication'
    line: 'PasswordAuthentication no'
    validate: '/usr/sbin/sshd -t -f %s'
  notify: restart ssh  # systemd unit is 'ssh' on Debian/Ubuntu, not 'sshd'

- name: Disable SSH root password login
  ansible.builtin.lineinfile:
    path: /etc/ssh/sshd_config
    regexp: '^#?PermitRootLogin'
    line: 'PermitRootLogin prohibit-password'
    validate: '/usr/sbin/sshd -t -f %s'
  notify: restart ssh

- name: Create standard service directories
  ansible.builtin.file:
    path: "{{ item }}"
    state: directory
    owner: root
    group: root
    mode: '0755'
  loop:
    - /srv/apps
    - /srv/logs
    - /srv/backups

- name: Install baseline packages
  ansible.builtin.apt:
    name:
      - curl
      - git
      - jq
      - unattended-upgrades
    state: present
    update_cache: true

# Installing the package does nothing on its own. This enables it.
- name: Enable unattended security upgrades
  ansible.builtin.copy:
    dest: /etc/apt/apt.conf.d/20auto-upgrades
    content: |
      APT::Periodic::Update-Package-Lists "1";
      APT::Periodic::Unattended-Upgrade "1";
    owner: root
    group: root
    mode: '0644'

This is an example, not a universal SSH hardening baseline.

PermitRootLogin prohibit-password blocks root password login. It can still allow root login with SSH keys. If your policy is no root login at all, use PermitRootLogin no and test it.

The validate step runs sshd -t against the candidate config before the line is written, so a broken edit fails the task instead of bricking the daemon. The handler restarts the SSH service — on Debian/Ubuntu the systemd unit is ssh, even though the daemon binary is sshd. The handler itself is deliberately not shown here.

Nothing here is impressive.

That is the point.

The value is not that I can disable SSH password authentication on one host. The value is that I can prove every host in scope should have it disabled, review the change, run it in check mode, and apply it consistently.

Bootstrap a host
#

A host can have multiple roles.

A Docker host gets Docker. A monitoring host gets the monitoring agent. A runner host gets the CI runner prerequisites. A lab host gets the lab baseline and nothing else.

# playbooks/bootstrap-linux-host.yml
- name: Bootstrap Linux host baseline
  hosts: linux_managed
  become: true

  roles:
    - role: common

    - role: docker
      when: docker_enabled | default(false) | bool

    - role: monitoring_agent
      when: monitoring_enabled | default(true) | bool

    - role: ci_runner
      when: ci_runner_enabled | default(false) | bool

The inventory decides what applies.

# inventory/hosts.yml
all:
  children:
    linux_managed:
      hosts:
        linux-app-01:
          ansible_host: 192.0.2.10
          docker_enabled: true
          monitoring_enabled: true

        linux-monitor-01:
          ansible_host: 192.0.2.20
          docker_enabled: true
          monitoring_enabled: true

        linux-runner-01:
          ansible_host: 192.0.2.30
          docker_enabled: true
          monitoring_enabled: true
          ci_runner_enabled: true

  vars:
    ansible_user: automation-admin
    ansible_python_interpreter: /usr/bin/python3

192.0.2.0/24 is documentation space. Use your own addressing. Do not publish your real topology for internet points.

Dry run first
#

This is the part M365 admins should appreciate immediately.

Do not go straight to enforcement.

Run the change in check mode. Show the diff. Review the blast radius. Then apply.

Where the modules support it.

Check mode is a simulation, not a contract. Some modules can show what they would change. Some cannot. Diff output can also print sensitive configuration content.

So treat it as a warning shot, not proof that the universe will behave.

ansible-playbook -i inventory/hosts.yml playbooks/bootstrap-linux-host.yml \
  --limit linux-runner-01 \
  --check \
  --diff

If the output is boring, good.

Boring is operational maturity.

If the output says it wants to restart half the stack, replace SSH configuration, and reinstall Docker on a host you thought was already compliant, do not ignore that. That is not Ansible being noisy. That is drift saying hello.

Git is the change record
#

The repo is the source of desired state and the change record.

The control plane is the combination of Git, review, the Ansible controller, runner permissions, identities, and secret handling.

A change to the baseline role is a pull request. A new host is an inventory change. A new service role is reviewable. A rollback may start with a revert. If the playbook changed external state, data, credentials, or registrations, the rollback also needs an operational recovery step.

That matters because infrastructure changes are not harmless just because the hosts are small.

A self-hosted runner can deploy code. A monitoring node can see logs. A DNS server can shape traffic. A jump host can reach things users cannot.

If the change path is “SSH in and fix it”, nobody can review the security model.

The security model
#

This is where the post gets less cozy.

Ansible is powerful because it centralizes operations. That also centralizes risk.

And it points that risk at one place. If the controller can push config to a runner that holds a Microsoft Graph credential, then compromising the controller transitively compromises the tenant. The controller is not adjacent to your tenant security boundary. It is inside it.

So shrink the credential it can reach. Group.ReadWrite.All is tenant-wide, and there is no native Graph mechanism that scopes it to specific groups — Exchange application access policies only constrain mailbox permissions, not directory ones. So the control is mostly organizational. Do not hand the permission to a general-purpose runner. Register a separate, narrowly-scoped app for the job, grant only the permissions it needs, and isolate and monitor its credential. If the job can run as a directory role instead of a raw Graph permission, scope that role to an administrative unit holding only the groups it touches — that is the one place you get real object-level scoping.

Treat the controller as privileged
#

The controller has reach. It can connect to managed hosts. It can run tasks. It often has access to inventory, deployment paths, and operational secrets.

Put it on a management network with no public inbound path. Restrict who can log on. Patch it. Monitor it. Back it up. Know how to rebuild it.

Do not run random experiments on the same host that can reconfigure your infrastructure.

Use SSH keys deliberately
#

Use key-based auth. Disable password authentication. Use separate keys for automation where possible. Rotate them when operators change.

If one key can reach every host, that key is not a convenience. It is an incident boundary.

Keep secrets out of Git
#

Use Ansible Vault, an external secret store, or runtime injection. Pick one deliberately.

If you pick Vault, the vault password is the real secret. It decrypts everything Vault protects, so keep it off the controller, out of the repo, and out of shell history.

Do not commit tokens because “the repo is private”. Private repositories leak. Laptops get copied. Runners cache things. Humans paste the wrong file.

The secret you do not commit is the secret you do not have to explain later.

Be careful with self-hosted runners
#

Self-hosted runners are useful. They are also dangerous.

A runner is code execution attached to your environment. If untrusted workflows can run on it, treat compromise as plausible.

Do not let public pull requests execute on persistent internal runners. Do not give one runner access to every environment. Use labels and repository scoping. Separate build runners from deployment runners. Prefer ephemeral runners that are destroyed after each job, so a compromised runner does not persist.

Convenience is not a security boundary.

Scope the blast radius
#

Use inventory groups. Use --limit. Use staged rollouts. Use maintenance windows for disruptive roles.

And be honest about the account itself. The examples above use one automation-admin with become: true across the whole fleet, because it reads cleanly. A single account with blanket root on every host is its own blast radius. In a real environment, scope it: separate accounts or keys per tier, and become only where a task actually needs it.

The ability to update everything is not a reason to update everything at once.

That rule applies to Intune. It applies here too.

Where Intune still wins
#

None of this is an argument against Intune.

Windows security baselines, compliance policies, app deployment, device configuration, BitLocker, Microsoft Defender for Endpoint settings, attack surface reduction, platform controls — keep that in the right place. Do not replace a proper endpoint control plane with a pile of SSH tasks.

Use Ansible only where the target is infrastructure Intune is not meant to orchestrate:

Linux server infrastructure that is not enrolled or suitable as an Intune endpoint
lab containers
Docker hosts
internal service nodes
self-hosted CI/CD runners
monitoring collectors
network-adjacent utility hosts

The useful mental model
#

The useful question is not:

Can Intune manage this?

The useful question is:

Which control plane owns this risk?

If it is a user endpoint, Microsoft Intune is probably the answer. If it is Microsoft 365 access, Microsoft Entra ID and Conditional Access are probably in the answer. If it is detection and response, Microsoft Defender XDR and Microsoft Sentinel are probably in the answer. If it is Linux infrastructure, Docker hosts, and internal automation runners, Ansible may be one boring tool that keeps the floor from moving.

The mistake is not using multiple tools.

The mistake is having no ownership boundary between them.

What I would not automate
#

Not everything belongs in a playbook.

I do not like fully automating one-time runner registration tokens. I do not like hiding privileged bootstrap credentials in convenience scripts. I do not like long-lived Microsoft Graph client secrets on persistent runners. I do not like pulling Galaxy roles or container images and running them as root without pinning a version or digest I have actually looked at. I do not like playbooks that silently create broad access because “the lab needs it”. I do not like tasks that mutate production infrastructure without check mode, logging, or a rollback path.

Automation should reduce surprise.

If it increases surprise, it is just faster chaos.

The actual takeaway
#

Ansible is not the point. It is the example I use in my homelab.

The point is that the infrastructure around your Microsoft 365 administration becomes part of your security posture the moment it can run privileged automation.

If it touches your tenant, deploys your scripts, stores your logs, or runs your jobs, it deserves a control model.

Intune handles the endpoint control plane. Ansible can be one way to handle the infrastructure gap. Git gives you review. Check mode gives you a warning shot. Inventory gives you scope.

None of that is glamorous.

Good.

Glamour is usually where drift hides.

The Intune boundary#

Why this matters to an M365 admin#

The Microsoft identity boundary#

The pattern#

The M365 admin translation table#

A sanitized lab architecture#

The baseline role#

Bootstrap a host#

Dry run first#

Git is the change record#

The security model#

Treat the controller as privileged#

Use SSH keys deliberately#

Keep secrets out of Git#

Be careful with self-hosted runners#

Scope the blast radius#

Where Intune still wins#

The useful mental model#

What I would not automate#

The actual takeaway#