Skip to main content

Orka for VDI Environment

How to Use This Guide

This is a quick reference for troubleshooting common issues. Each section is organized by symptom (what users report or what you observe), followed by its likely causes and diagnostic steps. During an incident:
  1. Find the symptom that matches your situation
  2. Follow the diagnostic commands in order
  3. Apply the recommended solution
  4. Document what worked for your post-incident review
Tool usage guidelines:
  • Primary method: Always use Ansible playbooks for VM operations (deploy, delete, start, stop, image management)
  • Advanced diagnostics: You can SSH to hosts and use orka-engine CLI commands for lower-level troubleshooting
  • Examples: orka-engine vm list, orka-engine vm run --image <image> --net-interface en0
  • Note: All production operations should go through Ansible playbooks to maintain consistency
If your issue isn’t listed here: Escalate using the procedures in Guide B: Day-2 Operations Guide.

VM Provisioning Issues

Symptom: VM Deployment Fails Completely

What you see: deploy.yml playbook fails with error messages Quick diagnostic: ansible-playbook -i dev/inventory deploy.yml \ -e "vm_group=test" \ -e "desired_vms=1" \ -e "vm_image=<image-name>" \ -vvv ansible-playbook -i dev/inventory pull_image.yml \ -e "remote_image_name=<image-name>" \ -v ansible hosts -i dev/inventory -m shell -a "df -h /var/orka" ansible hosts -i dev/inventory -m shell -a "orka-engine --version" Likely causes and solutions:
CauseHow to VerifySolution
Image doesn’t existPulling image fails with error 404/not foundCheck image name/tag; verify in container registry
Registry authentication failedImage pull fails with authentication errorVerify registry_username and registry_password
Host out of disk spacedf shows >90% of available space is used on /var/orkaClean up old images; add storage
Host out of CPU/memoryError mentions resource limitsReduce VMs per host or add hosts
Image incompatible with hostError mentions architecture mismatchUse ARM images for Apple Silicon hosts
Network timeout pulling imagePull times outCheck network connectivity to registry
Orka Engine is unresponsiveCommands hang or timeoutRestart Orka Engine; contact MacStadium
Most common fix: Image name/tag typo or a registry authentication failure.

Symptom: VMs Deploy But Won’t Start

What you see: Deployment succeeds but VMs show “Stopped” or error status Quick diagnostic: ansible-playbook -i dev/inventory list.yml | grep <vm-name> ansible-playbook -i dev/inventory vm.yml \ -e "vm_name=<vm-name>" \ -e "desired_state=running" \ -v ssh admin@<host-ip> sudo log show --predicate 'process == "orka-engine"' --last 30m | grep -i error ssh admin@<host-ip> orka-engine vm list <vm-name> --format json

Likely causes and solutions:

CauseHow to verifySolution
Corrupted VM imageLogs show image errorsRe-pull image using pull_image.yml; redeploy VM
Insufficient host resourcesLogs show resource allocation failureFree resources on host; delete unused VMs or deploy to different host
VM configuration invalidJSON shows invalid CPU/memory valuesVerify all deployment parameters are correct; check image compatibility
Storage backend issueLogs show I/O errorsCheck host storage health with df -h; contact MacStadium
Boot disk missingLogs show disk not foundDelete VM and redeploy from scratch using fresh image pull
Most common fix: Corrupted image during pull. Re-pull the image to the host and redeploy.

Symptom: Wrong Number of VMs Deployed

What you see:**** Requested 10 VMs but only 7 deployed, or deployment stopped partway through Quick diagnostic: ansible-playbook -i dev/inventory list.yml -e "vm_group=<group>" | grep <group> | wc -l ansible hosts -i dev/inventory -m shell -a "top -l 1 | head -20" Likely causes and solutions:
CauseHow to verifySolution
Hit max_vms_per_host limitVMs distributed but stopped earlyIncrease limit or add more hosts
One or more hosts failedSome hosts show errorsFix failed hosts; redeploy remaining VMs
Ran out of IP addressesBridged mode: DHCP exhaustedExpand DHCP pool or use different subnet
Partial playbook failurePlaybook shows some failed tasksReview errors in verbose output (-vvv); fix issues; rerun playbook
Most common fix: You may have hit the max_vms_per_host limit. Add more hosts to distribute VM load.

Symptom: Can’t Delete VMs

What you see: delete.yml or vm.yml with desired_state=absent fails Quick diagnostic: ansible-playbook -i dev/inventory vm.yml \ -e "vm_name=<vm-name>" \ -e "desired_state=absent" \ -vvv ansible-playbook -i dev/inventory list.yml | grep <vm-name> ssh admin@<host-ip> orka-engine vm stop <vm-name> --force orka-engine vm delete <vm-name> ssh admin@<host-ip> ps aux | grep <vm-name> Likely causes and solutions:
CauseHow to verifySolution
VM already deletedlist.yml doesn’t show VMIgnore error; VM is already deleted
VM stuck in a hung stateForce stop succeeds; delete succeedsSSH to host; use orka-engine vm stop <vm> --force then delete
Orka Engine issueAll delete operations failingSSH to host; check Orka Engine service status; contact MacStadium
VM disk lockedLogs show disk busy errorStop all VMs using the disk; retry
Permission issueLogs show permission deniedVerify ansible_user has sudo access; check ansible_become=yes in inventory
Most common fix: VM is stuck in a hung state. Force stop and delete the VM via SSH to the host.

Citrix VDA Registration Failures

Symptom: New VMs Won’t Register with Citrix

What you see: VMs show as “Unregistered” in Citrix Cloud Console → Monitor → Machines Quick diagnostic: ansible-playbook -i dev/inventory list.yml | grep <vm-name> ssh admin@<host-ip> open vnc://<vm-ip>:5900 // Navigate to: System Preferences → Citrix VDA // Should show: "Registered" with Cloud Connector details // Test network connectivity from the VM // From the VM Terminal: ping <cloud-connector-ip> curl https://api.cloud.com // Check VDA service logs // On the VM: Console.app → Search for "Citrix" Likely causes and solutions: Most common fix: Firewall blocking outbound HTTPS from VMs to Citrix Cloud Connector. Verify ports 443, 1494, and 2598 are open.

Symptom: VMs Were Registered, Now Show as Unregistered

What you see: VMs that were working now show “Unregistered” status Quick diagnostic:
  1. Verify the VM is running: ansible-playbook -i dev/inventory list.yml | grep <vm-name>
  2. VNC into the VM and check VDA status: ssh admin@<host-ip> open vnc://<vm-ip>:5900
  3. Navigate to: System Preferences → Citrix VDA. Should show: “Registered” with Cloud Connector details.
  4. Test network connectivity from the VM: ping <cloud-connector-ip> curl https://api.cloud.com
  5. Check VDA service logs. On the VM: Console.app → Search for “Citrix”
Likely causes and solutions:
CauseHow to verifySolution
VDA not installed in imageSystem Preferences has no Citrix VDA paneRebuild golden image with VDA installed
VDA not configuredVDA pane shows “Not configured”Configure VDA with Cloud Connector details
VDA service crashed on VMsVDA status shows “Stopped” on multiple VMsRestart affected VMs using vm.yml with desired_state=stopped then running
Host reboot without VM auto-startAll VMs on one host unregistered simultaneouslyStart VMs using Ansible; configure auto-start in Orka if available
Network can’t reach Cloud ConnectorPing/curl failsCheck firewall rules; verify outbound HTTPS
Wrong Cloud Connector configuredVDA shows wrong Cloud Connector IPReconfigure VDA in golden image
VDA service not runningVDA status shows “Stopped”Restart VDA service or reboot VM
Firewall blocking required portsPorts 443, 1494, 2598 are blockedOpen required ports in firewall
Citrix licensing issueVDA shows licensing errorCheck Citrix Cloud licenses; contact support
Most common fix: Cloud Connector lost network connectivity or service crashed. Restart the Cloud Connector.

Symptom: VDA Shows “Registration in Progress” Indefinitely

What you see: VDA status remains stuck on “Registering…” and never completes Quick diagnostic:
  1. Check if Cloud Connector is up by navigating to Citrix Cloud Console → Monitor → Cloud Connectors, and look for: Status “Up” or “Down”.
  2. Check if multiple VMs are affected: Monitor → Machines → Filter by Delivery Group. Check if all VMs show as unregistered, or just some.
  3. Test connectivity from VM to Cloud Connector by running ssh admin@<host-ip>, VNC to the VM, then run: ping <cloud-connector-ip>
  4. Check the VDA service on the VM: System Preferences → Citrix VDA.
Likely causes and solutions:
CauseHow to verifySolution
Cloud Connector downConsole shows “Down” statusRestart Cloud Connector VM/service
DNS resolution failingnslookup <cloud-connector-fqdn> fails from VMFix DNS configuration in golden image or via DHCP; use IP address temporarily
Incorrect broker addressVDA configured with wrong Cloud Connector addressFix broker address in VDA configuration in golden image; redeploy VMs
Network path failureVMs can’t ping Cloud ConnectorCheck network/firewall; contact network team
VDA service crashedVDA status shows “Stopped” on VMsRestart affected VMs
Citrix Cloud service issueCloud Connector up but VMs unregisteredCheck Citrix status page: Contact support
Host reboot without VM auto-startAll VMs on one host unregisteredStart VMs manually; configure auto-start
Certificate expirationVDA logs show cert errorsRenew certificates; update VDA configuration
Most common fix: Verify that the VM can resolve the Cloud Connector hostname or configure it with a correct/updated IP address.

Symptom: VMs Register But Users Can’t Connect

What you see:**** Citrix Console shows VMs are “Registered” but users get connection errors Quick diagnostic:
  1. Verify the user is in the correct Delivery Group by navigating to Citrix Cloud Console → Manage → Delivery Groups → Search for the user
  2. Check that the VM is actually in an “Available” state by navigating to Monitor → Machines → and checking the ‘Status’ column. This should show “Available” not “In Use” or “Maintenance”.
  3. Test the connection yourself with a test account by launching a desktop from Citrix Workspace.
  4. Check for Citrix policy issues. Policies → Review policies applied to the impacted Delivery Group.
Likely causes and solutions:
CauseHow to verifySolution
User not in Delivery GroupSearch shows no assignmentAdd user to the appropriate Delivery Group
VM in maintenance modeStatus shows “Maintenance Mode”Take the VM out of maintenance mode
Delivery Group misconfiguredNo desktops publishedCheck Delivery Group configuration
Session limit reachedPolicies show max sessions = 1, already in useIncrease session limit or deploy more VMs
HDX protocol failureUsers get protocol errorCheck HDX policies; test with different user
VM networking issueVM registered but can’t be reachedVerify VM network connectivity
Most common fix: User was not added to the Delivery Group. Add the user via Citrix Cloud Console.

Network and Connectivity Problems

Symptom: VMs Can’t Reach Internet

What you see: Users report “No internet connection” / Can’t browse web or download updates Quick diagnostic:
  1. Test basic connectivity from the VM by VNCing into the VM:
    ping 8.8.8.8
    ping google.com
    curl https://google.com
  2. Check the VM network configuration
    ifconfig route -n get default
  3. Check DNS configuration
    cat /etc/resolv.conf
  4. Test connecting from the host to rule out any host issues
    ssh admin@<host-ip>
    ping 8.8.8.8
Likely causes and solutions:
CauseHow to verifySolution
DNS not configuredresolv.conf is empty or incorrectAdd DNS servers to golden image or DHCP
No default gatewayroute -n get default shows no routeConfigure gateway in image or via DHCP
Proxy requiredNetwork requires proxy for internet accessConfigure proxy settings in golden image; set HTTP_PROXY variables
DHCP not providing DNS/gatewayBridged mode: VM has IP but no DNS/gatewayFix DHCP server configuration to provide DNS and gateway options
Firewall is blocking VM trafficPing fails but host succeedsAdd firewall rule for VM subnet
NAT not workingHost reaches internet but VM doesn’tCheck Orka NAT configuration on host
Upstream network outageHost also can’t reach internetContact network team/ISP
VM subnet not routedTraceroute shows no pathAdd routing for VM subnet
Most common fix: DNS is not configured in the golden image. Add DNS servers (e.g., 8.8.8.8, 8.8.4.4) to network config in the image template.

Symptom: VMs Can’t Reach Internal Corporate Services

What you see: “Can’t access file shares” / “Internal apps unreachable” / “Need VPN” Quick diagnostic:
  1. Test connecting from the VM to internal services
ping <internal-server-ip> telnet
<internal-server-ip> <port>
  1. Check routing: traceroute <internal-server-ip>
  2. Test from host (confirm the host has access)
ssh admin@<host-ip>
ping <internal-server-ip>
  1. Check if the VM subnet is allowed through company firewalls. Contact your network IT team with the following information:
Source: VM or subnet
Destination: Internal service IP
Ports needed
Likely causes and solutions:
CauseHow to verifySolution
Firewall is blocking the VM subnetHost can reach but VM cannotAdd firewall rule to allow VM subnet
VMs are on an isolated VLANTraceroute shows no routeMove VMs to correct VLAN or add routing
Missing static routeNo path to internal networkAdd static route on VMs or router
Server-side firewallServer blocks VM IPsUpdate server firewall to allow VM subnet
ACL blocking trafficTraffic dropped at switch/routerUpdate ACLs to permit VM traffic
Split-tunnel VPN requiredServices only accessible via VPNConfigure VPN on VMs or route through VPN gateway
Most common fix: Your firewall rules don’t include the VM subnet. Work with your network team to add the appropriate ‘allow’ rules.

Symptom: Bridged Mode VMs Getting Wrong IPs (192.168.64.x)

What you see: VMs should get corporate IPs but are getting 192.168.64.x private range instead Important prerequisite: Bridged networking requires a DHCP server on your network that can assign IP addresses to VMs. If you don’t have DHCP configured, VMs will fall back to NAT mode with 192.168.64.x addresses. Quick diagnostic:
  1. Check cluster configuration on the Orka for VDI management node: cat /path/to/cluster.yml | grep vm_network_mode. This should show: vm_network_mode: bridge.
  2. Check host interface configuration: cat /path/to/nodes.yml | grep osx_node_vm_network_interface or check the hosts file: cat /path/to/hosts | grep osx_node_vm_network_interface
  3. Verify that the interface exists on the host: ssh admin@<host-ip> ifconfig | grep <interface-name>
  4. Check DHCP traffic on the interface: ssh admin@<host-ip> sudo tcpdump -i <interface-name> port 67 and port 68
  5. Deploy a test VM and watch for DHCP requests/replies
Likely causes and solutions:
CauseHow to verifySolution
osx_node_vm_network_interface is not setConfig files missing interface settingAdd to nodes.yml or hosts file
Wrong interface name specifiedInterface doesn’t exist on ifconfigSSH to host; verify the correct interface name
Deployment missing bridge mode flagCheck deploy command historyRedeploy with --extra-vars "vm_network_mode=bridge"
Configuration not applied to hostsConfig updated but not rerunRerun the host configuration Ansible playbook
DHCP server is unreachabletcpdump shows no DHCP repliesVerify DHCP server; check interface connection
Some VMs are still using NATMixed NAT and bridge VMsDelete all VMs; redeploy after config change
Solution steps:
  1. Delete all VMs (this is required before switching networking modes)
ansible-playbook -i dev/inventory delete.yml \
-e "vm_group=<all-groups>" \
-e "delete_count=<all>"
  1. Verify configuration files
cat cluster.yml should have: vm_network_mode: bridge
cat nodes.yml should have:osx_node_vm_network_interface: <interface>
  1. Reapply the host configuration: ansible-playbook -i dev/inventory configure-hosts.yml
  2. Deploy a test VM:
ansible-playbook -i dev/inventory deploy.yml \
-e "vm_group=test" \
-e "desired_vms=1"
  1. Verify that the VM received a corporate IP address: ansible-playbook -i dev/inventory list.yml | grep test
Most common fix: osx_node_vm_network_interface is not set or was set incorrectly. Verify the interface name, then reapply the configuration.

Symptom: Intermittent Network Connectivity

What you see: Network works sometimes, drops randomly, packet loss Quick diagnostic:
  1. Run a continuous ping test from the VM by VNCing into the VM and running: ping -c 100 8.8.8.8. Look for the packet loss percentage.
  2. Check for interface errors on the host, and look for errors/drops in output:
ssh admin@<host-ip>
netstat -i
  1. Check host network utilization:
ssh admin@<host-ip>
nload (or: iftop)
  1. Check if the connectivity issue is specific to one host, or impacts all hosts. Test VMs on different hosts to confirm.
Likely causes and solutions:
CauseHow to verifySolution
Network congestionnload shows saturated bandwidthQoS configuration; add bandwidth
Faulty network hardwareErrors show on specific interfaceReplace cable/switch; contact MacStadium
Host overloadedHigh CPU/memory on hostReduce VMs on host or upgrade host
Spanning tree reconvergenceBrief outages periodicallyTune STP or use rapid STP
IP address conflictsMultiple devices with same IPCheck DHCP pool size; fix duplicates
Wireless interference (if wireless)Packet loss at specific timesUse wired connection; change channel
Most common fix: Network congestion or host overloaded. Reduce VMs per host or work with network team on quality of service improvements.

Image Cache and Distribution Issues

Symptom: Image Pull Extremely Slow

What you see: pull_image.yml takes 30+ minutes for reasonably sized images Quick diagnostic:
  1. Test registry connection and speed:
ssh admin@<host-ip>
time curl -o /dev/null https://<registry>/test-file
  1. Check image size by viewing this in your container registry UI
  2. Monitor network utilization during the image pull process
ssh admin@<host-ip>
nload
  1. Check if the repository is rate limiting. Look for throttling messages in the image pull output to confirm/deny this.
ansible-playbook -i dev/inventory pull_image.yml \
-e "remote_image_name=<image>" \
-vvv | grep -i "limit\|throttle"
Likely causes and solutions:
CauseHow to verifySolution
Image is extremely largeImage size >50GBOptimize image; remove unnecessary files
Registry is located in a different datacenterHigh latency/slow speeds to registryDeploy registry in the same datacenter or use mirror
Network congestionBandwidth saturated during pullSchedule pulls during off-hours
Registry is rate limitingPull logs show throttlingContact registry admin; increase limits
Slow registry storageRegistry on slow disksUpgrade registry storage backend
Shared bandwidth limitsMultiple hosts pulling simultaneouslyStagger pulls across hosts
Most common fix: Registry is located in a geographically distant datacenter. Deploy the registry closer to Orka hosts or use registry replication.

Symptom: Image Pull Fails with Authentication Error

What you see: “unauthorized” / “authentication required” / “403 Forbidden” Quick diagnostic:
  1. Test registry authentication manually: curl -u <username>:<password> https://<registry>/v2/_catalog
  2. Verify credentials in the Ansible playbook command, and check that the registry_username and registry_password are correct
  3. Test pull with credentials:
ansible-playbook -i dev/inventory pull_image.yml \
-e "remote_image_name=<image>" \
-e "registry_username=<user>" \
-e "registry_password=<pass>" \
-vvv
  1. If available, check the registry access logs. Look for any authentication failures.
Likely causes and solutions:
CauseHow to verifySolution
Wrong credentialsManual curl fails with the same credentialsVerify username/password; reset if needed
Credentials expiredWere working before, now failingUpdate credentials; refresh tokens
User lacks pull permissionsAuth succeeds but pull deniedGrant pull permissions in registry
Registry requires token authPassword auth doesn’t workUse token-based auth; update playbook params
Network blocking auth endpointCan’t reach registry auth serverCheck firewall rules for auth endpoint
Insecure registry without flagTLS/cert verification failsAdd -e "insecure_pull=true" if appropriate
Most common fix: Credentials are outdated or incorrect. Verify and update your registry_username and registry_password values.

Symptom: Image Pull Succeeds, But Deploy Fails

What you see: pull_image.yml succeeds but deploy.yml can’t find image Quick diagnostic:
  1. Verify the image was pulled successfully:
ansible-playbook -i dev/inventory list.yml
  1. Try pulling the image again with verbose output:
ansible-playbook -i dev/inventory pull_image.yml \
-e "remote_image_name=<image-name>" \
-vvv
  1. Check the image name and tags match exactly
  2. Try pulling the image to a specific host
ansible-playbook -i dev/inventory deploy.yml \
-e "vm_group=test" \
-e "desired_vms=1" \
-e "vm_image=<exact-image-name>" \
--limit <specific-host>
Likely causes and solutions:
CauseHow to verifySolution
Image name mismatchImage pulled with a different name/tagUse exact same image name in deploy command
Image tag omitted or incorrectImage pulled with :latest but deploy uses :v1.0Always specify explicit tags; avoid :latest in production
Image pulled to wrong hostDeploy targeting host without imagePull image to all hosts using pull_image.yml without --limit
Image tag changedImage exists, but with different tag(s)Use correct image tag (including :latest if needed)
Deployment targeting wrong imageDeploy command references different image pathVerify vm_image parameter matches pulled image exactly
Image corrupted during pullImage exists, but is damagedDelete image; re-pull from container registry
Case sensitivity issueImage names differ only in caseUse exact, case-sensitive image name
Most common fix: Image name/tag mismatch between pull and deploy. Ensure an exact match, including image tags.

Symptom: Can’t Push New Image to Registry

What you see: create_image.yml fails during push phase Quick diagnostic:
  1. Run the create_image.yml Ansible playbook with verbose output:
ansible-playbook -i dev/inventory create_image.yml \
-e "vm_image=<source>" \
-e "remote_image_name=<destination>" \
-e "registry_username=<user>" \
-e "registry_password=<pass>" \
-vvv
  1. Check registry authentication
    curl -u <user>:<pass> https://<registry>/v2/_catalog
  2. Check registry storage space
  3. Verify sufficient host disk space: ansible hosts -i dev/inventory -m shell -a "df -h /var/orka"
Likely causes and solutions:
CauseHow to verifySolution
Registry authentication failedCurl returns 401 errorVerify push credentials; check permissions
Registry out of storagePush fails with storage errorExpand registry storage; clean old images
Registry quota exceededError mentions quotaIncrease quota or clean up images
Insufficient host disk spaceCan’t create/prepare image locally for pushree space on host; delete unused VMs using delete.yml or vm.yml
Source VM not stoppedImage creation requires stopped VMStop source VM before running create_image.yml playbook
Insecure registry without flagTLS/cert errorAdd -e "insecure_push=true" if appropriate
Network timeout during pushPush times outCheck network; try again during off-hours
Image name violates registry policyPush rejected by policyFollow registry naming conventions
Insufficient host disk spaceCan’t create image to pushFree space on host; delete unused VMs using delete.yml or vm.yml
Most common fix: Registry authentication or insufficient storage space. Verify credentials and check registry capacity.

Symptom: Inconsistent Images Across Hosts

What you see: The same image name on different hosts, but with different behavior/versions Quick diagnostic:
  1. List VMs on all hosts to check deployment times
ansible-playbook -i dev/inventory list.yml
  1. Check when VMs were last deployed using list.yml to show the VM status
  2. Test pulling an image to verify registry performance
ansible-playbook -i dev/inventory pull_image.yml \
-e "remote_image_name=<image-name>" \
--limit <single-host>
  1. Check the registry for image versions
Likely causes and solutions:
CauseHow to verifySolution
Images pulled at different timesTimestamps differ; registry updated between pullsRe-pull image to all hosts
Cached old versionDigest doesn’t match registryForce re-pull with docker pull --no-cache
Different image tags usedTags differ across hostsStandardize on specific tag (not :latest)
Registry changed without notificationRegistry version changedCoordinate with registry team on updates
Partial pull failureSome hosts have corrupted imageDelete and re-pull on affected hosts

Solution steps:

  1. Pull a fresh image to all hosts
    ansible-playbook -i dev/inventory pull_image.yml \
    -e "remote_image_name=<image>"
  2. Verify all hosts completed pull successfully, check playbook output for any errors
  3. Redeploy VMs from the freshly pulled image
    ansible-playbook -i dev/inventory deploy.yml \
    -e "vm_group=<group>" \
    -e "desired_vms=<count>" \
    -e "vm_image=<image>"
Most common fix: Images were pulled at different times, with registry updates between. Re-pull images to all hosts for consistency.

Performance and Latency Problems

Symptom: Desktop Feels Sluggish for Users What you see**:** Users report slow response, lag, choppy mouse movement, etc. Quick diagnostic:
  1. Check host resource utilization
    ansible hosts -i dev/inventory -m shell -a "top -l 1 | head -20"
  2. Count VMs per host
    ansible-playbook -i dev/inventory list.yml | grep <host-name> | wc -l
  3. Check specific VM resources by VNCing into the VM:
    open vnc://<vm-ip>:5900
    Open Activity Monitor → Check CPU, Memory, Disk, Network
  4. Test the user’s network latency (if remote). Ask the user to ping the VM IP, or use the Citrix HDX Tester tool
Likely causes and solutions:
CauseHow to verifySolution
Host overloadedCPU consistently >85% shown in top outputRedistribute VMs to other hosts using delete/redeploy; add more hosts
Too many VMs per hostVM count exceeds 2 per hostMove VMs to other hosts; respect max_vms_per_host limits
VM resource exhaustionActivity Monitor shows VM maxed CPU/memoryRestart VM using vm.yml playbook; consider increasing VM resources in golden image
Network latency (remote users)Ping >100ms or visible packet lossTune HDX policies for high latency; user needs better network connection
Disk I/O bottleneckActivity Monitor shows red disk pressure indicatorCheck host storage performance with MacStadium; reduce VM count on host
Background processesSpotlight indexing (mds) or updates consuming CPUWait for processes to complete; configure indexing schedules in golden image
Insufficient VM CPU/memoryVM configured with too few resourcesCreate new golden image with more CPU/memory allocation; redeploy VMs
Host storage saturationMultiple VMs competing for disk I/OMove VMs to hosts with faster storage; contact MacStadium about storage upgrades
Most common fix: Host is overloaded. Redistribute VMs across hosts or add capacity.

Symptom: Poor Video Quality or Choppy Playback

What you see: Pixelated screen, blurry text, stuttering video Quick diagnostic:
  1. Check the user’s network bandwidth
  2. Ask the user to run: Speedtest by Ookla - The Global Broadband Speed Test
 1. &lt;5 Mbps indicates the user is experiencing low bandwidth issues
  1. Check the user’s HDX Visual Quality policy by navigating to: Citrix Cloud Console → Policies.
 1. Find the user's policy in → HDX Settings → Visual Quality
  1. Test bandwidth/performance with Citrix HDX Monitor (if available), as this shows real-time HDX metrics
  2. Check the user’s connection type (Are they connecting remotely/via VPN? On wifi? Wired?)
Likely causes and solutions:
CauseHow to verifySolution
Low bandwidth connectionUser speed test shows <5 Mbps downloadAdjust HDX Visual Quality policy to “Low” or “Medium” for user’s Delivery Group
Visual Quality policy too lowPolicy shows “Medium” or “Low” settingIncrease to “High” or “Build to Lossless” for users with good connections
High latency connectionPing shows >150ms round-trip timeEnable HDX Adaptive Transport (Framehawk) in Citrix policies
VPN throttling bandwidthUser on VPN with constrained bandwidthContact network team about VPN QoS settings; increase VPN bandwidth allocation
WiFi interference/weak signalUser on WiFi with poor signal strengthSwitch user to wired Ethernet connection; improve WiFi signal; use 5GHz band
Application not HDX-optimizedSpecific app shows poor graphicsCheck for HDX optimization packs for that application; use app remoting instead
Host CPU overloadedMultiple users experiencing poor qualityReduce VMs per host; add more hosts to distribute load
User’s client device underpoweredOld/slow computer struggling with HDXUpdate Citrix Workspace app; consider thin client hardware upgrade
Most common fix: HDX Visual Quality policy too conservative. Increase quality for users with good connections.

Symptom: Slow Login Times (>2 Minutes)

What you see: Long wait from launching desktop to usable desktop Quick diagnostic:
  1. Measure login time components
 1. Time for desktop to appear in Workspace: Citrix delivery time

 2. Time from click to login prompt: VM boot time (if stopped)

 3. Time from login to desktop: User profile load time
  1. Check if VM had to boot ansible-playbook -i dev/inventory list.yml | grep <vm-name>
 1. If VM was stopped, boot time is included
  1. Check user profile size (if using roaming profiles)
 1. VNC to VM after user login: `du -sh /Users/<username>`
4. Monitor VM resources during login: Watch Activity Monitor during the login process
CauseHow to verifySolution
VM boot time includedVM was stopped and had to startKeep VMs running 24/7; deploy adequate pool size to avoid stopping VMs
Large roaming profileUser’s home directory >10GB in sizeImplement folder redirection; enable profile cleanup policies; limit profile size
Login scripts timing outConsole shows script errors or delays during loginFix or remove problematic login scripts; optimize script performance
Slow network share accessProfile stored on congested network storageOptimize network storage performance; use faster storage for profiles
Spotlight indexing on first loginFirst login after VM creation triggers indexingAllow indexing to complete once; optimize indexing settings in golden image
Too many login itemsMany applications launching at loginRemove unnecessary startup items; configure minimal login items in golden image
GPO processing delayLong wait at “Applying policies” screenOptimize Group Policy Objects; reduce number of policies; use loopback processing
Profile corruptionLogin hangs or fails repeatedlyDelete local profile; force fresh profile download from server
Most common fix: Large roaming profiles. Implement folder redirection and profile cleanup policies.

Symptom: Application Launches Are Slow

What you see: Apps take 30+ seconds to launch after clicking Quick diagnostic:
  1. Test from within the VM directly
 1. VNC to the VM

 2. Launch the app; and time the launch
  1. Check if the app is on network share vs. local storage
 1. Applications on network drives are slower
  1. Check VM disk I/O
 1. Activity Monitor → Disk tab during application launch
  1. Check available memory
 1. Activity Monitor → Memory tab

 2. Look for increased memory pressure
CauseHow to verifySolution
Apps installed on network shareApp path shows network/UNC locationInstall applications locally in golden image; update image; redeploy VMs
Insufficient memoryMemory pressure high; heavy swap usage shownCreate golden image with more RAM allocation; redeploy VMs for power users
Slow disk I/ODisk wait times high in Activity MonitorCheck host storage performance with MacStadium; redistribute VMs
App requires more resourcesLarge app (Xcode, video editing) on small VMCreate high-spec golden image variant; deploy separate VM group for power users
Antivirus scanning on launchAV process active during app startupExclude app folders from real-time scanning; configure AV exceptions
App not optimized for virtualizationNative app expects physical hardware resourcesUse published applications instead of full desktop; optimize app settings
First launch initializationApp creating caches/configs on first useSubsequent launches will be faster; pre-configure apps in golden image
Network dependencyApp verifying license or downloading data on launchEnsure good network connectivity; pre-cache data if possible
Most common fix: Applications installed on network shares. Pre-install in golden image for local execution.

Symptom: High CPU Usage Even When Idle

What you see: VM consuming 50%+ CPU with no user activity Quick diagnostic:
  1. Identify the process consuming CPU by VNCing into the VM, then navigate to Activity Monitor → Sort by %CPU
  2. Check for runaway processes
 1. Look for: Spotlight indexing (mds), kernel_task, unexpected processes
  1. Check for malware (unlikely but possible)
 1. Run security scan if suspicious
  1. Monitor over time: Is the CPU spike temporary or sustained?
Likely causes and solutions:
CauseHow to verifySolution
Spotlight indexingmds or mdworker processes using high CPUWait 30-60 minutes for completion; configure indexing exclusions in golden image
Background macOS updatessoftwareupdated or related processes activeAllow updates to complete; schedule updates during maintenance windows
Runaway application processSpecific app/process stuck consuming CPU continuouslyKill process via Activity Monitor; investigate app issue; report bug
Malware or cryptominerUnknown suspicious process using CPURun malware scan; rebuild VM from clean golden image if infected
System maintenance tasksNormal macOS background maintenance (periodic)Wait for completion (typically 30-60 min); occurs daily at specific times
GPU acceleration disabledSoftware rendering using CPU instead of GPUEnable GPU passthrough if available (M4 hosts); verify GPU settings in VM
Memory pressure causing swappingHigh swap activity consuming CPUIncrease VM memory allocation in golden image; reduce memory-intensive apps
Browser with many tabs/extensionsBrowser process consuming CPUClose unnecessary tabs; disable resource-heavy extensions; restart browser
Most common fix: Spotlight indexing or macOS maintenance tasks. Usually resolves itself within an hour.

Authentication and Access Control

Symptom: User Can’t Log Into Desktop (Credentials Rejected)

What you see: The user enters their credentials, but gets an “Invalid username or password” error Quick diagnostic:
  1. Verify the user exists in your identity provider by checking Active Directory or Azure AD
  2. Test user login capabilities with a known-good account
 1. Use an admin account or test account to attempt logging in
  1. Check if the issue is specific to one VM or is impacting all VMs
 1. Try launching a different desktop in the pool
  1. Check VDA domain binding (if using AD)
 1. VNC into the VM: System Preferences → Citrix VDA → Check domain binding status
CauseHow to verifySolution
User account disabled in AD/IdPCheck Active Directory or identity provider statusRe-enable user account; verify account is active
Password expiredUser confirms password expired or needs changeHave user reset password through normal corporate password reset process
VM not bound to domainVDA shows “Not bound” or incorrect domainRebuild golden image with proper domain binding; verify domain credentials
Time sync issueVM time differs by >5 minutes from domain controllerConfigure NTP in golden image; manually sync time; verify host time correct
Domain controller unreachableVM can’t ping or connect to DCCheck network connectivity; verify DNS resolution for domain; check firewall
Cached credentials expiredWorks for some users but not othersClear Keychain cached credentials; force fresh authentication
Wrong identity providerVDA bound to wrong domain or tenantReconfigure VDA with correct domain/tenant in golden image; redeploy VMs
Most common fix: The user’s password is expired. Have the user reset their password through your normal corporate process.

Symptom: User Can Log In But Has the Wrong Permissions

What you see: User is authenticated, but they can’t access files/apps they should have access to Quick diagnostic:
  1. Check the user’s group memberships
 1. VNC into the VM:

 2. In the Terminal, enter: `groups <username>` or `dscl . -read /Users/<username> GroupMembership`
  1. Verify user permissions on restricted resources
 1. Check file/folder permissions
  1. Test with a known-good user from the same group
 1. Do other users have the correct group access?
  1. Check GPO application (if using AD)
 1. In the Terminal, enter: `sudo gpupdate --user`
CauseHow to verifySolution
User not in required AD groupsgroups command doesn’t show expected groupAdd user to appropriate Active Directory security groups
GPO not applied correctlygpupdate shows no policies or errorsForce GPO refresh with sudo gpupdate --force --user; verify DC connectivity
Local file permissions incorrectFile ACLs don’t include user or groupFix file/folder permissions; verify inheritance settings
Profile not loaded correctlyUser profile appears incomplete or corruptedDelete local profile cache; force fresh profile download on next login
Network share mapping failedExpected drives not appearingVerify network connectivity; manually map shares to test; check GPO mappings
Cached credentials out of syncUsing old cached authenticationClear macOS Keychain; force re-authentication with current credentials
Group Policy precedence issueConflicting GPOs applied in wrong orderReview GPO precedence; adjust GPO link order; use block inheritance carefully
Domain trust relationship issueCross-domain permissions not workingVerify domain trusts are functional; contact domain administrators
Most common fix: The user is not in the required AD group(s). Add them to the appropriate group(s), and then force a GPO refresh.

Symptom: Single Sign-On (SSO) Not Working

What you see: Users are prompted for their SSO credentials despite being logged into iCloud/corporate network Quick diagnostic:
  1. Check Citrix Workspace SSO configuration
 1. Ask the user to log into Citrix Workspace → Preferences → and verify their SSO settings
  1. Verify Citrix Gateway/SSO configuration is correct
 1. If using Citrix Gateway: Check pass-through authentication settings
  1. Test SSO login capability with manual credentials
 1. Does SSO login work if the user enters their credentials manually?
  1. Check the user’s SSO/company omain login
 1. Are they logged in with the correct SSO/company domain account?
Likely causes and solutions:
CauseHow to verifySolution
SSO not enabled in Citrix WorkspaceWorkspace preferences show SSO disabledEnable SSO in Citrix Workspace settings under ‘Account preferences’
Gateway pass-through not configuredCitrix Gateway shows no SSO/pass-through configConfigure pass-through authentication on Citrix Gateway; enable domain pass-through
User on non-domain computerComputer not joined to corporate domainJoin computer to domain or use manual credential entry
Certificate authentication issueSSO uses cert auth; certificate is invalid/expiredRenew user certificate; reinstall certificate; verify cert trust chain
Wrong authentication methodSAML/OAuth/SSO login configured incorrectlyVerify auth method matches identity provider; check Citrix Cloud auth settings
Browser security settingsBrowser blocking credential passingAdjust browser security settings; add Citrix URLs to trusted sites
VPN interfering with SSOVPN tunnel disrupting authentication flowConfigure split-tunnel VPN; ensure SSO endpoints reachable
Most common fix: SSO is not enabled in Citrix Workspace. Enable SSO in the user’s Workspace preferences.

Symptom: Can’t Access Citrix Cloud Admin Console

What you see**:** Admins can’t log into Citrix Cloud Console to manage environment(s) Quick diagnostic:
  1. Try using a different browser
 1. Some browsers cache authentication differently
2. Clear browser cookies and cache, then try logging in again 3. Verify your admin account is not locked, check with Citrix support or another admin 4. Check Citrix Cloud status: https://status.cloud.com
CauseHow to verifySolution
Browser cache/cookies issueLogin works in incognito/private modeClear browser cookies and cache; restart browser; try again
MFA/2FA device failureError occurs during two-factor authentication stepRe-register MFA device in account settings; use backup codes if available
Account locked after failed attemptsMultiple failed login attempts triggered lockContact Citrix support to unlock; wait for auto-unlock period (usually 30 min)
Citrix Cloud service outageStatus page shows service issuesCheck status.cloud.com; wait for Citrix to resolve; monitor status updates
Network blocking Citrix CloudCan’t reach cloud.com domainsCheck firewall/proxy; verify outbound HTTPS allowed; try different network
Browser version incompatibleUsing old/unsupported browser versionUpdate to current Chrome, Firefox, Edge, or Safari version
Admin permissions revokedAccount no longer has admin roleContact Citrix Cloud organization admin; verify role assignments
Session timeoutLogged out due to inactivityLog back in; adjust session timeout settings if available
Most common fix: Browser cache issue. Clear your cookies and cache, or try using incognito/private mode.

Ansible Playbook Errors

Symptom: Playbook Fails with “Host unreachable”

What you see: Playbook errors: “Failed to connect to the host via ssh” / “Host is unreachable” Quick diagnostic:
  1. Test basic connectivity ping <host-ip>
  2. Test SSH manually ssh admin@<host-ip>
  3. Check inventory file cat dev/inventory
 1. Verify host IP addresses are correct
4. Test Ansible ping module ansible hosts -i dev/inventory -m ping Likely causes and solutions:
CauseHow to verifySolution
Host powered off or unreachablePing test fails completelyPower on host via MacStadium portal; contact MacStadium support
Wrong IP address in inventoryIP doesn’t match actual host addressUpdate dev/inventory file with correct host IP addresses
SSH service not running on hostPing works but SSH connection refused/timeoutRestart SSH service on host; contact MacStadium support
Firewall blocking SSH from control nodeSSH works from some locations but not control nodeCheck firewall rules; allow SSH (port 22) from Ansible control node IP
SSH key not in authorized_keysSSH prompts for password instead of using keyAdd Ansible control node’s public SSH key to host’s ~/.ssh/authorized_keys
Wrong username configuredUsing incorrect ansible_user valueVerify ansible_user=admin (or correct user) in inventory [all:vars]
Network routing issueCan’t reach host network from control nodeVerify routing; check if VPN required; test from different network location
Host SSH configuration changedSSH settings preventing key-based authVerify host SSH config allows public key authentication
Most common fix: SSH key is not in authorized_keys on host. Add the Ansible control node’s public key to the host.

Symptom: Playbook Fails with “Permission denied”

What you see: Playbook errors with permission/sudo errors during execution Quick diagnostic:
  1. Test basic connectivity ping <host-ip>
  2. Test SSH manually ssh admin@<host-ip>
  3. Check inventory file cat dev/inventory
 1. Verify host IP addresses are correct
4. Test Ansible ping module ansible hosts -i dev/inventory -m ping Likely causes and solutions:
CauseHow to verifySolution
Host powered off or unreachablePing test fails completelyPower on host via MacStadium portal; contact MacStadium support
Wrong IP address in inventoryIP doesn’t match actual host addressUpdate dev/inventory file with correct host IP addresses
SSH service not running on hostPing works but SSH connection refused/timeoutRestart SSH service on host; contact MacStadium support
Firewall blocking SSH from control nodeSSH works from some locations but not control nodeCheck firewall rules; allow SSH (port 22) from Ansible control node IP
SSH key not in authorized_keysSSH prompts for password instead of using keyAdd Ansible control node’s public SSH key to host’s ~/.ssh/authorized_keys
Wrong username configuredUsing incorrect ansible_user valueVerify ansible_user=admin (or correct user) in inventory [all:vars]
Network routing issueCan’t reach host network from control nodeVerify routing; check if VPN required; test from different network location
Host SSH configuration changedSSH settings preventing key-based authVerify host SSH config allows public key authentication
Most common fix: SSH key is not in authorized_keys on host. Add the Ansible control node’s public key to the host.

Symptom: Playbook Fails with “Permission denied”

What you see: Playbook errors with permission/sudo errors during execution Quick diagnostic:
  1. Test sudo access manually
    ssh admin@<host-ip>
    sudo ls /var/orka
  2. Check Ansible inventory
    sudo settings cat dev/inventory | grep ansible_become
  3. Run the playbook with verbose output ansible-playbook -i dev/inventory <playbook> -vvv
  4. 4. Check if a specific task is failing, and/or look at which task in the playbook fails
Likely causes and solutions:
CauseHow to verifySolution
ansible_become is not setInventory missing ansible_become=yesAdd ansible_become=yes to [all:vars] section in inventory
User lacks sudo permissionsManual sudo command prompts for passwordAdd Ansible user to sudoers; configure passwordless sudo for admin user
Sudo password is requiredPlaybook needs become_password but not providedAdd -K flag when running playbook to prompt for sudo password
File permissions are too restrictiveSpecific files/dirs not readable/writableFix file permissions on host; verify ownership is correct
SELinux/security policy blockingmacOS security policies preventing operationAdjust security settings; may need to disable SIP temporarily for some operations
Wrong sudo path or configurationSudo command not found or misconfiguredVerify sudo is installed and in PATH; check /etc/sudoers configuration
Ansible connection user mismatchConnecting as one user, trying to become anotherVerify ansible_user matches expected user account on hosts
Most common fix: ansible_become=yes not set in inventory. Add to [all:vars] section.

Symptom: Playbook Times Out

What you see: Your Ansible playbook runs, but it times out on specific tasks without completing Quick diagnostic:
  1. Run with verbose output to see where the playbook hangs: ansible-playbook -i dev/inventory <playbook> -vvv
  2. Test the specific command manually by SSHing to the host and running the command that’s timing out
  3. Check if the task requires a long time to complete
 1. Note that image pulls and VM deployments can take more time than initially anticipated
4. Monitor host resources during tasks by running:
ssh admin@<host-ip>
top
Likely causes and solutions:
CauseHow to verifySolution
Task legitimately takes a long timeImage pull or VM deployment in progressBe patient; increase the task timeout in your playbook if needed; and monitor progress with -vv
Host is overloaded and responding slowlyHigh CPU/memory usage on host during taskReduce load on host; stop some VMs; retry during low-usage period
Network timeout during downloadDownloading a large image from a slow sourceImprove network path to registry; use closer registry; retry during off-hours
Task(s) hanging indefinitelyNo progress visible for an extended period of timeCancel with Ctrl+C; SSH to host to debug; check for stuck processes
Insufficient async timeoutDefault timeout is too short for the operationIncrease async timeout parameter in playbook task definition
Host became unresponsiveHost not responding to any commandsSSH to host to check status; may need host reboot; contact MacStadium
Deadlock or resource contentionTask waiting for resource held by another processIdentify and kill blocking processes; restart Orka Engine service
Network connection is unstableIntermittent connectivity during long operationsImprove network stability; use a more reliable connection; and/or retry the operation
Most common fix: Legitimate long-running task (such as an image pull). Increase the async timeout or be patient.

Symptom: Playbook Variables Not Being Applied

What you see: Playbook runs but doesn’t use the variables you specified with -e Quick diagnostic:
  1. Check command syntax
 1. Verify `-e` flags are formatted correctly
2. Run with verbose output:
ansible-playbook -i dev/inventory <playbook> \
-e "var1=value1" \
-e "var2=value2" \
-vv
3. Check playbook for variable names, and ensure they match exactly (variable names are case-sensitive)
4. Check for hard-coded values in the playbook, these might override other variables
Likely causes and solutions:
CauseHow to identifySolution
Variable name typo or case mismatchNames don’t match exactly (case-sensitive)Use the exact variable name from your playbook’s documentation; check case
Variable already set with precedencePlaybook has default; your var has lower precedenceExtra vars (-e) should override; verify syntax is correct
Wrong variable data typePassing string where an integer expected or vice versaCheck playbook documentation for expected data type; convert if needed
Variable not used in playbookPlaybook doesn’t reference that variableVerify playbook supports variable; check playbook source code or docs
Syntax error in -e flagCommand line parsing failed silentlyUse proper quotes: -e "vm_group=test" not -e vm_group=test
Multiple -e flags parsed incorrectlyOnly first -e being appliedEnsure each -e flag is separate and properly formatted
Variable scope issueVariable defined in wrong group_vars locationCheck variable is in correct inventory group or all group
Special characters not escapedVariable value contains spaces or special charsQuote values properly: -e "vm_name=test vm" needs quotes
Most common fix: Variable name typo. Check playbook documentation for exact variable names (these are case-sensitive).

Symptom: Playbook Fails Partway Through

What you see: The playbook starts successfully, but fails on a specific task Quick diagnostic:
  1. Run the playbook with verbose output to see the exact error: ansible-playbook -i dev/inventory <playbook> -vvv
  2. Check the specific task that failed, reviewing the error logs carefully
  3. Test the failing task’s command manually by SSHing into the host and running the command
  4. Check if the task is stuck in a partially successful state and needs cleanup, or needs to be re-run
CauseHow to verifySolution
Resource exhaustion mid-taskHost ran out of disk space or memory during operationFree resources on the host; delete unused VMs; retry playbook from the beginning
Network interruptionConnection to host lost during task executionVerify network stability; check for network issues; rerun playbook
Task dependency not metPrevious task didn’t fully complete before next startedReview task dependencies; add explicit wait/pause between tasks if needed
Invalid parameter valueTask received bad input causing failureVerify all parameter values are valid; check for typos in variables
Race conditionTask timing-sensitive; failed due to timing issueAdd explicit pause or wait_for tasks between dependent operations
External service unavailableRegistry, DNS, or API temporarily unavailableCheck external service status; retry when service available; implement retries
Disk write failureFile system full or read-only during writeCheck disk space with df -h; verify filesystem not read-only
Concurrent playbook executionAnother playbook is modifying the same resourcesEnsure only one playbook runs at a time; implement locking if needed
Most common fix: Network or resource interruption. Verify connectivity and available resources, then re-run the playbook.

Symptom: Playbook Says “Changed” But Nothing Actually Changed

What you see: Playbook reports changes, but its state appears identical Quick diagnostic:
  1. Check what the playbook claims to change, look at the task output while the playbook is running
  2. Verify the actual state on the Orka for VDI host ssh admin@<host-ip> and check if the claimed changes actually exist
  3. Run the playbook in check mode: ansible-playbook -i dev/inventory <playbook> --check
  4. Check for idempotency issues, and run the playbook twice. The task status should display “ok,” the second time, and not “changed”.
CauseHow to verifySolution
Playbook not idempotentPlaybook status always reports “changed”Fix playbook to properly check state
Task reporting incorrectlyCode bug in the playbookReview/fix task logic
Cached state outdatedPlaybook is using old state infoForce refresh of facts
External state changedSomething else modified playbook stateDetermine what else is changing state
Task has side effectsChange occurs but not where expectedReview full task behavior
Most common fix: The playbook is not properly checking the existing state before making changes (idempotency issue).

Escalation Quick Reference

When to escalate:
Issue PatternEscalate ToContactSLA
Single user problemHandle yourselfN/AImmediate
5-10 users affectedInfrastructure team leadInternal30+ minutes
10+ users affectedInfrastructure managerInternalImmediate
Host hardware failureMacStadium supportsupport@macstadium.com1 business day
Orka Engine issuesMacStadium supportsupport@macstadium.com1 business day
Network infrastructureNetwork teamInternalVaries
Citrix Cloud outageCitrix supportsupport@citrix.comVaries by support tier
VDA failures (widespread)Citrix supportsupport@citrix.comVaries by support tier
Storage/registry downStorage teamInternalVaries