Recently,
I was engaged in supporting a customer to help them progress with the roll out of Windows 10. Understandably, they wanted to leverage PXE to build the devices which
worked perfectly fine until they decided to upgrade to CB 1910 along with ADK
1903. Post upgrade, PXE stopped working suddenly. Now, I have done countless
upgrades till date, so it didn’t make any sense as to why such a routine
upgrade would cause PXE to break. This is where my quest starts. Now the write
up from here on may appear long but bear with me as I try to capture the entire
troubleshooting process.
become tricky if the setup is not configured correctly. As the first step, I
checked for the usual areas of the PXE configuration, starting with the
SMSPXE.log on the Distribution Point hosting the PXE role. I could clearly see
that the client was sending the PXE request and the PXE server was even
responding with the policy and the NBP file. However,
the client kept on requesting using the default IP 0.0.0.0
indicating that the PXE client was not receiving a response from the DHCP
server to accept an IP. That clearly appeared to be an issue at the Network
layer and on checking with the customer they revealed that they were using DHCP
scope options instead of using IPHelpers. Naturally, at this point one would
advise to use IPHelpers, but the customer had been using DHCP scope options for
years and there was no reason as to why things will stop working suddenly. However, some modifications were already made with the scopes so it was decided to configure the IPHelpers after all.
this, as expected, things moved along a bit and atleast the PXE client was able
to download the NBP file. The machine
started to send request from an IP after it got a response from the DHCP
server. A step in the right direction.
the process stopped again at the stage where the PXE client did not download
the Bootmgfw.efi. The PXE server
responded in a timely manner each time the PXE client made the request. The PXE client showed the following message
without timing out for hours.
seemed unusual as I could clearly see in the SMSPXE.log that the PXE server was
responding with the boot image after receiving a response from the ConfigMgr
server on client look up action. There was a policy available for
the device.
captured network traffic and could see that the traffic logs were matching the
logs in SMSPXE.log. Again, no errors at this point. Not ruling out an issue with the PXE configuration itself, I went ahead and re-added the role in ConfigMgr. Unfortunately, this made no difference and the PXE behavior remained the same. Suspecting an issue with
the boot image I tried different versions of ADK, but even that didn’t help. By
now I had also tested with PXE provider service to see if the issue was with
the WDS in some way, but that didn’t help either. This is where PXE
troubleshooting can become really tricky. If there are no errors, then it
becomes difficult to pinpoint the issue. The PXE client will just not download the Bootmgfw.efi, with no errors.
and it only seems to affect virtual servers running PXE point. In this case,
the PXE point was running on a Server 2012 R2 VM and was being governed under ‘Server-block’
setting found within NSX-T\Nerworking\Segments\Segments Profiles.
This
is part of Segment security in VMware that provides stateless Layer 2 and Layer
3 security by checking the traffic to the segment and dropping unauthorized
packets sent from VMs by matching the IP address, MAC address, and protocols to
a set of allowed addresses and protocols. One can use segment security to protect
the segment integrity by filtering out malicious attacks from the VMs in the
network. One can configure the Bridge Protocol Data Unit (BPDU) filter, DHCP
Snooping, DHCP server block, and rate limiting options to customize the
security on a segment profile.
This
is similar to DHCP guard & route Advertisement Guard applied
on network segments in Hyper-V.
Once
the NSX-T profile was disabled, the PXE process started to work immediately. On
checking further with the Wintel team, they confirmed that they had configured
this Security segment profile for all the VMs around the same time when the
ConfigMgr was upgraded to CB 1910. That would explain why things stopped
working suddenly after the upgrade.
Since
there is no official documentation released by Microsoft on this, I decided to
blog my experience. Hope this helps someone and saves them time and effort.
Until
next time..
Reference:
https://docs.vmware.com/en/VMware-NSX-T-Data-Center/index.html