Recently I was migrating a SharePoint 2016 farm to SharePoint 2019. Not that exciting, because the customer did not have any farm solutions or customizations other than a custom claims provider for ADFS. The SharePoint 2019 farm needed to be configured using the Azure Active Directory. We used the solution AzureCP for this. All went as planned, but then Search threw an exception…
Search Crawl issue
The migration plan was basically as follows: Install SharePoint 2019 farm, backup and restore all content databases and the databases for User Profile, Search and Managed Metadata, recreate service applications and web applications, mount databases and execute a script to migrate the user accounts from ADFS to Azure AD claims. And end with a new Full Crawl for Search. Done it dozens of times, not really exciting.
Right after mounting the content databases (and just before executing the user migration script) I tested some search crawls. All successfully, all green lights to proceed. But when I started search crawls again after the user migration script, search threw an exception:
Item not crawled due to one of the following reasons: Preventive crawl rule; Specified content source hops/depth exceeded; URL has query string parameter; Required protocol handler not found.
Wait? What?!
In the next couple of weeks I was troubleshooting a lot, reading all the (ULS) logs, even contacted the author of AzureCP, but no luck. And the most weird part is, sometimes it did work! The customer could test the new environment and after a week of successfull testing, we decided to go for real. It became a very long night and I left the customer at 4 AM. Murphy was there again, no search crawls. We had to roll back to SharePoint 2016. Again.
The needle in the hay stack
In the days after, we opened a support ticket and together with the escalation engineer we looked in different areas then the exception message was telling us. We used Fiddler on the Search servers to see what’s happening. And there it was. Something got our attention… a call to the file Robots.txt. Strange. We do not have a robots.txt file. Apparently, Search thinks there is one. And Search thinks that it is not allowed to access the URLs. Gotcha!
The contents of the spooky file that Search was honoring looked like this:
User-agent: *
Disallow: /
This caused the (so misleading) crawl error.
We then explicitly created a Robots.txt file and stored that at the IIS Virtual Directory for each web application and the content is now:
User-agent: MS Search 6.0 Robot
Disallow:
More information:
https://www.techmikael.com/2014/12/the-right-robotstxt-settings-for.html
Wow. Just wow. Case closed. Happy customer.
The only question I am still struggling with: why does Search think there is a robots.txt file, while there isn’t any? Where does it come from.