Hi all,
I've spent the last week trying to get to the bottom of this and am not making much progress. Here is what we have:
2 sites (call them A and B), 4 DC/DFS servers, two in each site, one virtual and one physical. DC01 (physical) and DC02 (virtual) are in site A, DC03 (virtual) and DC04 (physical) are in site B. Sites are connected by 100Mbps WAN link. There are no sites defined / organized in AD, this is just a different geographical location.
The original setup was on 2003 servers, a few months ago we upgraded domain and DFS to 2008R2. All 4 servers were fresh install, old servers were retired. All servers are patched by WSUS on a regular basis and rebooted in the middle of the night, we haven't had any issues with that. I've read about DFSR patches and hotfixes, if these are not part of WSUS updates then they have not been applied.
We have 7 namespaces with 19 folders, each folder is in its own replication group. All replication is set to full mesh (except one folder as described below), and each folder has only one active referral target.
About a week ago, we discovered that permissions on a couple of critical folders were not as they should be and decided to remedy that. On 3 out of 4 servers they were so messed up that I couldn't even gain access (as full domain admin), and to replace ownership (to domain admin group) it would mean all permissions would also be set to domain admin group, before we can set them to what they really need to be. Since this particular folder contains just under 2TB of data (mostly PDF files 1MB to 4MB each), we decided to replace permissions during after hours. At the time, DC03 was active referral target for this folder, however (for some reason that escapes me at this time) I decided to apply permissions on DC01 and let them replicate to other servers. So this was done, it took about 90 minutes to apply permissions and since we didn't know how long it will take to replicate this to DC03, I switched referral to DC01 which became the only referral target for that folder. We did a quick test and everything seemed ok. We were planning to wait for changes to replicate and then switch referral target back to DC03.
In the morning we've got the calls about users not being able to access some files. After investigation, we found that files that were saved to DC03 the day before had not been replicated to DC01, and now they were inaccessible as they were still on DC03 but DC01 was the only referral target. XCOPY was used to manually copy files from the day before, however during the investigation we found a handful of files were not replicated from some subfolders going back a couple of months. This was the first time we realized replication may not be working at 100% and started digging deeper.
At some point during this weekend I rebooted all 4 DCs one by one, without any positive impact. I have also changed full mesh replication to create a chain : DC01 > DC02 > (WAN link) > DC03 > DC04, topology tested ok. I haven't noticed any improvement. Staging area for this folder is set to 128GB, following small staging area events in the event log. Prior to this we've had plenty of disk activity, which has gone down to only a few MB/s and is easily handled by the server (4 CPUs, 8GB memory, 4x3TB disks in RAID5. Since I changed staging area on Friday we've only got one error about high watermark, the same day. At this time logs show occasional sharing violation for different files (normal use pattern from what I can tell) and plenty of info events about files being changed on multiple servers. DFSRS.exe takes around 650MB and low CPU usage, with about 2-3 MB/s disk traffic.
Right now we have some folders (not all) that have backlogs to or from DC01, while other servers are current for the most part, except for the 2TB folder we replaced permissions on. That folder currently has a backlog of 1.440 million files (presumably permission changes) DC02 > DC01, and 1.442 million DC01 > DC02. Interestingly dfsrdiag backlog still shows backlog between DC01 and DC03/04 even though they shouldn't be replicating directly according to topology. Backlog numbers are a bit higher than numbers above, it's almost as if backlog didn't go away but rather stands still. I expected any backlog from DC03 > DC01 would become DC03 > DC02 and DC02 > DC01, as per current topology.
While running dfsdiag backlog commands I found some cases where the command would execute but with warning :
[WARNING] Found 2 <DfsrReplicatedFolderConfig> objects with same ReplicationGroupGuid=878ED61A-A737-4C88-8D16-D65CABE68175 and ReplicatedFolderName=uploads; using first object.
I am not sure if this is related or if the problem existed before we did work a week ago.
I have followed instructions to rename .XML files into .OLD and have observed new XML files were created following DFSR service restart. It doesn't seem to have made any difference.
Please let me know what information I can provide to hopefully resolve this.
Thanks very much