So, there are still some unknowns at this point. I haven't gotten past the "initial backup" stage yet, so I don't know what I'll encounter when it hits the real-time monitoring stage. I don't know that my periodic rescan implementation is correct. But, these things are mostly tested and observed working.
But there is one big unknown that is a problem right now. The Backup Agent takes ZFS snapshots as part of inspecting files. It's supposed to release them when it's done.
It doesn't seem to be releasing them.
I'm not sure why. This is a bug and this is a problem. I'm going to have to figure out what's going on with the ZFS mounts and the reference counting, and it's a hard thing to debug. I can't even run the code in a debugger right now, and even if I could, there are a lot of moving pieces in different places.
My system currently has 292 ZFS snapshots. They're quick and easy to clean up, except that the backup agent is running right now, and a few of those snapshots are actually in use. Don't want to pull the rug out from under it. :-)
UPDATE 5:46 PM:
I've added ZFS-specific debugging, but in the course of adding it, a thought occurred to me: What if it's not buggy at all, but it simply hasn't gotten to the point of releasing any of them because the initial backup queue is so large? I think this may be the case. Time to just let it run and see what happens, I guess.
After adding the debugging, I fired it back up, and it promptly created a snapshot that won't be released until 162,809 files are processed! Well, down to 162,712 now.
Meanwhile a second snapshot is tracking a measly 100 files, and a third snapshot is already tracking 145,168 files.
UPDATE 5:49 PM:
It's chewing through the files. Down to 152,704 now on [1].
UPDATE 5:50 PM:
Hmm, there may be a problem yet. It was processing files quickly because the remote file state cache said they were already uploaded. It has now hit files that were not already uploaded. Uploads are completing, but the reference count isn't dropping...
UPDATE 5:53 PM:
Ohh, these are all small files. Which means that they have their snapshot reference released before they enter the upload queue (they get staged to /tmp). Everything in the upload queue is already released. Until the queue hits the low water mark, it won't be pumping any more entries from the backup queue into it, which means it won't be releasing any more snapshot references. So, working as designed? 🙂 Fingers crossed.
UPDATE 5:56 PM:
Upload queue has to go from 10,000 entries to 5,000 before it'll put more into it. It's at about 9,200 right now. Just gotta wait and see.
UPDATE 5:59 PM:
Hmm, perhaps trouble after all. The reference count is 152,704, but the backup queue only has 71,419 entries in it. That's less than 152,704. Even after they're all processed, the reference count isn't going to get to zero...
UPDATE 6:45 PM:
Welp, it woke up, dropped another 5,000 files into the upload queue, released about as many snapshot references, but it still has more references than there are files in the queue. Hmm...
UPDATE 7:06 PM:
I think I found the hole. 🙂 There's a bit of code that checks, "Hey, is this exact file already uploaded (according to the cache)? If so, we don't have to do anything." It was interpreting "don't have to do anything" a little bit liberally – still has to release the snapshot reference! Let's see if it behaves a bit more as expected with that fixed. 🙂
UPDATE 7:12 PM:
Yey! From the log:
[1] Releasing reference for path: /code/D2.diff [1] Reference count is now 1 [1] Releasing reference for path: /snap/README [1] Reference count is now 0 [1] => Disposing of the snapshot Running: zfs destroy rpool/ROOT/ubuntu_znaqup@RTB-638525379026955749 No longer tracking snapshot, now tracking 8 snapshots
UPDATE 8:08 PM
Huh, hit long polling for the first time and revealed a problem there. And, the main entrypoint code wasn't setting and checking all of the things it should have to ensure a proper shutdown on SIGINT or SIGTERM.
No comments:
Post a Comment