I recently joined a team of 20 people split across traditional software development, support personnel, storage engineering, security engineering and more and we're essentially supposed to sysadmin a large number of systems working ~70 hr weeks. The old team we're replacing did not handle well in this environment and it seems like my team needs to focus more so the storage engineers solve storage engineering problems, coders do coding etc so we can bail each other out of this hole faster. Any thoughts on how we can go about this? I'm working ~50hr weeks max and still perplexed at how these 70 hr weeks happen.
It sounds very, very difficult and am sorry you and your team are in that situation. The old team didn’t hold up because what is being asked is impossible. Step one is to protect the team’s ability to function long term, and that means working reasonable shifts and resting. If this means, for a short while, you have one person from each discipline on-call so they can be putting out fires, and others are resting, then there’s “recovery days” following on-call, do it.
Automate, and simplify the hell out of things. Do things run out or disk space? If more disks can be thrown at it, so that. If not, rotate anything that uses space off aggressively and automatically, and have alarms that escalate as the disk gets lower and lower. Have fix-it scripts to aid in cleanup if a human has to do it, but make that the exception. If the systems you support are unstable over time, look into rotating servers out of the active pool and restarting services however often it takes until you can figure out what is leaking memory, file handles, or whatever else causes time-impacted instability.
Ideally I’d try to have 3 designated work types:
It may not mean that every week people swap, but you need to guard people’s time to be able to focus on fixing the sources of the fires, not just putting them out.
On-call should be generating hash marks every time a particular issue occurs so you can have that guide prioritization of tools to automate or simplify repair, and fix root causes.
Thank you so much for the detailed answer. That's what I was thinking too as I have a wide panacea of tech experience server level and above, but there's no way I could ever do business continuity, storage engineering, or anything involved in actually building out the server definition. I've worked in IBM z/OS and IBM i so unfortunately it's not as easy as defining a linux server in a cloud shell, but definitely can still relate. I'll take this feedback with me and really make sure we start to hone in on our specialties as we ease off the gas pedal on just pure survival mode.
I worked at IBM across a couple of cloud/saas offerings for a little over 6 years, from 2015 to 2021. While things may have changed with the split of Kyndryl and Arvind replacing Ginni, from my experience (and from talking to various managers and VPs/Directors) this culture isn't completely unexpected. From what I saw, the specific management chain and location was far more impactful when compared to similar companies.
For example, before Kyndryl was split off, my team was interviewing internally for a role, and a common theme from applicants in GTS/GBS was extreme overmanagement, resistance to change/progress, and long hours/poor work life balance.
I don't want to say it's a lost cause. I definitely saw modernisation/adjustment success stories at IBM. I was actually part of one back when I was a grad! But I do think there's a lot of value in seriously evaluating what impact is possible before you require support from much higher in the company. Don't kill yourself trying to solve what you cannot fix.