Know Your Tools, and Fear No Bug
One of my favorite series of blog posts of all time is “Unix as an IDE”. These blog posts walks you through how your Unix/Linux environment is your IDE. This philosophy of thought challenges using a dedicated IDE for development, as all the tools you need are already on your Operating System. Debugger integration? Why not just use gdb rather than the wrapper your IDE provides? Remote file editing? How about wrapping a call to scp/rsync within vim/emacs? Auto-complete? With Language Server Protocols (LSP) one can adopt the nice auto-complete features into a given numerous text-editors (if they don’t already exist). One might argue that modern IDEs provide productivity benefits for Software Developers, and it’s not that I disagree or dislike any particular IDE, but I believe knowing the underling Operating System’s core utilities enables mastery of a given environment.
For me, leveraging native utilities (strace/gdb/etc…) for troubleshooting and resolving issues is usually a more inviting solution that installing a 3rd party utility. Especially if this 3rd party utility that just glues the underlying tools together for you anyway. The better understanding you have of your underlying environment, the better your ability to bend it to your will or resolve issues as they arrise.
That being said, modern software stacks are not trivial. They’re often numerous micro-services glued together through a series of API calls that make their way into an intermeditary (JSON/Protobuf/Parquet/XML/YAML) data representation to be ingested by yet another micro-service until eventually it makes it to the end user. The ever growing amount of abstraction layers continue to separate you, the Software Developer/Systems Administrator/DevOps Engineer/Hacker/Reader of this blog from the core process(es)/program/Operating System where your app and the asscoiated bugs reside. While its great the next generation of engineers are starting to explore platforms such as Kubernetes prior to entering the work force, its the ability to troubleshoot and disolve these abstractions I fear that’s being overlooked. Allow me to explore this idea with you and our fictious company of Arch Cloud Labs.
Perhaps you have a lovely pipeline of monitoring tools built around Prometheus and Grafana that send alerts to Pager Duty to inform you the “Hello-World” app is down. But wait, it’s almost 2023 and we’re Cloud Native! Maybe you have CloudTrail (or a similar service in your Cloud provider) alerts that trigger a Lambda function to enable Slack bot to send a message to the on-call team while also creating a Jira ticket. Maybe the person that used to maintain this stack left the company, and you’re trying to figure out if they left any documentation… Perhaps its something in between? If we peer back a few months to my blog post on configuring AWS billing alerts with Discord notifications you’ll see four services are needed prior to interacting with Discord’s API to alert me that I’ve spent $7.00 the past month. These pipelines can also be fragile and potentially not fire the way you think they do resulting in missing valuable data or the alert all together. Regardless, the levels of abstraction of even reporting that an error has occurred continues to grow as Cloud Native technologies are adopted. If our alerting and monitoring stacks are becoming more complicated, and require a small team to maintain are the debugging tools keeping pace? And if they are keeping pace, can we, the team at this ficticous company employ them correctly to gain insights as fast as business needs require them to be done?
Ideally, alerts are generated with aforementioned pipelines prior to becoming issues, but what happenes when the issue isn’t captured in the monitoring thats established? Could it be that at this very moment you’re not monitoring for the “right” things? When its down to you and the platform (cloud, k8s, etc…) what are your options? When faced with the cold blank stare of the unknown “un-googleable” error, I encourage you dear reader to embrace the unknown and trace your way into understanding. It is the pursuit of truth in these debugging investigations that reveal that the SaaS stack you once thought you knew well appears foreign. Rather than simply redeploy a failed IaC deployment, and hope for the best, seek out the underlying logs and understand why things went awry. When a Pod crashes, seek to understand the root cause rather than just scale pods to account for failure. It is at this level of analysis, the abstractions of today’s modern applications are broken down, and you dear reader can peer through the veil of abstraction and see your application for what it truly is. A mix of 3rd party (potentially deprecated) API calls glued together with syscalls to produce the 1s and 0s to achieve the goal of your hello-world application.
As the march of abstraction continues, and YAML consumes your day-to-day, I emplore you to question how you would peer into the void of endless streams of API calls to identify issues should they occur. While globally distributed services will require centralized logging to be effective, knowing how to debug the recreated issue in isolation is key in long term resolutions that aren’t hot-patch production fixes. There’s no stopping the continued growth of technologies that create abstractions for the Software Developer/Systems Administrator/Hacker to master. I’m not suggesting that one should not adopt said technologies either, but I am suggesting to prioritize the ability to debug your environemtns before they become too unweidly and take on a life of their own. In short dear reader, know your tools and how to effectively use them to seek answers to the questions you have today, and may have tomorrow, and you will fear no bug.
Thank you for reading.