xargs.io

Startup Principles

startup principles gratitude engineering excellence

TLDR

When I build my own company, I’ll think back to working at a bootstrapped startup and the lessons I learned from our founder.

My business will embody these principles:

Believe (obsessively and doggedly believe)
Keep It Super Simple ¹ ² (radical simplicity)
Ship It (ship the bare minimum and get feedback, repeat)

Honorable mention:

do things that don’t scale ³ (By the time you have scaling limits, you can pay to figure out scale.)

When I’m building features, finding customers, or planning technical architecture, I’ll think of her and say: WWMD.

My Startup Story

Small startups at a cesspool of ideas, experiences, stress, and joy and I learned a ton through my phases in smaller startups. Today, I want to highlight lessons learned from working at a small bootstrapped-ish startup. I’ve also been reading Ray Dalio’s book Principles ⁴.

The startup was ~50 people when I left and I was there for 3 years. I joined two years into their journey (~25 ppl total with 6 in engineer). Money was always tight and yet we were moving two digits of millions in revenue when I left.

Beyond learning from my time there, I learned from the 2 years before me… from the example set by out solo founder.

She had a year or two of professional experience in web development focused on design and frontend and had no business experience…. until she made her own experience! She became obsessed with a business problem encountered casually and from then on, lived, breathed, and fought for her business.

She built a truly minimum viable product… and was a shining example of what you can do with radical simplicity. Her busines started with a Google Form. She leveraged amazing productivity and value from off-the-shelf free tools. Turns out free tiers can be pushed REALLY FAR with the right urgency and necessity! If Airtable existed at the time, I think that would have been our database.

She learned what she needed to know about databases, servers, analytics but it was a means to an end of making the business work. She didn’t get lost in la-la land of falling in love with the tech. They were a tool she became proficient with but they didn’t rule her. Technology was a means to an end of building the business. My time with her honed my business approach to delivering the most value we could with available time and energy. We radically optimized for making the biggest bang for the buck, because there weren’t many bucks to go around.

What drew me to the company was her obsession, passion and success. Her fervor and obsession were infectious and I saw my own star rising with hers. I learned to see the world a more like she did, with less rules and boundaries. I bring that with me to my own engineering leadership and personal values in software engineering.

Starting in a new environment

micro very-professional

A Story of Ludicrous Speedups (3000 sec -> 5 sec)

During a work hackathon, our project involved using Docker for deployment and dependency management.

The dockerfile was inherited from an underlying open source project and was ok when used for deployments but very slow for local development work. Why, you might ask?

It used multi-stage builds, one for node, one for golang and then a final stage that collected the built artifacts from the prior stages. But the problem was…

Conflating package installation with project build

The Dockerfile failed to use a best practice of first copying over the package manifest. For Node these are package.json & yarn.lock. For Golang it’s go.mod and go.sum.

Instead of copying over these specific files up front, the Dockerfile copied the full project into the container then performing a build.

The problem

Since the local copied source code changed frequently during development, all later steps in the Dockerfile were invalidated and performed without caching :(. Downloading all golang dependencies and compiling from scratch was onerous.

The solution

Break apart the dependency installation phase from the local code phase. Package manifests should be copied in first, then yarn install will install Node dependencies. I had to get hacky to accomplish the same thing with Golang, but I’ll post my solution when I have a good moment.

Conceptually, the outcome was:

Build phase a1: Copy in package manifests for Node & yarn install
Build phase b: Copy in package manifests & fetch golang deps
Build phase a2 (built on a1): Build local code for js/ts
Build phase b2 (built on b1): Build local code for golang
Build phase c (independent of a or b) Selectively copy build artifacts from a2 and b2.

Outcome

Dev build time for docker image is now near instant (5 seconds) rather than 2900 seconds on a low power laptop.

Bonus

We also created a dedicated Dockerfile.dev that excluded js production build logic which was accounting for 300+ seconds of build time. Instead the js was built with a development script enabling hot module reloading.

Using yubikey for SSH

ssh security yubikey

Using yubikeys everywhere is my jam…here’s how.

Setup

I did it by installing yubikey-agent with a:

brew install yubikey-agent
brew services start yubikey-agent

Then shell configuration in ~/.zshrc:

export SSH_AUTH_SOCK="/usr/local/var/run/yubikey-agent.sock"

For each yubikey

Create an 8 char password in password manager
Run yubikey-agent -setup
- Enter PIN/PUK
Get the public key and verify it works with ssh-add -L
Record public key in password manager and use the Yubico id to disambiguate which yubikey
Add the public key to anywhere relevant, ie https://github.com/settings/keys

Credit/Links

My workflow is a mixture of docs from https://github.com/FiloSottile/yubikey-agent and my own password manager setup.
Another time when I want to tinker more, I’ll try out this set of instructions ssh and gpg from yubikey
Superceded by 1st link: https://github.com/jamesog/yubikey-ssh

Note Taking System

I’m trying out a new note taking system with the following goals:

Easy to draft
Searchable
Similar syntax/shortcuts to other workflows (markdown)

It’s based on the concept of Zettelkasten and I’m stitching together my own system using:

Neuron
vscode-memo
vscode-highlight: For coloring #tags and @-mentions
ripgrep for searching (used from bash scripts)

Articles To Write

micro

Transitioning a Startup from MongoDB to Postgres
Engineering Leadership with Remote Teams
Lessons as an Engineering Leader
Emotional Labor of Management
Productive Engineering and Management Culture
The Value of Long Form Writing and Thinking Time
Elixir Reflections: 2013 to Now
My Development Environment
Notetaking 2021
Startup Horror Stories

Long Absence and New Excitement

update

My blog has been dormant, where have I been?

Busy! With my professional life and personal life. Big and good things on both fronts.

Since I last wrote:

I led a 50 person startup in manufacturing as the CTO using Elixir and Typescript
I led the development of a SaaS product as their Sr. Director of Engineering (Python, Elixir, Typescript, Graphql, and REST)
I built and ran an open source app in production that streams mutations in MongoDB to Postgresql called Moresql
We used Moresql to transition a company from MongoDB to Postgresql.
I spent a lot of time with my cats
I refined my bash coding style, evidence in my dotfiles
I’ve lived in 3 distinctly different places and I’ve gotten to do some traveling
I’ve been fully remote in remote and colocated companies since 2013

I haven’t written about my engineering leadership experiences because it’s hard to sufficiently abstract them in the moment in a way that I can publicly write about. I’ll see if that changes in the future because I want to be able to share learnings, since my professional growth is so very important to me.

I recently put up two pull requests for a commandline tool written in Rust called Tome. Tome is a rust binary tool that helps make a folder of scripts re-usable for one or more people so that it has good user ergonomics. My pull requests are https://github.com/toumorokoshi/tome/pull/4 and https://github.com/toumorokoshi/tome/pull/5.

With three days of Rust, my impressions are as follows:

The compiler teaches me the language, through useful error messages
Some types are obvious (Option/Result types) while others (String vs str and borrowed version) need me to investigate to understand
Cross compiling isn’t as user friendly as golang but is working now that I took time to set it up
All my favorite commandline tools are moving into Rust, perhaps I should too!

If you see me and want good conversation starters:

Ask me about using Trello as a columnular data store
About moving from MongoDB to Postgres (thanks JSONB)
About what my best purchases were during covid

In 2021, I’m looking forward to:

Learning and growing technically (Rust?)
Getting good at a few things outside software engineering.
Spending more time outdoors.
Traveling to see friends.

It’s 2021, I’m happy and have challenges to work on :).

Mp4s To Gif From Twitter

Visit the link, Cmd-Option-J in Chrome to open DevTools.

Execute and capture result of $('#playerContainer').data().config.video_url

Take that link and enter it in https://cloudconvert.com/ and select “Select Files” dropdown, enter url.

Or, use a commandline tool:

Install pre-requisites

pip install cloudconvert requests

Download script: https://gist.github.com/c83c3e91ee3f9df21686bb50b4fbf904

Make it executable: chmod +x twitter-gif

Run it: twitter-gif TWEET-LINK outputfilename-optional

Solving Infinite Loop In NPM With Dtruss

npm dtruss debugging

Last week one of the engineering juniors that I mentor ran into a strange environmental issue.

When he ran npm run karma it would run for ~8 minutes and then suddenly spit out an out of memory error. He tried debugging it for awhile himself and then reached out to me to assist.

We ran through the normal set of troubleshooting steps:

Verify NPM and Node are on versions appropriately matched to production. (They were newer so we re-installed the ones used in prod)
rm -rf node_modules/ followed by npm install. (This semi-frequently resolves issues when old dependencies are not cleared out)

And when we tried running the offending command again, we suffered the error once more.

Which was when I reached into my bag of tricks and thought back on articles by @b0rk and @brendangregg. I remembered tutorials about using Dtrace to track down system calls from particular process identifiers. And I remembered a similar tool called DTruss that allows for attaching to PID and observing the system calls. For more info on DTruss, go check it out here: http://www.brendangregg.com/DTrace/dtruss or by vim $(which dtruss).

So I explained the barebones that I knew about how DTruss operates and we fired up dtruss npm run karma.

We had time to talk a bit about system calls and the meaning of the readout. After 2 minutes we noticed that the log continued to fly by but the same folder was being accessed. Over and over and over. We had a recursive dependency due to an out-dated library that was stored inside the project tree.

Thanks to DTruss, we realized the issue, wiped out the offending folder and tried again with success!

PS - While writing this article I learned that Brendan Gregg wrote DTruss. Many thanks both for DTruss and for writing articles about how to use these tools! I also owe a thanks to Julia Evans who exposed me to these tools through her blogging and Zines :).

Thoughts On Gitlab Data Incident

Background

On Feb 1st, Gitlab suffered a irrecoverable data loss for a period of 6 hours.

https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/

(In case that link goes stale, here’s a copy: https://gist.github.com/8b9449ec4260583d0e644c7cdc94f3be)

My first thought is that it’s a horrible experience both for the users who lost data and for the engineers involved in the process at Gitlab. The feelings of anger, self doubt and frustration are hard to bear. I wish them all the best in recovering and getting back to work. My heart goes out to them for this experience.

After being floored by the possibility of permanent data loss, my thoughts went next to consider how their experience could inform my team’s decisions with regard to our own processes.

None of this is intended to as backseat driving the situation Gitlab suffered. It is intended as constructive discussion of systems failing to discourage human error, of which we are all susceptible.

Summary of Events

The tl;dr was PG replica got behind. Engineer1 went into debugging after their shift was over. Then the engineer believed they were SSH’d into the replica, but were really SSH’d into the primary. At this point Engineer1 tried to run a command to start replication. They had trouble with command and assumed they needed to wipe out the data directory fully where postgres stores databases. They ran a variant of “rm -rf” and removed the 300GB of data. Engineer1 realized the issue and stopped the deletion when only a few gigabytes remained. The data was unrecoverable from data directory. At this point Engineer1 handed off the baton due to realizing the mistake and already being heavily fatigued.

Their 5 backup systems all failed them. Their latest mostly complete backup was 6 hrs out of sync. Their webhooks data is lost or 24 hrs out of sync.

Repeating that… all 5 backups failed! That is a very very worst case.

That said, their data from 24 hrs ago seemed like valid backups and their backup from 6 hours before was valid. That means backup system 6 and 7 were working decently.

Ways to Limit Risk in Future

My takeaways from their incident:

Check your backup system works the way you think it does. Ideally this means occasional automated and manual occasions when backups are loaded into system and verified.
Use buddy system when doing potentially dangerous things on production.
This would lessen the likely of executing commands while SSH’d into wrong box
Talk through actions before doing them when on production. Have team mate confirm each step.
Take an airline pilot checklist approach to these situations to fend off some of the avoidable mistakes.
Do not make big decisions under time crunch. The engineer was trying to leave at end of shift w/ hard stop timeline. They were rushed and stressed. Having replication lag way longer and handing off to other person could have offset the much worse disaster that they induced. Twelve hours of partially degraded service might be worthwhile trade instead of a complete loss of 6 hr of data.
Tiredness leads to mistakes. Tap out and hand off the baton.
Take a backup manually before operating like this on production systems. A 5 minute operate of streaming exporting via pg_dump to AWS S3 would help narrow the window from 6 hr loss to minutes or zero time (assuming app was in full maintenance mode during database replication). I take advantage of this technique before doing potentially destructive database actions. Create a full db snapshot if it’s a db level change or a table level snapshot if limited to single table. Commit your action, validate findings, and then wipe out the snapshots if space is precious.

Conclusions

Humans make mistakes when working with complicated systems. Well designed systems and policies help put safeguards in place to reduce the likelihood of irrecoverable & disasterous events.

I anticipate that the engineering team is working on a clear blameless post-mortem to bring closure to this event.If you’re unfamiliar with blameless post-mortems, check out this article by John Allspaw: https://codeascraft.com/2012/05/22/blameless-postmortems/. During the post-mortem they’ll identify the actions taken and circumstances of the incident along with systems and protocols that can be improved to make these circumstances likely to recur.

PS - I went and checked our various backups for production systems after this event. The hourly, daily, weekly, monthly backups are in good order for Mongo, Postgres and Redis. The automated backups of Redshift look good, as do the manual checkpoints from before major changes. The S3 copies that are permanently stored for varying durations for Mongo are in good shape as well. The realtime replication of Mongo to Postgres is in good shape and has preserved us from data loss when an incident occured. I’ll be ever nervous about data loss, but I think we’re in generally good shape.

All the IO and Multiplexing

10 Apr 2021

TLDR

My Startup Story

09 Apr 2021

09 Apr 2021

A Story of Ludicrous Speedups (3000 sec -> 5 sec)

Conflating package installation with project build

The problem

The solution

Outcome

Bonus

04 Apr 2021

Setup

For each yubikey

Credit/Links

29 Mar 2021

25 Mar 2021

24 Mar 2021

07 Mar 2017

12 Feb 2017

01 Feb 2017

Background

Summary of Events

Ways to Limit Risk in Future

Conclusions