- 17 May 2014
- 2541 words
- 14 minutes read time
There are many subtle joys associated with working almost exclusively in the command line all day: tab completion, a simple interface, and unix pipes.
Like general command-line mastery, learning how to wrest unix pipes to your best advantage can translate into a wide variety of benefits. Don’t get me wrong, when you need a scripting language, use a fully fledged scripting language – bash isn’t going to juggle diverse data types and complex logic as gracefully as python or ruby. But when you need to briefly bridge the gap between repetitiveness and automation, shells and pipes can make life a lot better.
Unix (POSIX?) executables tend to follow a beautifully simple mantra:
- Do one thing, and do it well
- Accept simple, ASCII input
- Produce readable, ASCII output
This simplicity means that those same executables make exceptionally good building blocks. I hope that I can offer some helpful advice in learning how to use pipes for a more supercharged *nix experience.
A Pipe Primer
To review, let’s watch a command pipe output to another command:
- The echo command begins the pipeline. We pass it the argument
Hello worldwhich it simply prints to stdout.
- stdout is redirected to the stdin of sed. In the absence of a filename argument, sed operates upon the incoming stdin and returns the results to standard output (finding and replacing a string).
Pretty simple. But as long as that’s clear, we can build very useful pipes for a variety of purposes.
A few things to keep in mind before moving forward:
- Standard input is not the same as an argument. This is explored a little bit more later on.
- Many executables will automatically detect that you’re passing input over standard input, but others will require an explicit flag or argument to indicate that input should be accepted over standard in (for example,
-as an argument often indicates that standard in should be looked at.)
- If you’re ever experimenting with piping and find that some input just ‘disappears’, bear in mind that standard error and standard output are different things. (technically, standard output is file descriptor
1and standard error is file descriptor
2, so you can always merge them both into stdout with something like
I’m going to illustrate some more ideas with a couple of examples in order to solidify some concepts before getting into more nuanced pipe usage.
The Exploding MySQL Server
Your MySQL server is dying. You’re getting “connection refused” errors, but
service mysqld status says everything is good, and server load is high, but not crippling. Where to go from here?
Given that it’s a connection problem, why not look at the connections themselves?
netstat is a very useful command to gather low-level TCP/IP information about a *nix server. Typically I use it to ensure that daemons that are supposed to be listening on a port are indeed listening on a port with the
-l switch (print listening ports.) But netstat can also list connections and the state of those connections. For example, here’s a snapshot of nginx servicing requests to this blog:
(remote IPs edited to protect reader privacy)
Quick netstat primer:
-pprint the process associated with the port (in the example above, it’s the [process id]/[process name] portion, i.e.,
-ndon’t resolve names (both IPs and port numbers - could be useful or not, I used the switch above to avoid reverse DNS lookups on random readers of this blog)
-tlook at TCP connections (the
-uswitch is for UDP)
sudois necessary to retrieve all associated process names
- In addition, if you read the header preceding netstat output, the fields should be self-explanatory, although two of these fields aren’t immediately obvious - the
0 0represent the receiving queue and sending queue (essentially how many packets are waiting to be handled by the kernel)
So, let’s look at MySQL connections on this struggling server:
Looks like our server is handling lots of mysql connections to nearby machines (the 10.1.1.5 machine, for example, is the remote host, and the truncated output above indicates there are at least 3 connections established between it and your local MySQL daemon.)
If your server is a busy one, there could be many
ESTABLISHED connections – far too many to look at without some sort of aggregation. Therefore, it’s time to pipe. First, determine the question you’re trying to answer. In this case, the question likely is, “who is making so many mysql connections?”
There are a few different ways to answer this question, but here’s one way:
The pipeline fits together like so:
- Netstat prints out one record per line, with whitespace field separators that awk can understand.
- Output is piped to grep in order to find only connections associated with the mysql daemon.
- Output is piped to awk, which prints the fifth field ([remote]:[port])
- Output is piped to sed. sed finds and replaces the regex
/:[0-9]+/, effectively deleting the port from remote hosts. This is required to abstract away individual connections, allowing us to aggregate upon individual hosts.
- Note: I pass the
-Eflag here to enable extended regular expressions, your platform may vary. Check the man pages.
- Note: I pass the
- Output from sed is passed to sort in order to gather like IP addresses together.
- Output from sort is passed to uniq, which gathers identical adjacent lines and, with the
-cflag, counts them.
- Output from uniq is passed to sort again, in order to sort those unique lines by their frequency.
At this point in your debugging, it should be self-evident that 10.1.1.6 is establishing way too many connections and proceed from there.
Although there are very likely utilities that can expose this sort of information for MySQL, because we’re looking at the low-level sockets, this technique can work for any daemon accepting TCP connections, which includes anything from Apache, to nginx, to Mongrel, to Mongo.
Try building a similar pipeline yourself, beginning with netstat and gradually adding executables to observe how the pipeline output changes.
One important note is that when laying these sorts of pipes, I’d never just blindly type out the entire pipeline at once - start with netstat, then grep, and so on, which enables you to incrementally build the command and see what’s in the pipeline, enabling far easier debugging along the way.
For example, we have the unique connections per remote host – but what if we want to sum them? With the working first part of the pipe, there are multiple ways to achieve the answer just by adding a few more commands to the pipeline:
The two approaches are:
- paste - first we print the connection counts, use paste to join them with +’s, then the command-line calculator utility
bcperforms the summation.
- awk - simply summing the first field per line then printing the resultant total achieves the same result.
The Rogue Process
This is a straightforward exercise, but illustrates a couple of important commands.
You’re running apache, but unfortunately, the processes are stuck and won’t go away – no response to a good old
kill $PID. Thus you need to find all instances of
httpd listening on port 80 and force kill the processes.
This is a case which necessitates piping output into the argument of a command rather than standard input. To illustrate why this is the case, look at these commands:
The second command has a successful return code because the
kill command expects an argument, not a bunch of standard input.
xargs command exists for this purpose. Building upon the previous example, a pipeline to force kill all processes communicating over port 80:
Note: This is an example using netstat criteria to select specific processes – to kill processes based solely upon process name, check out
To summarize the pipeline:
netstatprints TCP connections (
-t) with their associated processes (
-p) using numeric ports (
awkprints out the last field ([PID/process name]) if local connection string contains “:80” and the last field has numeric data (some TCP ports may show up without a process, this filters them out)
/delimiter to print field 1 (the PID)
xargspasses PIDs to
Thus with this pipeline, we’ve force-killed all processes communicating from server-side port 80. This can be handy if your webserver (be it apache, nginx, lighttpd, or something similar) has locked up for some reason and won’t die.
xargs command is a good example of taking piped output and performing an action upon the results and not just printing output for easier reading.
The Parallel Tarball
xargs command can be extremely useful when pipeline executables require explicit arguments. Even better, these types of tasks tend to take of the form of discrete jobs, and can often be parallelized.
To illustrate this, imagine a list of directories that you need to compress. You could use
find with the
-exec flag, or try something like this using
This pipeline uses
find to collect all directories in the working directory, and uses tar in parallel to archive and compress them. A contrived example, but could be easily extended to other use cases that involve processes that parallelize well, such as those that involve network communication, file modification, or other tasks that lend themselves well to concurrency.
Aside from normal
parallel accepts the
-j argument to set job concurrency. Usually this will be set to the optimal value for your host, so it doesn’t typically need to be tweaked.
WARNING: In playing around with this extensively, I tried setting my concurrency to the maximum allowable (using
-j0). This actually made my Macbook kernel panic - so, you know, beware of insane settings with
The Pipe Viewer
Sometimes when passing large amounts of a data through pipes, knowing the rate at which data is flowing through the pipe nan be extremely useful. Enter the
pv can be arcane to use at first, the easiest way to remember how to use it is to imagine measuring the flow through any other sort of physical pipe. Place it right in between a large flow of data within a pipeline, and it’ll at least give you the rate at which data is flowing.
One example would be measuring the rate at which data is fed into a checksum command, for example, md5:
And we get a progress bar indicating when we’ll get output from the md5sum command.
This can be equally useful for trivial tasks like copying large files – think
cp, but with a progress bar.
I was reminded that this is possible by an informative reddit post recently. Because ssh handles standard input and output pretty well, we can leverage it to do some useful things.
For example, if you want to compress and pull a tarball from a remote host to your local machine, the following command will do so:
The dash argument to
-f in the
tar command indicates that output should be sent to stdout and not an actual tar file, and piping that through bzip compresses the archive as well.
My favorite use case for this has been backing up a running raspberry pi, where there’s little disk space to spare for a temporary tarball on-disk.
ssh also handles stdin equally well. For example, the following pipeline uses
pssh (parallel ssh) to retrieve kernel versions from multiple hosts and summarize them in another file on a remote host:
There’s a myriad of possibilities if you couple pipelines across hosts with ssh.
This barely scratches the surface of effective pipe usage, and we haven’t even looked at loops, tricks like set notation, and other command-line sorceries. However, with some of these basic concepts, chaining together commands for fun and profit should be well within reach.
If you’ve got any nice pipeline tricks of your own, please share!