Practical Linux Pipelining

« Practical Linux Pipelining »

17 May, 2014
1,990 words
8 minute read time

There are many subtle joys associated with working almost exclusively in the command line all day: tab completion, a simple interface, and unix pipes.

Like general command-line mastery, learning how to wrest unix pipes to your best advantage can translate into a wide variety of benefits. Don't get me wrong, when you need a scripting language, use a fully fledged scripting language – bash isn't going to juggle diverse data types and complex logic as gracefully as python or ruby. But when you need to briefly bridge the gap between repetitiveness and automation, shells and pipes can make life a lot better.

Unix (POSIX?) executables tend to follow a beautifully simple mantra:

Do one thing, and do it well
Accept simple, ASCII input
Produce readable, ASCII output

This simplicity means that those same executables make exceptionally good building blocks. I hope that I can offer some helpful advice in learning how to use pipes for a more supercharged *nix experience.

A Pipe Primer

To review, let's watch a command pipe output to another command:

shell

echo "Hello, world" | sed 's/world/dolly/'

Hello, dolly

What happened?

The echo command begins the pipeline. We pass it the argument Hello world which it simply prints to stdout.
stdout is redirected to the stdin of sed. In the absence of a filename argument, sed operates upon the incoming stdin and returns the results to standard output (finding and replacing a string).

Pretty simple. But as long as that's clear, we can build very useful pipes for a variety of purposes.

A few things to keep in mind before moving forward:

Standard input is not the same as an argument. This is explored a little bit more later on.
Many executables will automatically detect that you're passing input over standard input, but others will require an explicit flag or argument to indicate that input should be accepted over standard in (for example, - as an argument often indicates that standard in should be looked at.)
If you're ever experimenting with piping and find that some input just 'disappears', bear in mind that standard error and standard output are different things. (technically, standard output is file descriptor 1 and standard error is file descriptor 2, so you can always merge them both into stdout with something like 2>&1)

I'm going to illustrate some more ideas with a couple of examples in order to solidify some concepts before getting into more nuanced pipe usage.

Exercises

The Exploding MySQL Server

Your MySQL server is dying. You're getting "connection refused" errors, but service mysqld status says everything is good, and server load is high, but not crippling. Where to go from here?

Given that it's a connection problem, why not look at the connections themselves?

netstat is a very useful command to gather low-level TCP/IP information about a *nix server. Typically I use it to ensure that daemons that are supposed to be listening on a port are indeed listening on a port with the -l switch (print listening ports.) But netstat can also list connections and the state of those connections. For example, here's a snapshot of nginx servicing requests to this blog:

shell

sudo netstat -pnt | grep nginx

tcp  0  0 50.116.6.214:443 xxx.xx.xxx.xx:62659   ESTABLISHED 10862/nginx
tcp  0  0 50.116.6.214:443 xx.xxx.xxx.xxx:52493  ESTABLISHED 10863/nginx
tcp  0  0 50.116.6.214:80  xxx.xx.xxx.xx:62669   ESTABLISHED 10860/nginx
tcp  0  0 50.116.6.214:80  xxx.xx.xxx.xx:30899   ESTABLISHED 10860/nginx
tcp  0  0 50.116.6.214:80  xxx.xx.xxx.xx:63692   ESTABLISHED 10860/nginx

(remote IPs edited to protect reader privacy)

Quick netstat primer:

-p print the process associated with the port (in the example above, it's the [process id]/[process name] portion, i.e., 10862/nginx)
-n don't resolve names (both IPs and port numbers - could be useful or not, I used the switch above to avoid reverse DNS lookups on random readers of this blog)
-t look at TCP connections (the -u switch is for UDP)
sudo is necessary to retrieve all associated process names
In addition, if you read the header preceding netstat output, the fields should be self-explanatory, although two of these fields aren't immediately obvious - the 0 0 represent the receiving queue and sending queue (essentially how many packets are waiting to be handled by the kernel)

So, let's look at MySQL connections on this struggling server:

shell

sudo netstat -ptn | grep mysql

tcp  0  0 10.1.1.1:3306  10.1.1.5:53940    ESTABLISHED 2657/mysqld
tcp  0  0 10.1.1.1:3306  10.1.1.4:36353    ESTABLISHED 2657/mysqld
tcp  0  0 10.1.1.1:3306  10.1.1.2:33773    ESTABLISHED 2657/mysqld
tcp  0  0 10.1.1.1:3306  10.1.1.5:48255    ESTABLISHED 2657/mysqld
tcp  0  0 10.1.1.1:3306  10.1.1.5:36364    ESTABLISHED 2657/mysqld
... plus lots more ...

Looks like our server is handling lots of mysql connections to nearby machines (the 10.1.1.5 machine, for example, is the remote host, and the truncated output above indicates there are at least 3 connections established between it and your local MySQL daemon.)

If your server is a busy one, there could be many ESTABLISHED connections – far too many to look at without some sort of aggregation. Therefore, it's time to pipe. First, determine the question you're trying to answer. In this case, the question likely is, "who is making so many mysql connections?"

There are a few different ways to answer this question, but here's one way:

shell

netstat -pnt | grep mysqld | awk '{ print $5; }' | sed -E 's/:[0-9]+//' | sort | uniq -c | sort

  36 10.1.1.2
  36 10.1.1.3
  36 10.1.1.9
  72 10.1.1.7
 104 10.1.1.4
 116 10.1.1.5
1124 10.1.1.6

The pipeline fits together like so:

Netstat prints out one record per line, with whitespace field separators that awk can understand.
Output is piped to grep in order to find only connections associated with the mysql daemon.
Output is piped to awk, which prints the fifth field ([remote]:[port])
Output is piped to sed. sed finds and replaces the regex /:[0-9]+/, effectively deleting the port from remote hosts. This is required to abstract away individual connections, allowing us to aggregate upon individual hosts.
- Note: I pass the -E flag here to enable extended regular expressions, your platform may vary. Check the man pages.
Output from sed is passed to sort in order to gather like IP addresses together.
Output from sort is passed to uniq, which gathers identical adjacent lines and, with the -c flag, counts them.
Output from uniq is passed to sort again, in order to sort those unique lines by their frequency.

At this point in your debugging, it should be self-evident that 10.1.1.6 is establishing way too many connections and proceed from there.

Although there are very likely utilities that can expose this sort of information for MySQL, because we're looking at the low-level sockets, this technique can work for any daemon accepting TCP connections, which includes anything from Apache, to nginx, to Mongrel, to Mongo.

Try building a similar pipeline yourself, beginning with netstat and gradually adding executables to observe how the pipeline output changes.

One important note is that when laying these sorts of pipes, I'd never just blindly type out the entire pipeline at once - start with netstat, then grep, and so on, which enables you to incrementally build the command and see what's in the pipeline, enabling far easier debugging along the way.

For example, we have the unique connections per remote host – but what if we want to sum them? With the working first part of the pipe, there are multiple ways to achieve the answer just by adding a few more commands to the pipeline:

shell

netstat -pnt | grep mysqld | awk '{ print $5; }' | sed -E 's/:[0-9]+//' | sort | uniq -c | sort | awk '{ print $1; }' | paste -sd+ - | bc

shell

netstat -pnt | grep mysqld | awk '{ print $5; }' | sed -E 's/:[0-9]+//' | sort | uniq -c | sort | awk '{ total += $1; } END { print total; }'

The two approaches are:

paste - first we print the connection counts, use paste to join them with +'s, then the command-line calculator utility bc performs the summation.
awk - simply summing the first field per line then printing the resultant total achieves the same result.

The Rogue Process

This is a straightforward exercise, but illustrates a couple of important commands.

You're running apache, but unfortunately, the processes are stuck and won't go away – no response to a good old kill $PID. Thus you need to find all instances of httpd listening on port 80 and force kill the processes.

This is a case which necessitates piping output into the argument of a command rather than standard input. To illustrate why this is the case, look at these commands:

shell

echo 101 | kill

kill: not enough arguments

shell

kill 101

The second command has a successful return code because the kill command expects an argument, not a bunch of standard input.

Luckily, the xargs command exists for this purpose. Building upon the previous example, a pipeline to force kill all processes communicating over port 80:

shell

sudo netstat -pnt | awk '{ if ($4 ~ ":80" && $(NF) ~ "[0-9]") print $(NF); }' | cut -d/ -f1 | xargs kill -9

Note: This is an example using netstat criteria to select specific processes – to kill processes based solely upon process name, check out pkill.

To summarize the pipeline:

netstat prints TCP connections (-t) with their associated processes (-p) using numeric ports (-n)
awk prints out the last field ([PID/process name]) if local connection string contains ":80" and the last field has numeric data (some TCP ports may show up without a process, this filters them out)
cut uses the / delimiter to print field 1 (the PID)
xargs passes PIDs to kill

Thus with this pipeline, we've force-killed all processes communicating from server-side port 80. This can be handy if your webserver (be it apache, nginx, lighttpd, or something similar) has locked up for some reason and won't die.

The xargs command is a good example of taking piped output and performing an action upon the results and not just printing output for easier reading.

The Parallel Tarball

The xargs command can be extremely useful when pipeline executables require explicit arguments. Even better, these types of tasks tend to take of the form of discrete jobs, and can often be parallelized.

To illustrate this, imagine a list of directories that you need to compress. You could use find with the -exec flag, or try something like this using parallel:

shell

find . -depth 1 -type d | parallel -j10 tar cvjf {}.tbz {}

This pipeline uses find to collect all directories in the working directory, and uses tar in parallel to archive and compress them. A contrived example, but could be easily extended to other use cases that involve processes that parallelize well, such as those that involve network communication, file modification, or other tasks that lend themselves well to concurrency.

Aside from normal find usage, parallel accepts the -j argument to set job concurrency. Usually this will be set to the optimal value for your host, so it doesn't typically need to be tweaked. WARNING: In playing around with this extensively, I tried setting my concurrency to the maximum allowable (using -j0). This actually made my Macbook kernel panic - so, you know, beware of insane settings with parallel.

The Pipe Viewer

Sometimes when passing large amounts of a data through pipes, knowing the rate at which data is flowing through the pipe nan be extremely useful. Enter the pv command.

Although pv can be arcane to use at first, the easiest way to remember how to use it is to imagine measuring the flow through any other sort of physical pipe. Place it right in between a large flow of data within a pipeline, and it'll at least give you the rate at which data is flowing.

One example would be measuring the rate at which data is fed into a checksum command, for example, md5:

shell

pv archlinux-2013.10.01-dual.iso | md5sum

161MiB 0:00:05 [48.1MiB/s] [======>                 ] 30% ETA 0:00:11

And we get a progress bar indicating when we'll get output from the md5sum command.

This can be equally useful for trivial tasks like copying large files – think cp, but with a progress bar.

shell

pv verylargefile > copyofverylargefile

Network Piping

I was reminded that this is possible by an informative reddit post recently. Because ssh handles standard input and output pretty well, we can leverage it to do some useful things.

For example, if you want to compress and pull a tarball from a remote host to your local machine, the following command will do so:

shell

ssh remotehost "tar -cvjf - /opt" > opt.tar.bz2

… or, to compress client-side …

shell

ssh remotehost "tar -cvf - /opt" | bzip2 > opt.tar.bz2

The dash argument to -f in the tar command indicates that output should be sent to stdout and not an actual tar file, and piping that through bzip compresses the archive as well.

My favorite use case for this has been backing up a running raspberry pi, where there's little disk space to spare for a temporary tarball on-disk.

ssh also handles stdin equally well. For example, the following pipeline uses pssh (parallel ssh) to retrieve kernel versions from multiple hosts and summarize them in another file on a remote host:

shell

pssh -i -H 'host1 host2 host3' 'uname -a' | grep -i linux | ssh host4 'xargs -L1 echo > kernels'

There's a myriad of possibilities if you couple pipelines across hosts with ssh.

Conclusion

This barely scratches the surface of effective pipe usage, and we haven't even looked at loops, tricks like set notation, and other command-line sorceries. However, with some of these basic concepts, chaining together commands for fun and profit should be well within reach.

If you've got any nice pipeline tricks of your own, please share!

« SSH Kung Fu

Running Docker on CentOS - External Network Access »

Tyblog

All the posts unfit for blogging
blog.tjll.net