About useless use of cat (UUOC)

The context

It is well known that cat can be uselessly used and that other forms are preferable. For example:

cat file.txt | grep word

which can be better written like this:

grep word file.txt

This is not without its myths, though. Iván Zenteno (thanks, Iván!) argued in a recent Linuxeros Zapopan conversation that it was due to memory consumption. Quoting:

(Note: all quotations are rough English translations from their original messages in Spanish)

cat | grep NOOOO. Simply use grep regex filename. Don’t overload memory with cat. If your file is 100 MB long you are doing cat to RAM and by passing it via standard input to grep there’s another 100 MB. Then you are using 200 MB when you could be using 100 MB only.

I challenged that assertion:

[It] should be verified. It seems that a cat implementation that reads all the content before throwing it out would be a waste of resources; therefore I don’t think it works like that, therefore the argument would be false. I honestly think it is not the case. I will try to do some tests.

By challenging it, it may have seemed like I was challenging the whole Useless Use of Cat (UUOC) case. Just to make it clear, I was not: I was just challenging the RAM aspect as an argument against UUOC.

I should have even gone further: not even grep will use that amount of memory (for simple grep) because it’s just not needed.

Alex Callejas added a personal experience (thanks, Alex!):

If the search is for a string, wouldn’t it be better to use grep? In [non-current government entity] we had a script that did cat first to the whole file to look for CURP* strings. The problem was when the file was more than a 1 TB in size. It took ages to run to find newly added CURPs. We implemented the search using grep, and we did monthly historicals and lowered resource consumption.

* For context, CURP is a unique identifier string for each Mexican. It follows a specific format.

It made sense. It seems like everyone agrees that cat can frequently be uselessly used. Me included.

Later, in a clarification I said:

I think pure grep is faster. I think it is because of its buffering and avoiding additional pipe processing, but the RAM consumption is a myth.

I offered to do some testing so we all could learn together. After all, this is one of those simple beliefs that can quickly be demystified and provide a great challenge to our personal understanding of underlying technology. Little did I know that it would throw back some interesting results!

The tests

The tests were run in a computer with 8 GiB of RAM (more than 4 GiB free for use by applications) with an first generation i3 2-core CPU with Hyperthreading enabled (a total of 4 virtual cores). The disk drive is an SSD interfaced through a SATA II port. It runs Debian Sid with Linux kernel version 5.10.

I created three text files with the sizes 166.2 MiB, 3.57 GiB and 33.92 GiB. Why those sizes? The first one is an original file with long rows which included path names and sizes. I cannot publish the file as they contain personal data. The other two are just created from a multiple concatenation of the first one to amounts that seemed reasonable to make up for sufficiently distinctive cases.

I tested for performance using Bash’s time built-in for the following tools:

  1. grep x $FILE | wc -l
    I filtered the results through wc -l to avoid having the terminal output to be included in the measurement. This would have contaminated the result. It seemed like wc -l was a good choice for a process that would be way simpler and faster than grep and, thus, would contaminate the results the least possible amount. Later, I got the suggestion for grep -c to avoid the extra pipe and it became test #3.
  2. grep x $FILE > /dev/zero
    Why /dev/zero? Because when using /dev/null grep skipped all processing. My take is that grep accurately detects that redirected output to /dev/null would just be a waste of resources and it decides to avoid doing any work.
  3. grep -c x $FILE
    It is the same as test #1 but avoids the extra pipeline. This is the purest possible form of the test. Thanks Iván Zenteno!
  4. wc -l $FILE
    What if grep had some buffering that other utilities did not? This is so we try to replicate the results in tools other than grep.
  5. gawk '{l++}END{print l}' $FILE
    Same as above. What if grep had some buffering that other utilities did not? This is so we try to replicate the results in tools other than grep. This gawk script does the same as grep -c.

For each tool I tested three input techniques:

  1. Direct file specification
  2. Input redirection
  3. Standard input from cat.

I tested each tool + input combination for each of the three file sizes. For each tool + input + size combination, the test was run three times and the lower value taken. This is so we avoid disk I/O variability by making sure that as much of the file is fully read into disk cache. Why did we not dropped caches instead? Because disk I/O is way slower so the results would be reflecting disk read time instead of pure cat time instead. Important: only the first two file sizes to fully fit in RAM but not the third one! This makes the third file a great way to see if disk I/O performs differently with cat than with other tools.

After each test that involved cat, I repeated the test but using /usr/bin/time -v (not Bash’s time built-in) just on cat to get its maximum resident size set.

This is a total of 63 test results taken out of 189 data points. With all these tests we can have a real performance comparison, but also what I cared the most of: if cat RAM consumption was actually an issue.

The results

Measurement Units Small Medium Large
File size MiB

166.2

3,656.9

34,740.9

grep x $FILE | wc -l s

0.2

6.1

143.5

grep x < $FILE | wc -l s

0.3

8.2

153.5

cat $FILE | grep x | wc -l s

0.4

9.7

137.9

Maximum resident set size for cat: kB

1,624

1,672

1,784

grep x $FILE > /dev/null s

0.0

0.0

0.0

grep x $FILE > /dev/zero s

0.3

6.2

144.6

grep x < $FILE > /dev/zero s

0.3

6.1

143.8

cat $FILE | grep x > /dev/zero s

0.3

7.3

134.0

Maximum resident set size for cat: kB

1,636

1,628

1,628

grep -c x $FILE s

0.1

2.3

140.1

grep -c x < $FILE s

0.1

2.3

140.5

cat $FILE | grep -c x s

0.1

3.1

135.7

Maximum resident set size for cat: kB

1,788

1,784

1,672

wc -l $FILE s

0.1

1.9

133.5

wc -l < $FILE s

0.1

1.5

133.6

cat $FILE | wc -l s

0.1

3.0

133.4

Maximum resident set size for cat: kB

1,784

1,660

1,624

gawk '{l++}END{print l}' $FILE s

0.4

8.0

139.1

gawk '{l++}END{print l}' < $FILE s

0.4

7.7

138.6

cat $FILE | gawk '{l++}END{print l}' s

0.6

13.1

148.0

Maximum resident set size for cat: kB

1,624

1,624

1,788

I am highlighting the best and worst results. If two results are within 1% of each other I highlighted both, as I considered it the same result. If the three results are within 1% of each other I did not highlight any of them (all three are the best and the worst at the same time).

The conclusions

  1. I was quite surprised about grep > /dev/null. I certainly didn’t know about that.
  2. The idea that cat will double RAM usage is a myth.
  3. Surprisingly, cat | grep performed better than pure grep in all of the tests for the large file! I am bewildered by this result!
  4. cat | wc -l did not make any difference for the large file but it did for the medium file. This is another head-scratcher.
  5. cat | gawk was consistent with the expectation.
  6. In some cases, even input redirection worked better than directly specifying the file!

With all this, my general conclusions are:

  1. The real usage waste for simple cases is just an extra process and a pipe.
  2. For large input or output files, disk I/O will be your bottleneck. Cutting down cat will do next to nothing to improve performance.
  3. For medium input files, sometimes redirection even worked up to 2x better! Always test if performance is critical to you. It all depends on how optimized is the tool that receives the input for the different input scenarios.
  4. The extra process and pipe overhead will become apparent when repeating the process hundreds of times for small files. Keep it in mind!

There is no need to be religious or bully about it. Really: don’t worry at all unless there is a case for it. By the way, the last sentence is the definition for a microoptimization.

Fixing UUOC will be, in most cases, just micro-optimization. Always keep it in mind, though, and if you find a case for it, fix it! Some extreme cases may be severely impacted by UUOC. But otherwise, if the case is simple and you feel your code is more legible by uselessly using cat, use it and don’t feel bad about it at all.

This does not excuse you from learning about this resource waste and its implications. It is your responsibility to know what you are doing and to always keep UUOC in mind while coding to avoid performance impacts!

The cliffhanger ending

These are the results for my scenario. Who knows about other operating systems, other kernel versions, other computer configuration, other row-lengths and text content… Can you replicate my results?

Also, this is only true when the file can be processed sequentially; this is, that the whole file does not need to be read before starting its processing. There are tools that might need to load the whole file into RAM, like sort and sponge. I would expect that if sort works on a file backed by disk (not fed through standard input either with pipe or redirection) it might not need as much RAM, but the algorithm would need to be a bit more complicated. If it’s fed from standard input, though, there is not way to save RAM. Or is it?

So I made a quick test for the small and medium files only:

Measurement Units Small Medium
File size MiB

166.2

3,656.9
Time for sort $FILE | wc -l s

10.8

578.3
Time for sort < $FILE | wc -l s 10.6 617.9
Time for cat $FILE | sort | wc -l s 21.5 1,015.9
Max RSS for sort $FILE | wc -l kB

235,896

4,939,544
Max RSS for sort < $FILE | wc -l kB 235,700 5,139,724
Max RSS for cat $FILE | sort | wc -l kB 14,400 15,172

Can you explain why the difference of results? Can you estimate what would happen with files larger than the available RAM?

This is a discussion for another day.

Q&A: PostgreSQL: LEFT JOIN por tipo de registro

Esta publicación corresponde a una respuesta que he dado via chat en el grupo de ayuda PosgreSQL en Español en Telegram.

Un usuario tiene una tabla con dos columnas: tipo_objeto e id_objeto. Los objetos se almacenan en otras tablas, pero la tabla en la cual se almacena el objeto depende del tipo de objeto. En algunos contextos esto se denomina relaciones polimórficas.

Por ejemplo, se puede tener una tabla de recursos en la cual se almacena una lista de recursos por departamento. Los recursos pueden ser humanos o materiales. Tendríamos la tabla humanosmaterialesrecursos. Lo interesante es que en la de recursos tendríamos los campos siguientes: fk_depto_id, tipo_recurso, recurso_id. Este último campo es un identificador que se encuentra en la tabla humanos o materiales según lo que indique tipo_recurso.

¿Cómo hacer un LEFT JOIN entre la tabla que almacena la relación y las tablas con los objetos?

Sigue leyendo

Poda: finds similar directories

I invite you to try Poda. I made this to find similar directories among multiple storage locations (or hampers as I call them in the program). The typical use case is to find out whether I find whether I have duplicate or similar directories in my laptop, the other laptop, NAS, flash drives, etc. I may have made a backup of a flash drive on my laptop and not sure what has changed where. Or even within the same storage!

[Update: There was a mistake in the use of dirdupes.py below. A sort filter was missing. It is now corrected.]

The way it works is that you first get the index of each hamper. For example, you may have four hampers:

Sigue leyendo

Actualicé IPv6 Toolkit en Debian

He actualizado IPv6 Toolkit in Debian. IPv6 Toolkit es una herramienta de diagnóstico y evaluación para los protocolos de IPv6, escrito por Fernando Gont.

Principalmente, esta actualización previene que IPv6 Toolkit sea eliminado de Bullseye, pues la compilación se había roto bajo el nuevo GCC 10.

Esta es la bitácora de cambios completa (en inglés):

Sigue leyendo