Anécdota de la ventana que no aparecía

Les cuento una tontería que me pasó ayer.

Iba a ir a trabajar a la oficina (normalmente trabajo desde casa) pero antes quería entrar así rapidillo al Slack a ver algo de la misma oficina. Abro la laptop, se despierta, meto clave, le doy click al iconito en el área de notificaciones (junto al reloj en el Windows) para que apareciera la ventana del Slack pero no sale ninguna ventana.

Qué raro… Lo intento de nuevo, nada.

Le doy botón derecho al iconito del Slack, Salir, desaparece el iconito, recargo Slack, sale el iconito pero no muestra la ventana. Lo intento de nuevo; lo mismo.

Mato el proceso en el administrador de tareas, recargo Slack, lo mismo: muestra el iconito pero no aparece la ventana.

Lo reintento pero ahora matando el árbol de procesos, me aseguro de que no quede nada sobre Slack en la lista de procesos, recargo, y lo mismo, iconito en las notificaciones pero no hay ventana.

Dije, «bueno, algo raro con la compu». Apago la compu, la prendo, cargo Slack y lo mismo: sale el iconito en las notificaciones pero al darle click no hay ventana de Slack…

Esto ya me está desesperando un poco.

Me acuerdo que la laptop podría tener eso del Inicio Rápido en el que después del logout nomas hace hibernación para ahorrar tiempo de apagado y encendido, así que no es un reinicio desde cero como tal, así que mejor le doy Reiniciar. Lo mismo: cargo Slack, hay iconito, no hay ventana.

Veo que en el menú del iconito del Slack hay un menú de «diagnóstico» y me permite deshabilitar la aceleración por hardware, lo hago, reinicio y nada. Puedo pedir logs, borrar el caché (es una app de Electron)… nada.

Y entonces me dispongo a hacer investigación en Internet y en mi cabeza me cambio a modo «avanzado». Me acomodo bien en el teclado, me dispongo a prender el segundo monitor para trabajar más agusto… ¡tenía apagado el segundo monitor! Lo prendo… ¡ahí estaba el Slack! XD jajajaja

¿Por qué procrastino?

Esta es una respuesta que le di a un participante de Programadores y Estudiantes en Discord. No puedo afirmar que lo que le dije sea efectivo, pero a mí me ha sido útil.

Su pregunta:

Cuando quiero aprender a programar siempre me distraigo o digo «ahorita le sigo». ¿Alguien sabe cómo solucionar eso?

Otros participantes, acertadamente, mencionaron otros aspectos importantes como:

  • Dedicarle tiempo si uno considera que algo es importante en la vida.
  • Síndrome del impostor y su impacto en la motivación.
  • Poner música mientras uno aprende. A mí me funciona poner instrumentales de piano.
  • Disciplina.

Yo me fui por el lado del autoanálisis, de preguntarse uno por qué, identificar emociones, ponerles palabras, e iterar profundizando.

¿Puedes identificar el motivo por el cual te distraes? La pregunta suena sencilla pero no lo es. Otro enfoque es ¿puedes identificar el motivo por el cual tu cerebro opta por distraerse en lugar de mantenerse estudiando? Igual, pregunta que suena sencilla pero no lo es. Haz una instrospección, un autoanálisis. En el momento en que eso ocurre ¿qué emociones sientes? ¿Qué motivaciones percibes?

Si llegas a alguna conclusión, trata de preguntarte más a profundidad cosas como: ¿por qué? Por ejemplo, si dices identificas que «te da flojera» pregúntate por qué. Ponle palabras a la emoción. Tal vez es porque lo ves como una tarea muy compleja. ¿Por qué lo ves como una tarea compleja? Tal vez porque esperas aprender en 1 semana y a la hora de ver un video sientes que no avanzaste mucho o que te tomará 1 año y eso te desmotiva. El chiste es que encuentres tus emociones. De ahí pueden salir técnicas como dividir la tarea de aprendizaje en partes, estar consciente de que esto es como un maratón y no como una carrera de velocidad [nota 1], tal vez escoger una hora en la que estás más despierto y no cuando ya estás cansado… etc.

[nota 1] O sea que tu meta de esta semana ya no será «aprender a programar» sino «aprender variables globales» nada más y así vas alcanzando metas más pequeñas.

Recuerda algo: tu cerebro es elástico. Si eres procrastinador es porque [de alguna manera] tu cerebro ha visto que eso es bueno y útil. Posiblemente la dopamina y la gratificación a corto plazo que te da el entretenimiento y las redes sociales (o lo que sea con lo que te distraigas). Remar contra eso es muy difícil pero si lo haces 2 semanas constantemente se vuelve más fácil.

Es como cuando uno por fin va al gimnasio. Si vas 1 mes, después de 1 día de no ir te falta el gimnasio. Pero, después de 1 mes de no ir, ¡futa! regresar es un martirio, pero cuando por fin regresas te vuelve a faltar si no vas.

A ti, ¿qué te funciona? ¿Qué recomendaciones le darías a este compañero?

About useless use of cat (UUOC)

The context

It is well known that cat can be uselessly used and that other forms are preferable. For example:

cat file.txt | grep word

which can be better written like this:

grep word file.txt

This is not without its myths, though. Iván Zenteno (thanks, Iván!) argued in a recent Linuxeros Zapopan conversation that it was due to memory consumption. Quoting:

(Note: all quotations are rough English translations from their original messages in Spanish)

cat | grep NOOOO. Simply use grep regex filename. Don’t overload memory with cat. If your file is 100 MB long you are doing cat to RAM and by passing it via standard input to grep there’s another 100 MB. Then you are using 200 MB when you could be using 100 MB only.

I challenged that assertion:

[It] should be verified. It seems that a cat implementation that reads all the content before throwing it out would be a waste of resources; therefore I don’t think it works like that, therefore the argument would be false. I honestly think it is not the case. I will try to do some tests.

By challenging it, it may have seemed like I was challenging the whole Useless Use of Cat (UUOC) case. Just to make it clear, I was not: I was just challenging the RAM aspect as an argument against UUOC.

I should have even gone further: not even grep will use that amount of memory (for simple grep) because it’s just not needed.

Alex Callejas added a personal experience (thanks, Alex!):

If the search is for a string, wouldn’t it be better to use grep? In [non-current government entity] we had a script that did cat first to the whole file to look for CURP* strings. The problem was when the file was more than a 1 TB in size. It took ages to run to find newly added CURPs. We implemented the search using grep, and we did monthly historicals and lowered resource consumption.

* For context, CURP is a unique identifier string for each Mexican. It follows a specific format.

It made sense. It seems like everyone agrees that cat can frequently be uselessly used. Me included.

Later, in a clarification I said:

I think pure grep is faster. I think it is because of its buffering and avoiding additional pipe processing, but the RAM consumption is a myth.

I offered to do some testing so we all could learn together. After all, this is one of those simple beliefs that can quickly be demystified and provide a great challenge to our personal understanding of underlying technology. Little did I know that it would throw back some interesting results!

The tests

The tests were run in a computer with 8 GiB of RAM (more than 4 GiB free for use by applications) with an first generation i3 2-core CPU with Hyperthreading enabled (a total of 4 virtual cores). The disk drive is an SSD interfaced through a SATA II port. It runs Debian Sid with Linux kernel version 5.10.

I created three text files with the sizes 166.2 MiB, 3.57 GiB and 33.92 GiB. Why those sizes? The first one is an original file with long rows which included path names and sizes. I cannot publish the file as they contain personal data. The other two are just created from a multiple concatenation of the first one to amounts that seemed reasonable to make up for sufficiently distinctive cases.

I tested for performance using Bash’s time built-in for the following tools:

  1. grep x $FILE | wc -l
    I filtered the results through wc -l to avoid having the terminal output to be included in the measurement. This would have contaminated the result. It seemed like wc -l was a good choice for a process that would be way simpler and faster than grep and, thus, would contaminate the results the least possible amount. Later, I got the suggestion for grep -c to avoid the extra pipe and it became test #3.
  2. grep x $FILE > /dev/zero
    Why /dev/zero? Because when using /dev/null grep skipped all processing. My take is that grep accurately detects that redirected output to /dev/null would just be a waste of resources and it decides to avoid doing any work.
  3. grep -c x $FILE
    It is the same as test #1 but avoids the extra pipeline. This is the purest possible form of the test. Thanks Iván Zenteno!
  4. wc -l $FILE
    What if grep had some buffering that other utilities did not? This is so we try to replicate the results in tools other than grep.
  5. gawk '{l++}END{print l}' $FILE
    Same as above. What if grep had some buffering that other utilities did not? This is so we try to replicate the results in tools other than grep. This gawk script does the same as grep -c.

For each tool I tested three input techniques:

  1. Direct file specification
  2. Input redirection
  3. Standard input from cat.

I tested each tool + input combination for each of the three file sizes. For each tool + input + size combination, the test was run three times and the lower value taken. This is so we avoid disk I/O variability by making sure that as much of the file is fully read into disk cache. Why did we not dropped caches instead? Because disk I/O is way slower so the results would be reflecting disk read time instead of pure cat time instead. Important: only the first two file sizes to fully fit in RAM but not the third one! This makes the third file a great way to see if disk I/O performs differently with cat than with other tools.

After each test that involved cat, I repeated the test but using /usr/bin/time -v (not Bash’s time built-in) just on cat to get its maximum resident size set.

This is a total of 63 test results taken out of 189 data points. With all these tests we can have a real performance comparison, but also what I cared the most of: if cat RAM consumption was actually an issue.

The results

Measurement Units Small Medium Large
File size MiB

166.2

3,656.9

34,740.9

grep x $FILE | wc -l s

0.2

6.1

143.5

grep x < $FILE | wc -l s

0.3

8.2

153.5

cat $FILE | grep x | wc -l s

0.4

9.7

137.9

Maximum resident set size for cat: kB

1,624

1,672

1,784

grep x $FILE > /dev/null s

0.0

0.0

0.0

grep x $FILE > /dev/zero s

0.3

6.2

144.6

grep x < $FILE > /dev/zero s

0.3

6.1

143.8

cat $FILE | grep x > /dev/zero s

0.3

7.3

134.0

Maximum resident set size for cat: kB

1,636

1,628

1,628

grep -c x $FILE s

0.1

2.3

140.1

grep -c x < $FILE s

0.1

2.3

140.5

cat $FILE | grep -c x s

0.1

3.1

135.7

Maximum resident set size for cat: kB

1,788

1,784

1,672

wc -l $FILE s

0.1

1.9

133.5

wc -l < $FILE s

0.1

1.5

133.6

cat $FILE | wc -l s

0.1

3.0

133.4

Maximum resident set size for cat: kB

1,784

1,660

1,624

gawk '{l++}END{print l}' $FILE s

0.4

8.0

139.1

gawk '{l++}END{print l}' < $FILE s

0.4

7.7

138.6

cat $FILE | gawk '{l++}END{print l}' s

0.6

13.1

148.0

Maximum resident set size for cat: kB

1,624

1,624

1,788

I am highlighting the best and worst results. If two results are within 1% of each other I highlighted both, as I considered it the same result. If the three results are within 1% of each other I did not highlight any of them (all three are the best and the worst at the same time).

The conclusions

  1. I was quite surprised about grep > /dev/null. I certainly didn’t know about that.
  2. The idea that cat will double RAM usage is a myth.
  3. Surprisingly, cat | grep performed better than pure grep in all of the tests for the large file! I am bewildered by this result!
  4. cat | wc -l did not make any difference for the large file but it did for the medium file. This is another head-scratcher.
  5. cat | gawk was consistent with the expectation.
  6. In some cases, even input redirection worked better than directly specifying the file!

With all this, my general conclusions are:

  1. The real usage waste for simple cases is just an extra process and a pipe.
  2. For large input or output files, disk I/O will be your bottleneck. Cutting down cat will do next to nothing to improve performance.
  3. For medium input files, sometimes redirection even worked up to 2x better! Always test if performance is critical to you. It all depends on how optimized is the tool that receives the input for the different input scenarios.
  4. The extra process and pipe overhead will become apparent when repeating the process hundreds of times for small files. Keep it in mind!

There is no need to be religious or bully about it. Really: don’t worry at all unless there is a case for it. By the way, the last sentence is the definition for a microoptimization.

Fixing UUOC will be, in most cases, just micro-optimization. Always keep it in mind, though, and if you find a case for it, fix it! Some extreme cases may be severely impacted by UUOC. But otherwise, if the case is simple and you feel your code is more legible by uselessly using cat, use it and don’t feel bad about it at all.

This does not excuse you from learning about this resource waste and its implications. It is your responsibility to know what you are doing and to always keep UUOC in mind while coding to avoid performance impacts!

The cliffhanger ending

These are the results for my scenario. Who knows about other operating systems, other kernel versions, other computer configuration, other row-lengths and text content… Can you replicate my results?

Also, this is only true when the file can be processed sequentially; this is, that the whole file does not need to be read before starting its processing. There are tools that might need to load the whole file into RAM, like sort and sponge. I would expect that if sort works on a file backed by disk (not fed through standard input either with pipe or redirection) it might not need as much RAM, but the algorithm would need to be a bit more complicated. If it’s fed from standard input, though, there is not way to save RAM. Or is it?

So I made a quick test for the small and medium files only:

Measurement Units Small Medium
File size MiB

166.2

3,656.9
Time for sort $FILE | wc -l s

10.8

578.3
Time for sort < $FILE | wc -l s 10.6 617.9
Time for cat $FILE | sort | wc -l s 21.5 1,015.9
Max RSS for sort $FILE | wc -l kB

235,896

4,939,544
Max RSS for sort < $FILE | wc -l kB 235,700 5,139,724
Max RSS for cat $FILE | sort | wc -l kB 14,400 15,172

Can you explain why the difference of results? Can you estimate what would happen with files larger than the available RAM?

This is a discussion for another day.

Q&A: PostgreSQL: LEFT JOIN por tipo de registro

Esta publicación corresponde a una respuesta que he dado via chat en el grupo de ayuda PosgreSQL en Español en Telegram.

Un usuario tiene una tabla con dos columnas: tipo_objeto e id_objeto. Los objetos se almacenan en otras tablas, pero la tabla en la cual se almacena el objeto depende del tipo de objeto. En algunos contextos esto se denomina relaciones polimórficas.

Por ejemplo, se puede tener una tabla de recursos en la cual se almacena una lista de recursos por departamento. Los recursos pueden ser humanos o materiales. Tendríamos la tabla humanosmaterialesrecursos. Lo interesante es que en la de recursos tendríamos los campos siguientes: fk_depto_id, tipo_recurso, recurso_id. Este último campo es un identificador que se encuentra en la tabla humanos o materiales según lo que indique tipo_recurso.

¿Cómo hacer un LEFT JOIN entre la tabla que almacena la relación y las tablas con los objetos?

Sigue leyendo