How to remove hard spaces from file using awk
Some tasks seem easy when we perform them manually, but new things come to light once we want to automate them. What can go wrong if we remove hard spaces from a text file using a script?
If it’s a one-time operation, we run the editor, use the find-and-replace functionality, and that’s enough. Otherwise, we need to find a more automatic-friendly approach.
The literal approach
I’m a fan of simple tools, so I use awk for that case. Let’s say that we want to clean up the CSV file. We likely end up with something like this:
awk -F";" '{ gsub(/ /, ""); }1' "filename.csv"
If it works, we can end our solution here, but there is still one issue. How can we be sure that the provided pattern is an actual hard space?
I went into trouble because I copied the space character right from the terminal window, and after minutes of failing attempts, it turns out that the terminal printed soft spaces instead of hard ones. Be aware, that some software may also change whitespaces on their own.
Even if we successfully copied and pasted the hard spaces into our script, we need to leave a comment to let our future self notice that there is a hard space, not just space.
#!/bin/bash
#
# Hey, an important fact, this is a hard space!
# \ /
awk -F";" '{ gsub(/ /, ""); }1' "filname.csv"
Hex to the rescue
The gsub
function expects two parameters: the pattern and string to replace. However, we can provide a byte sequence instead of just characters.
First, let’s check the byte of the hard space. Just copy the hard space from the file using the app that preserves whitespaces characters into the terminal and run hexdump.
echo " " | hexdump -C
The output should be similar to this one:
00000000 c2 a0 0a |...|
00000003
We got three bytes. The last one is 0a
, which corresponds to line feed and indicates the end of the line. So, the hard space consists of two bytes: c2
and a0
.
Let’s use them within our script. To tell the gsub
function that we provide hexadecimal values, prepend the bytes with \x
.
awk -F";" '{ gsub(/\xc2\xa0/, ""); }1' "filname.csv"
Great! We’ve removed hard spaces from the file. Use it also if you need to replace them with soft ones.
Resources
- https://stackoverflow.com/a/27056408 – use of hexdump tool