Tuesday, April 24, 2012

ó vs ó

(Wow, I can actually see the difference, when typed in this window!) ó vs ó ó vs ó ó vs ó
Maybe not... might just be antialiasing...
Anyhow.

Apparently MacOS's bash handles these differently in different cases.

Comparing the following two strings returns FALSE:

'Debaixo Dos Caracóis'
'Debaixo Dos Caracóis'

whereas comparing these two returns TRUE:

'Debaixo Dos Caracóis'
'Debaixo Dos Caracóis'

Duh!

The weird thing is that they come from the same source. The only difference is that one was entered directly with the filename (which contains this string) typed using tab-completion, and the other was entered via for-loop of all files in the directory. Seems like a bug to me, but what do I know?

Some investigating led to how to remove accents from characters: Source
It didn't work for me directly, I had to do some fiddling to get any conversion for the second string (otherwise it would return "iconv couldn't convert")

using the command echo "$string" | iconv -t UTF-7 I get the following two values:

'Debaixo Dos Carac+APM-is'
'Debaixo Dos Caraco+AwE-is'

So, apparently the two accented o's are coming up different, even though they appear the same! Again, oddly, if I run my script within a for loop iterating over all files in the directory, I get "o+AwE-" and if I use tab-completion on the command line I get "+APM-" and the strings match.

Further investigation reveals some useful info: printf can be run from the command line! Source


$ first=`echo "+APM-" | iconv -f UTF-7 -t UTF-8`
$ second=`echo "o+AwE-" | iconv -f UTF-7 -t UTF-8`
$ echo $first
ó
$ echo $second
ó
$ printf '%d\n' "'$first"
-61
$ printf '%d\n' "'$second"
111
$ if [ "$first" == "$second" ] ; then echo YEP ; fi
$

So there you have it... their values aren't even the same. Are they ASCII? I have no idea. This locale and Unicode and other codepage stuff completely eludes me. Here's the page that can tell you all about it. It's WAY over my head: Locale And it doesn't even seem there's such thing as a universal Unicode chart (like there is for 7-bit ASCII)

All I know is that my bash is set with a locale of en-US.UTF-8 and apparently that doesn't match the filenames on my hard drive (?!)

That "o+AwE-" is strangely remniscent of how accents (used to?) be entered on Macs way back in the OS7.5 days... first you typed the letter, then some combination of CTRL and whatnot. I was fascinated by it once and wrote a whole chart. The "+APM-" seems more like PC-style entering a value directly using ALT-###, but I could be imagining things.

This would make my whole project extremely more difficult, except I'd already been working on fuzzy string-matching for differences like McDonalds vs MacDonalds so thankfully I can use that instead of trying to figure out a table of special cases like this. Otherwise, frankly, I'd be at a complete loss as to how to procede without dropping the character entirely. Even the iconv locales don't seem to convert them to anything directly-comparable (only visually). Actually, it could still become a problem for filenames with too many accents... UGH

No comments:

Post a Comment