Convert to Unicode code points

Get help using and writing Nisus Writer Pro macros.
Post Reply
Groucho
Posts: 497
Joined: 2007-03-03 09:55:06
Location: Europe

Convert to Unicode code points

Post by Groucho »

Some time ago I wrote a macro to tag a document so it can be converted into Palm Database (pdb, eReader) format. At a certain point the macro was to convert special characters (accented letters, diacritics, Ascii > 127) into Unicode code points. I made use of this conditional statement:

Code: Select all

if Find '[^\u0001-\u007e]', 'aE-i'
	Edit:Convert:To Unicode Code Points
end
After a while I noticed that two consecutive special character were always separated by a space. For example,

–” (en-dash and right curly quote)

becomes

U+2013 U+201D (their corresponding unicode codes but with a space in between)

So I re-wrote the piece of macro with a while loop, thus:

Code: Select all

While Find '[^\u0001-\u007e]', 'E-i'
	Edit:Convert:To Unicode Code Points
end
…and all went well. The macro is considerably slower, but it works.

I wonder, though, if there’s a way to convert a character into unicode points without resorting to the menu, which seems to trigger the space issue. I mean something through the Nisus macro language or Perl. I couldn’t find it anywhere.

Greetings, Henry.
User avatar
martin
Official Nisus Person
Posts: 5227
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Re: Convert to Unicode code points

Post by martin »

There's definitely a way for you to do the conversion more directly, without using the convert menu, but it would be more complicated. Instead, why not strip the spaces out after the replacement:

Code: Select all

If Find '[^\u0001-\u007e]', 'aE-i'
	Edit:Convert:To Unicode Code Points
	Replace All '(?<=U\+\h\h\h\h) (?=U\+\h\h\h\h)', '', 'aE'
end
Does that work? Or do you have places in your document where code points appear where you want spaces maintained?
Groucho
Posts: 497
Joined: 2007-03-03 09:55:06
Location: Europe

Re: Convert to Unicode code points

Post by Groucho »

Quite so. There are spaces that I want to keep.
I think the while loop is the only choice at this point. A bit slow but OK.
Now, if I may question, what is the other way?

Thanks, Henry.
Kino
Posts: 400
Joined: 2008-05-17 04:02:32

Re: Convert to Unicode code points

Post by Kino »

Groucho wrote:At a certain point the macro was to convert special characters (accented letters, diacritics, Ascii > 127) into Unicode code points. I made use of this conditional statement:

Code: Select all

if Find '[^\u0001-\u007e]', 'aE-i'
	Edit:Convert:To Unicode Code Points
end
Your find expression looks odd. Character code index is not one-based but zero-based. And you’d better avoid an expression such as “Ascii > 127” which is self-contradictory.
what is the other way?

Code: Select all

$prefix = Cast to String '&#x'
$suffix = Cast to String ';'

$doc = Document.active
$sels = $doc.text.findAll '[^\x00-\x7F]', 'E-i'
foreach $sel in reversed $sels
	$d = $sel.text.characterAtIndex $sel.location
	$h = $prefix
	$h &= Convert To Hex $d
	$h &= $suffix
	$sel.text.replaceInRange $sel.range, $h
end
If you prefer Perl . . .

Code: Select all

$doc = Document.active
$sels = $doc.text.findAll '[^\x00-\x7F]+', 'E-i'
$str = Array.new
foreach $sel in $sels
	$str.appendValue $sel.substring
end
$str = $str.join "\x00"
$hex = ''
Set Exported Perl Variables 'str', 'hex'
begin Perl
	foreach (split //, $str) {  # Nobumi’s code with a small modification
		if ( $_ eq "\x00" ) { $hex .= $_ }
		else { $hex .= sprintf("U+%04X", ord) }
	};

end
$str = $hex.split "\x00"
foreach $sel in reversed $sels
	$sel.text.replaceInRange $sel.range, $str.pop
end
Groucho
Posts: 497
Joined: 2007-03-03 09:55:06
Location: Europe

Re: Convert to Unicode code points

Post by Groucho »

Thanks, Kino. The Nisus macro version is OK. I just stripped the suffix, as I don’t need any, and changed the prefix to \U as this is requested by pdb’s markup language (\U2019, for example, is left curly quote “).

As for the find expression, I followed Nisus Macro Reference, the table in Literals.

Greetings, Henry.
Kino
Posts: 400
Joined: 2008-05-17 04:02:32

Re: Convert to Unicode code points

Post by Kino »

Groucho wrote:As for the find expression, I followed Nisus Macro Reference, the table in Literals.
I don't speak about it. Is there any reason not to include 0x0000 and 0x007F, the first and the last ASCII characters, in the negative set?
Groucho
Posts: 497
Joined: 2007-03-03 09:55:06
Location: Europe

Re: Convert to Unicode code points

Post by Groucho »

Ah, I see now. Well, no, there is no reason to keep them out. I just didn’t notice that. To be true, I don’t know why I kept them out. Maybe I followed a practical philosophy, that is, 0x0000 and 0x007F never occur in my files so why am I putting them in. This is a macro for personal use, you know.

Henry.
Post Reply