Reply to topic  [ 7 posts ] 
Convert to Unicode code points 
Author Message

Joined: 2007-03-03 09:55:06
Posts: 494
Location: Europe
Some time ago I wrote a macro to tag a document so it can be converted into Palm Database (pdb, eReader) format. At a certain point the macro was to convert special characters (accented letters, diacritics, Ascii > 127) into Unicode code points. I made use of this conditional statement:
Code:
if Find '[^\u0001-\u007e]', 'aE-i'
   Edit:Convert:To Unicode Code Points
end

After a while I noticed that two consecutive special character were always separated by a space. For example,

–” (en-dash and right curly quote)

becomes

U+2013 U+201D (their corresponding unicode codes but with a space in between)

So I re-wrote the piece of macro with a while loop, thus:
Code:
While Find '[^\u0001-\u007e]', 'E-i'
   Edit:Convert:To Unicode Code Points
end

…and all went well. The macro is considerably slower, but it works.

I wonder, though, if there’s a way to convert a character into unicode points without resorting to the menu, which seems to trigger the space issue. I mean something through the Nisus macro language or Perl. I couldn’t find it anywhere.

Greetings, Henry.


2010-03-20 07:23:00
Profile
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
There's definitely a way for you to do the conversion more directly, without using the convert menu, but it would be more complicated. Instead, why not strip the spaces out after the replacement:
Code:
If Find '[^\u0001-\u007e]', 'aE-i'
   Edit:Convert:To Unicode Code Points
   Replace All '(?<=U\+\h\h\h\h) (?=U\+\h\h\h\h)', '', 'aE'
end

Does that work? Or do you have places in your document where code points appear where you want spaces maintained?


2010-03-20 10:57:57
Profile WWW

Joined: 2007-03-03 09:55:06
Posts: 494
Location: Europe
Quite so. There are spaces that I want to keep.
I think the while loop is the only choice at this point. A bit slow but OK.
Now, if I may question, what is the other way?

Thanks, Henry.


2010-03-21 01:21:58
Profile

Joined: 2008-05-17 04:02:32
Posts: 400
Groucho wrote:
At a certain point the macro was to convert special characters (accented letters, diacritics, Ascii > 127) into Unicode code points. I made use of this conditional statement:
Code:
if Find '[^\u0001-\u007e]', 'aE-i'
   Edit:Convert:To Unicode Code Points
end

Your find expression looks odd. Character code index is not one-based but zero-based. And you’d better avoid an expression such as “Ascii > 127” which is self-contradictory.

Quote:
what is the other way?

Code:
$prefix = Cast to String '&#x'
$suffix = Cast to String ';'

$doc = Document.active
$sels = $doc.text.findAll '[^\x00-\x7F]', 'E-i'
foreach $sel in reversed $sels
   $d = $sel.text.characterAtIndex $sel.location
   $h = $prefix
   $h &= Convert To Hex $d
   $h &= $suffix
   $sel.text.replaceInRange $sel.range, $h
end

If you prefer Perl . . .
Code:
$doc = Document.active
$sels = $doc.text.findAll '[^\x00-\x7F]+', 'E-i'
$str = Array.new
foreach $sel in $sels
   $str.appendValue $sel.substring
end
$str = $str.join "\x00"
$hex = ''
Set Exported Perl Variables 'str', 'hex'
begin Perl
   foreach (split //, $str) {  # Nobumi’s code with a small modification
      if ( $_ eq "\x00" ) { $hex .= $_ }
      else { $hex .= sprintf("U+%04X", ord) }
   };

end
$str = $hex.split "\x00"
foreach $sel in reversed $sels
   $sel.text.replaceInRange $sel.range, $str.pop
end


2010-03-21 03:48:43
Profile

Joined: 2007-03-03 09:55:06
Posts: 494
Location: Europe
Thanks, Kino. The Nisus macro version is OK. I just stripped the suffix, as I don’t need any, and changed the prefix to \U as this is requested by pdb’s markup language (\U2019, for example, is left curly quote “).

As for the find expression, I followed Nisus Macro Reference, the table in Literals.

Greetings, Henry.


2010-03-21 07:30:43
Profile

Joined: 2008-05-17 04:02:32
Posts: 400
Groucho wrote:
As for the find expression, I followed Nisus Macro Reference, the table in Literals.

I don't speak about it. Is there any reason not to include 0x0000 and 0x007F, the first and the last ASCII characters, in the negative set?


2010-03-21 08:04:34
Profile

Joined: 2007-03-03 09:55:06
Posts: 494
Location: Europe
Ah, I see now. Well, no, there is no reason to keep them out. I just didn’t notice that. To be true, I don’t know why I kept them out. Maybe I followed a practical philosophy, that is, 0x0000 and 0x007F never occur in my files so why am I putting them in. This is a macro for personal use, you know.

Henry.


2010-03-21 08:23:27
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 7 posts ] 

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software