Reply to topic  [ 11 posts ] 
Getting the number of characters of a found text string 
Author Message

Joined: 2007-04-12 14:59:36
Posts: 229
If a macro finds a textstring, and you need to know the number of characters of that string, what is the easiest way to obtain it?


2009-08-12 07:19:13
Profile
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
If you just want the number of characters in the active selection, this will do:
Code:
$selection = TextSelection.active
$characterCount = $selection.length

If you want the number of characters in a text/string variable, you can use:
Code:
$text = "whatever"
$characterCount = $text.length

One note: both of these do not return a strict character count. From the macro reference:
Quote:
Note: the return value is actually the number of 16 bit (2 byte) pieces required to represent the text in the UTF-16 encoding. In other words, characters whose code point is greater than U+FFFF (those that require surrogate pairs) will count as more than a single character. For example, the musical double-flat symbol (U+1D12B) counts as two characters. This is uncommon and does not affect most letters used by most languages (eg: English, Hebrew, Arabic, Japanese, etc).


2009-08-12 11:19:52
Profile WWW

Joined: 2007-04-12 14:59:36
Posts: 229
I see how it works in principle, thank you. But I am still not sure what is most economical to do the following:

Search the next string of Chinese characters
Select as many (English) before the string as characters have been found in it. (To put them afterwards into italics or the like).

I can see that an ordinary search with \p{Han} is doing step 1. Should I then read the number of characters by counting the selection done by the find process or is there a better way? The second step can be achieved by one of these new "send selector" commands, I guess.


2009-08-12 23:34:48
Profile

Joined: 2007-04-12 14:59:36
Posts: 229
"Select as many (English)" should read as: "Select as many (English) words". Sorry for this typo.


2009-08-12 23:38:52
Profile
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
As they say, there's many ways to skin this cat. But probably this is most efficient:

Code:
# Find next group of Han characters
If Find '\p{Han}+', 'E-W'
   $selection = TextSelection.active
   $count = $selection.length
   
   # process the following English text
   If Find "(?:\\W+\\p{Latin}+){$count,$count}", 'E-W'
      Italic
   Else
      Prompt "Could not match $count following English words."
   End
End

If you want any part of that second Find expression explained, I'd be happy to.


2009-08-13 11:33:04
Profile WWW

Joined: 2007-04-12 14:59:36
Posts: 229
As it is there are two problems with this macro. The first is, the selection of English words should be before, not afterwards: you need to look backwards. This is for practical reasons: Take an English text with interspersed terms in Chinese. You want the English reader to first know how to pronounce what follows. The second problem is: no commas, parenthesis etc. should be selected, only the letters of the words.


2009-08-13 12:40:25
Profile
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
Sorry, I missed the part about selecting the preceding text and not the following text. This should do what you need:
Code:
# Find next group of Han characters
If Find '\p{Han}+', 'E-W'
   $selection = TextSelection.active
   $count = $selection.length
   
   # process the preceding English text
   While $count > 0
      $found = Find '\p{Latin}+', 'Eb-W'
      If ! $found
         Exit 'Missing preceding English text, aborting!'
      End
      $count -= 1
      
      Italic
   End
End


2009-08-13 15:38:57
Profile WWW

Joined: 2007-04-12 14:59:36
Posts: 229
Thanks a lot, Martin, this is very helpful and I see that it is a more elegant procedure than the one I had in mind.
To become really perfect there are still two little problems to be solved, and I think I could find out how to. But in case you want to show me, the problems are these:
1) Just in case the text to be put into Italics has already been put into Italics, the macro puts it back to normal, which it should not.
2) The macro is meant to be used over and over again (but permitting visual control at each go). So the cursor on completion should be at the end of the first string it found.

PS The revised macro does not have the line any more you had proposed to explain if necessary. Would you explain it any way?


2009-08-14 02:29:57
Profile
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
js wrote:
1) Just in case the text to be put into Italics has already been put into Italics, the macro puts it back to normal, which it should not.
2) The macro is meant to be used over and over again (but permitting visual control at each go). So the cursor on completion should be at the end of the first string it found.

This revised macro should do the trick:
Code:
$doc = Document.active

# Find next group of Han characters
If Find '\p{Han}+', 'E-W'
   $selection = $doc.textSelection
   $count = $selection.length
   
   # process the preceding English text
   While $count > 0
      $found = Find '\p{Latin}+', 'Eb-W'
      If ! $found
         Exit 'Missing preceding English text, aborting!'
      End
      $count -= 1
      
      # if not already italic, then apply
      $isItalic = Menu State ':Format:Italic'
      If ! $isItalic
         Italic
      End
   End

   # place selection just after Han characters we first found
   $doc.setSelection($selection)
   Select End
End


Quote:
PS The revised macro does not have the line any more you had proposed to explain if necessary. Would you explain it any way?

Sure, so we have this PowerFind Pro (regular expression):
Code:
Find "(?:\\W+\\p{Latin}+){$count,$count}", 'E-W'

Let's start with the "(?:whatever)" construct. Parentheses group multiple expressions, so they can be treated as a single unit. Additionally parentheses will "capture" whatever is matched, so it is available as a back-reference (eg: "\1"). Adding in "?:" turns off the capture, so really for our purposes these two are equivalent:
Code:
Find "(\\W+\\p{Latin}+){$count,$count}", 'E-W'
Find "(?:\\W+\\p{Latin}+){$count,$count}", 'E-W'

The latter is however more efficient because it doesn't create a capture. Why do we bother with the grouping? Well, we want to find exactly "$count" English words. The way to do that is via the repetition operator "{min,max}". As an example, the expression "a{1,3}" finds between one and three little a's. In our pattern we set the minimum and maximum both to $count, eg: find exactly $count of whatever comes beforehand. What comes beforehand? This pattern we grouped with the parentheses:
Code:
(?:\\W+\\p{Latin}+)

Inside we have "\W+" which stands for one or more non-word characters, whereas "\p{Latin}+" stands for one or more Latin characters, eg: we are matching a Latin word and any preceding whitespace/punctuation.

The reason the backslashes are doubled-up is because we use double-quote string literals, which is first interpreted by the macro language (eg: "$count" is replaced by an actual number, etc). So by the time the Find command sees the expression, the backslashes have been reduced by one.

Hopefully that's somewhat clear. Let me know if you have any questions.


2009-08-17 13:06:33
Profile WWW

Joined: 2007-04-12 14:59:36
Posts: 229
Thanks for the macro, with the test of a menu state, Martin. And also for the explanations that are easy to follow. I guess {$count,$count} could be abbreviated to {$count}, couldn't it?


2009-08-18 04:10:29
Profile
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
Yes, I forgot about that shorthand, that would definitely work!


2009-08-18 10:31:14
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 11 posts ] 

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software