Page 1 of 1

Getting the number of characters of a found text string

Posted: 2009-08-12 07:19:13
by js
If a macro finds a textstring, and you need to know the number of characters of that string, what is the easiest way to obtain it?

Re: Getting the number of characters of a found text string

Posted: 2009-08-12 11:19:52
by martin
If you just want the number of characters in the active selection, this will do:

Code: Select all

$selection = TextSelection.active
$characterCount = $selection.length
If you want the number of characters in a text/string variable, you can use:

Code: Select all

$text = "whatever"
$characterCount = $text.length
One note: both of these do not return a strict character count. From the macro reference:
Note: the return value is actually the number of 16 bit (2 byte) pieces required to represent the text in the UTF-16 encoding. In other words, characters whose code point is greater than U+FFFF (those that require surrogate pairs) will count as more than a single character. For example, the musical double-flat symbol (U+1D12B) counts as two characters. This is uncommon and does not affect most letters used by most languages (eg: English, Hebrew, Arabic, Japanese, etc).

Re: Getting the number of characters of a found text string

Posted: 2009-08-12 23:34:48
by js
I see how it works in principle, thank you. But I am still not sure what is most economical to do the following:

Search the next string of Chinese characters
Select as many (English) before the string as characters have been found in it. (To put them afterwards into italics or the like).

I can see that an ordinary search with \p{Han} is doing step 1. Should I then read the number of characters by counting the selection done by the find process or is there a better way? The second step can be achieved by one of these new "send selector" commands, I guess.

Re: Getting the number of characters of a found text string

Posted: 2009-08-12 23:38:52
by js
"Select as many (English)" should read as: "Select as many (English) words". Sorry for this typo.

Re: Getting the number of characters of a found text string

Posted: 2009-08-13 11:33:04
by martin
As they say, there's many ways to skin this cat. But probably this is most efficient:

Code: Select all

# Find next group of Han characters
If Find '\p{Han}+', 'E-W'
	$selection = TextSelection.active
	$count = $selection.length
	
	# process the following English text
	If Find "(?:\\W+\\p{Latin}+){$count,$count}", 'E-W'
		Italic
	Else
		Prompt "Could not match $count following English words."
	End
End
If you want any part of that second Find expression explained, I'd be happy to.

Re: Getting the number of characters of a found text string

Posted: 2009-08-13 12:40:25
by js
As it is there are two problems with this macro. The first is, the selection of English words should be before, not afterwards: you need to look backwards. This is for practical reasons: Take an English text with interspersed terms in Chinese. You want the English reader to first know how to pronounce what follows. The second problem is: no commas, parenthesis etc. should be selected, only the letters of the words.

Re: Getting the number of characters of a found text string

Posted: 2009-08-13 15:38:57
by martin
Sorry, I missed the part about selecting the preceding text and not the following text. This should do what you need:

Code: Select all

# Find next group of Han characters
If Find '\p{Han}+', 'E-W'
	$selection = TextSelection.active
	$count = $selection.length
	
	# process the preceding English text
	While $count > 0
		$found = Find '\p{Latin}+', 'Eb-W'
		If ! $found
			Exit 'Missing preceding English text, aborting!'
		End
		$count -= 1
		
		Italic
	End
End

Re: Getting the number of characters of a found text string

Posted: 2009-08-14 02:29:57
by js
Thanks a lot, Martin, this is very helpful and I see that it is a more elegant procedure than the one I had in mind.
To become really perfect there are still two little problems to be solved, and I think I could find out how to. But in case you want to show me, the problems are these:
1) Just in case the text to be put into Italics has already been put into Italics, the macro puts it back to normal, which it should not.
2) The macro is meant to be used over and over again (but permitting visual control at each go). So the cursor on completion should be at the end of the first string it found.

PS The revised macro does not have the line any more you had proposed to explain if necessary. Would you explain it any way?

Re: Getting the number of characters of a found text string

Posted: 2009-08-17 13:06:33
by martin
js wrote:1) Just in case the text to be put into Italics has already been put into Italics, the macro puts it back to normal, which it should not.
2) The macro is meant to be used over and over again (but permitting visual control at each go). So the cursor on completion should be at the end of the first string it found.
This revised macro should do the trick:

Code: Select all

$doc = Document.active

# Find next group of Han characters
If Find '\p{Han}+', 'E-W'
	$selection = $doc.textSelection
	$count = $selection.length
	
	# process the preceding English text
	While $count > 0
		$found = Find '\p{Latin}+', 'Eb-W'
		If ! $found
			Exit 'Missing preceding English text, aborting!'
		End
		$count -= 1
		
		# if not already italic, then apply
		$isItalic = Menu State ':Format:Italic'
		If ! $isItalic
			Italic
		End
	End

	# place selection just after Han characters we first found
	$doc.setSelection($selection)
	Select End
End
PS The revised macro does not have the line any more you had proposed to explain if necessary. Would you explain it any way?
Sure, so we have this PowerFind Pro (regular expression):

Code: Select all

Find "(?:\\W+\\p{Latin}+){$count,$count}", 'E-W'
Let's start with the "(?:whatever)" construct. Parentheses group multiple expressions, so they can be treated as a single unit. Additionally parentheses will "capture" whatever is matched, so it is available as a back-reference (eg: "\1"). Adding in "?:" turns off the capture, so really for our purposes these two are equivalent:

Code: Select all

Find "(\\W+\\p{Latin}+){$count,$count}", 'E-W'
Find "(?:\\W+\\p{Latin}+){$count,$count}", 'E-W'
The latter is however more efficient because it doesn't create a capture. Why do we bother with the grouping? Well, we want to find exactly "$count" English words. The way to do that is via the repetition operator "{min,max}". As an example, the expression "a{1,3}" finds between one and three little a's. In our pattern we set the minimum and maximum both to $count, eg: find exactly $count of whatever comes beforehand. What comes beforehand? This pattern we grouped with the parentheses:

Code: Select all

(?:\\W+\\p{Latin}+)
Inside we have "\W+" which stands for one or more non-word characters, whereas "\p{Latin}+" stands for one or more Latin characters, eg: we are matching a Latin word and any preceding whitespace/punctuation.

The reason the backslashes are doubled-up is because we use double-quote string literals, which is first interpreted by the macro language (eg: "$count" is replaced by an actual number, etc). So by the time the Find command sees the expression, the backslashes have been reduced by one.

Hopefully that's somewhat clear. Let me know if you have any questions.

Re: Getting the number of characters of a found text string

Posted: 2009-08-18 04:10:29
by js
Thanks for the macro, with the test of a menu state, Martin. And also for the explanations that are easy to follow. I guess {$count,$count} could be abbreviated to {$count}, couldn't it?

Re: Getting the number of characters of a found text string

Posted: 2009-08-18 10:31:14
by martin
Yes, I forgot about that shorthand, that would definitely work!