Reply to topic  [ 18 posts ]  Go to page 1, 2  Next
"Complete List" Macro 
Author Message

Joined: 2008-04-28 22:10:11
Posts: 46
Kino: I have read some of your replies and suggestions, thanks. They encouraged me to write this request to you, if you could write a new macro?
A new macro, let's call it Complete List, where the macro will be able to:
a. list every word (case sensitive)
b. list the number of frequency next to the word (italic style)
c. list the page number next to the frequency number (plain style)
If there are more than one frequency in a given page, let the number of frequency show and the page number to follow, for example:
- California, 6–4,8,9,9,11,12.
The dash between the frequency and the page number is "n-dash", so the processor won't hyphenate.

I am aware of Index feature in Nisus, but, correct me if I am wrong, with Indexing, you have to select the words to be indexed!! But with the macro I am suggesting, you either Select All, or not, just by launching the macro a new file should open up with the list created.


2009-03-20 06:21:17
Profile WWW
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
This is a quick attempt at such a macro:
Code:
$frequency = Hash.new
$indexStart = 0

# calculate word frequency and mark for inclusion in Index
Set Selection 1, 0
While Find Next '\w{3,}', 'E-W'
   $word = Read Selection
   $word = Cast to String $word
   $frequency{$word} += 1
   Menu ':Insert:Index:Index'
   Select End
End
Find Next '\Z', 'E'

# insert index
Type Text "\n\n"
$indexStart = Selection Location
Menu ':Insert:Index:Insert Index...'
Press Button 'Insert'

# insert frequency information
Set Selection $indexStart, 0
While Find Next '(\w+)(?=\t)', 'E-W$'
   Select End
   $word = $1
   $count = $frequency{$word}
   Type Text ' '
   Insert Text "($count)"
   Menu ':Format:Italic'
   Select End
End

Hopefully it's of some use.


2009-03-20 14:04:48
Profile WWW

Joined: 2008-05-17 04:02:32
Posts: 400
Here is another attempt.

You can change the behaviour of the macro by modifying
Code:
$shortestWordLength = 3
$prefix = Cast to String '- '
$suffix = Cast to String '.'
near the beginning of the macro.

Perhaps the definition of word used in this macro does not meet your needs. If so, please let us know.

This macro does not always work fine:
- It may not work properly for a document containing section notes ("Place: End of Section" in Endnote style).
- Whole note text is treated as if it belongs to a page having the corresponding note reference even if the latter half of the note text is situated in the next page.
Code:
### Complete List Macro ###

# A new macro, let's call it Complete List, where the macro will be able to:
# a. list every word (case sensitive)
# b. list the number of frequency next to the word (italic style)
# c. list the page number next to the frequency number (plain style)
# If there are more than one frequency in a given page, let the number of frequency show and the page number to follow, for example:
# - California, 6–4,8,9,9,11,12.
# The dash between the frequency and the page number is "n-dash", so the processor won't hyphenate.

# Known problems
# - This macro may not work properly for a document containing section notes ("Place: End of Section" in Endnote style).
# - Whole note text is treated as if it belongs to a page having the corresponding note reference even if the latter half of the note text is situated in the next page.

$shortestWordLength = 3
$prefix = Cast to String '- '
$suffix = Cast to String '.'

Require Pro Version 1.2

$doc = Document.active
if $doc == undefined
   exit  # exit silently if there is no open doucment
end

$start = Date.now

$text = $doc.text
$pages = Array.new
$i = 1

While Select Page $i
   $page = $doc.textSelection
   $pages.appendValue $page
   $i += 1
end

Select Start  # deselect

$findExp = '(?=\b[\p{L}\'\’-]{' & $shortestWordLength
$findExp &= ',}\b)\p{L}(?:[\p{L}\'\’-]*\p{L})?'

$pageNumHash = $freqHash = Hash.new
$sepPageNum = Cast to String ","
$sepEntry = Cast to String ",\x20"
$p = 1

foreach $page in $pages
   $pageText = Cast to String $page.subtext
   $founds = $pageText.findAll $findExp, 'E-i'  # PowerFind Pro and Case-Sensitive
   foreach $found in $founds
      $word = $found.subtext
      $pageNum = $pageNumHash.valueForKey $word
      if $pageNum == undefined
         $pageNum = $p
      else
         $pageNum &= $sepPageNum & $p
      end
      $pageNumHash.setValueForKey $pageNum, $word
      $freqHash{$word} += 1
   end
   $p += 1
end

$words = $pageNumHash.keys
$words.sort 'li'  # l: localized (sort order chosen in Intl Pref pane and i: case-insensitive
$output = ''
$nDash = Text.newWithCharacter 0x2013
$LF = Text.newWithCharacter 0x000A

foreach $word in $words
   $output &= $prefix
   $output &= $word & $sepEntry
   $c = $freqHash.valueForKey $word
   if $c > 1
      $output &= $c & $nDash
   end
   $output &= $pageNumHash.valueForKey $word
   $output &= $suffix & $LF
end

New
Menu ':Format:Paragraph Alignment:Align Left'
$doc = Document.active
$doc.clearAndDisableUndoHistory
Type Text $output
$numFound = Find All '[0-9]+(?=\x{2013})', 'E'  # one or more Arabic digits followed by en-dash
if $numFound
   Menu ':Format:Italic'
   Select Start  # deselect
end

$finish = Date.now
$elapsed = $finish.secondsSinceUnixEpoch - $start.secondsSinceUnixEpoch
exit "Finished in $elapsed seconds"

### end of macro ###
Formatted macro file:
http://www2.odn.ne.jp/alt-quinon/files/NWPro/misc/CompleteList_nwm.zip

On this occasion, I'd like to repeat my past request for new macro commands. I think they would make it easy to write this kind of macros. Thanks.

$sel.location.pageNumber
$sel.location.lineNumber
$sel.location.paragraphNumber
$sel.insertIndex [topic [,subtopic]]


2009-03-21 02:48:02
Profile

Joined: 2008-04-28 22:10:11
Posts: 46
Wow! I have downloaded the macro and tried it on one of my files. It took 46 seconds to list a 30 page document, in an untitled new file, and it was awesome!

thanx a million.

Windsor


2009-03-22 06:57:01
Profile WWW

Joined: 2008-05-17 04:02:32
Posts: 400
I discovered a bug in the macro: words followed by note references will be ignored. As I don't know how to remove the bug from the macro itself, I wrote another macro as a workaround.

This macro inserts (or removes) Zero-Width Joiner before all note references in the body so that Complete List macro recognizes words followed by a note reference properly. As ZWJ is a zero-width invisible character, the pagination of the document should remain the same but I noticed small differences in some cases. So please be attentive.

Sorry for the trouble.
Code:
### ZERO WIDTH JOINER before Note References ###

# This macro insert or remove U+200D ZERO WIDTH JOINER before all note references in the document body. As this is a zero-width invisible character, the appearance of the document should remain the same.
# When you run this macro, you will be asked to choose which action you want the macro to do.
# Nothing will be inserted where U+200D is already present.

$joiner = Text.newWithCharacter 0x200D  # ZERO WIDTH JOINER
Require Pro Version 1.2

$doc = Document.active
if $doc == undefined
   exit  # exit silently if no document is open
end

$text = $doc.text
$notes = $doc.allNotes
if ! $notes.count
   exit 'No footnote/endnote found, exit...'
end

$message = 'Insert or Remove ZERO WIDTH JOINER before Note References'
$detail = 'Hit Stop Macro to cancel the operation'
$insert = 'Insert'
$remove = 'Remove'

$action = Prompt $message, $detail, $insert, $remove

foreach $note in reversed $notes
   $range = $note.documentTextRange
   $sel = TextSelection.newWithLocationAndLength $text, $range.location - 1, 1
   if $action == $insert
      if $sel.subtext != $joiner
         $text.insertAtIndex $range.location, $joiner
      end
   else
      if $sel.subtext == $joiner
         $text.deleteInRange $sel.range
      end
   end
end

### end of macro ###
Formatted macro file:
http://www2.odn.ne.jp/alt-quinon/files/NWPro/footendnotes/ZWJbeforeNoteReferences_nwm.zip


2009-03-22 08:55:51
Profile

Joined: 2008-04-28 22:10:11
Posts: 46
Kino: Thank you for the macro. However, when I used this last version, this is what the message said: The Macro "ZWJbeforeNoteReferences" has been completed. No footnote/endnote found, exit… and then I had to push the OK button.

Now. I sent the previous macro to my friend for a test drive, he said that the punctuation marks (in the Armenian) are not ignored, which, as a result, the words are chopped. See, for example, the word which? (in English) has the question mark at the end of the word, in Armenian, however, is in the word itself, like this: Ո՞րմէկը. As you can see, the question mark is after the first character. So, the macro should INCLUDE the Armenian punctuation marks, so that the words won't be split. Otherwise the first version is exactly how I need it to work, that is, when executing it from the macro's menu, a new file is being created and the list of the words are being made. But with the new editing, it should be able to keep the punctuation marks. Here are the marks:

ՙ
՟
՝
՛
՜
՞
՚

If you want copy and paste these characters for your convenience, if need be. And please use the CompleteList macro. The new one did not create a new file and did not create the word list.

Windsor


2009-03-24 14:08:51
Profile WWW

Joined: 2008-05-17 04:02:32
Posts: 400
Windsor wrote:
However, when I used this last version, this is what the message said: The Macro "ZWJbeforeNoteReferences" has been completed. No footnote/endnote found, exit… and then I had to push the OK button.
Sorry, I should have been clearer. "ZWJbeforeNoteReferences" is not a new version of the "Complete List" macro but an auxiliary macro which should be run before "Complete List" macro when the target document contains footnotes/endnotes. You don't need running it for documents not having footnotes/endnotes.

Quote:
See, for example, the word which? (in English) has the question mark at the end of the word, in Armenian, however, is in the word itself, like this: Ո՞րմէկը. As you can see, the question mark is after the first character. So, the macro should INCLUDE the Armenian punctuation marks, so that the words won't be split.
I did not know that. The writing system of Armenian looks very interesting. Then, I replaced
Code:
$findExp = '(?=\b[\p{L}\'\’-]{' & $shortestWordLength
$findExp &= ',}\b)\p{L}(?:[\p{L}\'\’-]*\p{L})?'
with
Code:
$findExp = '(?=\b[\p{L}\'\’\x{559}-\x{55F}-]{' & $shortestWordLength
$findExp &= ',}\b)\p{L}(?:[\p{L}\'\’\x{559}-\x{55F}-]*\p{L})?'
so that the macro handles Armenian words properly.

Formatted macro file:
http://www2.odn.ne.jp/alt-quinon/files/NWPro/misc/CompleteWordList_nwm.zip
(I renamed the macro "Complete Word List" which I think more descriptive but you can give any name to it for it does not depend on any other macro.)


2009-03-24 23:15:58
Profile

Joined: 2008-04-28 22:10:11
Posts: 46
Kino:

Thank you for the attempt. I tried the latest version, and still it chopped the words that had the punctuation. When I checked your code, in this forum, I realized that there were two of the punctuation marks that were in the list (559 and 55f) and the rest were hard to understand for me. I checked the Unicode list of the Armenian font and I saw the following:

0559 for ՙ
055A for ՚
055B for ՛
055C for ՜
055E for ՞
055F for ՟

Could you try to include these codes in the macro, so that the words that will be having one of these punctuation marks won't be split?

By the way, I like the name Complete Word List. Let's keep it.

Thanks a million.
Windsor


2009-03-25 15:38:49
Profile WWW

Joined: 2008-05-17 04:02:32
Posts: 400
Sorry for the trouble. Please try this one.
http://www2.odn.ne.jp/alt-quinon/files/NWPro/misc/CompleteWordList_nwm.zip

I changed the find expression to
Code:
$findExp = '(?=\b(?<c>\p{L}|[-\'\’\x{559}-\x{55F}]){' & $shortestWordLength
$findExp &= ',}\b)\p{L}(\g<c>*\p{L})?'
which seems to work.

'\x{559}-\x{55F}' in this and the older expressions stands for a character range from U+0559 and U+055F. And I don't understand why the older expression does not work. I should miss something obvious ;-(


2009-03-25 20:22:40
Profile

Joined: 2008-04-28 22:10:11
Posts: 46
Kino:

Now we're talkin'!

Thanx a million.

Windsor


2009-03-26 10:51:25
Profile WWW
Official Nisus Person
User avatar

Joined: 2002-07-11 17:14:10
Posts: 4251
Location: San Diego, CA
Kino wrote:
'\x{559}-\x{55F}' in this and the older expressions stands for a character range from U+0559 and U+055F. And I don't understand why the older expression does not work. I should miss something obvious ;-(

This looks like a problem with our regex engine- I'll file a bug. Thanks and sorry for any confusion it caused.


2009-03-26 11:50:13
Profile WWW

Joined: 2008-05-17 04:02:32
Posts: 400
Using macro commands newly available in Nisus Writer Pro 1.3, I rewrote the macro. Thanks to a proper command to get the page number (pageNumberAtIndex), it is not necessary any more to invent tricky code and the macro runs faster.

I changed the output format so that you will get
 - actually, 9–20(4),37,44,45,50(2).
instead of
 - actually, 9–20,20,20,20,37,44,45,50,50.

I don't know if you, Windsor, like it, though.

Also, this version distnguish words in note text by putting “n” after page number.

By changing the value of the variables at the beginning of the macro, you can customize the behaviour of the macro as you like.

A problem, though. Is it impossible to get table object without making a selection in a table cell actually? This makes the macro slow down if you run it on a document having many tables like Nisus Macro Reference.zrtf.
Code:
   $where = $sel.text.documentContentType
   if $where == 'table'
      $doc.setSelection $sel
      $where = $doc.tableSelection.table.enclosingText.documentContentType
   end
If so, is it possible to make table command work with text selection objects in a future version so that $sel.table returns the corresponding table object for a text selection situated in a table cell? Thanks.

Code:
### Complete Word List Macro (rev. 2) ###

# Written for Windsor.

# A new macro, let's call it Complete List, where the macro will be able to:
# a. list every word (case sensitive)
# b. list the number of frequency next to the word (italic style)
# c. list the page number next to the frequency number (plain style)
# If there are more than one frequency in a given page, let the number of frequency show and the page number to follow, for example: - California, 6–4,8,9,9,11,12.
# The dash between the frequency and the page number is "n-dash" [$freqSep], so the processor won't hyphenate.

# Changes:
# - When the same word appears multiple times in the same page, it will be noted as pageNumber(numberOfCoccurences), e.g. 23(7), where “(“ and “)” are defined by $openingFreq and $closingFreq.
# - For a page number in note text, “n“ ($noteSuffix) will be appended to it, e.g. 23n.
# - Words in Comments, Headers and Footers are ignored.

$shortestWordLength = 3
$caseSensitive = true  # set it to false if case-insensitive is preferable
$prefix = Cast to String '- '
$entrySep = Cast to String ', '
$freqSep = Cast to String '–'  # en-dash
$pageNumSep = Cast to String ','
$noteSuffix = Cast to String 'n'
$openingFreq = Cast to String '('
$closingFreq = Cast to String ')'
$suffix = Cast to String '.'
$debugEnabled = false

Require Pro Version 1.3
Debug.setCodeProfilingEnabled $debugEnabled

$doc = Document.active
if $doc == undefined
   exit  # exit silently if there is no open doucment
end

if $shortestWordLength < 2
   $findExp = '\b\p{L}(?:[-\'\x{2019}\x{559}-\x{55F}\p{L}]*\p{L})?\b'
else
   $SWL = $shortestWordLength - 2
   $findExp = '(?=\p{L}(?<w>[-\'\x{2019}\x{55A}-\x{55F}\p{L}]){' & $SWL
   $findExp &= ',}\p{L})\b\p{L}(?:\g<w>*\p{L})?'
end

$sels = $doc.text.findAll $findExp, 'Ew', '-amn'  # find in the main body and notes only

$data = Hash.new

foreach $sel in $sels
   $word = $sel.substring
   if $caseSensitive == false
      $word = $word.textByLowercasing
   end
   if $data{$word} == undefined
      $data{$word} = Hash.new
   end
   $pageNum = $sel.text.pageNumberAtIndex $sel.location
   $where = $sel.text.documentContentType
   if $where == 'table'
      $doc.setSelection $sel
      $where = $doc.tableSelection.table.enclosingText.documentContentType
   end
   if $where != 'body'
      $pageNum &= $noteSuffix
   end
   $data{$word}{$pageNum} += 1
end

$words = $data.keys
$words.sort 'li'  # l: localized (sort order chosen in Intl Pref pane) and i: case-insensitive
$output = Array.new

foreach $word in $words
   $outputTemp = $prefix & $word
   $outputTemp &= $entrySep
   $c = 0
   $pages = $data{$word}.keys
   $pages.sort
   $pageTemp = Array.new
   foreach $page in $pages
      $count = $data{$word}{$page}
      $c += $count
      if $count > 1
         $page &= $openingFreq & $count
         $page &= $closingFreq
      end
      $pageTemp.appendValue $page
   end
   if $c > 1
      $outputTemp &= $c & $freqSep
   end
   $outputTemp &= $pageTemp.join($pageNumSep) & $suffix
   $output.appendValue $outputTemp
end

$LF = Text.newWithCharacter 0x000A
$output = $output.join $LF

if $debugEnabled == true
   $doc.setSelections $sels
end

$WordList = Document.newWithText $output
$WordList.clearAndDisableUndoHistory
Menu ':Format:Paragraph Alignment:Align Left'

$findFreq = '[0-9]+(?=\Q'
$findFreq &= $freqSep & '\E)'
$numFound = Find All $findFreq, 'E'
if $numFound
   Menu ':Format:Italic'
end

Select Document Start

### end of macro ###
Formatted macro file:
http://www2.odn.ne.jp/alt-quinon/files/NWPro/textinfo/CompleteWordList_nwm.zip


2009-07-17 20:44:02
Profile

Joined: 2007-04-12 14:59:36
Posts: 229
That‘s a great and helpful macro. Now I wonder about this: The page as a „package size“ is of course very interesting for physical books. But for text analysis it is often more interesting to know whether a certain word appears within a package size you define yourself, f.e. within a paragraph, or within a chunk of text defined by a delimiting character.
Would it be difficult to adapt the macro in this way?.
I realize that the net result could be achieved by preparing the text itself: replacing all returns by soft returns, and all page beaks by returns. But I would not even know how to achieve the latter.


2009-07-18 02:09:25
Profile

Joined: 2008-05-17 04:02:32
Posts: 400
js wrote:
But for text analysis it is often more interesting to know whether a certain word appears within a package size you define yourself, f.e. within a paragraph, or within a chunk of text defined by a delimiting character.
The easiest way is to modify the target document so that each package (defined by you) is a paragraph or a section. Then, replace pageNumberAtIndex at the 57th paragraph of the macro with paragraphNumberAtIndex or sectionNumberAtIndex. Which is more appropriate depends on the structure of the document. If it contains a table, you have to use sectionNumberAtIndex because, unfortunately, paragraphNumberAtIndex does not work properly for text in table cells. In NW Pro 1.3, if you copy a section break onto the replace field, you can replace a given string (e.g. two successive returns, a page break) with Section Break (Same Page) with “Replace Attributes” turned on. Or by using Replace All Breaks macro.

Quote:
I realize that the net result could be achieved by preparing the text itself: replacing all returns by soft returns, and all page beaks by returns. But I would not even know how to achieve the latter.
Net result? 釣果? I'm not very sure if I understand what you mean but you can achieve it by…
Code:
Replace All '\n', '\x{2028}', 'E', '-am'  # -am: in main body text only
Replace All '\f', '\n', 'E', '-am'

Another problem with paragraphNumberAtIndex in such a macro is that the value returned by it is not meaningful for words in note text. In that case, you have to transform footnotes/endnotes into inline notes using Macro:Changing Text:Convert Notes to Text before running the macro.

However, if your document is already logically split by page breaks, I think you would better replace page breaks with section breaks and use sectionNumberAtIndex.


2009-07-18 03:49:23
Profile

Joined: 2007-04-12 14:59:36
Posts: 229
The good news as far as I am concerned: The modification works perfectly well with paragraphs, thank you.

As for sections, I don’t not know: If I add a section break from the insert menu, and paste it into a find window (in Regex mode) what it actually inserts is \f, the same as a page break. (And by the way: I did not find \f in the Macro Manual … ) So I know now how to insert Page Breaks with a macro, but not how to insert sections breaks. Never mind. I don’t need to right now. I am perfectly happy with paragraph breaks. But to have your macro really do what I want, I lack a final step.

With that little modification I get a list of words with paragraph references. But I use other meaningful references. They consist in one word at the beginning of the paragraph. Let’s see I use

Anna at the beginning of paragraph 1
Judy at the beginning of paragraph 2
Emma at the beginning of paragraph 3

Now your macro tells me that such and such word can be found in 1 or in 3. But I would like it to tell me that it can be found under Anna or under Emma. How can this be done? I guess it would mean producing a hash like
1 -> Anna
2 -> Judy
3 -> Emma
And then those numbers should be replaced in the result of your macro by the corresponding names. Or is that a weird detour for somethin that can be done easier through an other method?


2009-07-18 06:39:43
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 18 posts ]  Go to page 1, 2  Next

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software