Reply to topic  [ 11 posts ] 
Sort lines and kill duplicates 
Author Message

Joined: 2007-04-12 14:59:36
Posts: 229
To sort lines, kill duplicates and have them counted I use this PERL macro:
Code:
#!/usr/bin/perl -w

use strict;

my %seen;
while (<>) {
    $seen{$_}++;
}

foreach my $key (sort keys %seen) {
   my $count = $seen{$key};
   print "$count $key\n";
}

I wonder if there is a simple way to integrate this into Nisus? Or can the Nisus Macro language do this?


2009-03-28 07:40:50
Profile

Joined: 2008-05-17 04:02:32
Posts: 400
Probably the simplest and the fastest is...
Code:
Menu ':Edit:Sort Paragraphs:Ascending (A-Z)'
Replace All '(^.+\n)\1+', '\1', 'E-isS'
Find options
E: PowerFind Pro
-i: Case Sensitive
s: In Selection
S: Preserve Selection

Edit:
- The sort order is the one set in the International Pref pane.
- If the selection contains empty paragraphs, you can make them uniq by changing the find expression from '(^.+\n)\1+' to '(^.*\n)\1+'. However, you may want to delete them all. In that case, add this command:
Code:
 Replace All '^\n+', '', 'EsS'


2009-03-28 08:09:36
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
Kino's approach is the one that I would have used too, but there is one important difference between js's macro and Kino's. js's macro counts the number of lines as well. Here is one way to extend Kino's approach to counting lines.
Code:
Select All
:Edit:Sort Paragraphs:Ascending (A-Z)
Find and Replace ‘^(.+)(\n\1)*$’, ‘•\t\0’, ‘Ea’
Find and Replace ‘\n[^•].*’, ‘•’, ‘Ea’
Find and Replace ‘^•\t([^•]+)(•+)$’, ‘•\2\t\1’, ‘Ea’

This method creates a "count" in the form bullets. In order to turn the bullets into numbers the fastest approach I have found so far is the following:
Code:
$doc = Document.active
$text = $doc.text
$bullets = $text.find ‘^•+’, ‘Ea’
foreach $bullet in reversed $bullets
   $text.replaceInRange $bullet.range, $bullet.length
end

I had at first wondered whether one could re-write js's macro directly in the Nisus macro language. But one limitation is that the Nisus macro language doesn't seem to have an easy way to sort hashes.

_________________
philip


2009-03-29 06:47:57
Profile

Joined: 2008-05-17 04:02:32
Posts: 400
Ah! I did not read the original script attentively. So here is somethng close to it.
Code:
$sp = Cast to String "\x20"
$LF = Cast to String "\n"
$doc = Document.active
$subtext = $doc.selectedSubtext
if ! $subtext.length
   exit 'Nothing selected, exit...'
end

$paragraphs = $subtext.split $LF
$freq = Hash.new
foreach $para in $paragraphs
   if $para != ''
      $freq{$para} += 1
   end
end

$keys = $freq.keys
$keys.sort 'li'
$output = ''
foreach $key in $keys
   $f = $freq{$key}
   $output &= "$f" & $sp
   # The style attribute of count number will be that of "$f" in this macro file.
   $output &= $key & $LF
end
Insert Attributed Text $output

Edit: I modified the macro so that the output preserves the style attributes (of the first occurrence). A great advantage of NW Pro macro over Perl when this matters. The algorithm of the macro is basically the same as that of the Perl script. NW Pro macro's weakness in parsing over-nested commands might make this one look different, though.


2009-03-29 08:22:28
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
Now I see it. Of course the keys of the hash are an array. So they can be sorted.
You're version does have some improvements over the version that I was experimenting with. But testing it on a file with a bit less than 6000 lines, this approach is markedly slower than the Find and Replace version.

_________________
philip


2009-03-29 09:49:50
Profile

Joined: 2008-05-17 04:02:32
Posts: 400
Err... if we were competing in a car race...
Code:
$start = Date.now

Select All
$sp = Cast to String "\x20"
$LF = "\n"
$doc = Document.active
$subtext = $doc.selectedSubtext
if ! $subtext.length
   exit 'Nothing selected, exit...'
end

$paragraphs = $subtext.split $LF
$freq = Hash.new
foreach $para in $paragraphs
   $freq{$para} += 1
end

$keys = $freq.keys
$keys.sort 'li'
$output = ''
foreach $key in $keys
   $output &= $freq{$key} & $sp
   $output &= $key & $LF
end
Insert Attributed Text $output

$finish = Date.now
$elapsed = $finish.secondsSinceUnixEpoch - $start.secondsSinceUnixEpoch
Exit "finished in $elapsed seconds"
This version does not remove empty paragraphs. You can make it far faster if you don't take care of style attributes at all.


2009-03-29 11:07:16
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
Kino wrote:
Err... if we were competing in a car race...


Actually my point is that if we were competing we (apparently) shouldn't bother with all this hash business. The "bullet count" version I posted earlier is much faster. Your version timed to 292 seconds on my machine on the 6000 line file. The "bullet count" version took 9 seconds :!:

This is not a criticism of your macro writing skills. I was very surprised that the internal hash method was so much slower. I had written a version almost identical to yours myself, but since it took so long and caused beach balls (and even an occasional crash) I thought this was due to my poor understanding of the macro language. But apparently not. This method is just slow.

I have been using "bullet count" since the Nisus Classic days, only because I find it easier to write such a macro. But now with NWP it is apparently also faster in execution.

_________________
philip


Last edited by phspaelti on 2009-03-29 19:42:29, edited 1 time in total.



2009-03-29 19:08:31
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
Kino wrote:
You can make it far faster if you don't take care of style attributes at all.


Actually I was wondering about this too. But what part of the macro takes care of the style attributes? The slow part of the macro is the loop which assembles the $output string. Timing individual bits of the macro, the "hash loop" takes about 2 seconds, and the "output loop" 288. While I'm sure that the slowness is due to the concatenating of styled text, I don't understand how this can be avoided.

_________________
philip


2009-03-29 19:40:33
Profile

Joined: 2008-05-17 04:02:32
Posts: 400
phspaelti wrote:
While I'm sure that the slowness is due to the concatenating of styled text, I don't understand how this can be avoided.

By making plain text the text object.
Code:
$StyledText = false

Debug.setDestination 'new'
$start = Date.now
Debug.log 'Macro started'

$sp = Cast to String "\x20"
$LF = Cast to String "\n"
$doc = Document.active
$text = $doc.text
$sel = TextSelection.newWithLocationAndLength $text, 0, $text.length
if $StyledText == true
   $textcopy = $sel.subtext
else
   $textcopy = Cast to String $sel.subtext
end
Debug.log 'Created copy of the main body text object'

$lines = $textcopy.split $LF
$textcopy = ''
Debug.log 'Text object split into an array.'

$freq = Hash.new
foreach $line in $lines
   $freq{$line} += 1
end
$lines = Array.new
Debug.log 'Counted number of occurrences'

$keys = $freq.keys
$keys.sort 'li'
Debug.log 'Finished to sort'

$output = ''
foreach $key in $keys
   $output &= $freq{$key} & $sp
   $output &= $key & $LF
end
Debug.log 'Finished to create a word list'

New
$doc = Document.active
$doc.clearAndDisableUndoHistory
if $StyledText == true
   $doc.insertText $output, 'm'
else
   $doc.insertText $output, 'a'
end

$finish = Date.now
$elapsed = $finish.secondsSinceUnixEpoch - $start.secondsSinceUnixEpoch

Debug.log "Finished in $elapsed seconds"
In the non-styled-text mode, it runs 100 times faster (!) on my test file of 22000 lines.


2009-03-30 04:25:16
Profile
User avatar

Joined: 2007-02-07 00:58:12
Posts: 876
Location: Japan
Kino wrote:
By making plain text the text object.
Code:

   $textcopy = Cast to String $sel.subtext

In the non-styled-text mode, it runs 100 times faster (!) on my test file of 22000 lines.

So that does it.
That's a nice trick. It works like a charn. Thanks!

_________________
philip


2009-03-30 07:34:21
Profile

Joined: 2007-04-12 14:59:36
Posts: 229
Big thanks to Kino and phspaelti for their contributions. I found them extremely helpful as well as instructive!


2009-04-02 14:05:56
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 11 posts ] 

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software