Sort lines and kill duplicates

Get help using and writing Nisus Writer Pro macros.
Post Reply
js
Posts: 229
Joined: 2007-04-12 14:59:36

Sort lines and kill duplicates

Post by js » 2009-03-28 07:40:50

To sort lines, kill duplicates and have them counted I use this PERL macro:

Code: Select all

#!/usr/bin/perl -w

use strict;

my %seen;
while (<>) {
    $seen{$_}++;
}

foreach my $key (sort keys %seen) {
	my $count = $seen{$key};
	print "$count $key\n";
}
I wonder if there is a simple way to integrate this into Nisus? Or can the Nisus Macro language do this?

Kino
Posts: 400
Joined: 2008-05-17 04:02:32

Re: Sort lines and kill duplicates

Post by Kino » 2009-03-28 08:09:36

Probably the simplest and the fastest is...

Code: Select all

Menu ':Edit:Sort Paragraphs:Ascending (A-Z)'
Replace All '(^.+\n)\1+', '\1', 'E-isS'
Find options
E: PowerFind Pro
-i: Case Sensitive
s: In Selection
S: Preserve Selection

Edit:
- The sort order is the one set in the International Pref pane.
- If the selection contains empty paragraphs, you can make them uniq by changing the find expression from '(^.+\n)\1+' to '(^.*\n)\1+'. However, you may want to delete them all. In that case, add this command:

Code: Select all

 Replace All '^\n+', '', 'EsS'

User avatar
phspaelti
Posts: 956
Joined: 2007-02-07 00:58:12
Location: Japan

Re: Sort lines and kill duplicates

Post by phspaelti » 2009-03-29 06:47:57

Kino's approach is the one that I would have used too, but there is one important difference between js's macro and Kino's. js's macro counts the number of lines as well. Here is one way to extend Kino's approach to counting lines.

Code: Select all

Select All
:Edit:Sort Paragraphs:Ascending (A-Z)
Find and Replace ‘^(.+)(\n\1)*$’, ‘•\t\0’, ‘Ea’
Find and Replace ‘\n[^•].*’, ‘•’, ‘Ea’
Find and Replace ‘^•\t([^•]+)(•+)$’, ‘•\2\t\1’, ‘Ea’
This method creates a "count" in the form bullets. In order to turn the bullets into numbers the fastest approach I have found so far is the following:

Code: Select all

$doc = Document.active
$text = $doc.text
$bullets = $text.find ‘^•+’, ‘Ea’
foreach $bullet in reversed $bullets
	$text.replaceInRange $bullet.range, $bullet.length
end
I had at first wondered whether one could re-write js's macro directly in the Nisus macro language. But one limitation is that the Nisus macro language doesn't seem to have an easy way to sort hashes.
philip

Kino
Posts: 400
Joined: 2008-05-17 04:02:32

Re: Sort lines and kill duplicates

Post by Kino » 2009-03-29 08:22:28

Ah! I did not read the original script attentively. So here is somethng close to it.

Code: Select all

$sp = Cast to String "\x20"
$LF = Cast to String "\n"
$doc = Document.active
$subtext = $doc.selectedSubtext
if ! $subtext.length
	exit 'Nothing selected, exit...'
end

$paragraphs = $subtext.split $LF
$freq = Hash.new
foreach $para in $paragraphs
	if $para != ''
		$freq{$para} += 1
	end
end

$keys = $freq.keys
$keys.sort 'li'
$output = ''
foreach $key in $keys
	$f = $freq{$key}
	$output &= "$f" & $sp
	# The style attribute of count number will be that of "$f" in this macro file.
	$output &= $key & $LF
end
Insert Attributed Text $output
Edit: I modified the macro so that the output preserves the style attributes (of the first occurrence). A great advantage of NW Pro macro over Perl when this matters. The algorithm of the macro is basically the same as that of the Perl script. NW Pro macro's weakness in parsing over-nested commands might make this one look different, though.

User avatar
phspaelti
Posts: 956
Joined: 2007-02-07 00:58:12
Location: Japan

Re: Sort lines and kill duplicates

Post by phspaelti » 2009-03-29 09:49:50

Now I see it. Of course the keys of the hash are an array. So they can be sorted.
You're version does have some improvements over the version that I was experimenting with. But testing it on a file with a bit less than 6000 lines, this approach is markedly slower than the Find and Replace version.
philip

Kino
Posts: 400
Joined: 2008-05-17 04:02:32

Re: Sort lines and kill duplicates

Post by Kino » 2009-03-29 11:07:16

Err... if we were competing in a car race...

Code: Select all

$start = Date.now

Select All
$sp = Cast to String "\x20"
$LF = "\n"
$doc = Document.active
$subtext = $doc.selectedSubtext
if ! $subtext.length
	exit 'Nothing selected, exit...'
end

$paragraphs = $subtext.split $LF
$freq = Hash.new
foreach $para in $paragraphs
	$freq{$para} += 1
end

$keys = $freq.keys
$keys.sort 'li'
$output = ''
foreach $key in $keys
	$output &= $freq{$key} & $sp
	$output &= $key & $LF
end
Insert Attributed Text $output

$finish = Date.now
$elapsed = $finish.secondsSinceUnixEpoch - $start.secondsSinceUnixEpoch
Exit "finished in $elapsed seconds"
This version does not remove empty paragraphs. You can make it far faster if you don't take care of style attributes at all.

User avatar
phspaelti
Posts: 956
Joined: 2007-02-07 00:58:12
Location: Japan

Re: Sort lines and kill duplicates

Post by phspaelti » 2009-03-29 19:08:31

Kino wrote: Err... if we were competing in a car race...
Actually my point is that if we were competing we (apparently) shouldn't bother with all this hash business. The "bullet count" version I posted earlier is much faster. Your version timed to 292 seconds on my machine on the 6000 line file. The "bullet count" version took 9 seconds :!:

This is not a criticism of your macro writing skills. I was very surprised that the internal hash method was so much slower. I had written a version almost identical to yours myself, but since it took so long and caused beach balls (and even an occasional crash) I thought this was due to my poor understanding of the macro language. But apparently not. This method is just slow.

I have been using "bullet count" since the Nisus Classic days, only because I find it easier to write such a macro. But now with NWP it is apparently also faster in execution.
Last edited by phspaelti on 2009-03-29 19:42:29, edited 1 time in total.
philip

User avatar
phspaelti
Posts: 956
Joined: 2007-02-07 00:58:12
Location: Japan

Re: Sort lines and kill duplicates

Post by phspaelti » 2009-03-29 19:40:33

Kino wrote: You can make it far faster if you don't take care of style attributes at all.
Actually I was wondering about this too. But what part of the macro takes care of the style attributes? The slow part of the macro is the loop which assembles the $output string. Timing individual bits of the macro, the "hash loop" takes about 2 seconds, and the "output loop" 288. While I'm sure that the slowness is due to the concatenating of styled text, I don't understand how this can be avoided.
philip

Kino
Posts: 400
Joined: 2008-05-17 04:02:32

Re: Sort lines and kill duplicates

Post by Kino » 2009-03-30 04:25:16

phspaelti wrote:While I'm sure that the slowness is due to the concatenating of styled text, I don't understand how this can be avoided.
By making plain text the text object.

Code: Select all

$StyledText = false

Debug.setDestination 'new'
$start = Date.now
Debug.log 'Macro started'

$sp = Cast to String "\x20"
$LF = Cast to String "\n"
$doc = Document.active
$text = $doc.text
$sel = TextSelection.newWithLocationAndLength $text, 0, $text.length
if $StyledText == true
	$textcopy = $sel.subtext
else
	$textcopy = Cast to String $sel.subtext
end
Debug.log 'Created copy of the main body text object'

$lines = $textcopy.split $LF
$textcopy = ''
Debug.log 'Text object split into an array.'

$freq = Hash.new
foreach $line in $lines
	$freq{$line} += 1
end
$lines = Array.new
Debug.log 'Counted number of occurrences'

$keys = $freq.keys
$keys.sort 'li'
Debug.log 'Finished to sort'

$output = ''
foreach $key in $keys
	$output &= $freq{$key} & $sp
	$output &= $key & $LF
end
Debug.log 'Finished to create a word list'

New
$doc = Document.active
$doc.clearAndDisableUndoHistory
if $StyledText == true
	$doc.insertText $output, 'm'
else
	$doc.insertText $output, 'a'
end

$finish = Date.now
$elapsed = $finish.secondsSinceUnixEpoch - $start.secondsSinceUnixEpoch

Debug.log "Finished in $elapsed seconds"
In the non-styled-text mode, it runs 100 times faster (!) on my test file of 22000 lines.

User avatar
phspaelti
Posts: 956
Joined: 2007-02-07 00:58:12
Location: Japan

Re: Sort lines and kill duplicates

Post by phspaelti » 2009-03-30 07:34:21

Kino wrote: By making plain text the text object.

Code: Select all

…
	$textcopy = Cast to String $sel.subtext
…
In the non-styled-text mode, it runs 100 times faster (!) on my test file of 22000 lines.
So that does it.
That's a nice trick. It works like a charn. Thanks!
philip

js
Posts: 229
Joined: 2007-04-12 14:59:36

Re: Sort lines and kill duplicates

Post by js » 2009-04-02 14:05:56

Big thanks to Kino and phspaelti for their contributions. I found them extremely helpful as well as instructive!

Post Reply