My editorial work on an internet-only literary magazine involves accepting lots of files in various formats, usually Word 97, converting them to RTF, editing them, converting those RTF files to HTML, and massaging those files into XHTML.
I find NWP excellent to use as a primary editing tool, mainly because of its macro language, which I use to clean up the typing styles that various authors present.
I have run into a problem, though, and I wonder if anyone can help.
Sometimes a whole DOC file, and sometimes just part of a DOC file, carries infected characters, usually left or right single or double quotes, and sometimes en and em dashes. These characters have a hidden component: that is when I delete each one by backspacing over them, the first backspace deletes a character, and the second backspace seems to do nothing, but does delete another, invisible character.
On the other hand, if I open the DOC file with TextEdit, save the file as RTF, then open that RTF file in NWP, the infected characters have disappeared and the text is clean.
This suggest to me that NWP is failing to interpret Word 97 (or some other version of Word) punctuation glyphs correctly, glyphs that Text Edit can read correctly.
I attach two files:
tranter-nwp-flaws1.rtf
part of a Microsoft Word file opened with NWP v1.4.1 and saved as a RTF file,
showing the faulty characters.
and
tranter-nwp-flaws2.rtf
part of the same RTF file saved by NWP and then opened with TextEdit v1.6 (264) and saved as a RTF file, showing no faulty characters.
Uhhh... unbelievably, in a forum devoted to NWP, which saves all files in RTF, the RTF extension is not allowed as an upload! SO I have converted the files to DOC format and uploaded them.
NWP pro seems to mis-translate some Word punctuation glyphs
NWP pro seems to mis-translate some Word punctuation glyphs
- Attachments
-
tranter-nwp-flaws2.doc
- (6 KiB) Downloaded 427 times
-
tranter-nwp-flaws1.doc
- (12.91 KiB) Downloaded 426 times
Re: NWP pro seems to mis-translate some Word punctuation glyphs
Hi -
could you upload zipped/compressed RTF files for download please - as far as I can see, there are U+0081-Controls left in the DOC - from maybe UTF-Code somehow? You can "Edit/Convert/To Unicode Code Points" yourself - BTW...
HE
could you upload zipped/compressed RTF files for download please - as far as I can see, there are U+0081-Controls left in the DOC - from maybe UTF-Code somehow? You can "Edit/Convert/To Unicode Code Points" yourself - BTW...
HE
MacBook Pro i5
SSD 840/850 Pro
High Sierra 10.13.6
Nisus Writer Pro 3.4.1
SSD 840/850 Pro
High Sierra 10.13.6
Nisus Writer Pro 3.4.1
Re: NWP pro seems to mis-translate some Word punctuation glyphs
AFAIK there are three types of conversion bugs in OpenOffice’s import filter. If I understand Martin’s explanation, the filter converts a doc file into OO’s internal format (odt?) and, then, converts the latter into RTF. It is the RTF conversion that is buggy. If you open such a doc file and save it in RTF format in OpenOffice Writer, you will find the same errors. My copy of OO 3.2.0 crashes when I try to save in RTF, though.jtranter wrote:Sometimes a whole DOC file, and sometimes just part of a DOC file, carries infected characters, usually left or right single or double quotes, and sometimes en and em dashes.
The macro below tries to fix two types of the conversion bugs, including the one affecting your sample file.
Code: Select all
$doc = Document.active
if $doc == undefined
exit
end
$text = $doc.text.copy
$c = 0
Set Exported Perl Variables 'chars'
Set Include Perl UTF Preamble false
$sels = $text.findAll '\x81\p{Any}', 'E'
if $sels.count
$c += $sels.count
$chars = Array.new
foreach $sel in $sels
$chars.appendValue $sel.substring
end
$chars = $chars.join "\n"
begin Perl
binmode (STDIN, ':bytes');
binmode (STDOUT, ':utf8');
use Encode;
Encode::from_to ($chars, 'utf8', 'iso-8859-1');
$chars = decode ('macJapanese', $chars);
end
$chars = $chars.split "\n"
foreach $sel in reversed $sels
$sel.text.replaceInRange $sel.range, $chars.pop
end
end
$sels = $text.findAll '[\x80-\x9F&&[^\x81\x8D\x8F\x90]]', 'E'
if $sels.count
$c += $sels.count
$chars = Array.new
foreach $sel in $sels
$chars.appendValue $sel.substring
end
$chars = $chars.join "\n"
begin Perl
binmode (STDIN, ':bytes');
binmode (STDOUT, ':utf8');
use Encode;
Encode::from_to ($chars, 'utf8', 'iso-8859-1');
$chars = decode ('cp1252', $chars);
end
$chars = $chars.split "\n"
foreach $sel in reversed $sels
$sel.text.replaceInRange $sel.range, $chars.pop
end
end
if $c
Document.newWithText $text
else
exit 'No mojibake found, exiting...'
end
Re: NWP pro seems to mis-translate some Word punctuation glyphs
Here are the original two RTF files, in ZIP format. Thanks.
- Attachments
-
- tranter-nwp-flaws2.rtf.zip
- (1.39 KiB) Downloaded 446 times
-
- tranter-nwp-flaws1.rtf.zip
- (3.75 KiB) Downloaded 444 times
Re: NWP pro seems to mis-translate some Word punctuation glyphs
Kino said:
<The macro below tries to fix two types of the conversion bugs, including the one affecting your sample file.>
Thanks, Kino: that works fine.
Is there any chance that a later version of NWP might include those steps when NWP opens a DOC file?
best
JT
<The macro below tries to fix two types of the conversion bugs, including the one affecting your sample file.>
Thanks, Kino: that works fine.
Is there any chance that a later version of NWP might include those steps when NWP opens a DOC file?
best
JT
- martin
- Official Nisus Person
- Posts: 5230
- Joined: 2002-07-11 17:14:10
- Location: San Diego, CA
- Contact:
Re: NWP pro seems to mis-translate some Word punctuation glyphs
Well, we write a word processor, not forum softwarejtranter wrote:unbelievably, in a forum devoted to NWP, which saves all files in RTF, the RTF extension is not allowed as an upload! SO I have converted the files to DOC format and uploaded them.

- martin
- Official Nisus Person
- Posts: 5230
- Joined: 2002-07-11 17:14:10
- Location: San Diego, CA
- Contact:
Re: NWP pro seems to mis-translate some Word punctuation glyphs
That's right. OO loads the ".doc" file into RAM, using the data structures native to its own operation, and then exports that as RTF for NWP to read. But one shouldn't compare OO's internal structure in RAM to any file format, even ODT; they are not concretely linked.Kino wrote:If I understand Martin’s explanation, the filter converts a doc file into OO’s internal format (odt?) and, then, converts the latter into RTF.
Yes, we've found OO's RTF export to have some problems, including occasional crashes, and fix them when possible.It is the RTF conversion that is buggy. If you open such a doc file and save it in RTF format in OpenOffice Writer, you will find the same errors. My copy of OO 3.2.0 crashes when I try to save in RTF, though.
We'll see what we can do, thanks!jtranter wrote:Is there any chance that a later version of NWP might include those steps when NWP opens a DOC file?