NWP pro seems to mis-translate some Word punctuation glyphs

Everything related to our flagship word processor.
Post Reply
jtranter
Posts: 38
Joined: 2010-03-12 00:37:07

NWP pro seems to mis-translate some Word punctuation glyphs

Post by jtranter »

My editorial work on an internet-only literary magazine involves accepting lots of files in various formats, usually Word 97, converting them to RTF, editing them, converting those RTF files to HTML, and massaging those files into XHTML.

I find NWP excellent to use as a primary editing tool, mainly because of its macro language, which I use to clean up the typing styles that various authors present.

I have run into a problem, though, and I wonder if anyone can help.

Sometimes a whole DOC file, and sometimes just part of a DOC file, carries infected characters, usually left or right single or double quotes, and sometimes en and em dashes. These characters have a hidden component: that is when I delete each one by backspacing over them, the first backspace deletes a character, and the second backspace seems to do nothing, but does delete another, invisible character.

On the other hand, if I open the DOC file with TextEdit, save the file as RTF, then open that RTF file in NWP, the infected characters have disappeared and the text is clean.

This suggest to me that NWP is failing to interpret Word 97 (or some other version of Word) punctuation glyphs correctly, glyphs that Text Edit can read correctly.

I attach two files:

tranter-nwp-flaws1.rtf
part of a Microsoft Word file opened with NWP v1.4.1 and saved as a RTF file,
showing the faulty characters.

and

tranter-nwp-flaws2.rtf
part of the same RTF file saved by NWP and then opened with TextEdit v1.6 (264) and saved as a RTF file, showing no faulty characters.


Uhhh... unbelievably, in a forum devoted to NWP, which saves all files in RTF, the RTF extension is not allowed as an upload! SO I have converted the files to DOC format and uploaded them.
Attachments
tranter-nwp-flaws2.doc
(6 KiB) Downloaded 360 times
tranter-nwp-flaws1.doc
(12.91 KiB) Downloaded 354 times
User avatar
Elbrecht
Posts: 354
Joined: 2007-03-31 14:59:22
Location: Frankfurt, Germany

Re: NWP pro seems to mis-translate some Word punctuation glyphs

Post by Elbrecht »

Hi -

could you upload zipped/compressed RTF files for download please - as far as I can see, there are U+0081-Controls left in the DOC - from maybe UTF-Code somehow? You can "Edit/Convert/To Unicode Code Points" yourself - BTW...

HE
MacBook Pro i5
SSD 840/850 Pro
High Sierra 10.13.6
Nisus Writer Pro 3.4
Kino
Posts: 400
Joined: 2008-05-17 04:02:32

Re: NWP pro seems to mis-translate some Word punctuation glyphs

Post by Kino »

jtranter wrote:Sometimes a whole DOC file, and sometimes just part of a DOC file, carries infected characters, usually left or right single or double quotes, and sometimes en and em dashes.
AFAIK there are three types of conversion bugs in OpenOffice’s import filter. If I understand Martin’s explanation, the filter converts a doc file into OO’s internal format (odt?) and, then, converts the latter into RTF. It is the RTF conversion that is buggy. If you open such a doc file and save it in RTF format in OpenOffice Writer, you will find the same errors. My copy of OO 3.2.0 crashes when I try to save in RTF, though.

The macro below tries to fix two types of the conversion bugs, including the one affecting your sample file.

Code: Select all

$doc = Document.active
if $doc == undefined
	exit
end
$text = $doc.text.copy

$c = 0
Set Exported Perl Variables 'chars'
Set Include Perl UTF Preamble false

$sels = $text.findAll '\x81\p{Any}', 'E'
if $sels.count
	$c += $sels.count
	$chars = Array.new
	foreach $sel in $sels
		$chars.appendValue $sel.substring
	end
	$chars = $chars.join "\n"
	begin Perl
		binmode (STDIN, ':bytes');
		binmode (STDOUT, ':utf8');
		use Encode;
		Encode::from_to ($chars, 'utf8', 'iso-8859-1');
		$chars = decode ('macJapanese', $chars);
	end
	$chars = $chars.split "\n"
	foreach $sel in reversed $sels
		$sel.text.replaceInRange $sel.range, $chars.pop
	end
end

$sels = $text.findAll '[\x80-\x9F&&[^\x81\x8D\x8F\x90]]', 'E'
if $sels.count
	$c += $sels.count
	$chars = Array.new
	foreach $sel in $sels
		$chars.appendValue $sel.substring
	end
	$chars = $chars.join "\n"
	begin Perl
		binmode (STDIN, ':bytes');
		binmode (STDOUT, ':utf8');
		use Encode;
		Encode::from_to ($chars, 'utf8', 'iso-8859-1');
		$chars = decode ('cp1252', $chars);
	end
	$chars = $chars.split "\n"
	foreach $sel in reversed $sels
		$sel.text.replaceInRange $sel.range, $chars.pop
	end
end

if $c
	Document.newWithText $text
else
	exit 'No mojibake found, exiting...'
end
jtranter
Posts: 38
Joined: 2010-03-12 00:37:07

Re: NWP pro seems to mis-translate some Word punctuation glyphs

Post by jtranter »

Here are the original two RTF files, in ZIP format. Thanks.
Attachments
tranter-nwp-flaws2.rtf.zip
(1.39 KiB) Downloaded 374 times
tranter-nwp-flaws1.rtf.zip
(3.75 KiB) Downloaded 370 times
jtranter
Posts: 38
Joined: 2010-03-12 00:37:07

Re: NWP pro seems to mis-translate some Word punctuation glyphs

Post by jtranter »

Kino said:

<The macro below tries to fix two types of the conversion bugs, including the one affecting your sample file.>

Thanks, Kino: that works fine.

Is there any chance that a later version of NWP might include those steps when NWP opens a DOC file?

best
JT
User avatar
martin
Official Nisus Person
Posts: 5227
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Re: NWP pro seems to mis-translate some Word punctuation glyphs

Post by martin »

jtranter wrote:unbelievably, in a forum devoted to NWP, which saves all files in RTF, the RTF extension is not allowed as an upload! SO I have converted the files to DOC format and uploaded them.
Well, we write a word processor, not forum software ;) but I've fixed the annoyance by adding the extension to phpbb's allowed file types; thanks for letting us know.
User avatar
martin
Official Nisus Person
Posts: 5227
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Re: NWP pro seems to mis-translate some Word punctuation glyphs

Post by martin »

Kino wrote:If I understand Martin’s explanation, the filter converts a doc file into OO’s internal format (odt?) and, then, converts the latter into RTF.
That's right. OO loads the ".doc" file into RAM, using the data structures native to its own operation, and then exports that as RTF for NWP to read. But one shouldn't compare OO's internal structure in RAM to any file format, even ODT; they are not concretely linked.
It is the RTF conversion that is buggy. If you open such a doc file and save it in RTF format in OpenOffice Writer, you will find the same errors. My copy of OO 3.2.0 crashes when I try to save in RTF, though.
Yes, we've found OO's RTF export to have some problems, including occasional crashes, and fix them when possible.
jtranter wrote:Is there any chance that a later version of NWP might include those steps when NWP opens a DOC file?
We'll see what we can do, thanks!
Post Reply