Create Index for Japanese Documents

Get help using and writing Nisus Writer Pro macros.
Post Reply
Kino
Posts: 400
Joined: 2008-05-17 04:02:32

Create Index for Japanese Documents

Post by Kino » 2017-03-31 10:22:42

Here is a macro which [1] indexes Japanese words in the frontmost document as “yomi<space>word”, [2] inserts Index, and [3] remove “yomi” necessary only for sorting Index entries.

Code: Select all

### Generate Index ###

# This is a sample NWP macro to generate Index for Japanese documents.
# Run against “kusabana.rtf”, it generates Index at the end of the file.
# For other documents, you have to define as many $item as needed.
# This macro indexes a word as “yomi<space>word” of which the yomi works as sort key.
# After the generation of index, “yomi” will be removed with trailing space.

$numColumn = 1 # Number of columns
$removeYomi = @true

$doc = Document.active
if ! $doc
	exit 'No open document, exiting...'
end

Debug.setDestination 'new'

$item = Hash.new
	# $item{'よみ 索引対象語'} = Cast to String 'Find Expression'
$item{'がんぴ 眼皮'} = Cast to String '眼皮'
$item{'せきちく 石竹'} = Cast to String '石竹'
$item{'だるまだいし 達磨大師'} = Cast to String '達磨大師'
$item{'わかんさんさいずえ 和漢三才図会'} = Cast to String '和漢三才図会'
$item{'かあねいしょん カーネイション'} = Cast to String 'カーネイション'
$item{'ちょうじ 丁子'} = Cast to String '丁子'
$item{'なでしこ ナデシコ'} = Cast to String '(?<!\p{Katakana})ナデシコ'
	# 'ナデシコ' not preceded by any Katakana leter (\p{Katakana})
$item{'なでしこ ナデシコ:からなでしこ カラナデシコ'} = Cast to String 'カラナデシコ'
$item{'なでしこ ナデシコ:こなでしこ コナデシコ'} = Cast to String 'コナデシコ'
$item{'なでしこ ナデシコ:むしとりなでしこ ムシトリナデシコ'} = Cast to String 'ムシトリナデシコ'
$item{'ふじわらのさだいえ 藤原定家'} = Cast to String '定家'
$item{'ふじわらのさだいえ 藤原定家:しんちょくせんしゅう 『新勅撰集』'} = Cast to String '新勅撰集'
$item{'ぱあすれい パースレイ'} = Cast to String 'パースレイ'
$item{'ありすとてれす アリストテレス'} = Cast to String 'アリストテレス'
$item{'だりや ダ(ー)リヤ'} = Cast to String 'ダリヤ|ダーリヤ'
	# 'ダリヤ' or 'ダーリヤ'
$item{'かゔあにゅす カヴアニュス'} = Cast to String 'カヴアニュス'
$item{'かゔあにゅす カヴアニュス:すぺいんしょくぶつずせつ 西班牙植物図説'} = Cast to String '西班牙植物図説'
$item{'はやしじゅっさい 林述斎'} = Cast to String '林述斎'
$item{'わかつきれいじろう 若槻禮次郞'} = Cast to String '若槻首相'
$item{'きんげんぞう 金源三'} = Cast to String '金源三' # ??? I don’t know
$item{'みなもとのとしより 源俊頼'} = Cast to String '俊頼'
$item{'にほん 日本'} = Cast to String '(?<!\p{Han})日本(?!\p{Han})'
	# '日本' not preceded nor followed by any Kanji leter (\p{Han})
$item{'にほん 日本:━じん ━人'} = Cast to String '(?<!\p{Han})日本人(?!\p{Han})'
	# '日本人' not preceded nor followed by any Kanji leter (\p{Han})
$item{'にほん 日本:━および━ 「━及━」'} = Cast to String '「日本及日本人」'
$item{''} = Cast to String ''	# Empty $item does not affect, it seems
$item{''} = Cast to String ''


$errors = Array.new

foreach $i in $item.keys
	$sels = $doc.text.findAll $item{$i}, 'E-i', '-am'
	if $sels.count
		Push Target Selection $sels
			Add to Text Index As $i
		Pop Target Selection
	else
		$errors.appendValue $item{$i}
	end
end

# Define cross-references
$crossRef = Hash.new
$crossRef{'ていか 定家'} = Cast to String '藤原定家'
$crossRef{'としより 俊頼'} = Cast to String '源俊頼'

foreach $i in $crossRef.keys
	$sel = $doc.text.find $crossRef{$i}, 'E-ir'
	$crossRefString = $crossRef{$i}
	if $sel
		Push Target Selection $sels
			Add to Text Index As $i, $crossRefString
		Pop Target Selection
	else
		$errors.appendValue $crossRef{$i}
	end
end

# Exclude words in “Not for Index” characrer style from Index
$style = $doc.styleWithName 'Not for Index'
$exclude = $doc.text.findAll $style
if $exclude.count
	Push Target Selection $exclude
		Remove Text Indexing
	Pop Target Selection
end

Select Document End
Document.setActive $doc
Menu.activateAtPath(':Tools:Index:Insert Index')
Send Text $numColumn  
Press Button 'Insert'

if $removeYomi
	Document.setActive $doc
	Replace All '^\S+ ', '', 'EsS-i'
	# Remove visible characters at each paragraph start together with trailing space
end

if $errors.count
	$errmsg = Cast to String 'Failed: '
	foreach $err in $errors
		$err = $errmsg & $err
		Debug.log $err
	end
end

### end of macro
With a standard contemporary Japanese document, perhaps it would be safer to enclose all Katakana words between (?<!\p{Katakana}|\x{30FC}) and (?!\p{Katakana}|\x{30FC}) (\x{30FC} stands for “ー”), and all Kanji words between (?<!\p{Han}) and (?!\p{Han}) so that, for example, “シリア” will not match “シリアノス” nor “アッシリア”. Of course, that depends on how the document is written.
Macro&SampleFile.zip
Macro and a sample file “kusabana.rtf”
(21.34 KiB) Downloaded 116 times
Last edited by Kino on 2017-03-31 11:54:38, edited 3 times in total.

Kino
Posts: 400
Joined: 2008-05-17 04:02:32

Re: Create Index for Japanese Documents

Post by Kino » 2017-03-31 10:59:08

It is tedious to type yomi (pronunciation of Kanji word represented by Hiragana or Katakana letters) for each Index entry. Here is a macro which inserts “yomi<space>” before selected Japanese word(s).

Code: Select all

### Prepend Yomi ###

# This macro add yomi before selected Japanese word(s) in Kanji, Katakana, or Hiragana.
# This macro requires mecab <http://taku910.github.io/mecab/> installed manually or via Homebrew, MacPorts, etc.
# Dictionary used by mecab (usually “ipadic”) should be encoded in UTF-8.
# If you use Homebrew, the following Terminal commands will do the job.
# 	brew install mecab
# 	brew install mecab-ipadic
# If you have mecab installed in a directory other thaqn /usr/local/bin, modify the value of $mecabPath accordingly.
# Yomi returned by mecab is not always correct. Add such words to $defYomi as many as needed.
# This macro requires a Services module 'Transform Clipboard by Shell Script.workflow' in /Users/<you>/Library/Services as follows:
#	on run
#		set scriptvar to (the clipboard)
#		set the clipboard to (do shell script scriptvar)
#	end run
# which execute a shell script in the clipbord and write back its output to the clipboard.


$mecabPath = '/usr/local/bin/mecab'
$betweenYomi = '*'
$afterYomi = Cast to String " "
$VowelizeProlongedSoundMark = @true

$defYomi = Hash.new	# Add words mecab does not understand well
$defYomi{'形相'} = Cast to String 'けいそう'	# mecab returns 'ぎょうそう'
$defYomi{'本性'} = Cast to String 'ほんせい'	# mecab returns 'ほんしょう'
$defYomi{'離存'} = Cast to String 'りそん'	# mecab does not know the word
$defYomi{'冒瀆'} = Cast to String 'ぼうとく'	# mecab does not understand no-JIS X 0208 chars
$defYomi{'碧巌録'} = Cast to String 'へきがんろく'	# mecab does not know the word
$defYomi{'西班牙'} = Cast to String 'すぺいん'	# mecab does not know the word

$doc = Document.active
if ! $doc
	exit 'No open document, exiting...'
end

$sels = $doc.textSelections
if ! $sels.firstValue.length
	exit 'No selection, exiting...'
end

$str = $doc.selectedSubstrings.join($betweenYomi)
$script = "echo '$str' | $mecabPath -O yomi | perl -Mutf8 -CS -pe 'tr/[ァ-ヶヽヾ]/[ぁ-ゖゝゞ]/'"
Write Clipboard $script

Menu.activateAtPath('Services:Transform Clipboard by Shell Script')

$converted = Read Clipboard

while $converted == $script
	Sleep 0.2
	$converted = Read Clipboard
end

if $VowelizeProlongedSoundMark	# Necessary for correct sort order
	$converted.replaceAll '(?<=[あぁかゕがさざただなはばぱまやゃらわゎ])ー', 'あ', 'E'
	$converted.replaceAll '(?<=[いぃきぎしじちぢにひびぴみりゐ])ー', 'い', 'E'
	$converted.replaceAll '(?<=[うぅゔくぐすずつづぬふぶぷむゆゅる])ー', 'う', 'E'
	$converted.replaceAll '(?<=[えぇけゖげせぜてでねへべぺめえれゑ])ー', 'え', 'E'
	$converted.replaceAll '(?<=[おぉこごそぞとどのほぼぽもよょろを])ー', 'お', 'E'
end

$converted = Cast to Attributed String $converted
$ja = Language.languageWithCode('ja')
$fontName = $sels[0].text.displayAttributesAtIndex($sels[0].location).fontName

Push Target Text $converted
	$ja.apply		# Perhaps necessary for correct sort order
	Set Font Name $fontName
Pop Target Text

$converted = $converted.split($betweenYomi)

foreach $sel in reversed $sels
	$yomi = $converted.pop & $afterYomi
	$sel.text.insertAtIndex $sel.location, $yomi
end

### end of macro
The macro relies on “mecab” (Unix program) and “Transform Clipboard by Shell Script.workflow” (Services module). The latter is a slightly modified version of Philip Spaelti’s “Run Shell Script from Clipboard” viewtopic.php?f=17&t=5948#p27249.

This macro is terribly slow in comparison with a macro which calls mecab via perl which was killed by Sandbox ;-( So running it with a single word selected is very irritating. It would be better to run it on multiple non-contiguous selections.

FEATURE REQUEST! Menu command which converts selected Japanese words into their yomi (pronunciation of Kanji/Katakana word represented by Hiragana or Katakana letters) and a macro command such as “Convert to Yomi textObject, [Hiragana/Katakana]”. I think OS X has API which makes possible those features because Jedit X http://www.artman21.com/en/jedit_x/ has such a menu command “:Tools:Kanji to Hiragana” since longtime ago.
Macro&ServicesModule.zip
(36.36 KiB) Downloaded 109 times
Last edited by Kino on 2017-03-31 11:45:31, edited 2 times in total.

Kino
Posts: 400
Joined: 2008-05-17 04:02:32

Re: Create Index for Japanese Documents

Post by Kino » 2017-03-31 11:10:27

And here is a macro creating a Japanese word list from the frontmost document, that you might find useful.

Code: Select all

### Japanese Word List ###

# Create a list of Katakana or Kanji words from the frontmost document.

$lang = Language.languageWithCode 'ja'
# zh_TW for Traditional Chinese and zh_CN for Simplified one

Debug.setCodeProfilingEnabled false

$LF = Cast to String "\n"
$tab = Cast to String "\t"
$doc = Document.active
if $doc == undefined
	exit 'No open document, exiting...'
end

$findExp = Hash.new  # a, b, c,... are used as sort keys

$findExp{'a. Katakana words including ジャン゠ソウル・パルトル, F・カフカ, Ch. ミュンシュ'} = '(?:[\x{FF21}-\x{FF3A}](?<p>[\x{30A0}\x{30FB}\x{FF1D}])|\p{Upper}\p{Lower}?\.(?:[-\x20]\p{Upper}\p{Lower}?\.)*\x20)*(?<k>\p{Katakana}[\p{Katakana}\x{30FC}]*)(?:(?:[\x{30A0}\x{30FB}\x{FF1D}][\x{FF21}-\x{FF3A}])?\g<p>\g<k>)*'
$findExp{'b. Katakana/Kanji words such as 老子, 新プラトン主義, フィリップ・K・ディック'} = '(?:[\x{FF21}-\x{FF3A}](?<p>[\x{30A0}\x{30FB}\x{FF1D}])|\p{Upper}\p{Lower}?\.(?:[-\x20]\p{Upper}\p{Lower}?\.)*\x20)*(?:(?<k>\p{Katakana}[\p{Katakana}\x{30FC}]*)(?:(?:[\x{30A0}\x{30FB}\x{FF1D}][\x{FF21}-\x{FF3A}])?\g<p>\g<k>)*|\p{Han}+(?:\g<p>\p{Han}+)*)+'
$findExp{'c. Separate Kanji chars'} = '\p{Han}'
$findExp{'d. One or more consecutive Kanji chars'} = '\p{Han}+'
$findExp{'e. Two or more consecutive Kanji chars'} = '\p{Han}{2,}'
$findExp{'f. Kanji followed by 2 Hiragana'} = '\p{Han}+\p{Hiragana}{1,2}'

$menuItems = $findExp.keys
$menuItems.sort
$input = Prompt Options 'List up . . .', '', '', $menuItems
$selections = $doc.selectedSubstrings

if $selections.firstValue.length
	$LF = Cast to String "\n"
	$selections = $selections.join $LF
	$sels = $selections.findAll $findExp{$input}, 'E-i'
else
	$sels = $doc.text.findAll $findExp{$input}, 'E-i'
end

if ! $sels.count
	exit 'No word found, exiting...'
end

$str = Hash.new
foreach $sel in $sels
	$str{$sel.substring} += 1
end

$words = $str.keys
$words.sort 'li', $lang
foreach $i, $word in $words
	$ocr = $str{$word} & $tab
	$words[$i] = $ocr & $word
end
$words = $words.join $LF
Push Target Text $words
	$lang.apply
Pop Target Text
Document.newWithText $words

### end of macro ###
JapaneseWordList.nwm.zip
Macro file
(5.43 KiB) Downloaded 109 times
Last edited by Kino on 2017-03-31 11:45:54, edited 1 time in total.

Kino
Posts: 400
Joined: 2008-05-17 04:02:32

Re: Create Index for Japanese Documents

Post by Kino » 2017-03-31 11:37:46

With those macros (not exactly the same because of the necessity for applying special formatting), I created 索引 (index) for every volume of『井筒俊彦全集』(The Complete Works of Toshihiko IZUTSU, 12 vols and 1 supplement) http://www.keio-up.co.jp/kup/izutsu/cw.html https://www.keio-up.co.jp/np/search_result.do?ser_id=73 except 総索引 (general index of all the volumes included in the last volume 別巻). Impossible if I had not been a NWP user. Thousand thanks, Mr. Nisus and Nice Us people! :-)

User avatar
phspaelti
Posts: 912
Joined: 2007-02-07 00:58:12
Location: Japan

Re: Create Index for Japanese Documents

Post by phspaelti » 2017-03-31 19:40:31

Kino wrote: FEATURE REQUEST! Menu command which converts selected Japanese words into their yomi (pronunciation of Kanji/Katakana word represented by Hiragana or Katakana letters) and a macro command such as “Convert to Yomi textObject, [Hiragana/Katakana]”. I think OS X has API which makes possible those features because Jedit X http://www.artman21.com/en/jedit_x/ has such a menu command “:Tools:Kanji to Hiragana” since longtime ago.
I second that feature request
philip

credneb
Posts: 146
Joined: 2007-03-28 07:30:34

Re: Create Index for Japanese Documents

Post by credneb » 2017-04-01 08:34:15

I third the feature request.

Nobumi Iyanaga
Posts: 151
Joined: 2007-01-17 05:46:17
Location: Tokyo, Japan
Contact:

Re: Create Index for Japanese Documents

Post by Nobumi Iyanaga » 2017-04-01 21:15:38

This is an essential feature. Please add it to NW!
Best regards,

Nobumi Iyanaga
Tokyo,
Japan

User avatar
Elbrecht
Posts: 335
Joined: 2007-03-31 14:59:22
Location: Frankfurt, Germany

Re: Create Index for Japanese Documents

Post by Elbrecht » 2017-04-12 00:08:31

Hi Kino –

I love to read Toshihiko IZUTSU – but sorry English only! This Teheran Connection continues to work with Sachiko MURATA – great reading too!

HE
Kino wrote:With those macros (not exactly the same because of the necessity for applying special formatting), I created 索引 (index) for every volume of『井筒俊彦全集』(The Complete Works of Toshihiko IZUTSU, 12 vols and 1 supplement) http://www.keio-up.co.jp/kup/izutsu/cw.html https://www.keio-up.co.jp/np/search_result.do?ser_id=73 except 総索引 (general index of all the volumes included in the last volume 別巻). Impossible if I had not been a NWP user. Thousand thanks, Mr. Nisus and Nice Us people! :-)
MacBook Pro i5
SSD 840/850 Pro
macOS Sierra 10.12.6
Nisus Writer Pro 2.1.8

User avatar
martin
Official Nisus Person
Posts: 4261
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Re: Create Index for Japanese Documents

Post by martin » 2017-05-11 16:48:36

phspaelti wrote:
Kino wrote: FEATURE REQUEST! Menu command which converts selected Japanese words into their yomi (pronunciation of Kanji/Katakana word represented by Hiragana or Katakana letters) and a macro command such as “Convert to Yomi textObject, [Hiragana/Katakana]”. I think OS X has API which makes possible those features because Jedit X http://www.artman21.com/en/jedit_x/ has such a menu command “:Tools:Kanji to Hiragana” since longtime ago.
I second that feature request
Thank you to all of you for letting us know this was important to you. I've added the potential enhancement to our issue tracker.

Post Reply