nisus.com
https://www.nisus.com/forum/

Create Index for Japanese Documents
https://www.nisus.com/forum/viewtopic.php?f=17&t=6715
Page 1 of 1

Author:  Kino [ 2017-03-31 10:22:42 ]
Post subject:  Create Index for Japanese Documents

Here is a macro which [1] indexes Japanese words in the frontmost document as “yomi<space>word”, [2] inserts Index, and [3] remove “yomi” necessary only for sorting Index entries.
Code:
### Generate Index ###

# This is a sample NWP macro to generate Index for Japanese documents.
# Run against “kusabana.rtf”, it generates Index at the end of the file.
# For other documents, you have to define as many $item as needed.
# This macro indexes a word as “yomi<space>word” of which the yomi works as sort key.
# After the generation of index, “yomi” will be removed with trailing space.

$numColumn = 1 # Number of columns
$removeYomi = @true

$doc = Document.active
if ! $doc
   exit 'No open document, exiting...'
end

Debug.setDestination 'new'

$item = Hash.new
   # $item{'よみ 索引対象語'} = Cast to String 'Find Expression'
$item{'がんぴ 眼皮'} = Cast to String '眼皮'
$item{'せきちく 石竹'} = Cast to String '石竹'
$item{'だるまだいし 達磨大師'} = Cast to String '達磨大師'
$item{'わかんさんさいずえ 和漢三才図会'} = Cast to String '和漢三才図会'
$item{'かあねいしょん カーネイション'} = Cast to String 'カーネイション'
$item{'ちょうじ 丁子'} = Cast to String '丁子'
$item{'なでしこ ナデシコ'} = Cast to String '(?<!\p{Katakana})ナデシコ'
   # 'ナデシコ' not preceded by any Katakana leter (\p{Katakana})
$item{'なでしこ ナデシコ:からなでしこ カラナデシコ'} = Cast to String 'カラナデシコ'
$item{'なでしこ ナデシコ:こなでしこ コナデシコ'} = Cast to String 'コナデシコ'
$item{'なでしこ ナデシコ:むしとりなでしこ ムシトリナデシコ'} = Cast to String 'ムシトリナデシコ'
$item{'ふじわらのさだいえ 藤原定家'} = Cast to String '定家'
$item{'ふじわらのさだいえ 藤原定家:しんちょくせんしゅう 『新勅撰集』'} = Cast to String '新勅撰集'
$item{'ぱあすれい パースレイ'} = Cast to String 'パースレイ'
$item{'ありすとてれす アリストテレス'} = Cast to String 'アリストテレス'
$item{'だりや ダ(ー)リヤ'} = Cast to String 'ダリヤ|ダーリヤ'
   # 'ダリヤ' or 'ダーリヤ'
$item{'かゔあにゅす カヴアニュス'} = Cast to String 'カヴアニュス'
$item{'かゔあにゅす カヴアニュス:すぺいんしょくぶつずせつ 西班牙植物図説'} = Cast to String '西班牙植物図説'
$item{'はやしじゅっさい 林述斎'} = Cast to String '林述斎'
$item{'わかつきれいじろう 若槻禮次郞'} = Cast to String '若槻首相'
$item{'きんげんぞう 金源三'} = Cast to String '金源三' # ??? I don’t know
$item{'みなもとのとしより 源俊頼'} = Cast to String '俊頼'
$item{'にほん 日本'} = Cast to String '(?<!\p{Han})日本(?!\p{Han})'
   # '日本' not preceded nor followed by any Kanji leter (\p{Han})
$item{'にほん 日本:━じん ━人'} = Cast to String '(?<!\p{Han})日本人(?!\p{Han})'
   # '日本人' not preceded nor followed by any Kanji leter (\p{Han})
$item{'にほん 日本:━および━ 「━及━」'} = Cast to String '「日本及日本人」'
$item{''} = Cast to String ''   # Empty $item does not affect, it seems
$item{''} = Cast to String ''


$errors = Array.new

foreach $i in $item.keys
   $sels = $doc.text.findAll $item{$i}, 'E-i', '-am'
   if $sels.count
      Push Target Selection $sels
         Add to Text Index As $i
      Pop Target Selection
   else
      $errors.appendValue $item{$i}
   end
end

# Define cross-references
$crossRef = Hash.new
$crossRef{'ていか 定家'} = Cast to String '藤原定家'
$crossRef{'としより 俊頼'} = Cast to String '源俊頼'

foreach $i in $crossRef.keys
   $sel = $doc.text.find $crossRef{$i}, 'E-ir'
   $crossRefString = $crossRef{$i}
   if $sel
      Push Target Selection $sels
         Add to Text Index As $i, $crossRefString
      Pop Target Selection
   else
      $errors.appendValue $crossRef{$i}
   end
end

# Exclude words in “Not for Index” characrer style from Index
$style = $doc.styleWithName 'Not for Index'
$exclude = $doc.text.findAll $style
if $exclude.count
   Push Target Selection $exclude
      Remove Text Indexing
   Pop Target Selection
end

Select Document End
Document.setActive $doc
Menu.activateAtPath(':Tools:Index:Insert Index')
Send Text $numColumn 
Press Button 'Insert'

if $removeYomi
   Document.setActive $doc
   Replace All '^\S+ ', '', 'EsS-i'
   # Remove visible characters at each paragraph start together with trailing space
end

if $errors.count
   $errmsg = Cast to String 'Failed: '
   foreach $err in $errors
      $err = $errmsg & $err
      Debug.log $err
   end
end

### end of macro

With a standard contemporary Japanese document, perhaps it would be safer to enclose all Katakana words between (?<!\p{Katakana}|\x{30FC}) and (?!\p{Katakana}|\x{30FC}) (\x{30FC} stands for “ー”), and all Kanji words between (?<!\p{Han}) and (?!\p{Han}) so that, for example, “シリア” will not match “シリアノス” nor “アッシリア”. Of course, that depends on how the document is written.
Attachment:
File comment: Macro and a sample file “kusabana.rtf”
Macro&SampleFile.zip [21.34 KiB]
Downloaded 54 times

Author:  Kino [ 2017-03-31 10:59:08 ]
Post subject:  Re: Create Index for Japanese Documents

It is tedious to type yomi (pronunciation of Kanji word represented by Hiragana or Katakana letters) for each Index entry. Here is a macro which inserts “yomi<space>” before selected Japanese word(s).
Code:
### Prepend Yomi ###

# This macro add yomi before selected Japanese word(s) in Kanji, Katakana, or Hiragana.
# This macro requires mecab <http://taku910.github.io/mecab/> installed manually or via Homebrew, MacPorts, etc.
# Dictionary used by mecab (usually “ipadic”) should be encoded in UTF-8.
# If you use Homebrew, the following Terminal commands will do the job.
#    brew install mecab
#    brew install mecab-ipadic
# If you have mecab installed in a directory other thaqn /usr/local/bin, modify the value of $mecabPath accordingly.
# Yomi returned by mecab is not always correct. Add such words to $defYomi as many as needed.
# This macro requires a Services module 'Transform Clipboard by Shell Script.workflow' in /Users/<you>/Library/Services as follows:
#   on run
#      set scriptvar to (the clipboard)
#      set the clipboard to (do shell script scriptvar)
#   end run
# which execute a shell script in the clipbord and write back its output to the clipboard.


$mecabPath = '/usr/local/bin/mecab'
$betweenYomi = '*'
$afterYomi = Cast to String " "
$VowelizeProlongedSoundMark = @true

$defYomi = Hash.new   # Add words mecab does not understand well
$defYomi{'形相'} = Cast to String 'けいそう'   # mecab returns 'ぎょうそう'
$defYomi{'本性'} = Cast to String 'ほんせい'   # mecab returns 'ほんしょう'
$defYomi{'離存'} = Cast to String 'りそん'   # mecab does not know the word
$defYomi{'冒瀆'} = Cast to String 'ぼうとく'   # mecab does not understand no-JIS X 0208 chars
$defYomi{'碧巌録'} = Cast to String 'へきがんろく'   # mecab does not know the word
$defYomi{'西班牙'} = Cast to String 'すぺいん'   # mecab does not know the word

$doc = Document.active
if ! $doc
   exit 'No open document, exiting...'
end

$sels = $doc.textSelections
if ! $sels.firstValue.length
   exit 'No selection, exiting...'
end

$str = $doc.selectedSubstrings.join($betweenYomi)
$script = "echo '$str' | $mecabPath -O yomi | perl -Mutf8 -CS -pe 'tr/[ァ-ヶヽヾ]/[ぁ-ゖゝゞ]/'"
Write Clipboard $script

Menu.activateAtPath('Services:Transform Clipboard by Shell Script')

$converted = Read Clipboard

while $converted == $script
   Sleep 0.2
   $converted = Read Clipboard
end

if $VowelizeProlongedSoundMark   # Necessary for correct sort order
   $converted.replaceAll '(?<=[あぁかゕがさざただなはばぱまやゃらわゎ])ー', 'あ', 'E'
   $converted.replaceAll '(?<=[いぃきぎしじちぢにひびぴみりゐ])ー', 'い', 'E'
   $converted.replaceAll '(?<=[うぅゔくぐすずつづぬふぶぷむゆゅる])ー', 'う', 'E'
   $converted.replaceAll '(?<=[えぇけゖげせぜてでねへべぺめえれゑ])ー', 'え', 'E'
   $converted.replaceAll '(?<=[おぉこごそぞとどのほぼぽもよょろを])ー', 'お', 'E'
end

$converted = Cast to Attributed String $converted
$ja = Language.languageWithCode('ja')
$fontName = $sels[0].text.displayAttributesAtIndex($sels[0].location).fontName

Push Target Text $converted
   $ja.apply      # Perhaps necessary for correct sort order
   Set Font Name $fontName
Pop Target Text

$converted = $converted.split($betweenYomi)

foreach $sel in reversed $sels
   $yomi = $converted.pop & $afterYomi
   $sel.text.insertAtIndex $sel.location, $yomi
end

### end of macro

The macro relies on “mecab” (Unix program) and “Transform Clipboard by Shell Script.workflow” (Services module). The latter is a slightly modified version of Philip Spaelti’s “Run Shell Script from Clipboard” https://nisus.com/forum/viewtopic.php?f=17&t=5948#p27249.

This macro is terribly slow in comparison with a macro which calls mecab via perl which was killed by Sandbox ;-( So running it with a single word selected is very irritating. It would be better to run it on multiple non-contiguous selections.

FEATURE REQUEST! Menu command which converts selected Japanese words into their yomi (pronunciation of Kanji/Katakana word represented by Hiragana or Katakana letters) and a macro command such as “Convert to Yomi textObject, [Hiragana/Katakana]”. I think OS X has API which makes possible those features because Jedit X http://www.artman21.com/en/jedit_x/ has such a menu command “:Tools:Kanji to Hiragana” since longtime ago.
Attachment:
Macro&ServicesModule.zip [36.36 KiB]
Downloaded 52 times

Author:  Kino [ 2017-03-31 11:10:27 ]
Post subject:  Re: Create Index for Japanese Documents

And here is a macro creating a Japanese word list from the frontmost document, that you might find useful.
Code:
### Japanese Word List ###

# Create a list of Katakana or Kanji words from the frontmost document.

$lang = Language.languageWithCode 'ja'
# zh_TW for Traditional Chinese and zh_CN for Simplified one

Debug.setCodeProfilingEnabled false

$LF = Cast to String "\n"
$tab = Cast to String "\t"
$doc = Document.active
if $doc == undefined
   exit 'No open document, exiting...'
end

$findExp = Hash.new  # a, b, c,... are used as sort keys

$findExp{'a. Katakana words including ジャン゠ソウル・パルトル, F・カフカ, Ch. ミュンシュ'} = '(?:[\x{FF21}-\x{FF3A}](?<p>[\x{30A0}\x{30FB}\x{FF1D}])|\p{Upper}\p{Lower}?\.(?:[-\x20]\p{Upper}\p{Lower}?\.)*\x20)*(?<k>\p{Katakana}[\p{Katakana}\x{30FC}]*)(?:(?:[\x{30A0}\x{30FB}\x{FF1D}][\x{FF21}-\x{FF3A}])?\g<p>\g<k>)*'
$findExp{'b. Katakana/Kanji words such as 老子, 新プラトン主義, フィリップ・K・ディック'} = '(?:[\x{FF21}-\x{FF3A}](?<p>[\x{30A0}\x{30FB}\x{FF1D}])|\p{Upper}\p{Lower}?\.(?:[-\x20]\p{Upper}\p{Lower}?\.)*\x20)*(?:(?<k>\p{Katakana}[\p{Katakana}\x{30FC}]*)(?:(?:[\x{30A0}\x{30FB}\x{FF1D}][\x{FF21}-\x{FF3A}])?\g<p>\g<k>)*|\p{Han}+(?:\g<p>\p{Han}+)*)+'
$findExp{'c. Separate Kanji chars'} = '\p{Han}'
$findExp{'d. One or more consecutive Kanji chars'} = '\p{Han}+'
$findExp{'e. Two or more consecutive Kanji chars'} = '\p{Han}{2,}'
$findExp{'f. Kanji followed by 2 Hiragana'} = '\p{Han}+\p{Hiragana}{1,2}'

$menuItems = $findExp.keys
$menuItems.sort
$input = Prompt Options 'List up . . .', '', '', $menuItems
$selections = $doc.selectedSubstrings

if $selections.firstValue.length
   $LF = Cast to String "\n"
   $selections = $selections.join $LF
   $sels = $selections.findAll $findExp{$input}, 'E-i'
else
   $sels = $doc.text.findAll $findExp{$input}, 'E-i'
end

if ! $sels.count
   exit 'No word found, exiting...'
end

$str = Hash.new
foreach $sel in $sels
   $str{$sel.substring} += 1
end

$words = $str.keys
$words.sort 'li', $lang
foreach $i, $word in $words
   $ocr = $str{$word} & $tab
   $words[$i] = $ocr & $word
end
$words = $words.join $LF
Push Target Text $words
   $lang.apply
Pop Target Text
Document.newWithText $words

### end of macro ###

Attachment:
File comment: Macro file
JapaneseWordList.nwm.zip [5.43 KiB]
Downloaded 52 times

Author:  Kino [ 2017-03-31 11:37:46 ]
Post subject:  Re: Create Index for Japanese Documents

With those macros (not exactly the same because of the necessity for applying special formatting), I created 索引 (index) for every volume of『井筒俊彦全集』(The Complete Works of Toshihiko IZUTSU, 12 vols and 1 supplement) http://www.keio-up.co.jp/kup/izutsu/cw.html https://www.keio-up.co.jp/np/search_result.do?ser_id=73 except 総索引 (general index of all the volumes included in the last volume 別巻). Impossible if I had not been a NWP user. Thousand thanks, Mr. Nisus and Nice Us people! :-)

Author:  phspaelti [ 2017-03-31 19:40:31 ]
Post subject:  Re: Create Index for Japanese Documents

Kino wrote:
FEATURE REQUEST! Menu command which converts selected Japanese words into their yomi (pronunciation of Kanji/Katakana word represented by Hiragana or Katakana letters) and a macro command such as “Convert to Yomi textObject, [Hiragana/Katakana]”. I think OS X has API which makes possible those features because Jedit X http://www.artman21.com/en/jedit_x/ has such a menu command “:Tools:Kanji to Hiragana” since longtime ago.

I second that feature request

Author:  credneb [ 2017-04-01 08:34:15 ]
Post subject:  Re: Create Index for Japanese Documents

I third the feature request.

Author:  Nobumi Iyanaga [ 2017-04-01 21:15:38 ]
Post subject:  Re: Create Index for Japanese Documents

This is an essential feature. Please add it to NW!

Author:  Elbrecht [ 2017-04-12 00:08:31 ]
Post subject:  Re: Create Index for Japanese Documents

Hi Kino –

I love to read Toshihiko IZUTSU – but sorry English only! This Teheran Connection continues to work with Sachiko MURATA – great reading too!

HE

Kino wrote:
With those macros (not exactly the same because of the necessity for applying special formatting), I created 索引 (index) for every volume of『井筒俊彦全集』(The Complete Works of Toshihiko IZUTSU, 12 vols and 1 supplement) http://www.keio-up.co.jp/kup/izutsu/cw.html https://www.keio-up.co.jp/np/search_result.do?ser_id=73 except 総索引 (general index of all the volumes included in the last volume 別巻). Impossible if I had not been a NWP user. Thousand thanks, Mr. Nisus and Nice Us people! :-)

Author:  martin [ 2017-05-11 16:48:36 ]
Post subject:  Re: Create Index for Japanese Documents

phspaelti wrote:
Kino wrote:
FEATURE REQUEST! Menu command which converts selected Japanese words into their yomi (pronunciation of Kanji/Katakana word represented by Hiragana or Katakana letters) and a macro command such as “Convert to Yomi textObject, [Hiragana/Katakana]”. I think OS X has API which makes possible those features because Jedit X http://www.artman21.com/en/jedit_x/ has such a menu command “:Tools:Kanji to Hiragana” since longtime ago.

I second that feature request

Thank you to all of you for letting us know this was important to you. I've added the potential enhancement to our issue tracker.

Page 1 of 1 All times are UTC - 8 hours
Powered by phpBB® Forum Software © phpBB Group
http://www.phpbb.com/