Alte Revision

Spielplatz place02

#! /usr/bin/env ruby
# encoding: UTF-8

# iolate.rb:
#
#   Version 1.01
#
# Extract and reinsert the translateable text from 
# the dokwiki-file cpage.dkw to the Textfile transfertext.utf8
#  
#  Copyright  27.11.2021
#  
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#  
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#  
#  You should have received a copy of the GNU General Public License
#  along with this program; if not, write to the Free Software
#  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
#  MA 02110-1301, USA.
#  
 
# == syntax 
#
# iolate.rb [-i inputfilename] 
#           [-o throughputfilename] 
#           [-c separation character]
#           [-k]
#           [-v] 
#           [-h] 
#
# -i 
#  name of the file in dokuwiki-syntax. If none is given, 
#  cpage.dkw is assumed
#
# -o 
#  Name of the text file which contains the text in dokuwiki-syntax.
#  If none is given transfertext.utf8 is assumed
#
# -c 
#  Character to seperate languages in transfertext.utf8
#  from oneanother. The default is ~. You may need this option,
#  if a ~ is naturally occuring in the text
#
# -k
#  Keep tags in the outputfile transfertext.utf8
#
# -v
#  verbose
#
# -h
#  show help
#

# == file formats
#
# cpage.dkw
# has to be formated according to https://comicslate.org/en/wiki/12balloons
# Only the text in text boxes between the the first tag pair
# {{cotan>...}} ... {{<cotan}} is extractet into transfertext.utf8,
# text boxes  before or after this are left unchanged
# The first ... here represents a Filename which must not contain }}  
# The second ... represents the string which is searched for translateable text
# which must not itself contain {{<cotan}}.
# Only text in textboxs is extraxted i.e. the line before the text must
# start with an @ wich is followed by at least four 
# comma seperated integeres, e.g. @37,25,419,112 
# After the text there has to be a line containing exactly one ~ character
# at the start of the line, marking the end of the texbox
# The following holds true für the translated file  cpage.dkw
# All <...> tags are removed from the result of the second run 
# All [...] tags which are not at thre start of the first line, 
#           are removed from the result of the second run
# All -. are removed
# All linefeeds and carriage returns are removed
# The removing of <...> and [...] tags can be suppressed with -k
# This file is than also used as output to store the data
# between passes in yaml format.  
#
# cpages.yaml
# stors all the data of the file cpages.dkw except for the text which 
# gas ti be translated
#
# transfertext.utf8
# In the extracted text all text from one textbox is concatenatet into 
# one line, seperated by a space. The text of all textboxes 
# is concatenated seperated by one LF character per box.
# This output file is used as input in the second pass.
# The translated text has to concur to the same specification
# in order to be used for the construction of cpages1.dkw
#
# cpages1.dkw
# The recreated file containing the markup from  cpage.dkw and the text
# from transfertext.utf8

# == usage in context
#
# 0. It is assumem that you have the curent version of 
#    the ruby interpreter installed as well as the gems 
#    listed under "dependecies". Copy this script into 
#    a directory on your computer. All further file handling
#    is assumed to take place there
# 
# 1. copy the content of the strip you want to translate 
#    or at least {{cotan>...}}, {{<cotan}} and what is between them 
#    into the file called cpage.dkw in the directory where
#    the executable file iolate.rb is located
#
# 2. execute, e.g call ruby iolate.rb
# 
# 3. open the file transfertext.utf8 and copy its contense 
#    into the input field of a translation service or programm
#    like Deepl.com or lingvanex.com
# 
# 4. translate
#
# 5. overcopy the text in transfertext.utf8 with 
#    the translation and save the file
# 
# 5. execute iolate.rb again
#
# 6. cpages1.dkw should now contain the translation
#

# == todo
#
# 1. Add the capabilitie to supply various translations
#    in transfertext.utf8 seperated by ~ and produce files
#     cpage1.dkw, cpage2.dkw etc. from them
#
# 2. optionally add a loop and REST-api combatibility 
#    to this script, to translate whole comics at once
#

# == constants
#
NEWLINECHAR = "\n" # EOL standard char to indicate end of line
SYSLIMITFILENAME = 255 # maximal length of a file name for ext4
# one of these letters after a ~ gives an outputfilename with the corresponding language name
COUNTRYLETTERS = [ ["d", "Dansk"], ["f", "French"], ["g", "German"], ["h", "Hungarian"], ["r", "Russian"], ["s", "Spanish"], ["u", "Finnish"] ]

# == dependecies

# class for handling filenames, pathnames, etc. 
begin
  require 'pathname' 
rescue Exception
  STDERR.puts "warnung: require of pathname failed." 
end

# class for parsing commandline options
begin
  require 'optparse' 
rescue Exception
  STDERR.puts "warnung: require of optparse failed." 
  # exit 1 # nonfatal if no options are given, so cross your fingers
end

# class for serialization
begin
  require 'yaml' 
rescue Exception
  STDERR.puts "warnung: require of yaml failed." 
  exit 1
end

# == methods
#
# to keep in line with a functional approch which might be easier 
# to port to javascript with opal, we make no use of oo,
# but use the methods of this object as functions

# remove everything between sBeginChar 
# und sEndChar including those characters
# allover sStr and return sStr
def removebetween(sStr, sBeginChar="<", sEndChar=">")
  iBegin = sStr.index(sBeginChar)
  while not iBegin.nil?
     iEnd =  sStr.index(sEndChar, iBegin + 1)
     if iEnd.nil?
       return sStr 
     else
       sStr.slice!(iBegin, iEnd - iBegin + 1)
       iBegin = sStr.index(sBeginChar)
     end
  end
return sStr 
end

# returns true if a string represents an integer
def is_i?(sMayBeInteger)
  /\A[-+]?\d+\z/ === sMayBeInteger
end

# returns true, if the string sLine is the first line of a text or mask box
def isBoxStart(sLine)
  unless sLine.start_with? "@" 
    return false
  end
  asLine = sLine[1...-1].rstrip.split ","
  unless asLine.length > 3
    return false
  end
  unless is_i?(asLine[0])
   _return false
  end
  unless is_i?(asLine[1])
   _return false
  end
  unless is_i?(asLine[2])
   _return false
  end
  unless is_i?(asLine[3][0]) || is_i?(asLine[3][0..1])
    return false
  end
  return true
end

# returns true, if the string sLine is the first line of a text box
def isTextBoxStart(sLine, sNextLine)
  unless sNextLine.nil? # something ahead
    if isBoxStart(sLine)
      unless sNextLine.chr == "#" # no mask 
        return true
      end
    end
  end
  return false
end

# splits a filename into extension and the part before
# unlike File.extname the dot is returned as part of the extension
def splitbasename(sFilename)
  iDotPos = sFilename.rindex(".")
  aF = Array.new
  aF << sFilename[0...iDotPos]
  aF << sFilename[iDotPos..-1]
  return aF
end

# shortens the basename, until it is of legal length
def shortentolegallength(sFilename)
if SYSLIMITFILENAME < 3
  exit 1 # not your usual file system
end
aFilename = File.split(sFilename) # 
il = aFilename[1].length
aB = splitbasename(aFilename[1]) # field of two elements for the extension and the stuff before it
if il > SYSLIMITFILENAME
  iDif = il - SYSLIMITFILENAME
  if aB[0].length > iDif # we want to keep at least one letter, to avoid producing dot files
    aB[0] = aB[0...-iDif]
  else 
    iresttocut = iDif - aB[0].length + 1
    aB[0] = aB[0].chr # keep one letter
    aB[1] = aB[0...-iresttocut] # additionally shorten the extension
  end 
end
return aFilename[0] + File::SEPARATOR + aB[0] + aB[1]
end

# == main
#
# for the use of option parser see http://www.dreamsyssoft.com/ruby-scripting-tutorial/optionparser-tutorial.php
# https://stelfox.net/blog/2012/12/rubys-option-parser-a-more-complete-example/
# http://ruby-doc.org/stdlib-1.9.3/libdoc/optparse/rdoc/OptionParser.html#method-i-make_switch
options = {:verbose => nil, :dokuinfilename => "cpage.dkw", :textoutputfilename => "transfertext.utf8", :sepcharacter => "~", :fkeep => false} # default values go here
opt_parser = OptionParser.new do |opts|
  opts.banner = "Usage: ocr-latest-png.rb [-i dokuinfilename] [-o textoutputfilename] [-c separationcharacter] [-k] [-v] [-h]" 
  opts.on("-i filename", "--inputfile", "name of the dokuwiki file from which to extract translateable text. The default is cpages.dkw.") do |anopt|
    options[:dokuinfilename] = anopt
  end
  opts.on("-o filename", "--outputtext", "name of the text file to which translateable text is written. The default istransfertext.utf8.") do |anopt|
    options[:textoutputfilename] = anopt
  end
  opts.on("-c separationcharacter", "--separationcharacter", "the charcter used to seperate different languages in transfertext.utf8. The default is ~.") do |anopt|
    options[:sepcharacter] = anopt
  end
    opts.on("-k filename", "--keeptags", "do not atempt to delete all tags. Default is .") do |anopt|
    options[:fkeep] = anopt
  end
  opts.on("-v", "--[no-]verbose", "show comments") do |anopt| 
    options[:verbose] = anopt
  end  
  opts.on("-h", "--help", "show this help.") do
    puts opts
    exit
  end
end
begin
  opt_parser.parse!
rescue OptionParser::InvalidOption
  puts "\nunknown option"
  puts "in line" + __LINE__ # the current line number in the source file.
  puts $! # error message
  puts $@ # error position
  raise
end
# set verbosity flag
if ! options[:verbose].nil?
  $ivc = 0
else
  $ivc = 1 
end
sDokuwikifile = "cpage.dkw" # default name
if options[:dokuinfilename].nil? # should be impossible, but well ...
  puts "after the -i option there needs to be a filename" if $ivc > 0
  exit 1
else
  sDokuwikifile = options[:dokuinfilename]
end
sTransferfile = "transfertext.utf8"
if options[:textoutputfilename].nil? # also impossible
  puts "after the -o option there needs to be a filename" if $ivc > 0
  exit 1
else
  sTransferfile = options[:textoutputfilename]
end
sSepChar = "~"
if options[:sepcharacter].nil? # also impossible
  puts "after the -c option there needs to be a separation character for the different languages" if $ivc > 0
  exit 1
else
  sSepChar = options[:sepcharacter]
end
fkeep = options[:fkeep]
bInCotan = false # flag
bInTextbox = false # flag
iNrTextboxes = 0 # the total number of textboxes which were red
asPageAsLines = Array.new # the markup pieces of the file are collected here 
aiTextLineNumbersInPage = Array.new # the line numbers of the text box lines in the page
asTextLines = Array.new # the text lines from the text boxes, one element per textbox contains all the lines from this textbox
asTextboxTags = Array.new # the tags at the start of a text line
unless File.exists? sTransferfile # if true, we are pre translation, otherwise post translation
  file = File.open(sDokuwikifile, "r")
  aText = file.read.split "\n"
  file.close
  sTextBoxText = String.new # string to collect all the text from a textbox in
  aText.each_index { |iline|
    sline = aText[iline]
    if bInCotan # starting to search for textboxes
      if sline.include? "{{<cotan}}" # stop searching for textboxes, if you're out of the cotan block
        bInCotan = false 
        asPageAsLines << sline unless bInCotan # hand cotan block's tail through 
      else 
        if bInTextbox
	      if sline.strip == "~".chr # this is the way a textbox ends, not with a !, but with a ~
	        asTextLines << sTextBoxText # all the txt from the text box has been collected and can now be stored as one line
            sTextBoxText = String.new # reset to empty
	        asPageAsLines << "" # store an empty line, because the text goes to asTextLine
            asPageAsLines << sline #lines before and after a textbox, including the textbox's head and tail are just handed through 
	        bInTextbox = false
	      else 
	        sline.strip!
	        if sTextBoxText.empty? # this is the first non empty text box line
              if sline.chr == "[" # text line starts with a tag
                while sline.chr == "[" # a tag is still at the start of the line
                  iClose = sline.index "]"
                  if iClose == -1 # no closing parenthesis found
                    break # the rest of the line is supposed to contain the text
                  else
                   asTextboxTags[iNrTextboxes - 1] += sline[0..iClose] # save the tag by appending it to previous tags in the same line
                   sline = sline[iClose + 1 .. -1] # remove the tag from the line
                  end
                end # next tag 
              end # all starting tags removed
	        end # the starting tags in non first text box lines in textboxes are not saved
	        unless fkeep # remove other tags unless blocked
	          removebetween sline
	          removebetween sline, "[", "]"
	        end
	        if ! sTextBoxText.empty? # if the text has various lines concatenate them all sepereted by a blank
	          sTextBoxText += " "
	        end
	        sTextBoxText += sline + "\n"
	      end        
        else # not in a Textbox
          asPageAsLines << sline #lines before and after a textbox, including the textbox's head and tail are just handed through 
	      if isTextBoxStart(sline, aText[iline + 1]) # a textbox starts with this line, but also a mask
		    aiTextLineNumbersInPage << asPageAsLines.length - 1 # the line number in the output file where the text belongs.
	        asTextboxTags << "" # initialize tag memory for this textbox
	        iNrTextboxes += 1 # keep, count of the textboxes	        
	        bInTextbox = true
	      end
        end  
      end
    else 
      asPageAsLines << sline # lines before and after the cotan block, including the cotan block's head and tail are just handed through 
      bInCotan = sline[0..7] == "{{cotan>"
    end   
  } # next line of the dokuwiki file
  # collect all information for a second pass in an array 
  oSerializedData = Array.new
  oSerializedData << fkeep
  oSerializedData << iNrTextboxes
  oSerializedData << asPageAsLines 
  oSerializedData << aiTextLineNumbersInPage
  oSerializedData << asTextboxTags
  # oSerializedData << asTextLines
  # write intermediate file for second pass, using the original dokuwiki file name with a yaml extension
  iDotPos = sDokuwikifile.rindex "."
  iDotPos = sDokuwikifile.length if iDotPos.nil?
  sYamlFile = Pathname.new(sDokuwikifile).sub_ext(".yaml")
  File.open(sYamlFile, "w") do |file|
    file.puts YAML::dump(oSerializedData)
  end  
  # write textfile to be translated
  File.open(sTransferfile, "w") { |file|
    asTextLines.each { |s|
      file.write s
    }
  }  
else # post translation
  sYamlFile = Pathname.new(sDokuwikifile).sub_ext(".yaml") # name of the yaml file reconstructed
  oSerializedData = Array.new
  oSerializedData = YAML.load_file(sYamlFile)
  fkeep = oSerializedData[0]
  iNrTextboxes = oSerializedData[1]
  asPageAsLines  = oSerializedData[2]
  aiTextLineNumbersInPage = oSerializedData[3]
  asTextboxTags = oSerializedData[4]
  # load translated text
  file = File.open(sTransferfile, "r")
    asAllTranslatedText = file.read.split sSepChar # various languages are seperated by a sSepChar wich by default is a ~ and optionally an additional letter to indicate the language
  file.close
  sLanguage = String.new # holds the language of the second and following translations
  asAllTranslatedText.each_index { |iOneLanguage| # one language at a time, by number 
    asTranslatedText = Array.new # contains one translation as an Array  
    sOneLanguage = asAllTranslatedText[iOneLanguage] # the translated text of this language in one string wih linebraks 
    asTranslatedText = sOneLanguage.split "\n" # get the lines for this languge
    if iOneLanguage > 0 # more than one language according to the number of subfiles. Starting with the second language there's the possibilty that there is a letter indicating a language after the ~
      sFirstLine = asTranslatedText[0][0] # omit the first character, i. e. the ~, take the second
      if sFirstLine.empty? # no language identifying letter or languge given
         # sLanguage = "" # stays empty
      else # an additional character can indicate a languge
         allLangs = COUNTRYLETTERS.select { |sl| # search through all letters indicating countries 
           fReturn = false
           sFirstLine.each_char { |sChar| # search through all chars in the string, which should be one
             if sl[0].upcase == sChar.upcase
               fReturn = true
             end
           }
           fReturn
         } 
         allLangs.each_index { |iL|
           if iL == 0 
             sLanguage = allLangs[0][1]
           else
             sLanguage = sLanguage + "-" + allLangs[iL][1]
           end
         }
      end
      asTranslatedText.shift # remove the first line of this languge's block, as it does not contain translated text
    end # end of the search for the name of the language 
    if asTranslatedText.length != iNrTextboxes # regardless of language every translation has to have the same number of lines
      if $ivc > 0
        unless sLanguage == ""
          STDERR.puts "warning for language " + sLanguage  
        end
        STDERR.puts "warning: the number of textboxes " + iNrTextboxes.to_s + " does not equal the number of translated strings " + asTranslatedText.length.to_s + "."  
      end
    end
    # make translation
    sOutputtext = String.new
    # iTextBoxCur = 0
    asPageAsLines.each_index { |iline|
      iTextBoxCur = aiTextLineNumbersInPage.find_index(iline - 1) # lookup, if this line's number is in the table of translateable lines
      if iTextBoxCur.nil? # do we have no translation for this line?
        sOutputtext = sOutputtext + asPageAsLines[iline] + "\n" # this line gets handed through unchanged   
      else # there is a translation for this line which we take from the sTransferfile
        sOutputtext = sOutputtext  + asTextboxTags[iTextBoxCur] + asTranslatedText[iTextBoxCur] + "\n"
        # iTextBoxCur += 1
      end
    }
    # remove trailing empty lines
    while sOutputtext[sOutputtext.length - 1] == "\n"
      sOutputtext.chomp!
    end
    # construct a name for the output file by inserting the number of the translation and the language before the extension dot
    iDotPos = sDokuwikifile.rindex "."
    iDotPos = sDokuwikifile.length if iDotPos.nil?
    sOutputfile = shortentolegallength( sDokuwikifile[0..iDotPos-1] + "-" + iOneLanguage.to_s + "-" + sLanguage + sDokuwikifile[iDotPos..-1])
    # write final output 
    File.open(sOutputfile, "w") do |file|
      file.puts sOutputtext
    end
  } # this translation is finished, go to the next  
  File.delete sTransferfile # cleaning up
  # File.delete sYamlFile # cleaning up
end

An Example from 1911.

After the first run of iolate.rb, the file transfertext.utf8 should look like this

Those were… **MY** notes. I think I know what's going on. The majority of robots on this planet are approaching neural pruning age.
Rather than scrap robots using Dr. Bowman's mental design, they may be planning to use an aggressive neural pruning program.
Aggressive? I've seen this program in action!
Saying this thing is aggressive is like saying a great white shark likes to nibble on things!
I'm more than a little disturbed by how much you know about this program.

Translate it into various languaes and seperate the languages by a line with an ~ otionally followed by a letter to indicate the language, to make it look like this example

Das waren... **Meine** Notizen. Ich glaube, ich weiß, was hier los ist. Die meisten Roboter auf diesem Planeten erreichen das Alter der neuronalen Beschneidung.
Anstatt die Roboter mit Dr. Bowmans mentalem Design zu verschrotten, planen sie vielleicht ein aggressives neurales Beschneidungsprogramm zu verwenden.
Aggressiv? Ich habe dieses Programm in Aktion gesehen!
Es als aggressiv zu bezeichnen, ist so, als würde man sagen, dass ein weißer Hai gerne an Dingen knabbert!
Ich bin mehr als nur ein wenig beunruhigt darüber, wie viel Sie über dieses Programm wissen.
~f
C'était... mes notes. Je pense que je sais ce qui se passe. La majorité des robots sur cette planète approchent de l'âge de l'élagage neural.
Plutôt que de détruire les robots en utilisant le design mental du Dr. Bowman, ils peuvent prévoir d'utiliser un programme agressif d'élagage neuronal.
Agressif ? J'ai vu ce programme en action !
Dire que cette chose est agressive c'est comme dire qu'un grand requin blanc aime grignoter des choses !
Je suis plus que troublé par ce que tu sais de ce programme.
~d
Det var... **MINE** noter. Jeg tror, jeg ved, hvad der foregår. De fleste robotter på denne planet nærmer sig den neurale beskæringsalder.
I stedet for at skrotte robotter ved hjælp af Dr. Bowmans mentale design, planlægger de måske at bruge et aggressivt neuralt beskæringsprogram.
Aggressivt? Jeg har set dette program i aktion!
At sige, at denne tingest er aggressiv er som at sige, at en stor hvid haj kan lide at gnaske på ting!
Jeg er mere end en smule foruroliget over, hvor meget du ved om dette program.
~s
Esas eran... mis notas. Creo que sé lo que está pasando. La mayoría de los robots de este planeta se acercan a la edad de poda neural.
En lugar de desechar los robots usando el diseño mental del Dr. Bowman, pueden estar planeando usar un programa de poda neural agresivo.
¿Agresivo? ¡He visto este programa en acción!
Decir que esta cosa es agresiva es como decir que a un gran tiburón blanco le gusta mordisquear cosas.
Estoy más que perturbado por lo mucho que sabes sobre este programa.
~u
Ne olivat... **MINUN** muistiinpanojani. Luulen tietäväni mistä on kyse. Suurin osa tämän planeetan roboteista lähestyy hermojen karsintaikää.
Sen sijaan, että he romuttaisivat robotteja tohtori Bowmanin mentaalisen mallin mukaan, he saattavat suunnitella aggressiivista hermojen karsintaohjelmaa.
Aggressiivinen? Olen nähnyt tämän ohjelman toiminnassa!
Tämän sanominen aggressiiviseksi on kuin sanoisi, että valkohai tykkää nakertaa asioita!
Minua häiritsee, miten paljon tiedät tästä ohjelmasta.
~h
Ezek... **az én jegyzeteim** voltak. Azt hiszem, tudom, mi folyik itt. A robotok többsége ezen a bolygón közelít az idegrendszeri metszési korhoz.
Ahelyett, hogy a robotokat Dr. Bowman mentális tervezésével selejteznék, talán egy agresszív neurális metszési programot terveznek.
Agresszív? Láttam ezt a programot működés közben!
Azt mondani, hogy ez a dolog agresszív, olyan, mintha azt mondanánk, hogy egy nagy fehér cápa szeret rágcsálni dolgokat!
Egy kicsit zavar, hogy mennyit tudsz erről a programról.

Not that DeepL honors the bold formating here only for German, Danish, Finnish and Hungarian, but not for French or Spanish. Your mileage may vary on this.

The second run of iolate.rb should than give you translated files for all the languages, which here are

cpage-0-.dkw
cpage-1-French.dkw
cpage-2-Dansk.dkw
cpage-3-Spanish.dkw
cpage-4-Finnish.dkw
cpage-5-Hungarian.dkw

You still have to translate what is outside the cotan tags manually.