Alte Revision
Spielplatz place02
#! /usr/bin/env ruby # encoding: UTF-8 # iolate.rb: # # Version 1.01 # # Extract and reinsert the translateable text from # the dokwiki-file cpage.dkw to the Textfile transfertext.utf8 # # Copyright 27.11.2021 # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA. # # == syntax # # iolate.rb [-i inputfilename] # [-o throughputfilename] # [-c separation character] # [-k] # [-v] # [-h] # # -i # name of the file in dokuwiki-syntax. If none is given, # cpage.dkw is assumed # # -o # Name of the text file which contains the text in dokuwiki-syntax. # If none is given transfertext.utf8 is assumed # # -c # Character to seperate languages in transfertext.utf8 # from oneanother. The default is ~. You may need this option, # if a ~ is naturally occuring in the text # # -k # Keep tags in the outputfile transfertext.utf8 # # -v # verbose # # -h # show help # # == file formats # # cpage.dkw # has to be formated according to https://comicslate.org/en/wiki/12balloons # Only the text in text boxes between the the first tag pair # {{cotan>...}} ... {{<cotan}} is extractet into transfertext.utf8, # text boxes before or after this are left unchanged # The first ... here represents a Filename which must not contain }} # The second ... represents the string which is searched for translateable text # which must not itself contain {{<cotan}}. # Only text in textboxs is extraxted i.e. the line before the text must # start with an @ wich is followed by at least four # comma seperated integeres, e.g. @37,25,419,112 # After the text there has to be a line containing exactly one ~ character # at the start of the line, marking the end of the texbox # The following holds true für the translated file cpage.dkw # All <...> tags are removed from the result of the second run # All [...] tags which are not at thre start of the first line, # are removed from the result of the second run # All -. are removed # All linefeeds and carriage returns are removed # The removing of <...> and [...] tags can be suppressed with -k # This file is than also used as output to store the data # between passes in yaml format. # # cpages.yaml # stors all the data of the file cpages.dkw except for the text which # gas ti be translated # # transfertext.utf8 # In the extracted text all text from one textbox is concatenatet into # one line, seperated by a space. The text of all textboxes # is concatenated seperated by one LF character per box. # This output file is used as input in the second pass. # The translated text has to concur to the same specification # in order to be used for the construction of cpages1.dkw # # cpages1.dkw # The recreated file containing the markup from cpage.dkw and the text # from transfertext.utf8 # == usage in context # # 0. It is assumem that you have the curent version of # the ruby interpreter installed as well as the gems # listed under "dependecies". Copy this script into # a directory on your computer. All further file handling # is assumed to take place there # # 1. copy the content of the strip you want to translate # or at least {{cotan>...}}, {{<cotan}} and what is between them # into the file called cpage.dkw in the directory where # the executable file iolate.rb is located # # 2. execute, e.g call ruby iolate.rb # # 3. open the file transfertext.utf8 and copy its contense # into the input field of a translation service or programm # like Deepl.com or lingvanex.com # # 4. translate # # 5. overcopy the text in transfertext.utf8 with # the translation and save the file # # 5. execute iolate.rb again # # 6. cpages1.dkw should now contain the translation # # == todo # # 1. Add the capabilitie to supply various translations # in transfertext.utf8 seperated by ~ and produce files # cpage1.dkw, cpage2.dkw etc. from them # # 2. optionally add a loop and REST-api combatibility # to this script, to translate whole comics at once # # == constants # NEWLINECHAR = "\n" # EOL standard char to indicate end of line SYSLIMITFILENAME = 255 # maximal length of a file name for ext4 # one of these letters after a ~ gives an outputfilename with the corresponding language name COUNTRYLETTERS = [ ["d", "Dansk"], ["f", "French"], ["g", "German"], ["h", "Hungarian"], ["r", "Russian"], ["s", "Spanish"], ["u", "Finnish"] ] # == dependecies # class for handling filenames, pathnames, etc. begin require 'pathname' rescue Exception STDERR.puts "warnung: require of pathname failed." end # class for parsing commandline options begin require 'optparse' rescue Exception STDERR.puts "warnung: require of optparse failed." # exit 1 # nonfatal if no options are given, so cross your fingers end # class for serialization begin require 'yaml' rescue Exception STDERR.puts "warnung: require of yaml failed." exit 1 end # == methods # # to keep in line with a functional approch which might be easier # to port to javascript with opal, we make no use of oo, # but use the methods of this object as functions # remove everything between sBeginChar # und sEndChar including those characters # allover sStr and return sStr def removebetween(sStr, sBeginChar="<", sEndChar=">") iBegin = sStr.index(sBeginChar) while not iBegin.nil? iEnd = sStr.index(sEndChar, iBegin + 1) if iEnd.nil? return sStr else sStr.slice!(iBegin, iEnd - iBegin + 1) iBegin = sStr.index(sBeginChar) end end return sStr end # returns true if a string represents an integer def is_i?(sMayBeInteger) /\A[-+]?\d+\z/ === sMayBeInteger end # returns true, if the string sLine is the first line of a text or mask box def isBoxStart(sLine) unless sLine.start_with? "@" return false end asLine = sLine[1...-1].rstrip.split "," unless asLine.length > 3 return false end unless is_i?(asLine[0]) _return false end unless is_i?(asLine[1]) _return false end unless is_i?(asLine[2]) _return false end unless is_i?(asLine[3][0]) || is_i?(asLine[3][0..1]) return false end return true end # returns true, if the string sLine is the first line of a text box def isTextBoxStart(sLine, sNextLine) unless sNextLine.nil? # something ahead if isBoxStart(sLine) unless sNextLine.chr == "#" # no mask return true end end end return false end # splits a filename into extension and the part before # unlike File.extname the dot is returned as part of the extension def splitbasename(sFilename) iDotPos = sFilename.rindex(".") aF = Array.new aF << sFilename[0...iDotPos] aF << sFilename[iDotPos..-1] return aF end # shortens the basename, until it is of legal length def shortentolegallength(sFilename) if SYSLIMITFILENAME < 3 exit 1 # not your usual file system end aFilename = File.split(sFilename) # il = aFilename[1].length aB = splitbasename(aFilename[1]) # field of two elements for the extension and the stuff before it if il > SYSLIMITFILENAME iDif = il - SYSLIMITFILENAME if aB[0].length > iDif # we want to keep at least one letter, to avoid producing dot files aB[0] = aB[0...-iDif] else iresttocut = iDif - aB[0].length + 1 aB[0] = aB[0].chr # keep one letter aB[1] = aB[0...-iresttocut] # additionally shorten the extension end end return aFilename[0] + File::SEPARATOR + aB[0] + aB[1] end # == main # # for the use of option parser see http://www.dreamsyssoft.com/ruby-scripting-tutorial/optionparser-tutorial.php # https://stelfox.net/blog/2012/12/rubys-option-parser-a-more-complete-example/ # http://ruby-doc.org/stdlib-1.9.3/libdoc/optparse/rdoc/OptionParser.html#method-i-make_switch options = {:verbose => nil, :dokuinfilename => "cpage.dkw", :textoutputfilename => "transfertext.utf8", :sepcharacter => "~", :fkeep => false} # default values go here opt_parser = OptionParser.new do |opts| opts.banner = "Usage: ocr-latest-png.rb [-i dokuinfilename] [-o textoutputfilename] [-c separationcharacter] [-k] [-v] [-h]" opts.on("-i filename", "--inputfile", "name of the dokuwiki file from which to extract translateable text. The default is cpages.dkw.") do |anopt| options[:dokuinfilename] = anopt end opts.on("-o filename", "--outputtext", "name of the text file to which translateable text is written. The default istransfertext.utf8.") do |anopt| options[:textoutputfilename] = anopt end opts.on("-c separationcharacter", "--separationcharacter", "the charcter used to seperate different languages in transfertext.utf8. The default is ~.") do |anopt| options[:sepcharacter] = anopt end opts.on("-k filename", "--keeptags", "do not atempt to delete all tags. Default is .") do |anopt| options[:fkeep] = anopt end opts.on("-v", "--[no-]verbose", "show comments") do |anopt| options[:verbose] = anopt end opts.on("-h", "--help", "show this help.") do puts opts exit end end begin opt_parser.parse! rescue OptionParser::InvalidOption puts "\nunknown option" puts "in line" + __LINE__ # the current line number in the source file. puts $! # error message puts $@ # error position raise end # set verbosity flag if ! options[:verbose].nil? $ivc = 0 else $ivc = 1 end sDokuwikifile = "cpage.dkw" # default name if options[:dokuinfilename].nil? # should be impossible, but well ... puts "after the -i option there needs to be a filename" if $ivc > 0 exit 1 else sDokuwikifile = options[:dokuinfilename] end sTransferfile = "transfertext.utf8" if options[:textoutputfilename].nil? # also impossible puts "after the -o option there needs to be a filename" if $ivc > 0 exit 1 else sTransferfile = options[:textoutputfilename] end sSepChar = "~" if options[:sepcharacter].nil? # also impossible puts "after the -c option there needs to be a separation character for the different languages" if $ivc > 0 exit 1 else sSepChar = options[:sepcharacter] end fkeep = options[:fkeep] bInCotan = false # flag bInTextbox = false # flag iNrTextboxes = 0 # the total number of textboxes which were red asPageAsLines = Array.new # the markup pieces of the file are collected here aiTextLineNumbersInPage = Array.new # the line numbers of the text box lines in the page asTextLines = Array.new # the text lines from the text boxes, one element per textbox contains all the lines from this textbox asTextboxTags = Array.new # the tags at the start of a text line unless File.exists? sTransferfile # if true, we are pre translation, otherwise post translation file = File.open(sDokuwikifile, "r") aText = file.read.split "\n" file.close sTextBoxText = String.new # string to collect all the text from a textbox in aText.each_index { |iline| sline = aText[iline] if bInCotan # starting to search for textboxes if sline.include? "{{<cotan}}" # stop searching for textboxes, if you're out of the cotan block bInCotan = false asPageAsLines << sline unless bInCotan # hand cotan block's tail through else if bInTextbox if sline.strip == "~".chr # this is the way a textbox ends, not with a !, but with a ~ asTextLines << sTextBoxText # all the txt from the text box has been collected and can now be stored as one line sTextBoxText = String.new # reset to empty asPageAsLines << "" # store an empty line, because the text goes to asTextLine asPageAsLines << sline #lines before and after a textbox, including the textbox's head and tail are just handed through bInTextbox = false else sline.strip! if sTextBoxText.empty? # this is the first non empty text box line if sline.chr == "[" # text line starts with a tag while sline.chr == "[" # a tag is still at the start of the line iClose = sline.index "]" if iClose == -1 # no closing parenthesis found break # the rest of the line is supposed to contain the text else asTextboxTags[iNrTextboxes - 1] += sline[0..iClose] # save the tag by appending it to previous tags in the same line sline = sline[iClose + 1 .. -1] # remove the tag from the line end end # next tag end # all starting tags removed end # the starting tags in non first text box lines in textboxes are not saved unless fkeep # remove other tags unless blocked removebetween sline removebetween sline, "[", "]" end if ! sTextBoxText.empty? # if the text has various lines concatenate them all sepereted by a blank sTextBoxText += " " end sTextBoxText += sline + "\n" end else # not in a Textbox asPageAsLines << sline #lines before and after a textbox, including the textbox's head and tail are just handed through if isTextBoxStart(sline, aText[iline + 1]) # a textbox starts with this line, but also a mask aiTextLineNumbersInPage << asPageAsLines.length - 1 # the line number in the output file where the text belongs. asTextboxTags << "" # initialize tag memory for this textbox iNrTextboxes += 1 # keep, count of the textboxes bInTextbox = true end end end else asPageAsLines << sline # lines before and after the cotan block, including the cotan block's head and tail are just handed through bInCotan = sline[0..7] == "{{cotan>" end } # next line of the dokuwiki file # collect all information for a second pass in an array oSerializedData = Array.new oSerializedData << fkeep oSerializedData << iNrTextboxes oSerializedData << asPageAsLines oSerializedData << aiTextLineNumbersInPage oSerializedData << asTextboxTags # oSerializedData << asTextLines # write intermediate file for second pass, using the original dokuwiki file name with a yaml extension iDotPos = sDokuwikifile.rindex "." iDotPos = sDokuwikifile.length if iDotPos.nil? sYamlFile = Pathname.new(sDokuwikifile).sub_ext(".yaml") File.open(sYamlFile, "w") do |file| file.puts YAML::dump(oSerializedData) end # write textfile to be translated File.open(sTransferfile, "w") { |file| asTextLines.each { |s| file.write s } } else # post translation sYamlFile = Pathname.new(sDokuwikifile).sub_ext(".yaml") # name of the yaml file reconstructed oSerializedData = Array.new oSerializedData = YAML.load_file(sYamlFile) fkeep = oSerializedData[0] iNrTextboxes = oSerializedData[1] asPageAsLines = oSerializedData[2] aiTextLineNumbersInPage = oSerializedData[3] asTextboxTags = oSerializedData[4] # load translated text file = File.open(sTransferfile, "r") asAllTranslatedText = file.read.split sSepChar # various languages are seperated by a sSepChar wich by default is a ~ and optionally an additional letter to indicate the language file.close sLanguage = String.new # holds the language of the second and following translations asAllTranslatedText.each_index { |iOneLanguage| # one language at a time, by number asTranslatedText = Array.new # contains one translation as an Array sOneLanguage = asAllTranslatedText[iOneLanguage] # the translated text of this language in one string wih linebraks asTranslatedText = sOneLanguage.split "\n" # get the lines for this languge if iOneLanguage > 0 # more than one language according to the number of subfiles. Starting with the second language there's the possibilty that there is a letter indicating a language after the ~ sFirstLine = asTranslatedText[0][0] # omit the first character, i. e. the ~, take the second if sFirstLine.empty? # no language identifying letter or languge given # sLanguage = "" # stays empty else # an additional character can indicate a languge allLangs = COUNTRYLETTERS.select { |sl| # search through all letters indicating countries fReturn = false sFirstLine.each_char { |sChar| # search through all chars in the string, which should be one if sl[0].upcase == sChar.upcase fReturn = true end } fReturn } allLangs.each_index { |iL| if iL == 0 sLanguage = allLangs[0][1] else sLanguage = sLanguage + "-" + allLangs[iL][1] end } end asTranslatedText.shift # remove the first line of this languge's block, as it does not contain translated text end # end of the search for the name of the language if asTranslatedText.length != iNrTextboxes # regardless of language every translation has to have the same number of lines if $ivc > 0 unless sLanguage == "" STDERR.puts "warning for language " + sLanguage end STDERR.puts "warning: the number of textboxes " + iNrTextboxes.to_s + " does not equal the number of translated strings " + asTranslatedText.length.to_s + "." end end # make translation sOutputtext = String.new # iTextBoxCur = 0 asPageAsLines.each_index { |iline| iTextBoxCur = aiTextLineNumbersInPage.find_index(iline - 1) # lookup, if this line's number is in the table of translateable lines if iTextBoxCur.nil? # do we have no translation for this line? sOutputtext = sOutputtext + asPageAsLines[iline] + "\n" # this line gets handed through unchanged else # there is a translation for this line which we take from the sTransferfile sOutputtext = sOutputtext + asTextboxTags[iTextBoxCur] + asTranslatedText[iTextBoxCur] + "\n" # iTextBoxCur += 1 end } # remove trailing empty lines while sOutputtext[sOutputtext.length - 1] == "\n" sOutputtext.chomp! end # construct a name for the output file by inserting the number of the translation and the language before the extension dot iDotPos = sDokuwikifile.rindex "." iDotPos = sDokuwikifile.length if iDotPos.nil? sOutputfile = shortentolegallength( sDokuwikifile[0..iDotPos-1] + "-" + iOneLanguage.to_s + "-" + sLanguage + sDokuwikifile[iDotPos..-1]) # write final output File.open(sOutputfile, "w") do |file| file.puts sOutputtext end } # this translation is finished, go to the next File.delete sTransferfile # cleaning up # File.delete sYamlFile # cleaning up end
An Example from 1911.
After the first run of iolate.rb, the file transfertext.utf8 should look like this
Those were… **MY** notes. I think I know what's going on. The majority of robots on this planet are approaching neural pruning age. Rather than scrap robots using Dr. Bowman's mental design, they may be planning to use an aggressive neural pruning program. Aggressive? I've seen this program in action! Saying this thing is aggressive is like saying a great white shark likes to nibble on things! I'm more than a little disturbed by how much you know about this program.
Translate it into various languaes and seperate the languages by a line with an ~ otionally followed by a letter to indicate the language, to make it look like this example
Das waren... **Meine** Notizen. Ich glaube, ich weiß, was hier los ist. Die meisten Roboter auf diesem Planeten erreichen das Alter der neuronalen Beschneidung. Anstatt die Roboter mit Dr. Bowmans mentalem Design zu verschrotten, planen sie vielleicht ein aggressives neurales Beschneidungsprogramm zu verwenden. Aggressiv? Ich habe dieses Programm in Aktion gesehen! Es als aggressiv zu bezeichnen, ist so, als würde man sagen, dass ein weißer Hai gerne an Dingen knabbert! Ich bin mehr als nur ein wenig beunruhigt darüber, wie viel Sie über dieses Programm wissen. ~f C'était... mes notes. Je pense que je sais ce qui se passe. La majorité des robots sur cette planète approchent de l'âge de l'élagage neural. Plutôt que de détruire les robots en utilisant le design mental du Dr. Bowman, ils peuvent prévoir d'utiliser un programme agressif d'élagage neuronal. Agressif ? J'ai vu ce programme en action ! Dire que cette chose est agressive c'est comme dire qu'un grand requin blanc aime grignoter des choses ! Je suis plus que troublé par ce que tu sais de ce programme. ~d Det var... **MINE** noter. Jeg tror, jeg ved, hvad der foregår. De fleste robotter på denne planet nærmer sig den neurale beskæringsalder. I stedet for at skrotte robotter ved hjælp af Dr. Bowmans mentale design, planlægger de måske at bruge et aggressivt neuralt beskæringsprogram. Aggressivt? Jeg har set dette program i aktion! At sige, at denne tingest er aggressiv er som at sige, at en stor hvid haj kan lide at gnaske på ting! Jeg er mere end en smule foruroliget over, hvor meget du ved om dette program. ~s Esas eran... mis notas. Creo que sé lo que está pasando. La mayoría de los robots de este planeta se acercan a la edad de poda neural. En lugar de desechar los robots usando el diseño mental del Dr. Bowman, pueden estar planeando usar un programa de poda neural agresivo. ¿Agresivo? ¡He visto este programa en acción! Decir que esta cosa es agresiva es como decir que a un gran tiburón blanco le gusta mordisquear cosas. Estoy más que perturbado por lo mucho que sabes sobre este programa. ~u Ne olivat... **MINUN** muistiinpanojani. Luulen tietäväni mistä on kyse. Suurin osa tämän planeetan roboteista lähestyy hermojen karsintaikää. Sen sijaan, että he romuttaisivat robotteja tohtori Bowmanin mentaalisen mallin mukaan, he saattavat suunnitella aggressiivista hermojen karsintaohjelmaa. Aggressiivinen? Olen nähnyt tämän ohjelman toiminnassa! Tämän sanominen aggressiiviseksi on kuin sanoisi, että valkohai tykkää nakertaa asioita! Minua häiritsee, miten paljon tiedät tästä ohjelmasta. ~h Ezek... **az én jegyzeteim** voltak. Azt hiszem, tudom, mi folyik itt. A robotok többsége ezen a bolygón közelít az idegrendszeri metszési korhoz. Ahelyett, hogy a robotokat Dr. Bowman mentális tervezésével selejteznék, talán egy agresszív neurális metszési programot terveznek. Agresszív? Láttam ezt a programot működés közben! Azt mondani, hogy ez a dolog agresszív, olyan, mintha azt mondanánk, hogy egy nagy fehér cápa szeret rágcsálni dolgokat! Egy kicsit zavar, hogy mennyit tudsz erről a programról.
Not that DeepL honors the bold formating here only for German, Danish, Finnish and Hungarian, but not for French or Spanish. Your mileage may vary on this.
The second run of iolate.rb should than give you translated files for all the languages, which here are
cpage-0-.dkw cpage-1-French.dkw cpage-2-Dansk.dkw cpage-3-Spanish.dkw cpage-4-Finnish.dkw cpage-5-Hungarian.dkw
You still have to translate what is outside the cotan tags manually.