Apache OpenOffice (AOO) Bugzilla – Issue 66816
GAWK Support
Last modified: 2013-08-07 15:12:27 UTC
GAWK ---- There are many excellent programs that perform specific tasks extremely well. OOo would not be able to surpass their functionality (actually this isn't even needed), therefore I recommend integrating (facilitating) their use in OOo. Indeed, OOo should build on existing free software, too. I already reported in issue http://www.openoffice.org/issues/show_bug.cgi?id=66589 the potential of the free statistical R-package. I will elaborate here on another excellent program, gawk. CLARIFICATION: when I say tight integration, I do NOT mean to use the code from that program inside OOo. Instead, OOo should be able to communicate (e.g. through pipelines) directly with that program and the user should be able to use that program within OOo, without having to establish the connection himself. WHY GAWK? Gawk is a free software available for both UNIX and MS Windows (project GnuWin32 on Sourceforge.net). It allows one to automate complex text-manipulation tasks beyond the possibilities of ordinary scripts (like VB or javaScript). UNIX users would be particularly happy to use it. There are 2 potential uses for gawk in OOo: 1.) the "Cells/Rows" architecture in Calc is obviously inviting for gawk use. I often came across the need to perform complex string manipulations, unfortunately Calc offers very limited possibilities here. In addition, some bugs with the find function (see my issue http://qa.openoffice.org/issues/show_bug.cgi?id=66590 ) complicate this further. Making complex computations inside the Calc worksheets does not allow for automation (use in a different worksheet), while the classical macros are really limited in scope. This is where gawk makes the difference. How to implement this? - Calc should open a bi-directional pipeline to gawk (gawk supports this) - the gawk script should be written and saved as a macro/script (therefore one additional level of automation) - IF no FieldSeparator (FS) or RecordSeparator (RS) are specified in the BEGIN-section of the gawk script, Calc should set some default values, which should be also used to split (join) the Cells and Rows in the worksheet when pipelining the data stream into gawk - these same values (FS & RS) should be used to split the data back into cells when importing the processed data back into Calc (through the bi-di pipeline). 2.) the second use of gawk is obviously in Writer. The advantages of gawk are again versatility and suitability for complex tasks and automation, but also its speed. The implementation should be similar to that described previously, with the exception that RS should delimit paragraphs while FS should be left default (=space). An advanced feature would be to implement 2 modes for Writer to parse the text: - as plain text (no formatting, just splitting into paragraphs) - as xml-tagged text for more advanced processing (include text styles/formatting, but not as comprehensive as in the saved file)
As enhancement re-assigned to requirements
EXAMPLE OF GAWK USE Here is a real-life example showing the usefulness of gawk/awk: I worked recently on a patient DB and wanted to create some dummy variables for the hospital unit (patient category). GAWK SCRIPT ($1 contains the input - the hospital unit) $2 = 0 # neurosurgery vs non-neurosurgery $3 = 0 # neurology vs non-neurology $4 = 0 # general surgery vs non-surgery $5 = 0 # internal maedicine vs non-im $6 = 1 # ERROR var, if unknown abreviation $0 = tolower($0) # NEUROSURGERY /nch/ {$2 = 1, $6 = 0 } # Neurology /^n$|^ne/ {$3 = 1, $6 =0 } # General Surgery /^ch/ {$4 = 1, $6 =0 } # INTERNAL MEDICINE /mi|end|nut/ {$5 = 1, $6 =0 } print $0 >> 'out-file' ### END SCRIPT - this simple script does exactly what I wanted in a very simple fashion, AND - it took me less than 5 minutes to write it!!! The execution is almost instantly even on big files (~1 MB text file). Unfortunately, I didn't manage to get this same thing done using only Calc's functionality. (One reason is the problem with the find() function described previously; this severe limitation of Calc hampers any serious work with strings.)