[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[SCM] gawk branch, master, updated. gawk-4.1.0-4868-gdaa6608c
From: |
Arnold Robbins |
Subject: |
[SCM] gawk branch, master, updated. gawk-4.1.0-4868-gdaa6608c |
Date: |
Mon, 15 Aug 2022 11:21:27 -0400 (EDT) |
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "gawk".
The branch, master has been updated
via daa6608c9e5ec8385ac2e7a24388950ad7c6e40e (commit)
via 36c498a173c5c61a0b78a426ecb6c7e834bdb681 (commit)
via 44bd38538e37f219c23b223cf297f5d03b7407f1 (commit)
from a1412f562aa8e4ab66c04507f4e171fcf22e1389 (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
- Log -----------------------------------------------------------------
http://git.sv.gnu.org/cgit/gawk.git/commit/?id=daa6608c9e5ec8385ac2e7a24388950ad7c6e40e
commit daa6608c9e5ec8385ac2e7a24388950ad7c6e40e
Author: Arnold D. Robbins <arnold@skeeve.com>
Date: Mon Aug 15 18:21:01 2022 +0300
Add new pm-gawk.texi manual. Update gawk manual for it.
diff --git a/NEWS b/NEWS
index 62c743e6..6b14fd8a 100644
--- a/NEWS
+++ b/NEWS
@@ -49,8 +49,9 @@ copy of the manual.
10. Gawk now supports Terence Kelly's "persistent malloc" (pma),
allowing gawk to preserve its variables, arrays and user-defined
functions between runs. THIS IS AN EXPERIMENTAL FEATURE!
+
For more information, see the manual. A new pm-gawk.1 man page
-is included.
+is included, as is a separate user manual that focuses on the feature.
11. Support for OS/2 has been removed. It was not being actively
maintained.
diff --git a/doc/ChangeLog b/doc/ChangeLog
index a32dbcba..ffa30eff 100644
--- a/doc/ChangeLog
+++ b/doc/ChangeLog
@@ -2,6 +2,15 @@
* gawktexi.in (Persistent Memory): Typo fix.
+ Semi-related:
+
+ * gawktexi.in (Persistent Memory): Add references to and info about
+ the new "Perstistent Memory gawk User Manual" from Terence Kelly.
+ (Distribution contents): Updated and corrected.
+ * pm-gawk.texi: New file.
+ * Makefile.am: Add appropriate stuff to build and install
+ the various versions of the pm-gawk.texi file.
+
2022-08-14 Arnold D. Robbins <arnold@skeeve.com>
* gawktexi.in (Persistent Memory): Small addition.
diff --git a/doc/Makefile.am b/doc/Makefile.am
index 39d9cdb7..4c196bcd 100644
--- a/doc/Makefile.am
+++ b/doc/Makefile.am
@@ -25,7 +25,7 @@
## process this file with automake to produce Makefile.in
-info_TEXINFOS = gawk.texi gawkinet.texi gawkworkflow.texi
+info_TEXINFOS = gawk.texi gawkinet.texi gawkworkflow.texi pm-gawk.texi
man_MANS = gawk.1 gawkbug.1 pm-gawk.1
@@ -54,7 +54,7 @@ EXTRA_DIST = ChangeLog ChangeLog.0 ChangeLog.1 \
bc_notes
# Get rid of generated files when cleaning
-CLEANFILES = *.ps *.html *.dvi *~ awkcard.nc awkcard.tr gawk.pdf gawkinet.pdf
gawkworkflow.pdf awkcard.pdf gawk.1.pdf gawkbug.1.pdf pm-gawk.1.pdf
+CLEANFILES = *.ps *.html *.dvi *~ awkcard.nc awkcard.tr gawk.pdf gawkinet.pdf
gawkworkflow.pdf awkcard.pdf gawk.1.pdf gawkbug.1.pdf pm-gawk.1.pdf pm-gawk.pdf
MAKEINFO = @MAKEINFO@ --no-split --force
@@ -80,9 +80,9 @@ AWKCARD = awkcard.ps
gawk.texi: $(srcdir)/gawktexi.in $(srcdir)/sidebar.awk
awk -f $(srcdir)/sidebar.awk < $(srcdir)/gawktexi.in > gawk.texi
-postscript: gawk.ps gawkinet.ps gawkworkflow.ps gawk.1.ps gawkbug.1.ps
pm-gawk.1.ps $(AWKCARD)
+postscript: gawk.ps gawkinet.ps gawkworkflow.ps pm-gawk.ps gawk.1.ps
gawkbug.1.ps pm-gawk.1.ps $(AWKCARD)
-pdf-local: postscript gawk.pdf gawkinet.pdf awkcard.pdf gawk.1.pdf
gawkbug.1.pdf pm-gawk.1.pdf
+pdf-local: postscript gawk.pdf gawkinet.pdf awkcard.pdf gawk.1.pdf
gawkbug.1.pdf pm-gawk.1.pdf pm-gawk.pdf
gawk.ps: gawk.dvi
TEXINPUTS=$(srcdir): dvips -o gawk.ps gawk.dvi
@@ -93,6 +93,9 @@ gawkinet.ps: gawkinet.dvi
gawkworkflow.ps: gawkworkflow.dvi
TEXINPUTS=$(srcdir): dvips -o gawkworkflow.ps gawkworkflow.dvi
+pm-gawk.ps: pm-gawk.dvi
+ TEXINPUTS=$(srcdir): dvips -o pm-gawk.ps pm-gawk.dvi
+
gawk.1.ps: gawk.1
-groff -man $(srcdir)/gawk.1 > gawk.1.ps
diff --git a/doc/Makefile.in b/doc/Makefile.in
index 1753ce45..b6dd6a1d 100644
--- a/doc/Makefile.in
+++ b/doc/Makefile.in
@@ -177,14 +177,14 @@ am__v_texidevnull_ = $(am__v_texidevnull_@AM_DEFAULT_V@)
am__v_texidevnull_0 = > /dev/null
am__v_texidevnull_1 =
INFO_DEPS = $(srcdir)/gawk.info $(srcdir)/gawkinet.info \
- $(srcdir)/gawkworkflow.info
+ $(srcdir)/gawkworkflow.info $(srcdir)/pm-gawk.info
TEXINFO_TEX = $(top_srcdir)/build-aux/texinfo.tex
am__TEXINFO_TEX_DIR = $(top_srcdir)/build-aux
-DVIS = gawk.dvi gawkinet.dvi gawkworkflow.dvi
-PDFS = gawk.pdf gawkinet.pdf gawkworkflow.pdf
-PSS = gawk.ps gawkinet.ps gawkworkflow.ps
-HTMLS = gawk.html gawkinet.html gawkworkflow.html
-TEXINFOS = gawk.texi gawkinet.texi gawkworkflow.texi
+DVIS = gawk.dvi gawkinet.dvi gawkworkflow.dvi pm-gawk.dvi
+PDFS = gawk.pdf gawkinet.pdf gawkworkflow.pdf pm-gawk.pdf
+PSS = gawk.ps gawkinet.ps gawkworkflow.ps pm-gawk.ps
+HTMLS = gawk.html gawkinet.html gawkworkflow.html pm-gawk.html
+TEXINFOS = gawk.texi gawkinet.texi gawkworkflow.texi pm-gawk.texi
TEXI2DVI = texi2dvi
TEXI2PDF = $(TEXI2DVI) --pdf --batch
MAKEINFOHTML = $(MAKEINFO) --html
@@ -365,7 +365,7 @@ target_alias = @target_alias@
top_build_prefix = @top_build_prefix@
top_builddir = @top_builddir@
top_srcdir = @top_srcdir@
-info_TEXINFOS = gawk.texi gawkinet.texi gawkworkflow.texi
+info_TEXINFOS = gawk.texi gawkinet.texi gawkworkflow.texi pm-gawk.texi
man_MANS = gawk.1 gawkbug.1 pm-gawk.1
EXTRA_DIST = ChangeLog ChangeLog.0 ChangeLog.1 \
README.card ad.block setter.outline \
@@ -393,7 +393,7 @@ EXTRA_DIST = ChangeLog ChangeLog.0 ChangeLog.1 \
# Get rid of generated files when cleaning
-CLEANFILES = *.ps *.html *.dvi *~ awkcard.nc awkcard.tr gawk.pdf gawkinet.pdf
gawkworkflow.pdf awkcard.pdf gawk.1.pdf gawkbug.1.pdf pm-gawk.1.pdf
+CLEANFILES = *.ps *.html *.dvi *~ awkcard.nc awkcard.tr gawk.pdf gawkinet.pdf
gawkworkflow.pdf awkcard.pdf gawk.1.pdf gawkbug.1.pdf pm-gawk.1.pdf pm-gawk.pdf
TROFF = groff -t -Tps -U
SEDME = sed -e "s/^level0 restore/level0 restore flashme 100 72 moveto
(Copyright `date '+%m-%d-%y %T'`, FSF, Inc. (all)) show/" \
-e "s/^\/level0 save def/\/level0 save def 30 -48 translate/"
@@ -495,6 +495,10 @@ $(srcdir)/gawkworkflow.info: gawkworkflow.texi
gawkworkflow.dvi: gawkworkflow.texi
gawkworkflow.pdf: gawkworkflow.texi
gawkworkflow.html: gawkworkflow.texi
+$(srcdir)/pm-gawk.info: pm-gawk.texi
+pm-gawk.dvi: pm-gawk.texi
+pm-gawk.pdf: pm-gawk.texi
+pm-gawk.html: pm-gawk.texi
.dvi.ps:
$(AM_V_DVIPS)TEXINPUTS="$(am__TEXINFO_TEX_DIR)$(PATH_SEPARATOR)$$TEXINPUTS" \
$(DVIPS) $(AM_V_texinfo) -o $@ $<
@@ -577,15 +581,17 @@ dist-info: $(INFO_DEPS)
mostlyclean-aminfo:
-rm -rf gawk.t2d gawk.t2p gawkinet.t2d gawkinet.t2p gawkworkflow.t2d \
- gawkworkflow.t2p
+ gawkworkflow.t2p pm-gawk.t2d pm-gawk.t2p
clean-aminfo:
-test -z "gawk.dvi gawk.pdf gawk.ps gawk.html gawkinet.dvi gawkinet.pdf
\
gawkinet.ps gawkinet.html gawkworkflow.dvi gawkworkflow.pdf \
- gawkworkflow.ps gawkworkflow.html" \
+ gawkworkflow.ps gawkworkflow.html pm-gawk.dvi pm-gawk.pdf \
+ pm-gawk.ps pm-gawk.html" \
|| rm -rf gawk.dvi gawk.pdf gawk.ps gawk.html gawkinet.dvi gawkinet.pdf
\
gawkinet.ps gawkinet.html gawkworkflow.dvi gawkworkflow.pdf \
- gawkworkflow.ps gawkworkflow.html
+ gawkworkflow.ps gawkworkflow.html pm-gawk.dvi pm-gawk.pdf \
+ pm-gawk.ps pm-gawk.html
maintainer-clean-aminfo:
@list='$(INFO_DEPS)'; for i in $$list; do \
@@ -906,9 +912,9 @@ uninstall-man: uninstall-man1
gawk.texi: $(srcdir)/gawktexi.in $(srcdir)/sidebar.awk
awk -f $(srcdir)/sidebar.awk < $(srcdir)/gawktexi.in > gawk.texi
-postscript: gawk.ps gawkinet.ps gawkworkflow.ps gawk.1.ps gawkbug.1.ps
pm-gawk.1.ps $(AWKCARD)
+postscript: gawk.ps gawkinet.ps gawkworkflow.ps pm-gawk.ps gawk.1.ps
gawkbug.1.ps pm-gawk.1.ps $(AWKCARD)
-pdf-local: postscript gawk.pdf gawkinet.pdf awkcard.pdf gawk.1.pdf
gawkbug.1.pdf pm-gawk.1.pdf
+pdf-local: postscript gawk.pdf gawkinet.pdf awkcard.pdf gawk.1.pdf
gawkbug.1.pdf pm-gawk.1.pdf pm-gawk.pdf
gawk.ps: gawk.dvi
TEXINPUTS=$(srcdir): dvips -o gawk.ps gawk.dvi
@@ -919,6 +925,9 @@ gawkinet.ps: gawkinet.dvi
gawkworkflow.ps: gawkworkflow.dvi
TEXINPUTS=$(srcdir): dvips -o gawkworkflow.ps gawkworkflow.dvi
+pm-gawk.ps: pm-gawk.dvi
+ TEXINPUTS=$(srcdir): dvips -o pm-gawk.ps pm-gawk.dvi
+
gawk.1.ps: gawk.1
-groff -man $(srcdir)/gawk.1 > gawk.1.ps
diff --git a/doc/gawk.info b/doc/gawk.info
index 00ab27eb..6c1f3404 100644
--- a/doc/gawk.info
+++ b/doc/gawk.info
@@ -22234,9 +22234,13 @@ the different verbosity levels are.
'gawk' does not currently detect such a situation and may not do so
in the future either.
- Here are articles and web links that provide more information about
-persistent memory and why it's useful in a scripting language like
-'gawk'.
+ Terence Kelly has provided a separate 'Persistent-Memory 'gawk' User
+Manual' document, which is included in the 'gawk' distribution. It is
+worth reading. *Note General Introduction: (pm-gawk)Top.
+
+ Here are additional articles and web links that provide more
+information about persistent memory and why it's useful in a scripting
+language like 'gawk'.
<https://web.eecs.umich.edu/~tpkelly/pma/>
This is the canonical source for Terence Kelly's Persistent Memory
@@ -31165,6 +31169,19 @@ Various '.c', '.y', and '.h' files
'doc/gawkworkflow.info'
The generated Info file for 'Participating in 'gawk' Development'.
+'doc/pm-gawk.texi'
+ The Texinfo source file for *note General Introduction:
+ (pm-gawk)Top. It should be processed with TeX (via 'texi2dvi' or
+ 'texi2pdf') to produce a printed document and with 'makeinfo' to
+ produce an Info or HTML file.
+
+'doc/pm-gawk.info'
+ The generated Info file for 'Persistent-Memory 'gawk' User Manual'.
+
+'doc/pm-gawk.1'
+ The 'troff' source for a manual page describing the the persistent
+ memory features presented in *note Persistent Memory::.
+
'doc/igawk.1'
The 'troff' source for a manual page describing the 'igawk' program
presented in *note Igawk Program::. (Since 'gawk' can do its own
@@ -31186,16 +31203,12 @@ Various '.c', '.y', and '.h' files
'Makefile.in'
'aclocal.m4'
-'bisonfix.awk'
-'config.guess'
+'build-aux/*'
'configh.in'
'configure.ac'
'configure'
'custom.h'
-'depcomp'
-'install-sh'
'missing_d/*'
-'mkinstalldirs'
'm4/*'
These files and subdirectories are used when configuring and
compiling 'gawk' for various Unix systems. Most of them are
@@ -36119,7 +36132,7 @@ Index
(line 44)
* AWKLIBPATH environment variable <6>: POSIX/GNU. (line 108)
* AWKLIBPATH environment variable <7>: Distribution contents.
- (line 183)
+ (line 192)
* AWKLIBPATH environment variable <8>: Shell Startup Files. (line 6)
* AWKPATH environment variable: AWKPATH Variable. (line 6)
* AWKPATH environment variable <1>: Include Files. (line 8)
@@ -36130,7 +36143,7 @@ Index
* AWKPATH environment variable <6>: POSIX/GNU. (line 105)
* AWKPATH environment variable <7>: Feature History. (line 11)
* AWKPATH environment variable <8>: Distribution contents.
- (line 183)
+ (line 192)
* AWKPATH environment variable <9>: Shell Startup Files. (line 6)
* AWKPATH environment variable <10>: PC Using. (line 12)
* AWKPATH environment variable <11>: VMS Running. (line 57)
@@ -37038,9 +37051,9 @@ Index
* environment variables, AWKLIBPATH <6>: POSIX/GNU. (line 108)
* environment variables, AWKPATH <7>: Feature History. (line 11)
* environment variables, AWKPATH <8>: Distribution contents.
- (line 183)
+ (line 192)
* environment variables, AWKLIBPATH <7>: Distribution contents.
- (line 183)
+ (line 192)
* environment variables, AWKPATH <9>: Shell Startup Files. (line 6)
* environment variables, AWKLIBPATH <8>: Shell Startup Files. (line 6)
* environment variables, Path: PC Binary Installation.
@@ -37775,7 +37788,7 @@ Index
* Kahrs, Jürgen <1>: Contributors. (line 71)
* Kasal, Stepan: Acknowledgments. (line 60)
* Kelly, Terence: Persistent Memory. (line 68)
-* Kelly, Terence <1>: Persistent Memory. (line 106)
+* Kelly, Terence <1>: Persistent Memory. (line 110)
* Kelly, Terence <2>: Feature History. (line 508)
* Kenobi, Obi-Wan: Undocumented. (line 6)
* Kernighan, Brian: History. (line 17)
@@ -39566,244 +39579,244 @@ Ref: Two-way I/O-Footnote-2884578
Node: TCP/IP Networking884660
Node: Profiling887736
Node: Persistent Memory897042
-Ref: Persistent Memory-Footnote-1904750
-Node: Extension Philosophy904877
-Node: Advanced Features Summary906364
-Node: Internationalization908536
-Node: I18N and L10N910210
-Node: Explaining gettext910897
-Ref: Explaining gettext-Footnote-1916789
-Ref: Explaining gettext-Footnote-2916974
-Node: Programmer i18n917139
-Ref: Programmer i18n-Footnote-1922088
-Node: Translator i18n922137
-Node: String Extraction922931
-Ref: String Extraction-Footnote-1924063
-Node: Printf Ordering924149
-Ref: Printf Ordering-Footnote-1926935
-Node: I18N Portability926999
-Ref: I18N Portability-Footnote-1929455
-Node: I18N Example929518
-Ref: I18N Example-Footnote-1932793
-Ref: I18N Example-Footnote-2932866
-Node: Gawk I18N932975
-Node: I18N Summary933597
-Node: Debugger934938
-Node: Debugging935938
-Node: Debugging Concepts936379
-Node: Debugging Terms938188
-Node: Awk Debugging940763
-Ref: Awk Debugging-Footnote-1941708
-Node: Sample Debugging Session941840
-Node: Debugger Invocation942374
-Node: Finding The Bug943760
-Node: List of Debugger Commands950234
-Node: Breakpoint Control951567
-Node: Debugger Execution Control955261
-Node: Viewing And Changing Data958623
-Node: Execution Stack962164
-Node: Debugger Info963801
-Node: Miscellaneous Debugger Commands967872
-Node: Readline Support972934
-Node: Limitations973830
-Node: Debugging Summary976384
-Node: Namespaces977663
-Node: Global Namespace978774
-Node: Qualified Names980172
-Node: Default Namespace981171
-Node: Changing The Namespace981912
-Node: Naming Rules983526
-Node: Internal Name Management985374
-Node: Namespace Example986416
-Node: Namespace And Features988978
-Node: Namespace Summary990413
-Node: Arbitrary Precision Arithmetic991890
-Node: Computer Arithmetic993377
-Ref: table-numeric-ranges997143
-Ref: table-floating-point-ranges997637
-Ref: Computer Arithmetic-Footnote-1998296
-Node: Math Definitions998353
-Ref: table-ieee-formats1001329
-Node: MPFR features1001897
-Node: MPFR On Parole1002342
-Ref: MPFR On Parole-Footnote-11003171
-Node: MPFR Intro1003326
-Node: FP Math Caution1004977
-Ref: FP Math Caution-Footnote-11006049
-Node: Inexactness of computations1006418
-Node: Inexact representation1007449
-Node: Comparing FP Values1008809
-Node: Errors accumulate1010050
-Node: Strange values1011506
-Ref: Strange values-Footnote-11014094
-Node: Getting Accuracy1014199
-Node: Try To Round1016909
-Node: Setting precision1017808
-Ref: table-predefined-precision-strings1018505
-Node: Setting the rounding mode1020336
-Ref: table-gawk-rounding-modes1020710
-Ref: Setting the rounding mode-Footnote-11024642
-Node: Arbitrary Precision Integers1024821
-Ref: Arbitrary Precision Integers-Footnote-11027996
-Node: Checking for MPFR1028145
-Node: POSIX Floating Point Problems1029619
-Ref: POSIX Floating Point Problems-Footnote-11034272
-Node: Floating point summary1034310
-Node: Dynamic Extensions1036500
-Node: Extension Intro1038053
-Node: Plugin License1039319
-Node: Extension Mechanism Outline1040116
-Ref: figure-load-extension1040555
-Ref: figure-register-new-function1042121
-Ref: figure-call-new-function1043214
-Node: Extension API Description1045277
-Node: Extension API Functions Introduction1046990
-Ref: table-api-std-headers1048826
-Node: General Data Types1053076
-Ref: General Data Types-Footnote-11061782
-Node: Memory Allocation Functions1062081
-Ref: Memory Allocation Functions-Footnote-11066582
-Node: Constructor Functions1066681
-Node: API Ownership of MPFR and GMP Values1070334
-Node: Registration Functions1071867
-Node: Extension Functions1072567
-Node: Exit Callback Functions1077889
-Node: Extension Version String1079139
-Node: Input Parsers1079802
-Node: Output Wrappers1092523
-Node: Two-way processors1097035
-Node: Printing Messages1099300
-Ref: Printing Messages-Footnote-11100471
-Node: Updating ERRNO1100624
-Node: Requesting Values1101363
-Ref: table-value-types-returned1102100
-Node: Accessing Parameters1103209
-Node: Symbol Table Access1104446
-Node: Symbol table by name1104958
-Ref: Symbol table by name-Footnote-11107983
-Node: Symbol table by cookie1108111
-Ref: Symbol table by cookie-Footnote-11112296
-Node: Cached values1112360
-Ref: Cached values-Footnote-11115896
-Node: Array Manipulation1116049
-Ref: Array Manipulation-Footnote-11117140
-Node: Array Data Types1117177
-Ref: Array Data Types-Footnote-11119835
-Node: Array Functions1119927
-Node: Flattening Arrays1124712
-Node: Creating Arrays1131688
-Node: Redirection API1136455
-Node: Extension API Variables1139288
-Node: Extension Versioning1139999
-Ref: gawk-api-version1140428
-Node: Extension GMP/MPFR Versioning1142160
-Node: Extension API Informational Variables1143788
-Node: Extension API Boilerplate1144861
-Node: Changes from API V11148835
-Node: Finding Extensions1150407
-Node: Extension Example1150966
-Node: Internal File Description1151764
-Node: Internal File Ops1155844
-Ref: Internal File Ops-Footnote-11167194
-Node: Using Internal File Ops1167334
-Ref: Using Internal File Ops-Footnote-11169717
-Node: Extension Samples1169991
-Node: Extension Sample File Functions1171520
-Node: Extension Sample Fnmatch1179169
-Node: Extension Sample Fork1180656
-Node: Extension Sample Inplace1181874
-Node: Extension Sample Ord1185500
-Node: Extension Sample Readdir1186336
-Ref: table-readdir-file-types1187225
-Node: Extension Sample Revout1188293
-Node: Extension Sample Rev2way1188882
-Node: Extension Sample Read write array1189622
-Node: Extension Sample Readfile1192787
-Node: Extension Sample Time1193882
-Node: Extension Sample API Tests1195634
-Node: gawkextlib1196126
-Node: Extension summary1199044
-Node: Extension Exercises1202746
-Node: Language History1203988
-Node: V7/SVR3.11205644
-Node: SVR41207796
-Node: POSIX1209230
-Node: BTL1210611
-Node: POSIX/GNU1211340
-Node: Feature History1217246
-Node: Common Extensions1234985
-Node: Ranges and Locales1236268
-Ref: Ranges and Locales-Footnote-11240884
-Ref: Ranges and Locales-Footnote-21240911
-Ref: Ranges and Locales-Footnote-31241146
-Node: Contributors1241369
-Node: History summary1247366
-Node: Installation1248746
-Node: Gawk Distribution1249690
-Node: Getting1250174
-Node: Extracting1251137
-Node: Distribution contents1252775
-Node: Unix Installation1259836
-Node: Quick Installation1260640
-Node: Compiling with MPFR1263060
-Node: Shell Startup Files1263750
-Node: Additional Configuration Options1264839
-Node: Configuration Philosophy1267154
-Node: Compiling from Git1269550
-Node: Building the Documentation1270105
-Node: Non-Unix Installation1271489
-Node: PC Installation1271949
-Node: PC Binary Installation1272790
-Node: PC Compiling1273663
-Node: PC Using1274769
-Node: Cygwin1278265
-Node: MSYS1279489
-Node: VMS Installation1280091
-Node: VMS Compilation1280810
-Ref: VMS Compilation-Footnote-11282039
-Node: VMS Dynamic Extensions1282097
-Node: VMS Installation Details1283782
-Node: VMS Running1286044
-Node: VMS GNV1290323
-Node: Bugs1291037
-Node: Bug definition1291949
-Node: Bug address1294885
-Node: Usenet1298073
-Node: Performance bugs1299262
-Node: Asking for help1302183
-Node: Maintainers1304150
-Node: Other Versions1305157
-Node: Installation summary1313427
-Node: Notes1314784
-Node: Compatibility Mode1315578
-Node: Additions1316360
-Node: Accessing The Source1317285
-Node: Adding Code1318722
-Node: New Ports1325537
-Node: Derived Files1329912
-Ref: Derived Files-Footnote-11335572
-Ref: Derived Files-Footnote-21335607
-Ref: Derived Files-Footnote-31336205
-Node: Future Extensions1336319
-Node: Implementation Limitations1336977
-Node: Extension Design1338187
-Node: Old Extension Problems1339331
-Ref: Old Extension Problems-Footnote-11340849
-Node: Extension New Mechanism Goals1340906
-Ref: Extension New Mechanism Goals-Footnote-11344270
-Node: Extension Other Design Decisions1344459
-Node: Extension Future Growth1346572
-Node: Notes summary1347178
-Node: Basic Concepts1348336
-Node: Basic High Level1349017
-Ref: figure-general-flow1349299
-Ref: figure-process-flow1349985
-Ref: Basic High Level-Footnote-11353287
-Node: Basic Data Typing1353472
-Node: Glossary1356800
-Node: Copying1388687
-Node: GNU Free Documentation License1426230
-Node: Index1451350
+Ref: Persistent Memory-Footnote-1904963
+Node: Extension Philosophy905090
+Node: Advanced Features Summary906577
+Node: Internationalization908749
+Node: I18N and L10N910423
+Node: Explaining gettext911110
+Ref: Explaining gettext-Footnote-1917002
+Ref: Explaining gettext-Footnote-2917187
+Node: Programmer i18n917352
+Ref: Programmer i18n-Footnote-1922301
+Node: Translator i18n922350
+Node: String Extraction923144
+Ref: String Extraction-Footnote-1924276
+Node: Printf Ordering924362
+Ref: Printf Ordering-Footnote-1927148
+Node: I18N Portability927212
+Ref: I18N Portability-Footnote-1929668
+Node: I18N Example929731
+Ref: I18N Example-Footnote-1933006
+Ref: I18N Example-Footnote-2933079
+Node: Gawk I18N933188
+Node: I18N Summary933810
+Node: Debugger935151
+Node: Debugging936151
+Node: Debugging Concepts936592
+Node: Debugging Terms938401
+Node: Awk Debugging940976
+Ref: Awk Debugging-Footnote-1941921
+Node: Sample Debugging Session942053
+Node: Debugger Invocation942587
+Node: Finding The Bug943973
+Node: List of Debugger Commands950447
+Node: Breakpoint Control951780
+Node: Debugger Execution Control955474
+Node: Viewing And Changing Data958836
+Node: Execution Stack962377
+Node: Debugger Info964014
+Node: Miscellaneous Debugger Commands968085
+Node: Readline Support973147
+Node: Limitations974043
+Node: Debugging Summary976597
+Node: Namespaces977876
+Node: Global Namespace978987
+Node: Qualified Names980385
+Node: Default Namespace981384
+Node: Changing The Namespace982125
+Node: Naming Rules983739
+Node: Internal Name Management985587
+Node: Namespace Example986629
+Node: Namespace And Features989191
+Node: Namespace Summary990626
+Node: Arbitrary Precision Arithmetic992103
+Node: Computer Arithmetic993590
+Ref: table-numeric-ranges997356
+Ref: table-floating-point-ranges997850
+Ref: Computer Arithmetic-Footnote-1998509
+Node: Math Definitions998566
+Ref: table-ieee-formats1001542
+Node: MPFR features1002110
+Node: MPFR On Parole1002555
+Ref: MPFR On Parole-Footnote-11003384
+Node: MPFR Intro1003539
+Node: FP Math Caution1005190
+Ref: FP Math Caution-Footnote-11006262
+Node: Inexactness of computations1006631
+Node: Inexact representation1007662
+Node: Comparing FP Values1009022
+Node: Errors accumulate1010263
+Node: Strange values1011719
+Ref: Strange values-Footnote-11014307
+Node: Getting Accuracy1014412
+Node: Try To Round1017122
+Node: Setting precision1018021
+Ref: table-predefined-precision-strings1018718
+Node: Setting the rounding mode1020549
+Ref: table-gawk-rounding-modes1020923
+Ref: Setting the rounding mode-Footnote-11024855
+Node: Arbitrary Precision Integers1025034
+Ref: Arbitrary Precision Integers-Footnote-11028209
+Node: Checking for MPFR1028358
+Node: POSIX Floating Point Problems1029832
+Ref: POSIX Floating Point Problems-Footnote-11034485
+Node: Floating point summary1034523
+Node: Dynamic Extensions1036713
+Node: Extension Intro1038266
+Node: Plugin License1039532
+Node: Extension Mechanism Outline1040329
+Ref: figure-load-extension1040768
+Ref: figure-register-new-function1042334
+Ref: figure-call-new-function1043427
+Node: Extension API Description1045490
+Node: Extension API Functions Introduction1047203
+Ref: table-api-std-headers1049039
+Node: General Data Types1053289
+Ref: General Data Types-Footnote-11061995
+Node: Memory Allocation Functions1062294
+Ref: Memory Allocation Functions-Footnote-11066795
+Node: Constructor Functions1066894
+Node: API Ownership of MPFR and GMP Values1070547
+Node: Registration Functions1072080
+Node: Extension Functions1072780
+Node: Exit Callback Functions1078102
+Node: Extension Version String1079352
+Node: Input Parsers1080015
+Node: Output Wrappers1092736
+Node: Two-way processors1097248
+Node: Printing Messages1099513
+Ref: Printing Messages-Footnote-11100684
+Node: Updating ERRNO1100837
+Node: Requesting Values1101576
+Ref: table-value-types-returned1102313
+Node: Accessing Parameters1103422
+Node: Symbol Table Access1104659
+Node: Symbol table by name1105171
+Ref: Symbol table by name-Footnote-11108196
+Node: Symbol table by cookie1108324
+Ref: Symbol table by cookie-Footnote-11112509
+Node: Cached values1112573
+Ref: Cached values-Footnote-11116109
+Node: Array Manipulation1116262
+Ref: Array Manipulation-Footnote-11117353
+Node: Array Data Types1117390
+Ref: Array Data Types-Footnote-11120048
+Node: Array Functions1120140
+Node: Flattening Arrays1124925
+Node: Creating Arrays1131901
+Node: Redirection API1136668
+Node: Extension API Variables1139501
+Node: Extension Versioning1140212
+Ref: gawk-api-version1140641
+Node: Extension GMP/MPFR Versioning1142373
+Node: Extension API Informational Variables1144001
+Node: Extension API Boilerplate1145074
+Node: Changes from API V11149048
+Node: Finding Extensions1150620
+Node: Extension Example1151179
+Node: Internal File Description1151977
+Node: Internal File Ops1156057
+Ref: Internal File Ops-Footnote-11167407
+Node: Using Internal File Ops1167547
+Ref: Using Internal File Ops-Footnote-11169930
+Node: Extension Samples1170204
+Node: Extension Sample File Functions1171733
+Node: Extension Sample Fnmatch1179382
+Node: Extension Sample Fork1180869
+Node: Extension Sample Inplace1182087
+Node: Extension Sample Ord1185713
+Node: Extension Sample Readdir1186549
+Ref: table-readdir-file-types1187438
+Node: Extension Sample Revout1188506
+Node: Extension Sample Rev2way1189095
+Node: Extension Sample Read write array1189835
+Node: Extension Sample Readfile1193000
+Node: Extension Sample Time1194095
+Node: Extension Sample API Tests1195847
+Node: gawkextlib1196339
+Node: Extension summary1199257
+Node: Extension Exercises1202959
+Node: Language History1204201
+Node: V7/SVR3.11205857
+Node: SVR41208009
+Node: POSIX1209443
+Node: BTL1210824
+Node: POSIX/GNU1211553
+Node: Feature History1217459
+Node: Common Extensions1235198
+Node: Ranges and Locales1236481
+Ref: Ranges and Locales-Footnote-11241097
+Ref: Ranges and Locales-Footnote-21241124
+Ref: Ranges and Locales-Footnote-31241359
+Node: Contributors1241582
+Node: History summary1247579
+Node: Installation1248959
+Node: Gawk Distribution1249903
+Node: Getting1250387
+Node: Extracting1251350
+Node: Distribution contents1252988
+Node: Unix Installation1260494
+Node: Quick Installation1261298
+Node: Compiling with MPFR1263718
+Node: Shell Startup Files1264408
+Node: Additional Configuration Options1265497
+Node: Configuration Philosophy1267812
+Node: Compiling from Git1270208
+Node: Building the Documentation1270763
+Node: Non-Unix Installation1272147
+Node: PC Installation1272607
+Node: PC Binary Installation1273448
+Node: PC Compiling1274321
+Node: PC Using1275427
+Node: Cygwin1278923
+Node: MSYS1280147
+Node: VMS Installation1280749
+Node: VMS Compilation1281468
+Ref: VMS Compilation-Footnote-11282697
+Node: VMS Dynamic Extensions1282755
+Node: VMS Installation Details1284440
+Node: VMS Running1286702
+Node: VMS GNV1290981
+Node: Bugs1291695
+Node: Bug definition1292607
+Node: Bug address1295543
+Node: Usenet1298731
+Node: Performance bugs1299920
+Node: Asking for help1302841
+Node: Maintainers1304808
+Node: Other Versions1305815
+Node: Installation summary1314085
+Node: Notes1315442
+Node: Compatibility Mode1316236
+Node: Additions1317018
+Node: Accessing The Source1317943
+Node: Adding Code1319380
+Node: New Ports1326195
+Node: Derived Files1330570
+Ref: Derived Files-Footnote-11336230
+Ref: Derived Files-Footnote-21336265
+Ref: Derived Files-Footnote-31336863
+Node: Future Extensions1336977
+Node: Implementation Limitations1337635
+Node: Extension Design1338845
+Node: Old Extension Problems1339989
+Ref: Old Extension Problems-Footnote-11341507
+Node: Extension New Mechanism Goals1341564
+Ref: Extension New Mechanism Goals-Footnote-11344928
+Node: Extension Other Design Decisions1345117
+Node: Extension Future Growth1347230
+Node: Notes summary1347836
+Node: Basic Concepts1348994
+Node: Basic High Level1349675
+Ref: figure-general-flow1349957
+Ref: figure-process-flow1350643
+Ref: Basic High Level-Footnote-11353945
+Node: Basic Data Typing1354130
+Node: Glossary1357458
+Node: Copying1389345
+Node: GNU Free Documentation License1426888
+Node: Index1452008
End Tag Table
diff --git a/doc/gawk.texi b/doc/gawk.texi
index 82062c45..f4dee7d0 100644
--- a/doc/gawk.texi
+++ b/doc/gawk.texi
@@ -65,6 +65,7 @@
@set GAWKINETTITLE TCP/IP Internetworking with @command{gawk}
@set GAWKWORKFLOWTITLE Participating in @command{gawk} Development
+@set PMGAWKTITLE Persistent-Memory @command{gawk} User Manual
@ifset FOR_PRINT
@set TITLE Effective awk Programming
@end ifset
@@ -31166,7 +31167,14 @@ backing file will lead to strange results and/or core
dumps.
may not do so in the future either.
@end quotation
-Here are articles and web links that provide more information about
+Terence Kelly has provided a separate @cite{@value{PMGAWKTITLE}}
+document, which is included in the @command{gawk}
+distribution. It is worth reading.
+@ifinfo
+@xref{Top, , General Introduction, pm-gawk, @value{PMGAWKTITLE}}.
+@end ifinfo
+
+Here are additional articles and web links that provide more information about
persistent memory and why it's useful in a scripting language like
@command{gawk}.
@@ -42756,6 +42764,27 @@ with @command{makeinfo} to produce an Info or HTML
file.
The generated Info file for
@cite{@value{GAWKWORKFLOWTITLE}}.
+@item doc/pm-gawk.texi
+The Texinfo source file for
+@ifinfo
+@ref{Top, , General Introduction, pm-gawk, @value{PMGAWKTITLE}}.
+@end ifinfo
+@ifnotinfo
+@cite{@value{PMGAWKTITLE}}.
+@end ifnotinfo
+It should be processed with @TeX{}
+(via @command{texi2dvi} or @command{texi2pdf})
+to produce a printed document and
+with @command{makeinfo} to produce an Info or HTML file.
+
+@item doc/pm-gawk.info
+The generated Info file for
+@cite{@value{PMGAWKTITLE}}.
+
+@item doc/pm-gawk.1
+The @command{troff} source for a manual page describing the
+the persistent memory features presented in @ref{Persistent Memory}.
+
@item doc/igawk.1
The @command{troff} source for a manual page describing the @command{igawk}
program presented in
@@ -42779,16 +42808,12 @@ the @file{Makefile.in} files used by Autoconf and
@item Makefile.in
@itemx aclocal.m4
-@itemx bisonfix.awk
-@itemx config.guess
+@itemx build-aux/*
@itemx configh.in
@itemx configure.ac
@itemx configure
@itemx custom.h
-@itemx depcomp
-@itemx install-sh
@itemx missing_d/*
-@itemx mkinstalldirs
@itemx m4/*
These files and subdirectories are used when configuring and compiling
@command{gawk} for various Unix systems. Most of them are explained
diff --git a/doc/gawktexi.in b/doc/gawktexi.in
index 6afa6649..c7214f59 100644
--- a/doc/gawktexi.in
+++ b/doc/gawktexi.in
@@ -60,6 +60,7 @@
@set GAWKINETTITLE TCP/IP Internetworking with @command{gawk}
@set GAWKWORKFLOWTITLE Participating in @command{gawk} Development
+@set PMGAWKTITLE Persistent-Memory @command{gawk} User Manual
@ifset FOR_PRINT
@set TITLE Effective awk Programming
@end ifset
@@ -30048,7 +30049,14 @@ backing file will lead to strange results and/or core
dumps.
may not do so in the future either.
@end quotation
-Here are articles and web links that provide more information about
+Terence Kelly has provided a separate @cite{@value{PMGAWKTITLE}}
+document, which is included in the @command{gawk}
+distribution. It is worth reading.
+@ifinfo
+@xref{Top, , General Introduction, pm-gawk, @value{PMGAWKTITLE}}.
+@end ifinfo
+
+Here are additional articles and web links that provide more information about
persistent memory and why it's useful in a scripting language like
@command{gawk}.
@@ -41599,6 +41607,27 @@ with @command{makeinfo} to produce an Info or HTML
file.
The generated Info file for
@cite{@value{GAWKWORKFLOWTITLE}}.
+@item doc/pm-gawk.texi
+The Texinfo source file for
+@ifinfo
+@ref{Top, , General Introduction, pm-gawk, @value{PMGAWKTITLE}}.
+@end ifinfo
+@ifnotinfo
+@cite{@value{PMGAWKTITLE}}.
+@end ifnotinfo
+It should be processed with @TeX{}
+(via @command{texi2dvi} or @command{texi2pdf})
+to produce a printed document and
+with @command{makeinfo} to produce an Info or HTML file.
+
+@item doc/pm-gawk.info
+The generated Info file for
+@cite{@value{PMGAWKTITLE}}.
+
+@item doc/pm-gawk.1
+The @command{troff} source for a manual page describing the
+the persistent memory features presented in @ref{Persistent Memory}.
+
@item doc/igawk.1
The @command{troff} source for a manual page describing the @command{igawk}
program presented in
@@ -41622,16 +41651,12 @@ the @file{Makefile.in} files used by Autoconf and
@item Makefile.in
@itemx aclocal.m4
-@itemx bisonfix.awk
-@itemx config.guess
+@itemx build-aux/*
@itemx configh.in
@itemx configure.ac
@itemx configure
@itemx custom.h
-@itemx depcomp
-@itemx install-sh
@itemx missing_d/*
-@itemx mkinstalldirs
@itemx m4/*
These files and subdirectories are used when configuring and compiling
@command{gawk} for various Unix systems. Most of them are explained
diff --git a/doc/pm-gawk.info b/doc/pm-gawk.info
new file mode 100644
index 00000000..58440ec7
--- /dev/null
+++ b/doc/pm-gawk.info
@@ -0,0 +1,1142 @@
+This is pm-gawk.info, produced by makeinfo version 6.8 from
+pm-gawk.texi.
+
+Copyright (C) 2022 Terence Kelly
+<tpkelly@eecs.umich.edu>
+<tpkelly@cs.princeton.edu>
+<tpkelly@acm.org>
+<http://web.eecs.umich.edu/~tpkelly/pma/>
+<https://dl.acm.org/profile/81100523747>
+
+Permission is granted to copy, distribute and/or modify this document
+under the terms of the GNU Free Documentation License, Version 1.3 or
+any later version published by the Free Software Foundation; with the
+Invariant Sections being "Introduction" and "History", no Front-Cover
+Texts, and no Back-Cover Texts. A copy of the license is available at
+<https://www.gnu.org/licenses/fdl-1.3.html>
+INFO-DIR-SECTION Text creation and manipulation
+START-INFO-DIR-ENTRY
+* pm-gawk: (pm-gawk). Persistent memory version of gawk.
+END-INFO-DIR-ENTRY
+
+
+File: pm-gawk.info, Node: Top, Next: Introduction, Up: (dir)
+
+General Introduction
+********************
+
+'gawk' 5.2 introduces a _persistent memory_ feature that can "remember"
+script-defined variables and functions across executions; pass variables
+between unrelated scripts without serializing/parsing text files; and
+handle data sets larger than available memory plus swap. This
+supplementary manual provides an in-depth look at persistent-memory
+'gawk'.
+
+Copyright (C) 2022 Terence Kelly
+<tpkelly@eecs.umich.edu>
+<tpkelly@cs.princeton.edu>
+<tpkelly@acm.org>
+<http://web.eecs.umich.edu/~tpkelly/pma/>
+<https://dl.acm.org/profile/81100523747>
+
+Permission is granted to copy, distribute and/or modify this document
+under the terms of the GNU Free Documentation License, Version 1.3 or
+any later version published by the Free Software Foundation; with the
+Invariant Sections being "Introduction" and "History", no Front-Cover
+Texts, and no Back-Cover Texts. A copy of the license is available at
+<https://www.gnu.org/licenses/fdl-1.3.html>
+
+* Menu:
+
+* Introduction::
+* Quick Start::
+* Examples::
+* Performance::
+* Data Integrity::
+* Acknowledgments::
+* Installation::
+* Debugging::
+* History::
+
+
+File: pm-gawk.info, Node: Introduction, Next: Quick Start, Prev: Top, Up:
Top
+
+1 Introduction
+**************
+
+
+GNU AWK ('gawk') 5.2, expected in August 2022, introduces a new
+_persistent memory_ feature that makes AWK scripting easier and
+sometimes improves performance. The new feature, called "pm-'gawk',"
+can "remember" script-defined variables and functions across executions
+and can pass variables and functions between unrelated scripts without
+serializing/parsing text files--all with near-zero fuss. pm-'gawk' does
+_not_ require non-volatile memory hardware nor any other exotic
+infrastructure; pm-'gawk' runs on the ordinary conventional computers
+and operating systems that most of us have been using for decades.
+
+
+The main 'gawk' documentation(1) covers the basics of the new
+persistence feature. This supplementary manual provides additional
+detail, tutorial examples, and a peek under the hood of pm-'gawk'. If
+you're familiar with 'gawk' and Unix-like environments, dive straight
+in:
+
+ * *note Quick Start:: hits the ground running with a few keystrokes
+ * *note Examples:: shows how pm-'gawk' streamlines typical AWK
+ scripting
+ * *note Performance:: covers asymptotic efficiency, OS tuning, and
+ more
+ * *note Data Integrity:: explains how to protect data from mishaps
+ * *note Acknowledgments:: thanks those who made pm-'gawk' happen
+ * *note Installation:: shows where obtain pm-'gawk'
+ * *note Debugging:: explains how to handle suspected bugs
+ * *note History:: traces pm-'gawk''s persistence technology
+
+
+You can find the latest version of this manual, and also the "director's
+cut," at the web site for the persistent memory allocator used in
+pm-'gawk':
+ <http://web.eecs.umich.edu/~tpkelly/pma/>
+
+
+Two publications describe the persistent memory allocator and early
+experiences with a pm-'gawk' prototype based on a fork of the official
+'gawk' sources:
+<https://queue.acm.org/detail.cfm?id=3534855>
+<http://nvmw.ucsd.edu/nvmw2022-program/nvmw2022-data/nvmw2022-paper35-final_version_your_extended_abstract.pdf>
+
+
+Feel free to send me questions, suggestions, and experiences:
+
+ <tpkelly@eecs.umich.edu> (preferred)
+ <tpkelly@cs.princeton.edu>
+ <tpkelly@acm.org>
+
+ ---------- Footnotes ----------
+
+ (1) <https://www.gnu.org/software/gawk/manual/> and 'man gawk'
+and 'info gawk'
+
+
+File: pm-gawk.info, Node: Quick Start, Next: Examples, Prev: Introduction,
Up: Top
+
+2 Quick Start
+*************
+
+Here's pm-'gawk' in action at the 'bash' shell prompt ('$'):
+ $ truncate -s 4096000 heap.pma
+ $ export GAWK_PERSIST_FILE=heap.pma
+ $ gawk 'BEGIN{myvar = 47}'
+ $ gawk 'BEGIN{myvar += 7; print myvar}'
+ 54
+First, 'truncate' creates an empty (all-zero-bytes) "heap file" where
+pm-'gawk' will store script variables; its size is a multiple of the
+system page size (4 KiB). Next, 'export' sets an environment variable
+that enables pm-'gawk' to find the heap file; if 'gawk' does _not_ see
+this envar, persistence is not activated. The third command runs a
+one-line AWK script that initializes variable 'myvar', which will reside
+in the heap file and thereby outlive the interpreter process that
+initialized it. Finally, the fourth command invokes pm-'gawk' on a
+_different_ one-line script that increments and prints 'myvar'. The
+output shows that pm-'gawk' has indeed "remembered" 'myvar' across
+executions of unrelated scripts. (If the 'gawk' executable in your
+search '$PATH' lacks the persistence feature, the output in the above
+example will be '7' instead of '54'. *Note Installation::.) To disable
+persistence until you want it again, prevent 'gawk' from finding the
+heap file via 'unset GAWK_PERSIST_FILE'. To permanently "forget" script
+variables, delete the heap file.
+
+
+
+ Toggling persistence by 'export'-ing and 'unset'-ing "ambient" envars
+requires care: Forgetting to 'unset' when you no longer want persistence
+can cause confusing bugs. Fortunately, 'bash' allows you to pass envars
+more deliberately, on a per-command basis:
+ $ rm heap.pma # start fresh
+ $ unset GAWK_PERSIST_FILE # eliminate ambient envar
+ $ truncate -s 4096000 heap.pma # create new heap file
+
+ $ GAWK_PERSIST_FILE=heap.pma gawk 'BEGIN{myvar = 47}'
+ $ gawk 'BEGIN{myvar += 7; print myvar}'
+ 7
+ $ GAWK_PERSIST_FILE=heap.pma gawk 'BEGIN{myvar += 7; print myvar}'
+ 54
+The first 'gawk' invocation sees the special envar prepended on the
+command line, so it activates pm-'gawk'. The second 'gawk' invocation,
+however, does _not_ see the envar and therefore does not access the
+script variable stored in the heap file. The third 'gawk' invocation
+sees the special envar and therefore uses the script variable from the
+heap file.
+
+ While sometimes less error prone than ambient envars, per-command
+envar passing as shown above is verbose and shouty. A shell alias saves
+keystrokes and reduces visual clutter:
+ $ alias pm='GAWK_PERSIST_FILE=heap.pma'
+ $ pm gawk 'BEGIN{print ++myvar}'
+ 55
+ $ pm gawk 'BEGIN{print ++myvar}'
+ 56
+
+
+File: pm-gawk.info, Node: Examples, Next: Performance, Prev: Quick Start,
Up: Top
+
+3 Examples
+**********
+
+Our first example uses pm-'gawk' to streamline analysis of a prose
+corpus, Mark Twain's 'Tom Sawyer' and 'Huckleberry Finn' from
+<https://gutenberg.org/files/74/74-0.txt> and
+<https://gutenberg.org/files/76/76-0.txt>. We first convert
+non-alphabetic characters to newlines (so each line has at most one
+word) and convert to lowercase:
+ $ tr -c a-zA-Z '\n' < 74-0.txt | tr A-Z a-z > sawyer.txt
+ $ tr -c a-zA-Z '\n' < 76-0.txt | tr A-Z a-z > finn.txt
+
+ It's easy to count word frequencies with AWK's associative arrays.
+pm-'gawk' makes these arrays persistent, so we need not re-ingest the
+entire corpus every time we ask a new question ("read once, analyze
+happily ever after"):
+ $ truncate -s 100M twain.pma
+ $ export GAWK_PERSIST_FILE=twain.pma
+ $ gawk '{ts[$1]++}' sawyer.txt # ingest
+ $ gawk 'BEGIN{print ts["work"], ts["play"]}' # query
+ 92 11
+ $ gawk 'BEGIN{print ts["necktie"], ts["knife"]}' # query
+ 2 27
+The 'truncate' command above creates a heap file large enough to store
+all of the data it must eventually contain, with plenty of room to spare
+(as we'll see in *note Sparse Heap Files::, this isn't wasteful). The
+'export' command ensures that subsequent 'gawk' invocations activate
+pm-'gawk'. The first pm-'gawk' command stores 'Tom Sawyer''s word
+frequencies in associative array 'ts[]'. Because this array is
+persistent, subsequent pm-'gawk' commands can access it without having
+to parse the input file again.
+
+ Expanding our analysis to encompass a second book is easy. Let's
+populate a new associative array 'hf[]' with the word frequencies in
+'Huckleberry Finn':
+ $ gawk '{hf[$1]++}' finn.txt
+Now we can freely intermix accesses to both books' data conveniently and
+efficiently, without the overhead and coding fuss of repeated input
+parsing:
+ $ gawk 'BEGIN{print ts["river"], hf["river"]}'
+ 26 142
+
+ By making AWK more interactive, pm-'gawk' invites casual
+conversations with data. If we're curious what words in 'Finn' are
+absent from 'Sawyer', answers (including "flapdoodle," "yellocution,"
+and "sockdolager") are easy to find:
+ $ gawk 'BEGIN{for(w in hf) if (!(w in ts)) print w}'
+
+ Rumors of Twain's death may be exaggerated. If he publishes new
+books in the future, it will be easy to incorporate them into our
+analysis incrementally. The performance benefits of incremental
+processing for common AWK chores such as log file analysis are discussed
+in <https://queue.acm.org/detail.cfm?id=3534855> and the companion paper
+cited therein, and below in *note Performance::.
+
+ Exercise: The "Markov" AWK script on page 79 of Kernighan & Pike's
+'The Practice of Programming' generates random text reminiscent of a
+given corpus using a simple statistical modeling technique. This script
+consists of a "learning" or "training" phase followed by an
+output-generation phase. Use pm-'gawk' to de-couple the two phases and
+to allow the statistical model to incrementally ingest additions to the
+input corpus.
+
+ Our second example considers another domain that plays to AWK's
+strengths, data analysis. For simplicity we'll create two small input
+files of numeric data.
+ $ printf '1\n2\n3\n4\n5\n' > A.dat
+ $ printf '5\n6\n7\n8\n9\n' > B.dat
+A conventional _non_-persistent AWK script can compute basic summary
+statistics:
+ $ cat summary_conventional.awk
+ 1 == NR { min = max = $1 }
+ min > $1 { min = $1 }
+ max < $1 { max = $1 }
+ { sum += $1 }
+ END { print "min: " min " max: " max " mean: " sum/NR }
+
+ $ gawk -f summary_conventional.awk A.dat B.dat
+ min: 1 max: 9 mean: 5
+
+ To use pm-'gawk' for the same purpose, we first create a heap file
+for our AWK script variables and tell pm-'gawk' where to find it via the
+usual environment variable:
+ $ truncate -s 10M stats.pma
+ $ export GAWK_PERSIST_FILE=stats.pma
+pm-'gawk' requires changing the above script to ensure that 'min' and
+'max' are initialized exactly once, when the heap file is first used,
+and _not_ every time the script runs. Furthermore, whereas
+script-defined variables such as 'min' retain their values across
+pm-'gawk' executions, built-in AWK variables such as 'NR' are re-set to
+zero every time pm-'gawk' runs, so we can't use them in the same way.
+Here's a modified script for pm-'gawk':
+ $ cat summary_persistent.awk
+ ! init { min = max = $1; init = 1 }
+ min > $1 { min = $1 }
+ max < $1 { max = $1 }
+ { sum += $1; ++n }
+ END { print "min: " min " max: " max " mean: " sum/n }
+Note the different pattern on the first line and the introduction of 'n'
+to supplant 'NR'. When used with pm-'gawk', this new initialization
+logic supports the same kind of cumulative processing that we saw in the
+text-analysis scenario. For example, we can ingest our input files
+separately:
+ $ gawk -f summary_persistent.awk A.dat
+ min: 1 max: 5 mean: 3
+
+ $ gawk -f summary_persistent.awk B.dat
+ min: 1 max: 9 mean: 5
+As expected, after the second pm-'gawk' invocation consumes the second
+input file, the output matches that of the non-persistent script that
+read both files at once.
+
+ Exercise: Amend the AWK scripts above to compute the median and
+mode(s) using both conventional 'gawk' and pm-'gawk'. (The median is
+the number in the middle of a sorted list; if the length of the list is
+even, average the two numbers at the middle. The modes are the values
+that occur most frequently.)
+
+ Our third and final set of examples shows that pm-'gawk' allows us to
+bundle both script-defined data and also user-defined _functions_ in a
+persistent heap that may be passed freely between unrelated AWK scripts.
+
+ The following shell transcript repeatedly invokes pm-'gawk' to create
+and then employ a user-defined function. These separate invocations
+involve several different AWK scripts that communicate via the heap
+file. Each invocation can add user-defined functions and add or remove
+data from the heap that subsequent invocations will access.
+ $ truncate -s 10M funcs.pma
+ $ export GAWK_PERSIST_FILE=funcs.pma
+ $ gawk 'function count(A,t) {for(i in A)t++; return ""==t?0:t}'
+ $ gawk 'BEGIN { a["x"] = 4; a["y"] = 5; a["z"] = 6 }'
+ $ gawk 'BEGIN { print count(a) }'
+ 3
+ $ gawk 'BEGIN { delete a["x"] }'
+ $ gawk 'BEGIN { print count(a) }'
+ 2
+ $ gawk 'BEGIN { delete a }'
+ $ gawk 'BEGIN { print count(a) }'
+ 0
+ $ gawk 'BEGIN { for (i=0; i<47; i++) a[i]=i }'
+ $ gawk 'BEGIN { print count(a) }'
+ 47
+The first pm-'gawk' command creates user-defined function 'count()',
+which returns the number of entries in a given associative array; note
+that variable 't' is local to 'count()', not global. The next pm-'gawk'
+command populates a persistent associative array with three entries; not
+surprisingly, the 'count()' call in the following pm-'gawk' command
+finds these three entries. The next two pm-'gawk' commands respectively
+delete an array entry and print the reduced count, 2. The two commands
+after that delete the entire array and print a count of zero. Finally,
+the last two pm-'gawk' commands populate the array with 47 entries and
+count them.
+
+ The following shell script invokes pm-'gawk' repeatedly to create a
+collection of user-defined functions that perform basic operations on
+quadratic polynomials: evaluation at a given point, computing the
+discriminant, and using the quadratic formula to find the roots. It
+then factorizes x^2 + x - 12 into (x - 3)(x + 4).
+ #!/bin/sh
+ rm -f poly.pma
+ truncate -s 10M poly.pma
+ export GAWK_PERSIST_FILE=poly.pma
+ gawk 'function q(x) { return a*x^2 + b*x + c }'
+ gawk 'function p(x) { return "q(" x ") = " q(x) }'
+ gawk 'BEGIN { print p(2) }' # evaluate & print
+ gawk 'BEGIN{ a = 1; b = 1; c = -12 }' # new coefficients
+ gawk 'BEGIN { print p(2) }' # eval/print again
+ gawk 'function d(s) { return s * sqrt(b^2 - 4*a*c)}'
+ gawk 'BEGIN{ print "discriminant (must be >=0): " d(1)}'
+ gawk 'function r(s) { return (-b + d(s))/(2*a)}'
+ gawk 'BEGIN{ print "root: " r( 1) " " p(r( 1)) }'
+ gawk 'BEGIN{ print "root: " r(-1) " " p(r(-1)) }'
+ gawk 'function abs(n) { return n >= 0 ? n : -n }'
+ gawk 'function sgn(x) { return x >= 0 ? "- " : "+ " } '
+ gawk 'function f(s) { return "(x " sgn(r(s)) abs(r(s))}'
+ gawk 'BEGIN{ print "factor: " f( 1) ")" }'
+ gawk 'BEGIN{ print "factor: " f(-1) ")" }'
+ rm -f poly.pma
+
+
+File: pm-gawk.info, Node: Performance, Next: Data Integrity, Prev:
Examples, Up: Top
+
+4 Performance
+*************
+
+This chapter explains several performance advantages that result from
+the implementation of persistent memory in pm-'gawk', shows how tuning
+the underlying operating system sometimes improves performance, and
+presents experimental performance measurements. To make the discussion
+concrete, we use examples from a "GNU/Linux" system--GNU utilities atop
+the Linux OS--but the principles apply to other modern operating
+systems.
+
+* Menu:
+
+* Constant-Time Array Access::
+* Virtual Memory and Big Data::
+* Sparse Heap Files::
+* Persistence versus Durability::
+* Experiments::
+* Results::
+
+
+File: pm-gawk.info, Node: Constant-Time Array Access, Next: Virtual Memory
and Big Data, Up: Performance
+
+4.1 Constant-Time Array Access
+==============================
+
+pm-'gawk' preserves the efficiency of data access when data structures
+are created by one process and later re-used by a different process.
+
+ Consider the associative arrays used to analyze Mark Twain's books in
+*note Examples::. We created arrays 'ts[]' and 'hf[]' by reading files
+'sawyer.txt' and 'finn.txt'. If N denotes the total volume of data in
+these files, building the associative arrays typically requires time
+proportional to N, or "O(N) expected time" in the lingo of asymptotic
+analysis. If W is the number of unique words in the input files, the
+size of the associative arrays will be proportional to W, or O(W).
+Accessing individual array elements requires only _constant_ or O(1)
+expected time, not O(N) or O(W) time, because 'gawk' implements arrays
+as hash tables.
+
+ The performance advantage of pm-'gawk' arises when different
+processes create and access associative arrays. Accessing an element of
+a persistent array created by a previous pm-'gawk' process, as we did
+earlier in BEGIN{print ts["river"], hf["river"]}, still requires only
+O(1) time, which is asymptotically far superior to the alternatives.
+Naïvely reconstructing arrays by re-ingesting all raw inputs in every
+'gawk' process that accesses the arrays would of course require O(N)
+time--a profligate cost if the text corpus is large. Dumping arrays to
+files and re-loading them as needed would reduce the preparation time
+for access to O(W). That can be a substantial improvement in practice; N
+is roughly 19 times larger than W in our Twain corpus. Nonetheless O(W)
+remains far slower than pm-'gawk''s O(1). As we'll see in *note
+Results::, the difference is not merely theoretical.
+
+ The persistent memory implementation beneath pm-'gawk' enables it to
+avoid work proportional to N or W when accessing an element of a
+persistent associative array. Under the hood, pm-'gawk' stores
+script-defined AWK variables such as associative arrays in a persistent
+heap laid out in a memory-mapped file (the heap file). When an AWK
+script accesses an element of an associative array, pm-'gawk' performs a
+lookup on the corresponding hash table, which in turn accesses memory on
+the persistent heap. Modern operating systems implement memory-mapped
+files in such a way that these memory accesses trigger the bare minimum
+of data movement required: Only those parts of the heap file containing
+needed data are "paged in" to the memory of the pm-'gawk' process. In
+the worst case, the heap file is not in the file system's in-memory
+cache, so the required pages must be faulted into memory from storage.
+Our asymptotic analysis of efficiency applies regardless of whether the
+heap file is cached or not. The entire heap file is _not_ accessed
+merely to access an element of a persistent associative array.
+
+ Persistent memory thus enables pm-'gawk' to offer the flexibility of
+de-coupling data ingestion from analytic queries without the fuss and
+overhead of serializing and loading data structures and without
+sacrificing constant-time access to the associative arrays that make AWK
+scripting convenient and productive.
+
+
+File: pm-gawk.info, Node: Virtual Memory and Big Data, Next: Sparse Heap
Files, Prev: Constant-Time Array Access, Up: Performance
+
+4.2 Virtual Memory and Big Data
+===============================
+
+Small data sets seldom spoil the delights of AWK by causing performance
+troubles, with or without persistence. As the size of the 'gawk'
+interpreter's internal data structures approaches the capacity of
+physical memory, however, acceptable performance requires understanding
+modern operating systems and sometimes tuning them. Fortunately
+pm-'gawk' offers new degrees of control for performance-conscious users
+tackling large data sets. A terse mnemonic captures the basic
+principle: Precluding paging promotes peak performance and prevents
+perplexity.
+
+ Modern operating systems feature "virtual memory" that strives to
+appear both larger than installed DRAM (which is small) and faster than
+installed storage devices (which are slow). As a program's memory
+footprint approaches the capacity of DRAM, the virtual memory system
+transparently "pages" (moves) the program's data between DRAM and a
+"swap area" on a storage device. Paging can degrade performance mildly
+or severely, depending on the program's memory access patterns. Random
+accesses to large data structures can trigger excessive paging and
+dramatic slowdown. Unfortunately, the hash tables beneath AWK's
+signature associative arrays inherently require random memory accesses,
+so large associative arrays can be problematic.
+
+ Persistence changes the rules in our favor: The OS pages data to
+pm-'gawk''s _heap file_ instead of the swap area. This won't help
+performance much if the heap file resides in a conventional
+storage-backed file system. On Unix-like systems, however, we may place
+the heap file in a DRAM-backed file system such as '/dev/shm/', which
+entirely prevents paging to slow storage devices. Temporarily placing
+the heap file in such a file system is a reasonable expedient, with two
+caveats: First, keep in mind that DRAM-backed file systems perish when
+the machine reboots or crashes, so you must copy the heap file to a
+conventional storage-backed file system when your computation is done.
+Second, pm-'gawk''s memory footprint can't exceed available DRAM if you
+place the heap file in a DRAM-backed file system.
+
+ Tuning OS paging parameters may improve performance if you decide to
+run pm-'gawk' with a heap file in a conventional storage-backed file
+system. Some OSes have unhelpful default habits regarding modified
+("dirty") memory backed by files. For example, the OS may write dirty
+memory pages to the heap file periodically and/or when the OS believes
+that "too much" memory is dirty. Such "eager" writeback can degrade
+performance noticeably and brings no benefit to pm-'gawk'. Fortunately
+some OSes allow paging defaults to be over-ridden so that writeback is
+"lazy" rather than eager. For Linux see the discussion of the 'dirty_*'
+parameters at
+<https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html>.
+Changing these parameters can prevent wasteful eager paging:(1)
+ $ echo 100 | sudo tee /proc/sys/vm/dirty_background_ratio
+ $ echo 100 | sudo tee /proc/sys/vm/dirty_ratio
+ $ echo 300000 | sudo tee /proc/sys/vm/dirty_expire_centisecs
+ $ echo 50000 | sudo tee /proc/sys/vm/dirty_writeback_centisecs
+Tuning paging parameters can help non-persistent 'gawk' as well as
+pm-'gawk'. [Disclaimer: OS tuning is an occult art, and your mileage
+may vary.]
+
+ ---------- Footnotes ----------
+
+ (1) The 'tee' rigmarole is explained at
+<https://askubuntu.com/questions/1098059/which-is-the-right-way-to-drop-caches-in-lubuntu>.
+
+
+File: pm-gawk.info, Node: Sparse Heap Files, Next: Persistence versus
Durability, Prev: Virtual Memory and Big Data, Up: Performance
+
+4.3 Sparse Heap Files
+=====================
+
+To be frugal with storage resources, pm-'gawk''s heap file should be
+created as a "sparse file": a file whose logical size is larger than its
+storage resource footprint. Modern file systems support sparse files,
+which are easy to create using the 'truncate' tool shown in our
+examples.
+
+ Let's first create a conventional _non_-sparse file using 'echo':
+ $ echo hi > dense
+ $ ls -l dense
+ -rw-rw-r--. 1 me me 3 Aug 5 23:08 dense
+ $ du -h dense
+ 4.0K dense
+The 'ls' utility reports that file 'dense' is three bytes long (two for
+the letters in "hi" plus one for the newline). The 'du' utility reports
+that this file consumes 4 KiB of storage--one block of disk, as small as
+a non-sparse file's storage footprint can be. Now let's use 'truncate'
+to create a logically enormous sparse file and check its physical size:
+ $ truncate -s 1T sparse
+ $ ls -l sparse
+ -rw-rw-r--. 1 me me 1099511627776 Aug 5 22:33 sparse
+ $ du -h sparse
+ 0 sparse
+Whereas 'ls' reports the logical file size that we expect (one TiB or 2
+raised to the power 40 bytes), 'du' reveals that the file occupies no
+storage whatsoever. The file system will allocate physical storage
+resources beneath this file as data is written to it; reading unwritten
+regions of the file yields zeros.
+
+ The "pay as you go" storage cost of sparse files offers both
+convenience and control for pm-'gawk' users. If your file system
+supports sparse files, go ahead and create lavishly capacious heap files
+for pm-'gawk'. Their logical size costs nothing and persistent memory
+allocation within pm-'gawk' won't fail until physical storage resources
+beneath the file system are exhausted. But if instead you want to
+_prevent_ a heap file from consuming too much storage, simply set its
+initial size to whatever bound you wish to enforce; it won't eat more
+disk than that. Copying sparse files with GNU 'cp' creates sparse
+copies by default.
+
+ File-system encryption can preclude sparse files: If the cleartext of
+a byte offset range within a file is all zero bytes, the corresponding
+ciphertext probably shouldn't be all zeros! Encrypting at the storage
+layer instead of the file system layer may offer acceptable security
+while still permitting file systems to implement sparse files.
+
+ Sometimes you might prefer a dense heap file backed by pre-allocated
+storage resources, for example to increase the likelihood that
+pm-'gawk''s internal memory allocation will succeed until the persistent
+heap occupies the entire heap file. The 'fallocate' utility will do the
+trick:
+ $ fallocate -l 1M mibi
+ $ ls -l mibi
+ -rw-rw-r--. 1 me me 1048576 Aug 5 23:18 mibi
+ $ du -h mibi
+ 1.0M mibi
+We get the MiB we asked for, both logically and physically.
+
+
+File: pm-gawk.info, Node: Persistence versus Durability, Next: Experiments,
Prev: Sparse Heap Files, Up: Performance
+
+4.4 Persistence versus Durability
+=================================
+
+Arguably the most important general guideline for good performance in
+computer systems is, "pay only for what you need."(1) To apply this
+maxim to pm-'gawk' we must distinguish two concepts that are frequently
+conflated: persistence and durability.(2) (A third logically distinct
+concept is the subject of *note Data Integrity::.)
+
+ "Persistent" data outlive the processes that access them, but don't
+necessarily last forever. For example, as explained in 'man
+mq_overview', message queues are persistent because they exist until the
+system shuts down. "Durable" data reside on a physical medium that
+retains its contents even without continuously supplied power. For
+example, hard disk drives and solid state drives store durable data.
+Confusion arises because persistence and durability are often
+correlated: Data in ordinary file systems backed by HDDs or SSDs are
+typically both persistent and durable. Familiarity with 'fsync()' and
+'msync()' might lead us to believe that durability is a subset of
+persistence, but in fact the two characteristics are orthogonal: Data in
+the swap area are durable but not persistent; data in DRAM-backed file
+systems such as '/dev/shm/' are persistent but not durable.
+
+ Durability often costs more than persistence, so
+performance-conscious pm-'gawk' users pay the added premium for
+durability only when persistence alone is not sufficient. Two ways to
+avoid unwanted durability overheads were discussed in *note Virtual
+Memory and Big Data::: Place pm-'gawk''s heap file in a DRAM-backed file
+system, or disable eager writeback to the heap file. Expedients such as
+these enable you to remove durability overheads from the critical path
+of multi-stage data analyses even when you want heap files to eventually
+be durable: Allow pm-'gawk' to run at full speed with persistence alone;
+force the heap file to durability (using the 'cp' and 'sync' utilities
+as necessary) after output has been emitted to the next stage of the
+analysis and the pm-'gawk' process using the heap has terminated.
+
+ Experimenting with synthetic data builds intuition for how
+persistence and durability affect performance. You can write a little
+AWK or C program to generate a stream of random text, or just cobble
+together a quick and dirty generator on the command line:
+ $ openssl rand --base64 1000000 | tr -c a-zA-Z '\n' > random.dat
+Varying the size of random inputs can, for example, find where
+performance "falls off the cliff" as pm-'gawk''s memory footprint
+exceeds the capacity of DRAM and paging begins.
+
+ Experiments require careful methodology, especially when the heap
+file is in a storage-backed file system. Overlooking the file system's
+DRAM cache can easily misguide interpretation of results and foil
+repeatability. Fortunately Linux allows us to invalidate the file
+system cache and thus mimic a "cold start" condition resembling the
+immediate aftermath of a machine reboot. Accesses to ordinary files on
+durable storage will then be served from the storage devices, not from
+cache. Read about 'sync' and '/proc/sys/vm/drop_caches' at
+<https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html>.
+
+ ---------- Footnotes ----------
+
+ (1) Remarkably, this guideline is widely ignored in surprising ways.
+Certain well-known textbook algorithms continue to grind away
+fruitlessly long after having computed all of their output.
+<https://queue.acm.org/detail.cfm?id=3424304>
+
+ (2) In recent years the term "persistent memory" has sometimes been
+used to denote novel byte-addressable non-volatile memory hardware--an
+unfortunate practice that contradicts sensible long-standing convention
+and causes needless confusion. NVM provides durability. Persistent
+memory is a software abstraction that doesn't require NVM.
+<https://queue.acm.org/detail.cfm?id=3358957>
+
+
+File: pm-gawk.info, Node: Experiments, Next: Results, Prev: Persistence
versus Durability, Up: Performance
+
+4.5 Experiments
+===============
+
+The C-shell ('csh') script listed below illustrates concepts and
+implements tips presented in this chapter. It produced the results
+discussed in *note Results:: in roughly 20 minutes on an aging laptop.
+You should be able to cut and paste the code listing below into a file
+and get it running without excessive fuss; write to me if you have
+difficulty.
+
+ The script measures the performance of four different ways to support
+word frequency queries over a text corpus: The naive approach of reading
+the corpus into an associative array for every query; manually dumping a
+text representation of the word-frequency table and manually loading it
+prior to a query; using 'gawk''s 'rwarray' extension to dump and load an
+associative array; and using pm-'gawk' to maintain a persistent
+associative array.
+
+ Comments at the top explain prerequisites. Lines 8-10 set input
+parameters: the directory where tests are run and where files including
+the heap file are held, the off-the-shelf timer used to measure run
+times and other performance characteristics such as peak memory usage,
+and the size of the input. The default input size results in pm-'gawk'
+memory footprints under 3 GiB, which is large enough for interesting
+results and small enough to fit in DRAM and avoid paging on today's
+computers. Lines 13-14 define a homebrew timer.
+
+ Two sections of the script are relevant if the default run directory
+is changed from '/dev/shm/' to a directory in a conventional
+storage-backed file system: Lines 15-17 define the mechanism for
+clearing file data cached in DRAM; lines 23-30 set Linux kernel
+parameters to discourage eager paging.
+
+ Lines 37-70 spit out, compile, and run a little C program to generate
+a random text corpus. This program is fast, flexible, and
+deterministic, generating the same random output given the same
+parameters.
+
+ Lines 71-100 run the four different AWK approaches on the same random
+input, reporting separately the time to build and to query the
+associative array containing word frequencies.
+
+
+#!/bin/csh -f
# 1
+# Set PMG envar to path of pm-gawk executable and AWKLIBPATH
# 2
+# to find rwarray.so
# 3
+# Requires "sudo" to work; consider this for /etc/sudoers file:
# 4
+# Defaults:youruserid !authenticate
# 5
+echo 'begin: ' `date` `date +%s`
# 6
+unsetenv GAWK_PERSIST_FILE # disable persistence until wanted
# 7
+set dir = '/dev/shm' # where heap file et al. will live
# 8
+set tmr = '/usr/bin/time' # can also use shell built-in "time"
# 9
+set isz = 1073741824 # input size; 1 GiB
# 10
+# set isz = 100000000 # small input for quick testing
# 11
+cd $dir # tick/tock/tyme below are homebrew timer, good within ~2ms
# 12
+alias tick 'set t1 = `date +%s.%N`' ; alias tock 'set t2 = `date +%s.%N`'
# 13
+alias tyme '$PMG -v t1=$t1 -v t2=$t2 "BEGIN{print t2-t1}"'
# 14
+alias tsync 'tick ; sync ; tock ; echo "sync time: " `tyme`'
# 15
+alias drop_caches 'echo 3 | sudo tee /proc/sys/vm/drop_caches'
# 16
+alias snd 'tsync; drop_caches'
# 17
+echo "pm-gawk: $PMG" ; echo 'std gawk: ' `which gawk`
# 18
+echo "run dir: $dir" ; echo 'pwd: ' `pwd`
# 19
+echo 'dir content:' ; ls -l $dir |& $PMG '{print " " $0}'
# 20
+echo 'timer: ' $tmr ; echo 'AWKLIBPATH: ' $AWKLIBPATH
# 21
+echo 'OS params:' ; set vm = '/proc/sys/vm/dirty'
# 22
+sudo sh -c "echo 100 > ${vm}_background_ratio" # restore these
# 23
+sudo sh -c "echo 100 > ${vm}_ratio" # paging params
# 24
+sudo sh -c "echo 1080000 > ${vm}_expire_centisecs" # to defaults
# 25
+sudo sh -c "echo 1080000 > ${vm}_writeback_centisecs" # if necessary
# 26
+foreach d ( ${vm}_background_ratio ${vm}_ratio \
# 27
+ ${vm}_expire_centisecs ${vm}_writeback_centisecs )
# 28
+ printf " %-38s %7d\n" $d `cat $d`
# 29
+end
# 30
+tick ; tock ; echo 'timr ovrhd: ' `tyme` 's (around 2ms for TK)'
# 31
+tick ; $PMG 'BEGIN{print "pm-gawk? yes"}'
# 32
+tock ; echo 'pmg ovrhd: ' `tyme` 's (around 4-5 ms for TK)'
# 33
+set inp = 'input.dat'
# 34
+echo 'input size ' $isz
# 35
+echo "input file: $inp"
# 36
+set rg = rgen # spit out and compile C program to generate random inputs
# 37
+rm -f $inp $rg.c $rg
# 38
+cat <<EOF > $rg.c
# 39
+// generate N random words, one per line, no blank lines
# 40
+// charset is e.g. 'abcdefg@' where '@' becomes newline
# 41
+#include <stdio.h>
# 42
+#include <stdlib.h>
# 43
+#include <string.h>
# 44
+#define RCH c = a[rand() % L];
# 45
+#define PICK do { RCH } while (0)
# 46
+#define PICKCH do { RCH } while (c == '@')
# 47
+#define FP(...) fprintf(stderr, __VA_ARGS__)
# 48
+int main(int argc, char *argv[]) {
# 49
+ if (4 != argc) {
# 50
+ FP("usage: %s charset N seed\n",
# 51
+ argv[0]); return 1; }
# 52
+ char c, *a = argv[1]; size_t L = strlen(a);
# 53
+ long int N = atol(argv[2]);
# 54
+ srand( atol(argv[3]));
# 55
+ if (2 > N) { FP("N == %ld < 2\n", N); return 2; }
# 56
+ PICKCH;
# 57
+ for (;;) {
# 58
+ if (2 == N) { PICKCH; putchar(c); putchar('\n'); break; }
# 59
+ if ('@' == c) { putchar('\n'); PICKCH; }
# 60
+ else { putchar( c ); PICK; }
# 61
+ if (0 >= --N) break;
# 62
+ }
# 63
+}
# 64
+EOF
# 65
+gcc -std=c11 -Wall -Wextra -O3 -o $rg $rg.c
# 66
+set t = '@@@@@@@' ; set c = "abcdefghijklmnopqrstuvwxyz$t$t$t$t$t$t"
# 67
+tick ; ./$rg "$c" $isz 47 > $inp ; tock ; echo 'gen time: ' `tyme`
# 68
+echo "input file: $inp"
# 69
+echo 'input wc: ' `wc < $inp` ; echo 'input uniq: ' `sort -u $inp | wc`
# 70
+snd
############################################################################
# 71
+tick ; $tmr $PMG '{n[$1]++}END{print "output: " n["foo"]}' $inp
# 72
+tock ; echo 'T naive O(N): ' `tyme` ; echo ''
# 73
+rm -f rwa
# 74
+snd
############################################################################
# 75
+echo ''
# 76
+tick ; $tmr $PMG -l rwarray '{n[$1]++}END{print "writea",writea("rwa",n)}'
$inp # 77
+tock ; echo 'T rwarray build O(N): ' `tyme` ; echo ''
# 78
+snd # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # 79
+tick ; $tmr $PMG -l rwarray 'BEGIN{print "reada",reada("rwa",n); \
# 80
+ print "output: " n["foo"]}'
# 81
+tock ; echo 'T rwarray query O(W): ' `tyme` ; echo ''
# 82
+rm -f ft
# 83
+snd
############################################################################
# 84
+tick ; $tmr $PMG '{n[$1]++}END{for(w in n)print n[w], w}' $inp > ft
# 85
+tock ; echo 'T freqtbl build O(N): ' `tyme` ; echo ''
# 86
+snd # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # 87
+tick ; $tmr $PMG '{n[$2] = $1}END{print "output: " n["foo"]}' ft
# 88
+tock ; echo 'T freqtbl query O(W): ' `tyme` ; echo ''
# 89
+rm -f heap.pma
# 90
+snd
############################################################################
# 91
+truncate -s 3G heap.pma # enlarge if needed
# 92
+setenv GAWK_PERSIST_FILE heap.pma
# 93
+tick ; $tmr $PMG '{n[$1]++}' $inp
# 94
+tock ; echo 'T pm-gawk build O(N): ' `tyme` ; echo ''
# 95
+snd # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # 96
+tick ; $tmr $PMG 'BEGIN{print "output: " n["foo"]}'
# 97
+tock ; echo 'T pm-gawk query O(1): ' `tyme` ; echo ''
# 98
+unsetenv GAWK_PERSIST_FILE
# 99
+snd
############################################################################
# 100
+echo 'Note: all output lines above should be identical' ; echo ''
# 101
+echo 'dir content:' ; ls -l $dir |& $PMG '{print " " $0}'
# 102
+echo '' ; echo 'storage footprints:'
# 103
+foreach f ( rwa ft heap.pma ) # compression is very slow, so we comment it
out # 104
+ echo " $f " `du -BK $dir/$f` # `xz --best < $dir/$f | wc -c` 'bytes xz'
# 105
+end
# 106
+echo '' ; echo 'end: ' `date` `date +%s` ; echo ''
# 107
+
+
+File: pm-gawk.info, Node: Results, Prev: Experiments, Up: Performance
+
+4.6 Results
+===========
+
+Running the script of *note Experiments:: with default parameters on an
+aging laptop yielded the results summarized in the table below. More
+extensive experiments, not reported here, yield qualitatively similar
+results. Keep in mind that performance measurements are often sensitive
+to seemingly irrelevant factors. For example, the program that runs
+first may have the advantage of a cooler CPU; later contestants may
+start with a hot CPU and consequent clock throttling by a modern
+processor's thermal regulation apparatus. Very generally, performance
+measurement is a notoriously tricky business. For scripting, whose main
+motive is convenience rather than speed, the proper role for performance
+measurements is to qualitatively test hypotheses such as those that
+follow from asymptotic analyses and to provide a rough idea of when
+various approaches are practical.
+
+
+ run time peak memory intermediate
+ AWK script (sec) footprint (K) storage (K)
+
+ naive O(N) 242.132 2,081,360 n/a
+ rwarray build O(N) 250.288 2,846,868 156,832
+ rwarray query O(W) 11.653 2,081,444
+ freqtbl build O(N) 288.408 2,400,120 69,112
+ freqtbl query O(W) 11.624 2,336,616
+ pm-gawk build O(N) 251.946 2,079,520 2,076,608
+ pm-gawk query O(1) 0.026 3,252
+
+
+ The results are consistent with the asymptotic analysis of *note
+Constant-Time Array Access::. All four approaches require roughly four
+minutes to read the synthetic input data. The naïve approach must do
+this every time it performs a query, but the other three build an
+associative array to support queries and separately serve such queries.
+The 'freqtbl' and 'rwarray' approaches build an associative array of
+word frequencies, serialize it to an intermediate file, and then read
+the entire intermediate file prior to serving queries; the former does
+this manually and the latter uses a 'gawk' extension. Both can serve
+queries in roughly ten seconds, not four minutes. As we'd expect from
+the asymptotic analysis, performing work proportional to the number of
+words is preferable to work proportional to the size of the raw input
+corpus: O(W) time is faster than O(N). And as we'd expect, pm-'gawk''s
+constant-time queries are faster still, by roughly two orders of
+magnitude. For the computations considered here, pm-'gawk' makes the
+difference between blink-of-an-eye interactive queries and response
+times long enough for the user's mind to wander.
+
+ Whereas 'freqtbl' and 'rwarray' reconstruct an associative array
+prior to accessing an individual element, pm-'gawk' stores a ready-made
+associative array in persistent memory. That's why its intermediate
+file (the heap file) is much larger than the other two intermediate
+files, why the heap file is nearly as large as pm-'gawk''s peak memory
+footprint while building the persistent array, and why its memory
+footprint is very small while serving a query that accesses a single
+array element. The upside of the large heap file is O(1) access instead
+of O(W)--a classic time-space tradeoff. If storage is a scarce
+resource, all three intermediate files can be compressed, 'freqtbl' by a
+factor of roughly 2.7, 'rwarray' by roughly 5.6x, and pm-'gawk' by
+roughly 11x using 'xz'. Compression is CPU-intensive and slow, another
+time-space tradeoff.
+
+
+File: pm-gawk.info, Node: Data Integrity, Next: Acknowledgments, Prev:
Performance, Up: Top
+
+5 Data Integrity
+****************
+
+Mishaps including power outages, OS kernel panics, scripting bugs, and
+command-line typos can harm your data, but precautions can mitigate
+these risks. In scripting scenarios it usually suffices to create safe
+backups of important files at appropriate times. As simple as this
+sounds, care is needed to achieve genuine protection and to reduce the
+costs of backups. Here's a prudent yet frugal way to back up a heap
+file between uses:
+ $ backup_base=heap_bk_`date +%s`
+ $ cp --reflink=always heap.pma $backup_base.pma
+ $ chmod a-w $backup_base.pma
+ $ sync
+ $ touch $backup_base.done
+ $ chmod a-w $backup_base.done
+ $ sync
+ $ ls -l heap*
+ -rw-rw-r--. 1 me me 4096000 Aug 6 15:53 heap.pma
+ -r--r--r--. 1 me me 0 Aug 6 16:16 heap_bk_1659827771.done
+ -r--r--r--. 1 me me 4096000 Aug 6 16:16 heap_bk_1659827771.pma
+Timestamps in backup filenames make it easy to find the most recent copy
+if the heap file is damaged, even if last-mod metadata are inadvertently
+altered.
+
+ The 'cp' command's '--reflink' option reduces both the storage
+footprint of the copy and the time required to make it. Just as sparse
+files provide "pay as you go" storage footprints, reflink copying offers
+"pay as you _change_" storage costs.(1) A reflink copy shares storage
+with the original file. The file system ensures that subsequent changes
+to either file don't affect the other. Reflink copying is not available
+on all file systems; XFS, BtrFS, and OCFS2 currently support it.(2)
+Fortunately you can install, say, an XFS file system _inside an ordinary
+file_ on some other file system, such as 'ext4'.(3)
+
+ After creating a backup copy of the heap file we use 'sync' to force
+it down to durable media. Otherwise the copy may reside only in
+volatile DRAM memory--the file system's cache--where an OS crash or
+power failure could corrupt it.(4) After 'sync'-ing the backup we
+create and 'sync' a "success indicator" file with extension '.done' to
+address a nasty corner case: Power may fail _while_ a backup is being
+copied from the primary heap file, leaving either file, or both, corrupt
+on storage--a particularly worrisome possibility for jobs that run
+unattended. Upon reboot, each '.done' file attests that the
+corresponding backup succeeded, making it easy to identify the most
+recent successful backup.
+
+ Finally, if you're serious about tolerating failures you must "train
+as you would fight" by testing your hardware/software stack against
+realistic failures. For realistic power-failure testing, see
+<https://queue.acm.org/detail.cfm?id=3400902>.
+
+ ---------- Footnotes ----------
+
+ (1) The system call that implements reflink copying is described in
+'man ioctl_ficlone'.
+
+ (2) The '--reflink' option creates copies as sparse as the original.
+If reflink copying is not available, '--sparse=always' should be used.
+
+ (3) See
+<https://www.usenix.org/system/files/login/articles/login_winter19_08_kelly.pdf>
+
+ (4) On some OSes 'sync' provides very weak guarantees, but on Linux
+'sync' returns only after all file system data are flushed down to
+durable storage. If your 'sync' is unreliable, write a little C program
+that calls 'fsync()' to flush a file. To be safe, also call 'fsync()'
+on every enclosing directory on the file's 'realpath()' up to the root.
+
+
+File: pm-gawk.info, Node: Acknowledgments, Next: Installation, Prev: Data
Integrity, Up: Top
+
+6 Acknowledgments
+*****************
+
+Haris Volos, Zi Fan Tan, and Jianan Li developed a persistent 'gawk'
+prototype based on a fork of the 'gawk' source. Advice from 'gawk'
+maintainer Arnold Robbins to me, which I forwarded to them, proved very
+helpful. Robbins moreover implemented, documented, and tested pm-'gawk'
+for the official version of 'gawk'; along the way he suggested numerous
+improvements for the 'pma' memory allocator beneath pm-'gawk'. Corinna
+Vinschen suggested other improvements to 'pma' and tested pm-'gawk' on
+Cygwin. Nelson H. F. Beebe provided access to Solaris machines for
+testing. Robbins, Volos, Li, Tan, Jon Bentley, and Hans Boehm reviewed
+drafts of this user manual and provided useful feedback. Bentley
+suggested the min/max/mean example in *note Examples::, and also the
+exercise of making Kernighan & Pike's "Markov" script persistent. Volos
+provided and tested the advice on tuning OS parameters in *note Virtual
+Memory and Big Data::. Stan Park answered several questions about
+virtual memory, file systems, and utilities.
+
+
+File: pm-gawk.info, Node: Installation, Next: Debugging, Prev:
Acknowledgments, Up: Top
+
+Appendix A Installation
+***********************
+
+'gawk' 5.2 featuring persistent memory is expected to be released in
+August 2022; look for it at <http://ftp.gnu.org/gnu/gawk/>. If 5.2 is
+not released yet, the master git branch is available at
+<http://git.savannah.gnu.org/cgit/gawk.git/snapshot/gawk-master.tar.gz>.
+Unpack the tarball, run './configure', 'make', and 'make check', then
+try some of the examples presented earlier. In the normal course of
+events, 5.2 and later 'gawk' releases featuring pm-'gawk' will appear in
+the software package management systems of major GNU/Linux distros.
+Eventually pm-'gawk' will be available in the default 'gawk' on such
+systems.
+
+
+File: pm-gawk.info, Node: Debugging, Next: History, Prev: Installation,
Up: Top
+
+Appendix B Debugging
+********************
+
+For bugs unrelated to persistence, see the 'gawk' documentation, e.g.,
+'GAWK: Effective AWK Programming', available at
+<https://www.gnu.org/software/gawk/manual/>.
+
+ If pm-'gawk' doesn't behave as you expect, first consider whether
+you're using the heap file that you intend; using the wrong heap file is
+a common mistake. Other fertile sources of bugs for newcomers are the
+fact that a 'BEGIN' block is executed every time pm-'gawk' runs, which
+isn't always what you really want, and the fact that built-in AWK
+variables such as 'NR' are always re-set to zero every time the
+interpreter runs. See the discussion of initialization surrounding the
+min/max/mean script in *note Examples::.
+
+ If you suspect a persistence-related bug in pm-'gawk', you can set an
+environment variable that will cause its persistent heap module, 'pma',
+to emit more verbose error messages; for details see the main 'gawk'
+documentation.
+
+ Programmers: You can re-compile 'gawk' with assertions enabled, which
+will trigger extensive integrity checks within 'pma'. Ensure that
+'pma.c' is compiled _without_ the '-DNDEBUG' flag when 'make' builds
+'gawk'. Run the resulting executable on small inputs, because the
+integrity checks can be very slow. If assertions fail, that likely
+indicates bugs somewhere in pm-'gawk'. Report such bugs to me (Terence
+Kelly) and also following the procedures in the main 'gawk'
+documentation. Specify what version of 'gawk' you're using, and try to
+provide a small and simple script that reliably reproduces the bug.
+
+
+File: pm-gawk.info, Node: History, Prev: Debugging, Up: Top
+
+Appendix C History
+******************
+
+The pm-'gawk' persistence feature is based on a new persistent memory
+allocator, 'pma', whose design is described in
+<https://queue.acm.org/detail.cfm?id=3534855>. It is instructive to
+trace the evolutionary paths that led to 'pma' and pm-'gawk'.
+
+ I wrote many AWK scripts during my dissertation research on Web
+caching twenty years ago, most of which processed log files from Web
+servers and Web caches. Persistent 'gawk' would have made these scripts
+smaller, faster, and easier to write, but at the time I was unable even
+to imagine that pm-'gawk' is possible. So I wrote a lot of bothersome,
+inefficient code that manually dumped and re-loaded AWK script variables
+to and from text files. A decade would pass before my colleagues and I
+began to connect the dots that make persistent scripting possible, and a
+further decade would pass before pm-'gawk' came together.
+
+ Circa 2011 while working at HP Labs I developed a fault-tolerant
+distributed computing platform called "Ken," which contained a
+persistent memory allocator that resembles a simplified 'pma': It
+presented a 'malloc()'-like C interface and it allocated memory from a
+file-backed memory mapping. Experience with Ken convinced me that the
+software abstraction of persistent memory offers important attractions
+compared with the alternatives for managing persistent data (e.g.,
+relational databases and key-value stores). Unfortunately, Ken's
+allocator is so deeply intertwined with the rest of Ken that it's
+essentially inseparable; to enjoy the benefits of Ken's persistent
+memory, one must "buy in" to a larger and more complicated value
+proposition. Whatever its other virtues might be, Ken isn't ideal for
+showcasing the benefits of persistent memory in isolation.
+
+ Another entangled aspect of Ken was a crash-tolerance mechanism that,
+in retrospect, can be viewed as a user-space implementation of
+failure-atomic 'msync()'. The first post-Ken disentanglement effort
+isolated the crash-tolerance mechanism and implemented it in the Linux
+kernel, calling the result "failure-atomic 'msync()'" (FAMS). FAMS
+strengthens the semantics of ordinary standard 'msync()' by guaranteeing
+that the durable state of a memory-mapped file always reflects the most
+recent successful 'msync()' call, even in the presence of failures such
+as power outages and OS or application crashes. The original Linux
+kernel FAMS prototype is described in a paper by Park et al. in EuroSys
+2013. My colleagues and I subsequently implemented FAMS in several
+different ways including in file systems (FAST 2015) and user-space
+libraries. My most recent FAMS implementation, which leverages the
+reflink copying feature described elsewhere in this manual, is now the
+foundation of a new crash-tolerance feature in the venerable and
+ubiquitous GNU 'dbm' ('gdbm') database
+(<https://queue.acm.org/detail.cfm?id=3487353>).
+
+ In recent years my attention has returned to the advantages of
+persistent memory programming, lately a hot topic thanks to the
+commercial availability of byte-addressable non-volatile memory hardware
+(which, confusingly, is nowadays marketed as "persistent memory"). The
+software abstraction of persistent memory and the corresponding
+programming style, however, are perfectly compatible with _conventional_
+computers--machines with neither non-volatile memory nor any other
+special hardware or software. I wrote a few papers making this point,
+for example <https://queue.acm.org/detail.cfm?id=3358957>.
+
+ In early 2022 I wrote a new stand-alone persistent memory allocator,
+'pma', to make persistent memory programming easy on conventional
+hardware. The 'pma' interface is compatible with 'malloc()' and, unlike
+Ken's allocator, 'pma' is not coupled to a particular crash-tolerance
+mechanism. Using 'pma' is easy and, at least to some, enjoyable.
+
+ Ken had been integrated into prototype forks of both the V8
+JavaScript interpreter and a Scheme interpreter, so it was natural to
+consider whether 'pma' might similarly enhance an interpreted scripting
+language. GNU AWK was a natural choice because the source code is
+orderly and because 'gawk' has a single primary maintainer with an open
+mind regarding new features.
+
+ Jianan Li, Zi Fan Tan, Haris Volos, and I began considering
+persistence for 'gawk' in late 2021. While I was writing 'pma', they
+prototyped pm-'gawk' in a fork of the 'gawk' source. Experience with
+the prototype confirmed the expected convenience and efficiency benefits
+of pm-'gawk', and by spring 2022 Arnold Robbins was implementing
+persistence in the official version of 'gawk'. The persistence feature
+in official 'gawk' differs slightly from the prototype: The former uses
+an environment variable to pass the heap file name to the interpreter
+whereas the latter uses a mandatory command-line option. In many
+respects, however, the two implementations are similar. A description
+of the prototype, including performance measurements, is available at
+<http://nvmw.ucsd.edu/nvmw2022-program/nvmw2022-data/nvmw2022-paper35-final_version_your_extended_abstract.pdf>.
+
+
+
+ I enjoy several aspects of pm-'gawk'. It's unobtrusive; as you gain
+familiarity and experience, it fades into the background of your
+scripting. It's simple in both concept and implementation, and more
+importantly it simplifies your scripts; much of its value is measured
+not in the code it enables you to write but rather in the code it lets
+you discard. It's all that I needed for my dissertation research twenty
+years ago, and more. Anecdotally, it appears to inspire creativity in
+early adopters, who have devised uses that pm-'gawk''s designers never
+anticipated. I'm curious to see what new purposes you find for it.
+
+
+
+Tag Table:
+Node: Top806
+Node: Introduction2008
+Ref: Introduction-Footnote-14325
+Node: Quick Start4416
+Node: Examples7203
+Node: Performance16098
+Node: Constant-Time Array Access16804
+Node: Virtual Memory and Big Data20094
+Ref: Virtual Memory and Big Data-Footnote-123645
+Node: Sparse Heap Files23781
+Node: Persistence versus Durability26791
+Ref: Persistence versus Durability-Footnote-130188
+Ref: Persistence versus Durability-Footnote-230429
+Node: Experiments30818
+Node: Results42633
+Node: Data Integrity46205
+Ref: Data Integrity-Footnote-149011
+Ref: Data Integrity-Footnote-249104
+Ref: Data Integrity-Footnote-349248
+Ref: Data Integrity-Footnote-449341
+Node: Acknowledgments49696
+Node: Installation50863
+Node: Debugging51635
+Node: History53305
+
+End Tag Table
+
+
+Local Variables:
+coding: utf-8
+End:
diff --git a/doc/pm-gawk.texi b/doc/pm-gawk.texi
new file mode 100644
index 00000000..b432f227
--- /dev/null
+++ b/doc/pm-gawk.texi
@@ -0,0 +1,1501 @@
+\input texinfo
+
+@c TODO: Checklist for release:
+@c revise all U P D A T E items as appropriate
+@c check all to-do notes
+@c remove most comments
+@c spell check (last 2am 15 Aug 2022)
+
+@c verbatim limits: 47 rows x 75 cols, smallformat 58 x 90
+
+@macro gwk {}
+@command{gawk}
+@end macro
+
+@macro pmg {}
+pm-@gwk{}
+@end macro
+
+@set TYTL Persistent-Memory @gwk{} User Manual
+
+@setfilename pm-gawk.info
+@settitle @value{TYTL}
+
+@dircategory Text creation and manipulation
+@direntry
+* pm-gawk: (pm-gawk). Persistent memory version of gawk.
+@end direntry
+
+@fonttextsize 11
+
+@c it seems to do no harm and possibly some good if color
+@c distinguishes internal links from URLs to outside web
+@tex
+\gdef\linkcolor{0.12 0.09 .5} % TK's attempt at subdued blue
+%\gdef\linkcolor{0.5 0.09 0.12} % Dark Red
+\gdef\urlcolor{0.5 0.09 0.12} % Dark Red
+\global\urefurlonlylinktrue
+@end tex
+
+@setchapternewpage off
+
+@copying
+@noindent
+@c UPDATE copyright info below
+Copyright @copyright{} 2022 Terence Kelly @*
+@ifnottex
+@noindent
+@email{tpkelly@@eecs.umich.edu} @*
+@email{tpkelly@@cs.princeton.edu} @*
+@email{tpkelly@@acm.org} @*
+@url{http://web.eecs.umich.edu/~tpkelly/pma/} @*
+@url{https://dl.acm.org/profile/81100523747}
+@end ifnottex
+
+@noindent
+Permission is granted to copy, distribute and/or modify this document
+under the terms of the GNU Free Documentation License, Version 1.3
+or any later version published by the Free Software Foundation;
+with the Invariant Sections being ``Introduction'' and ``History'',
+no Front-Cover Texts, and no Back-Cover Texts.
+A copy of the license is available at @*
+@url{https://www.gnu.org/licenses/fdl-1.3.html}
+@end copying
+
+@titlepage
+@title @value{TYTL}
+@c UPDATE date below
+@subtitle 15 August 2022
+@subtitle @gwk{} version 5.2
+@subtitle @pmg{} version 2022.08Aug.03.1659520468 (Avon 7)
+@author Terence Kelly
+@author @email{tpkelly@@eecs.umich.edu}
+@author @email{tpkelly@@cs.princeton.edu}
+@author @email{tpkelly@@acm.org}
+@author @url{http://web.eecs.umich.edu/~tpkelly/pma/}
+@author @url{https://dl.acm.org/profile/81100523747}
+@vskip 0pt plus 1filll
+@insertcopying
+@end titlepage
+
+@headings off
+
+@c @contents @c no need for this in a short document
+
+@node Top
+@ifnottex
+@ifnotxml
+@ifnotdocbook
+@top General Introduction
+@gwk{} 5.2 introduces a @emph{persistent memory} feature that can
+``remember'' script-defined variables and functions across executions;
+pass variables between unrelated scripts without serializing/parsing
+text files; and handle data sets larger than available memory plus
+swap. This supplementary manual provides an in-depth look at
+persistent-memory @gwk{}.
+
+@insertcopying
+@end ifnotdocbook
+@end ifnotxml
+@end ifnottex
+
+@menu
+* Introduction::
+* Quick Start::
+* Examples::
+* Performance::
+* Data Integrity::
+* Acknowledgments::
+* Installation::
+* Debugging::
+* History::
+@end menu
+
+@c ==================================================================
+@node Introduction
+@chapter Introduction
+
+@sp 1
+
+@c UPDATE below after official release
+GNU AWK (@gwk{}) 5.2, expected in August 2022, introduces a new
+@emph{persistent memory} feature that makes AWK scripting easier and
+sometimes improves performance. The new feature, called ``@pmg{},''
+can ``remember'' script-defined variables and functions across
+executions and can pass variables and functions between unrelated
+scripts without serializing/parsing text files---all with near-zero
+fuss. @pmg{} does @emph{not} require non-volatile memory hardware nor
+any other exotic infrastructure; @pmg{} runs on the ordinary
+conventional computers and operating systems that most of us have been
+using for decades.
+
+@sp 1
+
+@c TODO: ADR: hyperlinks to info page below
+
+@noindent
+The main @gwk{}
+documentation@footnote{@url{https://www.gnu.org/software/gawk/manual/}
+@w{ } and @w{ } @code{man gawk} @w{ } and @w{ } @code{info gawk}}
+covers the basics of the new persistence feature. This supplementary
+manual provides additional detail, tutorial examples, and a peek under
+the hood of @pmg{}. If you're familiar with @gwk{} and Unix-like
+environments, dive straight in: @*
+
+@itemize @c @w{}
+@item @ref{Quick Start} hits the ground running with a few keystrokes
+@item @ref{Examples} shows how @pmg{} streamlines typical AWK scripting
+@item @ref{Performance} covers asymptotic efficiency, OS tuning, and more
+@item @ref{Data Integrity} explains how to protect data from mishaps
+@item @ref{Acknowledgments} thanks those who made @pmg{} happen
+@item @ref{Installation} shows where obtain @pmg{}
+@item @ref{Debugging} explains how to handle suspected bugs
+@item @ref{History} traces @pmg{}'s persistence technology
+@end itemize
+
+@c UPDATE: revise above when content finalized
+
+@sp 1
+
+@noindent
+You can find the latest version of this manual, and also the
+``director's cut,'' at the web site for the persistent memory
+allocator used in @pmg{}: @*
+@center @url{http://web.eecs.umich.edu/~tpkelly/pma/}
+
+@sp 1
+
+@noindent
+Two publications describe the persistent memory allocator and early
+experiences with a @pmg{} prototype based on a fork of the official
+@gwk{} sources: @*
+@url{https://queue.acm.org/detail.cfm?id=3534855} @*
+@url{http://nvmw.ucsd.edu/nvmw2022-program/nvmw2022-data/nvmw2022-paper35-final_version_your_extended_abstract.pdf}
+
+
+@sp 1
+
+@noindent
+Feel free to send me questions, suggestions, and experiences: @*
+
+@noindent
+@w{ } @w{ } @w{ } @w{ } @w{ } @w{ } @w{ } @w{ }
@email{tpkelly@@eecs.umich.edu} @w{ } (preferred) @*
+@w{ } @w{ } @w{ } @w{ } @w{ } @w{ } @w{ } @w{ }
@email{tpkelly@@cs.princeton.edu} @*
+@w{ } @w{ } @w{ } @w{ } @w{ } @w{ } @w{ } @w{ } @email{tpkelly@@acm.org}
+
+@page
+@c ==================================================================
+@node Quick Start
+@chapter Quick Start
+
+@c example heaps are larger than strictly necessary so that readers
+@c who use them more extensively are less likely to exhaust memory
+
+Here's @pmg{} in action at the @command{bash} shell prompt (@samp{$}):
+@verbatim
+ $ truncate -s 4096000 heap.pma
+ $ export GAWK_PERSIST_FILE=heap.pma
+ $ gawk 'BEGIN{myvar = 47}'
+ $ gawk 'BEGIN{myvar += 7; print myvar}'
+ 54
+@end verbatim
+@noindent
+First, @command{truncate} creates an empty (all-zero-bytes) @dfn{heap
+file} where @pmg{} will store script variables; its size is a multiple
+of the system page size (4@tie{}KiB). Next, @command{export} sets an
+environment variable that enables @pmg{} to find the heap file; if
+@gwk{} does @emph{not} see this envar, persistence is not activated.
+The third command runs a one-line AWK script that initializes variable
+@code{myvar}, which will reside in the heap file and thereby outlive
+the interpreter process that initialized it. Finally, the fourth
+command invokes @pmg{} on a @emph{different} one-line script that
+increments and prints @code{myvar}. The output shows that @pmg{} has
+indeed ``remembered'' @code{myvar} across executions of unrelated
+scripts. (If the @gwk{} executable in your search @env{$PATH} lacks
+the persistence feature, the output in the above example will be
+@samp{7} instead of @samp{54}. @xref{Installation}.) To disable
+persistence until you want it again, prevent @gwk{} from finding the
+heap file via @command{unset GAWK_PERSIST_FILE}. To permanently
+``forget'' script variables, delete the heap file.
+
+@sp 2
+
+Toggling persistence by @command{export}-ing and @command{unset}-ing
+``ambient'' envars requires care: Forgetting to @command{unset} when
+you no longer want persistence can cause confusing bugs. Fortunately,
+@command{bash} allows you to pass envars more deliberately, on a
+per-command basis:
+@verbatim
+ $ rm heap.pma # start fresh
+ $ unset GAWK_PERSIST_FILE # eliminate ambient envar
+ $ truncate -s 4096000 heap.pma # create new heap file
+
+ $ GAWK_PERSIST_FILE=heap.pma gawk 'BEGIN{myvar = 47}'
+ $ gawk 'BEGIN{myvar += 7; print myvar}'
+ 7
+ $ GAWK_PERSIST_FILE=heap.pma gawk 'BEGIN{myvar += 7; print myvar}'
+ 54
+@end verbatim
+@noindent
+The first @gwk{} invocation sees the special envar prepended on the
+command line, so it activates @pmg{}. The second @gwk{} invocation,
+however, does @emph{not} see the envar and therefore does not access
+the script variable stored in the heap file. The third @gwk{}
+invocation sees the special envar and therefore uses the script
+variable from the heap file.
+
+While sometimes less error prone than ambient envars, per-command
+envar passing as shown above is verbose and shouty. A shell alias
+saves keystrokes and reduces visual clutter:
+@verbatim
+ $ alias pm='GAWK_PERSIST_FILE=heap.pma'
+ $ pm gawk 'BEGIN{print ++myvar}'
+ 55
+ $ pm gawk 'BEGIN{print ++myvar}'
+ 56
+@end verbatim
+
+@page
+@c ==================================================================
+@node Examples
+@chapter Examples
+
+Our first example uses @pmg{} to streamline analysis of a prose
+corpus, Mark Twain's @cite{Tom Sawyer} and @cite{Huckleberry Finn}
+from
+@c
+@url{https://gutenberg.org/files/74/74-0.txt}
+@c
+and
+@c
+@url{https://gutenberg.org/files/76/76-0.txt}.
+@c
+We first convert non-alphabetic characters to newlines (so each line
+has at most one word) and convert to lowercase:
+@verbatim
+ $ tr -c a-zA-Z '\n' < 74-0.txt | tr A-Z a-z > sawyer.txt
+ $ tr -c a-zA-Z '\n' < 76-0.txt | tr A-Z a-z > finn.txt
+@end verbatim
+
+It's easy to count word frequencies with AWK's associative arrays.
+@pmg{} makes these arrays persistent, so we need not re-ingest the
+entire corpus every time we ask a new question (``read once, analyze
+happily ever after''):
+@verbatim
+ $ truncate -s 100M twain.pma
+ $ export GAWK_PERSIST_FILE=twain.pma
+ $ gawk '{ts[$1]++}' sawyer.txt # ingest
+ $ gawk 'BEGIN{print ts["work"], ts["play"]}' # query
+ 92 11
+ $ gawk 'BEGIN{print ts["necktie"], ts["knife"]}' # query
+ 2 27
+@end verbatim
+@noindent
+The @command{truncate} command above creates a heap file large enough
+to store all of the data it must eventually contain, with plenty of
+room to spare (as we'll see in @ref{Sparse Heap Files}, this isn't
+wasteful). The @command{export} command ensures that subsequent
+@gwk{} invocations activate @pmg{}. The first @pmg{} command stores
+@cite{Tom Sawyer}'s word frequencies in associative array @code{ts[]}.
+Because this array is persistent, subsequent @pmg{} commands can
+access it without having to parse the input file again.
+
+Expanding our analysis to encompass a second book is easy. Let's
+populate a new associative array @code{hf[]} with the word frequencies
+in @cite{Huckleberry Finn}:
+@verbatim
+ $ gawk '{hf[$1]++}' finn.txt
+@end verbatim
+@noindent
+Now we can freely intermix accesses to both books' data conveniently
+and efficiently, without the overhead and coding fuss of repeated
+input parsing:
+@verbatim
+ $ gawk 'BEGIN{print ts["river"], hf["river"]}'
+ 26 142
+@end verbatim
+
+By making AWK more interactive, @pmg{} invites casual conversations
+with data. If we're curious what words in @cite{Finn} are absent from
+@cite{Sawyer}, answers (including ``flapdoodle,'' ``yellocution,'' and
+``sockdolager'') are easy to find:
+@verbatim
+ $ gawk 'BEGIN{for(w in hf) if (!(w in ts)) print w}'
+@end verbatim
+@c also: doxolojer meedyevil ridicklous dingnation gumption cavortings
+@c phrenology [words about slavery] shakespeare camelopard ope
+@c mesmerism sapheads disremember consekens prevarication
+@c missionaryin cannibal nebokoodneezer sentimentering palavering
+
+Rumors of Twain's death may be exaggerated. If he publishes new books
+in the future, it will be easy to incorporate them into our analysis
+incrementally. The performance benefits of incremental processing for
+common AWK chores such as log file analysis are discussed in
+@url{https://queue.acm.org/detail.cfm?id=3534855} and the companion
+paper cited therein, and below in @ref{Performance}.
+@c
@url{http://nvmw.ucsd.edu/nvmw2022-program/nvmw2022-data/nvmw2022-paper35-final_version_your_extended_abstract.pdf}.
+
+Exercise: The ``Markov'' AWK script on page 79 of Kernighan & Pike's
+@cite{The Practice of Programming} generates random text reminiscent
+of a given corpus using a simple statistical modeling technique. This
+script consists of a ``learning'' or ``training'' phase followed by an
+output-generation phase. Use @pmg{} to de-couple the two phases and
+to allow the statistical model to incrementally ingest additions to
+the input corpus.
+
+@page
+@c = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
+
+Our second example considers another domain that plays to AWK's
+strengths, data analysis. For simplicity we'll create two small input
+files of numeric data.
+@verbatim
+ $ printf '1\n2\n3\n4\n5\n' > A.dat
+ $ printf '5\n6\n7\n8\n9\n' > B.dat
+@end verbatim
+@noindent
+A conventional @emph{non}-persistent AWK script can compute basic
+summary statistics:
+@verbatim
+ $ cat summary_conventional.awk
+ 1 == NR { min = max = $1 }
+ min > $1 { min = $1 }
+ max < $1 { max = $1 }
+ { sum += $1 }
+ END { print "min: " min " max: " max " mean: " sum/NR }
+
+ $ gawk -f summary_conventional.awk A.dat B.dat
+ min: 1 max: 9 mean: 5
+@end verbatim
+
+To use @pmg{} for the same purpose, we first create a heap file for
+our AWK script variables and tell @pmg{} where to find it via the
+usual environment variable:
+@verbatim
+ $ truncate -s 10M stats.pma
+ $ export GAWK_PERSIST_FILE=stats.pma
+@end verbatim
+@noindent
+@pmg{} requires changing the above script to ensure that @code{min}
+and @code{max} are initialized exactly once, when the heap file is
+first used, and @emph{not} every time the script runs. Furthermore,
+whereas script-defined variables such as @code{min} retain their
+values across @pmg{} executions, built-in AWK variables such as
+@code{NR} are re-set to zero every time @pmg{} runs, so we can't use
+them in the same way. Here's a modified script for @pmg{}:
+@verbatim
+ $ cat summary_persistent.awk
+ ! init { min = max = $1; init = 1 }
+ min > $1 { min = $1 }
+ max < $1 { max = $1 }
+ { sum += $1; ++n }
+ END { print "min: " min " max: " max " mean: " sum/n }
+@end verbatim
+@noindent
+Note the different pattern on the first line and the introduction of
+@code{n} to supplant @code{NR}. When used with @pmg{}, this new
+initialization logic supports the same kind of cumulative processing
+that we saw in the text-analysis scenario. For example, we can ingest
+our input files separately:
+@verbatim
+ $ gawk -f summary_persistent.awk A.dat
+ min: 1 max: 5 mean: 3
+
+ $ gawk -f summary_persistent.awk B.dat
+ min: 1 max: 9 mean: 5
+@end verbatim
+@noindent
+As expected, after the second @pmg{} invocation consumes the
+second input file, the output matches that of the non-persistent
+script that read both files at once.
+
+Exercise: Amend the AWK scripts above to compute the median and
+mode(s) using both conventional @gwk{} and @pmg{}. (The median is the
+number in the middle of a sorted list; if the length of the list is
+even, average the two numbers at the middle. The modes are the values
+that occur most frequently.)
+
+@c heaps not portable across machines, use only with same gawk executable (?)
+@c refer to gawk docs for portability constraints on heaps
+@c can use only one heap at a time
+
+@page
+@c = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
+
+Our third and final set of examples shows that @pmg{} allows us to
+bundle both script-defined data and also user-defined @emph{functions}
+in a persistent heap that may be passed freely between unrelated AWK
+scripts.
+
+The following shell transcript repeatedly invokes @pmg{} to create and
+then employ a user-defined function. These separate invocations
+involve several different AWK scripts that communicate via the heap
+file. Each invocation can add user-defined functions and add or
+remove data from the heap that subsequent invocations will access.
+@smallformat
+@verbatim
+ $ truncate -s 10M funcs.pma
+ $ export GAWK_PERSIST_FILE=funcs.pma
+ $ gawk 'function count(A,t) {for(i in A)t++; return ""==t?0:t}'
+ $ gawk 'BEGIN { a["x"] = 4; a["y"] = 5; a["z"] = 6 }'
+ $ gawk 'BEGIN { print count(a) }'
+ 3
+ $ gawk 'BEGIN { delete a["x"] }'
+ $ gawk 'BEGIN { print count(a) }'
+ 2
+ $ gawk 'BEGIN { delete a }'
+ $ gawk 'BEGIN { print count(a) }'
+ 0
+ $ gawk 'BEGIN { for (i=0; i<47; i++) a[i]=i }'
+ $ gawk 'BEGIN { print count(a) }'
+ 47
+@end verbatim
+@end smallformat
+@noindent
+The first @pmg{} command creates user-defined function @code{count()},
+which returns the number of entries in a given associative array; note
+that variable @code{t} is local to @code{count()}, not global. The
+next @pmg{} command populates a persistent associative array with
+three entries; not surprisingly, the @code{count()} call in the
+following @pmg{} command finds these three entries. The next two
+@pmg{} commands respectively delete an array entry and print the
+reduced count, 2. The two commands after that delete the entire array
+and print a count of zero. Finally, the last two @pmg{} commands
+populate the array with 47 entries and count them.
+
+@c I could be persuaded to leave the polynomial example as an
+@c exercise, offering to send my answer to readers upon request.
+
+The following shell script invokes @pmg{} repeatedly to create a
+collection of user-defined functions that perform basic operations on
+quadratic polynomials: evaluation at a given point, computing the
+discriminant, and using the quadratic formula to find the roots. It
+then factorizes @math{x^2 + x - 12} into @math{(x - 3)(x + 4)}.
+@smallformat
+@verbatim
+ #!/bin/sh
+ rm -f poly.pma
+ truncate -s 10M poly.pma
+ export GAWK_PERSIST_FILE=poly.pma
+ gawk 'function q(x) { return a*x^2 + b*x + c }'
+ gawk 'function p(x) { return "q(" x ") = " q(x) }'
+ gawk 'BEGIN { print p(2) }' # evaluate & print
+ gawk 'BEGIN{ a = 1; b = 1; c = -12 }' # new coefficients
+ gawk 'BEGIN { print p(2) }' # eval/print again
+ gawk 'function d(s) { return s * sqrt(b^2 - 4*a*c)}'
+ gawk 'BEGIN{ print "discriminant (must be >=0): " d(1)}'
+ gawk 'function r(s) { return (-b + d(s))/(2*a)}'
+ gawk 'BEGIN{ print "root: " r( 1) " " p(r( 1)) }'
+ gawk 'BEGIN{ print "root: " r(-1) " " p(r(-1)) }'
+ gawk 'function abs(n) { return n >= 0 ? n : -n }'
+ gawk 'function sgn(x) { return x >= 0 ? "- " : "+ " } '
+ gawk 'function f(s) { return "(x " sgn(r(s)) abs(r(s))}'
+ gawk 'BEGIN{ print "factor: " f( 1) ")" }'
+ gawk 'BEGIN{ print "factor: " f(-1) ")" }'
+ rm -f poly.pma
+@end verbatim
+@end smallformat
+@noindent
+
+@page
+@c ==================================================================
+@node Performance
+@chapter Performance
+
+This chapter explains several performance advantages that result from
+the implementation of persistent memory in @pmg{}, shows how tuning
+the underlying operating system sometimes improves performance, and
+presents experimental performance measurements. To make the
+discussion concrete, we use examples from a ``GNU/Linux'' system---GNU
+utilities atop the Linux OS---but the principles apply to other modern
+operating systems.
+
+@c = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
+@node Constant-Time Array Access
+@section Constant-Time Array Access
+
+@pmg{} preserves the efficiency of data access when data structures
+are created by one process and later re-used by a different process.
+
+Consider the associative arrays used to analyze Mark Twain's books in
+@ref{Examples}. We created arrays @code{ts[]} and @code{hf[]} by
+reading files @file{sawyer.txt} and @file{finn.txt}. If @i{N} denotes
+the total volume of data in these files, building the associative
+arrays typically requires time proportional to @i{N}, or ``@i{O(N)}
+expected time'' in the lingo of asymptotic analysis. If @i{W} is the
+number of unique words in the input files, the size of the associative
+arrays will be proportional to @i{W}, or @i{O(W)}. Accessing
+individual array elements requires only @emph{constant} or @i{O(1)}
+expected time, not @i{O(N)} or @i{O(W)} time, because @gwk{}
+implements arrays as hash tables.
+
+@c how much larger is N than W for the Twain texts?
+@c % wc -w sawyer.txt finn.txt
+@c 77523 sawyer.txt
+@c 120864 finn.txt
+@c 198387 total
+@c % cat sawyer.txt finn.txt | sort | uniq | wc -w
+@c 10447
+@c
+@c #words is 19x larger than #uniquewords
+@c
+@c Note that the total number of English words in existence is fixed,
+@c so as the size of a corpus increases without bound, the ratio of
+@c vocabulary size to corpus size tends toward zero.
+
+The performance advantage of @pmg{} arises when different processes
+create and access associative arrays. Accessing an element of a
+persistent array created by a previous @pmg{} process, as we did
+earlier in
+@c
+@verb{|BEGIN{print ts["river"], hf["river"]}|},
+@c
+still requires only @i{O(1)} time, which is asymptotically far
+superior to the alternatives. Na@"{@dotless{i}}vely reconstructing
+arrays by re-ingesting all raw inputs in every @gwk{} process that
+accesses the arrays would of course require @i{O(N)} time---a
+profligate cost if the text corpus is large. Dumping arrays to files
+and re-loading them as needed would reduce the preparation time for
+access to @i{O(W)}. That can be a substantial improvement in
+practice; @i{N} is roughly 19 times larger than @i{W} in our Twain
+corpus. Nonetheless @i{O(W)} remains far slower than @pmg{}'s
+@i{O(1)}. As we'll see in @ref{Results}, the difference is not merely
+theoretical.
+
+The persistent memory implementation beneath @pmg{} enables it to
+avoid work proportional to @i{N} or @i{W} when accessing an element of
+a persistent associative array. Under the hood, @pmg{} stores
+script-defined AWK variables such as associative arrays in a
+persistent heap laid out in a memory-mapped file (the heap file).
+When an AWK script accesses an element of an associative array, @pmg{}
+performs a lookup on the corresponding hash table, which in turn
+accesses memory on the persistent heap. Modern operating systems
+implement memory-mapped files in such a way that these memory accesses
+trigger the bare minimum of data movement required: Only those parts
+of the heap file containing needed data are ``paged in'' to the memory
+of the @pmg{} process. In the worst case, the heap file is not in the
+file system's in-memory cache, so the required pages must be faulted
+into memory from storage. Our asymptotic analysis of efficiency
+applies regardless of whether the heap file is cached or not. The
+entire heap file is @emph{not} accessed merely to access an element of
+a persistent associative array.
+
+Persistent memory thus enables @pmg{} to offer the flexibility of
+de-coupling data ingestion from analytic queries without the fuss and
+overhead of serializing and loading data structures and without
+sacrificing constant-time access to the associative arrays that make
+AWK scripting convenient and productive.
+
+@c Further details on @pmg{}'s persistent heap are available in
+@c @url{https://queue.acm.org/detail.cfm?id=3534855}
+@c
+@c and [excessively long NVMW URL below]
+@c
+@c
@url{http://nvmw.ucsd.edu/nvmw2022-program/nvmw2022-data/nvmw2022-paper35-final_version_your_extended_abstract.pdf}.
+
+@page
+@c = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
+@node Virtual Memory and Big Data
+@section Virtual Memory and Big Data
+
+Small data sets seldom spoil the delights of AWK by causing
+performance troubles, with or without persistence. As the size of the
+@gwk{} interpreter's internal data structures approaches the capacity
+of physical memory, however, acceptable performance requires
+understanding modern operating systems and sometimes tuning them.
+Fortunately @pmg{} offers new degrees of control for
+performance-conscious users tackling large data sets. A terse
+mnemonic captures the basic principle: Precluding paging promotes peak
+performance and prevents perplexity.
+
+Modern operating systems feature @dfn{virtual memory} that strives to
+appear both larger than installed DRAM (which is small) and faster
+than installed storage devices (which are slow). As a program's
+memory footprint approaches the capacity of DRAM, the virtual memory
+system transparently @dfn{pages} (moves) the program's data between
+DRAM and a @dfn{swap area} on a storage device. Paging can degrade
+performance mildly or severely, depending on the program's memory
+access patterns. Random accesses to large data structures can trigger
+excessive paging and dramatic slowdown. Unfortunately, the hash
+tables beneath AWK's signature associative arrays inherently require
+random memory accesses, so large associative arrays can be
+problematic.
+
+Persistence changes the rules in our favor: The OS pages data to
+@pmg{}'s @emph{heap file} instead of the swap area. This won't help
+performance much if the heap file resides in a conventional
+storage-backed file system. On Unix-like systems, however, we may
+place the heap file in a DRAM-backed file system such as
+@file{/dev/shm/}, which entirely prevents paging to slow storage
+devices. Temporarily placing the heap file in such a file system is a
+reasonable expedient, with two caveats: First, keep in mind that
+DRAM-backed file systems perish when the machine reboots or crashes,
+so you must copy the heap file to a conventional storage-backed file
+system when your computation is done. Second, @pmg{}'s memory
+footprint can't exceed available DRAM if you place the heap file in a
+DRAM-backed file system.
+
+Tuning OS paging parameters may improve performance if you decide to
+run @pmg{} with a heap file in a conventional storage-backed file
+system. Some OSes have unhelpful default habits regarding modified
+(``dirty'') memory backed by files. For example, the OS may write
+dirty memory pages to the heap file periodically and/or when the OS
+believes that ``too much'' memory is dirty. Such ``eager'' writeback
+can degrade performance noticeably and brings no benefit to @pmg{}.
+Fortunately some OSes allow paging defaults to be over-ridden so that
+writeback is ``lazy'' rather than eager. For Linux see the discussion
+of the @code{dirty_*} parameters at
+@url{https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html}.
+Changing these parameters can prevent wasteful eager
+paging:@footnote{The @command{tee} rigmarole is explained at
+@url{https://askubuntu.com/questions/1098059/which-is-the-right-way-to-drop-caches-in-lubuntu}.}
+@verbatim
+ $ echo 100 | sudo tee /proc/sys/vm/dirty_background_ratio
+ $ echo 100 | sudo tee /proc/sys/vm/dirty_ratio
+ $ echo 300000 | sudo tee /proc/sys/vm/dirty_expire_centisecs
+ $ echo 50000 | sudo tee /proc/sys/vm/dirty_writeback_centisecs
+@end verbatim
+@noindent
+Tuning paging parameters can help non-persistent @gwk{} as well as
+@pmg{}. [Disclaimer: OS tuning is an occult art, and your mileage may
+vary.]
+
+@c sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
+@c
+@c sudo doesn't convey root privileges to the redirection '>' when calling
from cmd line
+
+@page
+@c = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
+@node Sparse Heap Files
+@section Sparse Heap Files
+
+To be frugal with storage resources, @pmg{}'s heap file should be
+created as a @dfn{sparse file}: a file whose logical size is larger
+than its storage resource footprint. Modern file systems support
+sparse files, which are easy to create using the @command{truncate}
+tool shown in our examples.
+
+Let's first create a conventional @emph{non}-sparse file using
+@command{echo}:
+@verbatim
+ $ echo hi > dense
+ $ ls -l dense
+ -rw-rw-r--. 1 me me 3 Aug 5 23:08 dense
+ $ du -h dense
+ 4.0K dense
+@end verbatim
+@noindent
+The @command{ls} utility reports that file @file{dense} is three bytes
+long (two for the letters in ``hi'' plus one for the newline). The
+@command{du} utility reports that this file consumes 4@tie{}KiB of
+storage---one block of disk, as small as a non-sparse file's storage
+footprint can be. Now let's use @command{truncate} to create a
+logically enormous sparse file and check its physical size:
+@verbatim
+ $ truncate -s 1T sparse
+ $ ls -l sparse
+ -rw-rw-r--. 1 me me 1099511627776 Aug 5 22:33 sparse
+ $ du -h sparse
+ 0 sparse
+@end verbatim
+@noindent
+Whereas @command{ls} reports the logical file size that we expect (one
+TiB or 2 raised to the power 40 bytes), @command{du} reveals that the
+file occupies no storage whatsoever. The file system will allocate
+physical storage resources beneath this file as data is written to it;
+reading unwritten regions of the file yields zeros.
+
+The ``pay as you go'' storage cost of sparse files offers both
+convenience and control for @pmg{} users. If your file system
+supports sparse files, go ahead and create lavishly capacious heap
+files for @pmg{}. Their logical size costs nothing and persistent
+memory allocation within @pmg{} won't fail until physical storage
+resources beneath the file system are exhausted. But if instead you
+want to @emph{prevent} a heap file from consuming too much storage,
+simply set its initial size to whatever bound you wish to enforce; it
+won't eat more disk than that. Copying sparse files with GNU
+@command{cp} creates sparse copies by default.
+
+File-system encryption can preclude sparse files: If the cleartext of
+a byte offset range within a file is all zero bytes, the corresponding
+ciphertext probably shouldn't be all zeros! Encrypting at the storage
+layer instead of the file system layer may offer acceptable security
+while still permitting file systems to implement sparse files.
+
+Sometimes you might prefer a dense heap file backed by pre-allocated
+storage resources, for example to increase the likelihood that
+@pmg{}'s internal memory allocation will succeed until the persistent
+heap occupies the entire heap file. The @command{fallocate} utility
+will do the trick:
+@verbatim
+ $ fallocate -l 1M mibi
+ $ ls -l mibi
+ -rw-rw-r--. 1 me me 1048576 Aug 5 23:18 mibi
+ $ du -h mibi
+ 1.0M mibi
+@end verbatim
+@noindent
+We get the MiB we asked for, both logically and physically.
+
+@c UPDATE: search for username in "ls" examples
+
+@page
+@c = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
+@node Persistence versus Durability
+@section Persistence versus Durability
+
+Arguably the most important general guideline for good performance in
+computer systems is, ``pay only for what you
+need.''@footnote{Remarkably, this guideline is widely ignored in
+surprising ways. Certain well-known textbook algorithms continue to
+grind away fruitlessly long after having computed all of their
+output. @*
+@c
+@url{https://queue.acm.org/detail.cfm?id=3424304}}
+@c
+To apply this maxim to @pmg{} we must distinguish two concepts that
+are frequently conflated: persistence and durability.@footnote{In
+recent years the term ``persistent memory'' has sometimes been used to
+denote novel byte-addressable non-volatile memory hardware---an
+unfortunate practice that contradicts sensible long-standing
+convention and causes needless confusion. NVM provides durability.
+Persistent memory is a software abstraction that doesn't require NVM.
+@url{https://queue.acm.org/detail.cfm?id=3358957}} (A third logically
+distinct concept is the subject of @ref{Data Integrity}.)
+
+@dfn{Persistent} data outlive the processes that access them, but
+don't necessarily last forever. For example, as explained in
+@command{man mq_overview}, message queues are persistent because they
+exist until the system shuts down. @dfn{Durable} data reside on a
+physical medium that retains its contents even without continuously
+supplied power. For example, hard disk drives and solid state drives
+store durable data. Confusion arises because persistence and
+durability are often correlated: Data in ordinary file systems backed
+by HDDs or SSDs are typically both persistent and durable.
+Familiarity with @code{fsync()} and @code{msync()} might lead us to
+believe that durability is a subset of persistence, but in fact the
+two characteristics are orthogonal: Data in the swap area are durable
+but not persistent; data in DRAM-backed file systems such as
+@file{/dev/shm/} are persistent but not durable.
+
+Durability often costs more than persistence, so performance-conscious
+@pmg{} users pay the added premium for durability only when
+persistence alone is not sufficient. Two ways to avoid unwanted
+durability overheads were discussed in @ref{Virtual Memory and Big
+Data}: Place @pmg{}'s heap file in a DRAM-backed file system, or
+disable eager writeback to the heap file. Expedients such as these
+enable you to remove durability overheads from the critical path of
+multi-stage data analyses even when you want heap files to eventually
+be durable: Allow @pmg{} to run at full speed with persistence alone;
+force the heap file to durability (using the @command{cp} and
+@command{sync} utilities as necessary) after output has been emitted
+to the next stage of the analysis and the @pmg{} process using the
+heap has terminated.
+
+Experimenting with synthetic data builds intuition for how persistence
+and durability affect performance. You can write a little AWK or C
+program to generate a stream of random text, or just cobble together a
+quick and dirty generator on the command line:
+@verbatim
+ $ openssl rand --base64 1000000 | tr -c a-zA-Z '\n' > random.dat
+@end verbatim
+@noindent
+Varying the size of random inputs can, for example, find where
+performance ``falls off the cliff'' as @pmg{}'s memory footprint
+exceeds the capacity of DRAM and paging begins.
+
+@c TODO:
+@c virtual *machines* / cloud machines can make performance hard to measure
repeatably
+@c here we assume good old fashioned OS install directly on "bare metal"
+
+Experiments require careful methodology, especially when the heap file
+is in a storage-backed file system. Overlooking the file system's
+DRAM cache can easily misguide interpretation of results and foil
+repeatability. Fortunately Linux allows us to invalidate the file
+system cache and thus mimic a ``cold start'' condition resembling the
+immediate aftermath of a machine reboot. Accesses to ordinary files
+on durable storage will then be served from the storage devices, not
+from cache. Read about @command{sync} and
+@file{/proc/sys/vm/drop_caches} at @*
+@c
+@url{https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html}.
+
+@page
+@c = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
+@node Experiments
+@section Experiments
+
+The C-shell (@command{csh}) script listed below illustrates concepts and
+implements tips presented in this chapter. It produced the results
+discussed in @ref{Results} in roughly 20 minutes on an aging laptop.
+You should be able to cut and paste the code listing below into a file
+and get it running without excessive fuss; write to me if you have
+difficulty.
+
+The script measures the performance of four different ways to support
+word frequency queries over a text corpus: The naive approach of
+reading the corpus into an associative array for every query; manually
+dumping a text representation of the word-frequency table and manually
+loading it prior to a query; using @gwk{}'s @code{rwarray} extension
+to dump and load an associative array; and using @pmg{} to maintain a
+persistent associative array.
+
+Comments at the top explain prerequisites. Lines 8--10 set input
+parameters: the directory where tests are run and where files
+including the heap file are held, the off-the-shelf timer used to
+measure run times and other performance characteristics such as peak
+memory usage, and the size of the input. The default input size
+results in @pmg{} memory footprints under 3 GiB, which is large enough
+for interesting results and small enough to fit in DRAM and avoid
+paging on today's computers. Lines 13--14 define a homebrew timer.
+
+Two sections of the script are relevant if the default run directory
+is changed from @file{/dev/shm/} to a directory in a conventional
+storage-backed file system: Lines 15--17 define the mechanism for
+clearing file data cached in DRAM; lines 23--30 set Linux kernel
+parameters to discourage eager paging.
+
+Lines 37--70 spit out, compile, and run a little C program to generate
+a random text corpus. This program is fast, flexible, and
+deterministic, generating the same random output given the same
+parameters.
+
+Lines 71--100 run the four different AWK approaches on the same random
+input, reporting separately the time to build and to query the
+associative array containing word frequencies.
+
+@sp 1
+
+@smallformat
+@verbatim
+#!/bin/csh -f
# 1
+# Set PMG envar to path of pm-gawk executable and AWKLIBPATH
# 2
+# to find rwarray.so
# 3
+# Requires "sudo" to work; consider this for /etc/sudoers file:
# 4
+# Defaults:youruserid !authenticate
# 5
+echo 'begin: ' `date` `date +%s`
# 6
+unsetenv GAWK_PERSIST_FILE # disable persistence until wanted
# 7
+set dir = '/dev/shm' # where heap file et al. will live
# 8
+set tmr = '/usr/bin/time' # can also use shell built-in "time"
# 9
+set isz = 1073741824 # input size; 1 GiB
# 10
+# set isz = 100000000 # small input for quick testing
# 11
+cd $dir # tick/tock/tyme below are homebrew timer, good within ~2ms
# 12
+alias tick 'set t1 = `date +%s.%N`' ; alias tock 'set t2 = `date +%s.%N`'
# 13
+alias tyme '$PMG -v t1=$t1 -v t2=$t2 "BEGIN{print t2-t1}"'
# 14
+alias tsync 'tick ; sync ; tock ; echo "sync time: " `tyme`'
# 15
+alias drop_caches 'echo 3 | sudo tee /proc/sys/vm/drop_caches'
# 16
+alias snd 'tsync; drop_caches'
# 17
+echo "pm-gawk: $PMG" ; echo 'std gawk: ' `which gawk`
# 18
+echo "run dir: $dir" ; echo 'pwd: ' `pwd`
# 19
+echo 'dir content:' ; ls -l $dir |& $PMG '{print " " $0}'
# 20
+echo 'timer: ' $tmr ; echo 'AWKLIBPATH: ' $AWKLIBPATH
# 21
+@end verbatim
+@page @c = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
+@verbatim
+echo 'OS params:' ; set vm = '/proc/sys/vm/dirty'
# 22
+sudo sh -c "echo 100 > ${vm}_background_ratio" # restore these
# 23
+sudo sh -c "echo 100 > ${vm}_ratio" # paging params
# 24
+sudo sh -c "echo 1080000 > ${vm}_expire_centisecs" # to defaults
# 25
+sudo sh -c "echo 1080000 > ${vm}_writeback_centisecs" # if necessary
# 26
+foreach d ( ${vm}_background_ratio ${vm}_ratio \
# 27
+ ${vm}_expire_centisecs ${vm}_writeback_centisecs )
# 28
+ printf " %-38s %7d\n" $d `cat $d`
# 29
+end
# 30
+tick ; tock ; echo 'timr ovrhd: ' `tyme` 's (around 2ms for TK)'
# 31
+tick ; $PMG 'BEGIN{print "pm-gawk? yes"}'
# 32
+tock ; echo 'pmg ovrhd: ' `tyme` 's (around 4-5 ms for TK)'
# 33
+set inp = 'input.dat'
# 34
+echo 'input size ' $isz
# 35
+echo "input file: $inp"
# 36
+set rg = rgen # spit out and compile C program to generate random inputs
# 37
+rm -f $inp $rg.c $rg
# 38
+cat <<EOF > $rg.c
# 39
+// generate N random words, one per line, no blank lines
# 40
+// charset is e.g. 'abcdefg@' where '@' becomes newline
# 41
+#include <stdio.h>
# 42
+#include <stdlib.h>
# 43
+#include <string.h>
# 44
+#define RCH c = a[rand() % L];
# 45
+#define PICK do { RCH } while (0)
# 46
+#define PICKCH do { RCH } while (c == '@')
# 47
+#define FP(...) fprintf(stderr, __VA_ARGS__)
# 48
+int main(int argc, char *argv[]) {
# 49
+ if (4 != argc) {
# 50
+ FP("usage: %s charset N seed\n",
# 51
+ argv[0]); return 1; }
# 52
+ char c, *a = argv[1]; size_t L = strlen(a);
# 53
+ long int N = atol(argv[2]);
# 54
+ srand( atol(argv[3]));
# 55
+ if (2 > N) { FP("N == %ld < 2\n", N); return 2; }
# 56
+ PICKCH;
# 57
+ for (;;) {
# 58
+ if (2 == N) { PICKCH; putchar(c); putchar('\n'); break; }
# 59
+ if ('@' == c) { putchar('\n'); PICKCH; }
# 60
+ else { putchar( c ); PICK; }
# 61
+ if (0 >= --N) break;
# 62
+ }
# 63
+}
# 64
+EOF
# 65
+gcc -std=c11 -Wall -Wextra -O3 -o $rg $rg.c
# 66
+set t = '@@@@@@@' ; set c = "abcdefghijklmnopqrstuvwxyz$t$t$t$t$t$t"
# 67
+tick ; ./$rg "$c" $isz 47 > $inp ; tock ; echo 'gen time: ' `tyme`
# 68
+echo "input file: $inp"
# 69
+echo 'input wc: ' `wc < $inp` ; echo 'input uniq: ' `sort -u $inp | wc`
# 70
+@end verbatim
+@page @c = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
+@verbatim
+snd
############################################################################
# 71
+tick ; $tmr $PMG '{n[$1]++}END{print "output: " n["foo"]}' $inp
# 72
+tock ; echo 'T naive O(N): ' `tyme` ; echo ''
# 73
+rm -f rwa
# 74
+snd
############################################################################
# 75
+echo ''
# 76
+tick ; $tmr $PMG -l rwarray '{n[$1]++}END{print "writea",writea("rwa",n)}'
$inp # 77
+tock ; echo 'T rwarray build O(N): ' `tyme` ; echo ''
# 78
+snd # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # 79
+tick ; $tmr $PMG -l rwarray 'BEGIN{print "reada",reada("rwa",n); \
# 80
+ print "output: " n["foo"]}'
# 81
+tock ; echo 'T rwarray query O(W): ' `tyme` ; echo ''
# 82
+rm -f ft
# 83
+snd
############################################################################
# 84
+tick ; $tmr $PMG '{n[$1]++}END{for(w in n)print n[w], w}' $inp > ft
# 85
+tock ; echo 'T freqtbl build O(N): ' `tyme` ; echo ''
# 86
+snd # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # 87
+tick ; $tmr $PMG '{n[$2] = $1}END{print "output: " n["foo"]}' ft
# 88
+tock ; echo 'T freqtbl query O(W): ' `tyme` ; echo ''
# 89
+rm -f heap.pma
# 90
+snd
############################################################################
# 91
+truncate -s 3G heap.pma # enlarge if needed
# 92
+setenv GAWK_PERSIST_FILE heap.pma
# 93
+tick ; $tmr $PMG '{n[$1]++}' $inp
# 94
+tock ; echo 'T pm-gawk build O(N): ' `tyme` ; echo ''
# 95
+snd # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # 96
+tick ; $tmr $PMG 'BEGIN{print "output: " n["foo"]}'
# 97
+tock ; echo 'T pm-gawk query O(1): ' `tyme` ; echo ''
# 98
+unsetenv GAWK_PERSIST_FILE
# 99
+snd
############################################################################
# 100
+echo 'Note: all output lines above should be identical' ; echo ''
# 101
+echo 'dir content:' ; ls -l $dir |& $PMG '{print " " $0}'
# 102
+echo '' ; echo 'storage footprints:'
# 103
+foreach f ( rwa ft heap.pma ) # compression is very slow, so we comment it
out # 104
+ echo " $f " `du -BK $dir/$f` # `xz --best < $dir/$f | wc -c` 'bytes xz'
# 105
+end
# 106
+echo '' ; echo 'end: ' `date` `date +%s` ; echo ''
# 107
+@end verbatim
+@end smallformat
+
+@page
+@c = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
+@node Results
+@section Results
+
+Running the script of @ref{Experiments} with default parameters on an
+aging laptop yielded the results summarized in the table below. More
+extensive experiments, not reported here, yield qualitatively similar
+results. Keep in mind that performance measurements are often
+sensitive to seemingly irrelevant factors. For example, the program
+that runs first may have the advantage of a cooler CPU; later
+contestants may start with a hot CPU and consequent clock throttling
+by a modern processor's thermal regulation apparatus. Very generally,
+performance measurement is a notoriously tricky business. For
+scripting, whose main motive is convenience rather than speed, the
+proper role for performance measurements is to qualitatively test
+hypotheses such as those that follow from asymptotic analyses and to
+provide a rough idea of when various approaches are practical.
+
+@sp 1
+
+@verbatim
+ run time peak memory intermediate
+ AWK script (sec) footprint (K) storage (K)
+
+ naive O(N) 242.132 2,081,360 n/a
+ rwarray build O(N) 250.288 2,846,868 156,832
+ rwarray query O(W) 11.653 2,081,444
+ freqtbl build O(N) 288.408 2,400,120 69,112
+ freqtbl query O(W) 11.624 2,336,616
+ pm-gawk build O(N) 251.946 2,079,520 2,076,608
+ pm-gawk query O(1) 0.026 3,252
+@end verbatim
+
+@sp 1
+
+The results are consistent with the asymptotic analysis of
+@ref{Constant-Time Array Access}. All four approaches require roughly
+four minutes to read the synthetic input data. The
+na@"{@dotless{i}}ve approach must do this every time it performs a
+query, but the other three build an associative array to support
+queries and separately serve such queries. The @code{freqtbl} and
+@code{rwarray} approaches build an associative array of word
+frequencies, serialize it to an intermediate file, and then read the
+entire intermediate file prior to serving queries; the former does
+this manually and the latter uses a @gwk{} extension. Both can serve
+queries in roughly ten seconds, not four minutes. As we'd expect from
+the asymptotic analysis, performing work proportional to the number of
+words is preferable to work proportional to the size of the raw input
+corpus: @i{O(W)} time is faster than @i{O(N)}. And as we'd expect,
+@pmg{}'s constant-time queries are faster still, by roughly two orders
+of magnitude. For the computations considered here, @pmg{} makes the
+difference between blink-of-an-eye interactive queries and response
+times long enough for the user's mind to wander.
+
+Whereas @code{freqtbl} and @code{rwarray} reconstruct an associative
+array prior to accessing an individual element, @pmg{} stores a
+ready-made associative array in persistent memory. That's why its
+intermediate file (the heap file) is much larger than the other two
+intermediate files, why the heap file is nearly as large as @pmg{}'s
+peak memory footprint while building the persistent array, and why its
+memory footprint is very small while serving a query that accesses a
+single array element. The upside of the large heap file is @i{O(1)}
+access instead of @i{O(W)}---a classic time-space tradeoff. If
+storage is a scarce resource, all three intermediate files can be
+compressed, @code{freqtbl} by a factor of roughly 2.7, @code{rwarray}
+by roughly 5.6x, and @pmg{} by roughly 11x using @command{xz}.
+Compression is CPU-intensive and slow, another time-space tradeoff.
+
+@page
+@c ==================================================================
+@node Data Integrity
+@chapter Data Integrity
+
+Mishaps including power outages, OS kernel panics, scripting bugs, and
+command-line typos can harm your data, but precautions can mitigate
+these risks. In scripting scenarios it usually suffices to create
+safe backups of important files at appropriate times. As simple as
+this sounds, care is needed to achieve genuine protection and to
+reduce the costs of backups. Here's a prudent yet frugal way to back
+up a heap file between uses:
+@verbatim
+ $ backup_base=heap_bk_`date +%s`
+ $ cp --reflink=always heap.pma $backup_base.pma
+ $ chmod a-w $backup_base.pma
+ $ sync
+ $ touch $backup_base.done
+ $ chmod a-w $backup_base.done
+ $ sync
+ $ ls -l heap*
+ -rw-rw-r--. 1 me me 4096000 Aug 6 15:53 heap.pma
+ -r--r--r--. 1 me me 0 Aug 6 16:16 heap_bk_1659827771.done
+ -r--r--r--. 1 me me 4096000 Aug 6 16:16 heap_bk_1659827771.pma
+@end verbatim
+@noindent
+Timestamps in backup filenames make it easy to find the most recent
+copy if the heap file is damaged, even if last-mod metadata are
+inadvertently altered.
+
+@c TODO: sync individual files above instead of globally (?)
+@c First carefully check what sync does in both cases
+@c using strace, verify that "sync [file]" is correct.
+@c Also check whether non-GNU/Linux offers fine-grained
+@c sync command. Cygwin? Solaris?
+
+The @command{cp} command's @command{--reflink} option reduces both the
+storage footprint of the copy and the time required to make it. Just
+as sparse files provide ``pay as you go'' storage footprints, reflink
+copying offers ``pay as you @emph{change}'' storage
+costs.@footnote{The system call that implements reflink copying is
+described in @command{man ioctl_ficlone}.} A reflink copy shares
+storage with the original file. The file system ensures that
+subsequent changes to either file don't affect the other. Reflink
+copying is not available on all file systems; XFS, BtrFS, and OCFS2
+currently support it.@footnote{The @command{--reflink} option creates
+copies as sparse as the original. If reflink copying is not
+available, @command{--sparse=always} should be used.} Fortunately you
+can install, say, an XFS file system @emph{inside an ordinary file} on
+some other file system, such as @code{ext4}.@footnote{See
+@url{https://www.usenix.org/system/files/login/articles/login_winter19_08_kelly.pdf}}
+
+@c The @command{filefrag} utility reveals how the storage allocated to
+@c the two files changes if they diverge.
+
+After creating a backup copy of the heap file we use @command{sync} to
+force it down to durable media. Otherwise the copy may reside only in
+volatile DRAM memory---the file system's cache---where an OS crash or
+power failure could corrupt it.@footnote{On some OSes @command{sync}
+provides very weak guarantees, but on Linux @command{sync} returns
+only after all file system data are flushed down to durable storage.
+If your @command{sync} is unreliable, write a little C program that
+calls @code{fsync()} to flush a file. To be safe, also call
+@code{fsync()} on every enclosing directory on the file's
+@code{realpath()} up to the root.} After @command{sync}-ing the
+backup we create and @command{sync} a ``success indicator'' file with
+extension @file{.done} to address a nasty corner case: Power may fail
+@emph{while} a backup is being copied from the primary heap file,
+leaving either file, or both, corrupt on storage---a particularly
+worrisome possibility for jobs that run unattended. Upon reboot, each
+@file{.done} file attests that the corresponding backup succeeded,
+making it easy to identify the most recent successful backup.
+
+@c TODO: ".done" -> ".ready" so ls alphabetizes nicely (?)
+
+Finally, if you're serious about tolerating failures you must ``train
+as you would fight'' by testing your hardware/software stack against
+realistic failures. For realistic power-failure testing, see
+@c
@url{https://cacm.acm.org/magazines/2020/9/246938-is-persistent-memory-persistent/fulltext}
+@c
+@c and
+@url{https://queue.acm.org/detail.cfm?id=3400902}.
+
+@page
+@c ==================================================================
+@node Acknowledgments
+@chapter Acknowledgments
+
+@c UPDATE: make sure nobody is overlooked
+
+Haris Volos, Zi Fan Tan, and Jianan Li developed a persistent @gwk{}
+prototype based on a fork of the @gwk{} source. Advice from @gwk{}
+maintainer Arnold Robbins to me, which I forwarded to them, proved
+very helpful. Robbins moreover implemented, documented, and tested
+@pmg{} for the official version of @gwk{}; along the way he suggested
+numerous improvements for the @code{pma} memory allocator beneath
+@pmg{}. Corinna Vinschen suggested other improvements to @code{pma}
+and tested @pmg{} on Cygwin. Nelson H.@: F.@: Beebe provided access
+to Solaris machines for testing. Robbins, Volos, Li, Tan, Jon
+Bentley, and Hans Boehm reviewed drafts of this user manual and
+provided useful feedback. Bentley suggested the min/max/mean example
+in @ref{Examples}, and also the exercise of making Kernighan & Pike's
+``Markov'' script persistent. Volos provided and tested the advice on
+tuning OS parameters in @ref{Virtual Memory and Big Data}. Stan Park
+answered several questions about virtual memory, file systems, and
+utilities.
+
+@c ==================================================================
+@c ==================================================================
+@c ==================================================================
+
+@node Installation
+@appendix Installation
+
+@c UPDATE below or remove this section if it's obsolete
+
+@gwk{} 5.2 featuring persistent memory is expected to be released in
+August 2022; look for it at @url{http://ftp.gnu.org/gnu/gawk/}. If
+5.2 is not released yet, the master git branch is available at
+@c
+@url{http://git.savannah.gnu.org/cgit/gawk.git/snapshot/gawk-master.tar.gz}.
+@c
+Unpack the tarball, run @command{./configure}, @command{make}, and
+@command{make check}, then try some of the examples presented earlier.
+In the normal course of events, 5.2 and later @gwk{} releases
+featuring @pmg{} will appear in the software package management
+systems of major GNU/Linux distros. Eventually @pmg{} will be
+available in the default @gwk{} on such systems.
+
+@c official gawk:
+@c http://ftp.gnu.org/gnu/gawk/
[where to look for 5.2 after release]
+@c https://www.skeeve.com/gawk/gawk-5.1.62.tar.gz
[doesn't support persistent functions]
+@c http://git.savannah.gnu.org/cgit/gawk.git/snapshot/gawk-master.tar.gz
[if 5.2 isn't released yet]
+@c http://git.savannah.gnu.org/cgit/gawk.git
[ongoing development]
+
+@c ==================================================================
+@node Debugging
+@appendix Debugging
+
+@c TODO: ADR: @cite -> @ref to info file below
+
+For bugs unrelated to persistence, see the @gwk{} documentation,
+e.g., @cite{GAWK: Effective AWK Programming},
+available at @url{https://www.gnu.org/software/gawk/manual/}.
+
+If @pmg{} doesn't behave as you expect, first consider whether you're
+using the heap file that you intend; using the wrong heap file is a
+common mistake. Other fertile sources of bugs for newcomers are the
+fact that a @code{BEGIN} block is executed every time @pmg{} runs,
+which isn't always what you really want, and the fact that built-in
+AWK variables such as @code{NR} are always re-set to zero every time
+the interpreter runs. See the discussion of initialization
+surrounding the min/max/mean script in @ref{Examples}.
+
+If you suspect a persistence-related bug in @pmg{}, you can set
+an environment variable that will cause its persistent heap module,
+@code{pma}, to emit more verbose error messages; for details see the
+main @gwk{} documentation.
+@c or the @code{pma} documentation at
+@c @url{http://web.eecs.umich.edu/~tpkelly/pma/}.
+
+Programmers: You can re-compile @gwk{} with assertions enabled, which
+will trigger extensive integrity checks within @code{pma}. Ensure
+that @file{pma.c} is compiled @emph{without} the @code{-DNDEBUG} flag
+when @command{make} builds @gwk{}. Run the resulting executable on small
+inputs, because the integrity checks can be very slow. If assertions
+fail, that likely indicates bugs somewhere in @pmg{}. Report such
+bugs to me (Terence Kelly) and also following the procedures in the
+main @gwk{} documentation. Specify what version of @gwk{} you're
+using, and try to provide a small and simple script that reliably
+reproduces the bug.
+
+@page
+@c ==================================================================
+@node History
+@appendix History
+
+The @pmg{} persistence feature is based on a new persistent memory
+allocator, @code{pma}, whose design is described in
+@url{https://queue.acm.org/detail.cfm?id=3534855}. It is instructive
+to trace the evolutionary paths that led to @code{pma} and @pmg{}.
+
+I wrote many AWK scripts during my dissertation research on Web
+caching twenty years ago, most of which processed log files from Web
+servers and Web caches. Persistent @gwk{} would have made these
+scripts smaller, faster, and easier to write, but at the time I was
+unable even to imagine that @pmg{} is possible. So I wrote a lot of
+bothersome, inefficient code that manually dumped and re-loaded AWK
+script variables to and from text files. A decade would pass before
+my colleagues and I began to connect the dots that make persistent
+scripting possible, and a further decade would pass before @pmg{} came
+together.
+
+Circa 2011 while working at HP Labs I developed a fault-tolerant
+distributed computing platform called ``Ken,'' which contained a
+persistent memory allocator that resembles a simplified @code{pma}: It
+presented a @code{malloc()}-like C interface and it allocated memory
+from a file-backed memory mapping. Experience with Ken convinced me
+that the software abstraction of persistent memory offers important
+attractions compared with the alternatives for managing persistent
+data (e.g., relational databases and key-value stores).
+Unfortunately, Ken's allocator is so deeply intertwined with the rest
+of Ken that it's essentially inseparable; to enjoy the benefits of
+Ken's persistent memory, one must ``buy in'' to a larger and more
+complicated value proposition. Whatever its other virtues might be,
+Ken isn't ideal for showcasing the benefits of persistent memory in
+isolation.
+
+Another entangled aspect of Ken was a crash-tolerance mechanism that,
+in retrospect, can be viewed as a user-space implementation of
+failure-atomic @code{msync()}. The first post-Ken disentanglement
+effort isolated the crash-tolerance mechanism and implemented it in
+the Linux kernel, calling the result ``failure-atomic @code{msync()}''
+(FAMS). FAMS strengthens the semantics of ordinary standard
+@code{msync()} by guaranteeing that the durable state of a
+memory-mapped file always reflects the most recent successful
+@code{msync()} call, even in the presence of failures such as power
+outages and OS or application crashes. The original Linux kernel FAMS
+prototype is described in a paper by Park et al. in EuroSys 2013. My
+colleagues and I subsequently implemented FAMS in several different
+ways including in file systems (FAST 2015) and user-space libraries.
+My most recent FAMS implementation, which leverages the reflink
+copying feature described elsewhere in this manual, is now the
+foundation of a new crash-tolerance feature in the venerable and
+ubiquitous GNU @command{dbm} (@command{gdbm}) database
+(@url{https://queue.acm.org/detail.cfm?id=3487353}).
+
+In recent years my attention has returned to the advantages of
+persistent memory programming, lately a hot topic thanks to the
+commercial availability of byte-addressable non-volatile memory
+hardware (which, confusingly, is nowadays marketed as ``persistent
+memory''). The software abstraction of persistent memory and the
+corresponding programming style, however, are perfectly compatible
+with @emph{conventional} computers---machines with neither
+non-volatile memory nor any other special hardware or software. I
+wrote a few papers making this point, for example
+@url{https://queue.acm.org/detail.cfm?id=3358957}.
+
+In early 2022 I wrote a new stand-alone persistent memory allocator,
+@code{pma}, to make persistent memory programming easy on conventional
+hardware. The @code{pma} interface is compatible with @code{malloc()}
+and, unlike Ken's allocator, @code{pma} is not coupled to a particular
+crash-tolerance mechanism. Using @code{pma} is easy and, at least to
+some, enjoyable.
+
+Ken had been integrated into prototype forks of both the V8 JavaScript
+interpreter and a Scheme interpreter, so it was natural to consider
+whether @code{pma} might similarly enhance an interpreted scripting
+language. GNU AWK was a natural choice because the source code is
+orderly and because @gwk{} has a single primary maintainer with an
+open mind regarding new features.
+
+Jianan Li, Zi Fan Tan, Haris Volos, and I began considering
+persistence for @gwk{} in late 2021. While I was writing @code{pma},
+they prototyped @pmg{} in a fork of the @gwk{} source. Experience
+with the prototype confirmed the expected convenience and efficiency
+benefits of @pmg{}, and by spring 2022 Arnold Robbins was implementing
+persistence in the official version of @gwk{}. The persistence
+feature in official @gwk{} differs slightly from the prototype: The
+former uses an environment variable to pass the heap file name to the
+interpreter whereas the latter uses a mandatory command-line option.
+In many respects, however, the two implementations are similar. A
+description of the prototype, including performance measurements, is
+available at
+@url{http://nvmw.ucsd.edu/nvmw2022-program/nvmw2022-data/nvmw2022-paper35-final_version_your_extended_abstract.pdf}.
+
+@c lessons learned [these are smallish ideas]
+@c compatibility with malloc
+@c make programmer do *nothing*
+@c components (pma) are easier to sell than monoliths (Ken)
+@c open source offers more impact than research
+@c work with colleagues who Think Different from one another
+
+@sp 2
+
+I enjoy several aspects of @pmg{}. It's unobtrusive; as you gain
+familiarity and experience, it fades into the background of your
+scripting. It's simple in both concept and implementation, and more
+importantly it simplifies your scripts; much of its value is measured
+not in the code it enables you to write but rather in the code it lets
+you discard. It's all that I needed for my dissertation research
+twenty years ago, and more. Anecdotally, it appears to inspire
+creativity in early adopters, who have devised uses that @pmg{}'s
+designers never anticipated. I'm curious to see what new purposes
+you find for it.
+
+@c TODO: future enhancements:
+@c pma version string consisting entirely of emojis
+@c benefits: compact, recognizable, lively
+@c problem: representing dates/numbers
+@c solution(?): emojis can probably encode Mayan dates (need to
investigate)
+@c problem: do Bayonne, Camden, ..., Newark, etc. have their own
emojis?
+@c solution(?): there are many different kinds of p**p emojis
+
+@c ==================================================================
+@c ==================================================================
+@c ==================================================================
+
+@bye
+
+@c ==================================================================
+@c ==================================================================
+@c ==================================================================
+
+@c Arnold Robbins recommendations, distilled from 547 line diff
+@c
+@c @emph -> @dfn for definitions of terms [done]
+@c Use the @dfn command to identify the introductory or
+@c defining use of a technical term. Use the command
+@c only in passages whose purpose is to introduce a
+@c term which will be used again or which the reader
+@c ought to know. Mere passing mention of a term for
+@c the first time does not deserve @dfn.
+@c -> @cite for book titles [done]
+@c Use the @cite command for the name of a book that
+@c lacks a companion Info file. The command produces
+@c italics in the printed manual, and quotation marks
+@c in the Info file. If a book is written in Texinfo,
+@c it is better to use a cross-reference command since
+@c a reader can easily follow such a reference in
+@c Info. See Section 5.4 [@xref], page 44.
+@c
+@c @code -> @command [merely to encourage carpal tunnel syndrome?]
+@c Use the @command command to indicate command names,
+@c such as ls or cc
+@c -> @samp for, um, things like "$" prompt of bash and "unset envar"
+@c Use the @samp command to indicate text that is a
+@c literal example or âsampleâ of a sequence of
+@c characters in a file, string, pattern, etc. ....
+@c Basically, @samp is a catchall for whatever is not
+@c covered by @code, @kbd, @key, @command, etc.
+@c -> @file for /proc/meminfo and pma.c
+@c Use the @file command to indicate text that is the
+@c name of a file, buffer, or directory, or is the name
+@c of a node in Info. You can also use the command for
+@c file name suffixes. Do not use @file for symbols in
+@c a programming language; use @code.
+@c -> @env for envars
+@c Use the @env command to indicate environment
+@c variables, as used by many operating systems,
+@c including GNU. Do not use it for metasyntactic
+@c variables; use @var for those (see the previous
+@c section).
+@c
+@c stuff that is done or otherwise addressed:
+@c
+@c [done, with tweaks]
+@c > @tex
+@c > \gdef\linkcolor{0.5 0.09 0.12} % Dark Red
+@c > \gdef\urlcolor{0.5 0.09 0.12} % Also
+@c > \global\urefurlonlylinktrue
+@c > @end tex
+@c
+@c < easier, improves performance, and enables @code{gawk} to handle larger
+@c ---
+@c > improves performance for certain kinds of I/O-bound workloads, [no,
I/O is orthogonal to asymptotic efficiency]
+@c > and enables @command{gawk} to handle larger
+@c
+@c @verbatim -> @kbd and @print [bad idea, extremely error prone]
+@c
+@c "Linux" -> "GNU/Linux" [in some places but not all; done with nuance]
+@c
+@c re-set -> reset [no, i like the former]
+@c
+@c > @c ADR: Define median and mode. Not everyone remembers what they are.
[done]
+@c
+@c remove "like the gawk of yesteryear" [done]
+@c
+@c 550c561 [it's not clear what the runtime overhead would be; it might
well be negligible]
+@c < ordinary file} on some other file system; for details see my article
+@c ---
+@c > ordinary file} on some other file system (with some runtime overhead);
for details see my article
+@c
+@c This expedient makes reflink copying available on popular file systems
+@c such as the @code{ext} family.
+@c @c ADR: You should also mention @samp{cp --sparse} here. [added TODO
comment]
+@c
+@c 572,574c584,588 [done]
+@c < Arnold Robbins got @value{ACRO} started by encouraging and supporting
+@c < Haris Volos, Zi Fan Tan, and Jianan Li as they developed a prototype
+@c < based on a fork of the @code{gawk} source. Robbins moreover
+@c ---
+@c > Haris Volos, Zi Fan Tan, and Jianan Li developed a prototype
+@c > based on a fork of the @command{gawk} source. Advice from Arnold
Robbins
+@c > (the @command{gawk} maintainer)
+@c > to me, which I forwarded to them, proved very helpful.
+@c > Robbins moreover
+@c
+@c 579c593 [done]
+@c < improvements to @code{pma}. Nelson Beebe provided access to Solaris
+@c ---
+@c > improvements to @code{pma}. Nelson H.@: F.@: Beebe provided access to
Solaris
+@c
+@c 582c596 [done]
+@c < Jon Bentley suggested the min/max/mean example in this manual and also
+@c ---
+@c > Jon Bentley suggested the min/max/mean example in @ref{Examples}, and
also
+@c
+@c 585,586c599,600 [done]
+@c < provided and tested the advice on tuning OS paging parameters in this
+@c < manual.
+@c ---
+@c > provided and tested the advice on tuning OS paging parameters in
+@c > @ref{System Configuration}.
+@c
+@c 650,651c664,665 [done]
+@c < runs, which isn't always what the programmer really wants, and the
+@c < fact that built-in AWK variables such as @code{NR} are always re-set
[I like ``re-set'']
+@c ---
+@c > runs, which isn't always what you really want, and the
+@c > fact that built-in AWK variables such as @code{NR} are always reset
+@c
+@c 678c692,694 [done]
+@c < The new @value{ACRO} persistence feature is based on a new persistent
+@c ---
+@c > @c ADR: Remove "new"; it won't be new for very long, and this
+@c > @c document will be around for a long time.
+@c > The @value{ACRO} persistence feature is based on a persistent
+@c
+@c @code{gdbm} -> GDBM [done, my way]
+@c
+@c 744,747c760,763 [done]
+@c < persistence for @code{gawk} in late 2021. While I was writing
+@c < @code{pma} in early 2022 the others, with advice and encouragement
+@c < from @code{gawk} maintainer Arnold Robbins, prototyped @value{ACRO} in
+@c < a fork of the @code{gawk} source. Experience with the prototype
+@c ---
+@c > persistence for @command{gawk} in late 2021. While I was writing
+@c > @code{pma} in early 2022 the others,
+@c > prototyped @value{ACRO} in
+@c > a fork of the @command{gawk} source. Experience with the prototype
+@c 749,751c765,767
+@c < @value{ACRO}, and by spring 2022 Robbins was implementing persistence
+@c < in the official version of @code{gawk}. The persistence feature in
+@c < official @code{gawk} differs slightly from the prototype: The former
+@c ---
+@c > @value{ACRO}, and by spring 2022 Arnold Robbins was implementing
persistence
+@c > in the official version of @command{gawk}. The persistence feature in
+@c > official @command{gawk} differs slightly from the prototype: The former
+@c
+@c 753c769 [done]
+@c < interpreter whereas the latter uses a command-line option. In many
+@c ---
+@c > interpreter whereas the latter uses a mandatory command-line option.
In many
+@c
+
http://git.sv.gnu.org/cgit/gawk.git/commit/?id=36c498a173c5c61a0b78a426ecb6c7e834bdb681
commit 36c498a173c5c61a0b78a426ecb6c7e834bdb681
Author: Arnold D. Robbins <arnold@skeeve.com>
Date: Mon Aug 15 18:18:23 2022 +0300
Regenerated doc/gawkinet.info.
diff --git a/doc/gawkinet.info b/doc/gawkinet.info
index 9ec2bfd6..a53a98d8 100644
--- a/doc/gawkinet.info
+++ b/doc/gawkinet.info
@@ -1,4 +1,4 @@
-This is gawkinet.info, produced by makeinfo version 6.7 from
+This is gawkinet.info, produced by makeinfo version 6.8 from
gawkinet.texi.
This is Edition 1.6 of 'TCP/IP Internetworking with 'gawk'', for the
@@ -570,6 +570,7 @@ meaning. If this table is too complicated, focus on the
three lines
printed in *bold*. All the examples in *note Networking With 'gawk':
Using Networking, use only the patterns printed in bold letters.
+
PROTOCOL LOCAL HOST NAME REMOTE RESULTING CONNECTION-LEVEL
PORT PORT BEHAVIOR
------------------------------------------------------------------------------
@@ -2433,6 +2434,7 @@ File: gawkinet.info, Node: STATIST, Next: MAZE, Prev:
WEBGRAB, Up: Some Appl
-0.3 `------------'------------'------------'------------'
-10 5 0 5 10"