ScrollScraper Technical FAQ for software engineers and others with technical interests


Q: I hate that ScrollScraper was down so long! Please explain the bible.ort.org image naming convention, in case you get a hit by a bus or something. :-)
A: In the original version of ScrollScraper, the software traversed the bible.ort.org website and figured out which verses it needed to include on-the-fly. It wasn't too important to precisely understand the ORT convention for naming the underlying GIF images which contain the Torah readings. Each of these GIF files contains three lines of Hebrew text.

In light of the demise of bible.ort.org in late 2022, it became much more important to understand the ORT convention. What follows is a reverse-engineered interpretation of the filenames, which I believe to be correct: ccvvqxyz.gif where:
  • cc = the chapter
  • vv = the verse. But specifically, it's the LAST verse for which some content is included within this GIF file
  • q = one of 'C', 'F', or 'D'. A 'C' file continues the text of the 'ccvv' verse mentioned above, in the following file. An 'F' file ends cleanly and does not continue. The rare 'D' files are in the middle of a long verse which began in a previous image and ends in a later image.
  • x = how many verses start on the first of the three lines in this GIF?
  • y = how many verses start on the second of the three lines in this GIF?
  • z = how many verses start on the third of the three lines in this GIF?
Examples:
  • t1/1721F010.gif - Genesis 17. Verse 21 is entirely contained within this file, but it begins with a section of verse 20. Verse 21 begins somewhere on this second line (of the three lines) of this image.
  • t2/0203C201.gif - Exodus 2. Verse 3 is the last verse which starts in this file. Verse 1&2 begin in this file, both on the first of the three lines. Verse 3 begins on the final of the three lines.


Q: Explain the Torah image map
A: From its inception c. 2005 through 2022, ScrollScraper worked by retrieving successive pages from the bible.ort.org website, and then assembling a Torah reading from the associated GIFs. It also optionally examined those GIF images to figure out which sections were dark-blue and which were light-blue, and thereby estimate which sections to hide by shading at the beginning and end of a reading.

Following the demise of bible.ort.org in 2022, in order to bring ScrollScraper back online, its software had to be modified to operate very differently. All of the ORT GIF images are now on the ScrollScraper website, and it makes sense to treat this set of Torah images as a unified whole rather than just examining a small set of images for each ScrollScraper user. Given the amount of effort to effectively address the above-mentioned shading problem in this new environment, it made sense to create a dataset and data structures which could address not only the shading problem, but also to take advantage of this global knowledge to create the most-requested ScrollScraper feature: TrueType fonts on the right side of the page.

The bible.ort.org images are great and must have taken a tremendous amount of work to create, but unfortunately each vertical line is only 30 pixels high and each line is 445 pixels wide. So they're quite granular when you zoom-in or print at high resolution.

We've computed a global map for the entire Torah, which knows the start-and-end coordinates of each Torah verse, and also knows about white space between verses, and even within a verse.

Given that information and the lengths of those segments, and which segments belong to which verse, it's not difficult to interpolate how to partition the (TrueType) Hebrew text of each verse among those segments. Then if you place each Hebrew fragment in the same position as its corresponding GIF fragment, you've solved the ScrollScraper TrueType problem. Now the output is as clear as the hardcopy Tikkun sitting on your bookshelf.

For example, consider one image's worth of data (there are 6938 such images comprising the complete Torah), from Exodus:

% grep t2/1601C101 final_outputs/map.csv
t2/1601C101.gif,0,0,137,light,15,26,138,291,NONE,15,26,292,444,dark,15,27
t2/1601C101.gif,1,0,444,dark,15,27
t2/1601C101.gif,2,0,80,dark,15,27,81,105,NONE,0,0,106,444,light,16,1

That's not very human-readable, but let's examine that in a tabular format. Note that the coordinate system is from right (0) to left (444) because we're dealing with Hebrew:

Start-xEnd-xColorChapterVerse
0137light1526
138291NONE
292444dark1527





0444dark1527





080dark1527
81105NONE
106444light161

We can also view that as a graphic, adjacent to the original ORT gif, as:


From that table, we see that the three segments of verse 27 have lengths (444-292+1=153,444-0+1=445,80-0+1=81). That total length is 679. If we partition the verse proportionally (and use proportional fonts in the calculation) we wind up with the three segments:

  • וַיָּבֹ֣אוּ אֵילִ֔מָה וְשָׁ֗ם
  • שְׁתֵּ֥ים עֶשְׂרֵ֛ה עֵינֹ֥ת מַ֖יִם וְשִׁבְעִ֣ים תְּמָרִ֑ים וַיַּחֲנוּ־שָׁ֖ם
  • עַל־הַמָּֽיִם
In this case, and in most cases, the automated partitioning divided the text in exactly the same layout as the scribe of the ORT scroll partitioned it. Sometimes the partitioning doesn't exactly match, but since our map is complete, any divergence is localized to a single verse.

Here's how that verse looks with TrueType fonts. Try zooming-in with your web browser, or printing a copy, and compare the granular left side with the clear right side.



Q: What's the most amazing technical factoid about ScrollScraper?
A: IMHO the most amazing thing is that all of the "Torah image map" and other resources which are pre-computed prior to running ScrollScraper are derived from only the ORT GIF images, their filenames, and the reverse-engineered file naming convention described above.

There's a special-case in the code for the smaller TrueType fonts required for the Shirat Hayam (Song of the Sea) and Deuteronomy 32. There are a handful of hand-curated tweaks for a few verses in Shirat Hayam, which provide adjustments to the aforementioned Torah image map. But that's it.



Q: How can I fiddle with ScrollScraper on my own computer, and make code changes and technical suggestions?
A: ScrollScraper is now a Dockerized application, so all you need is a Docker environment installed on your computer such as Docker for Desktop. Once you've installed Docker and downloaded or git-pulled the ScrollScraper source code repo you can run docker build -t scrollscraper . to build the ScrollScraper Docker image (the first build will take about half an hour. Subsequent re-builds will be much faster). Then you can docker run that image and docker exec inside of the resulting Docker container, to start experimenting.

Once you've exec'd into the container, you can run: cd /var/opt/scrollscraper; make test-scrollscraper.html





Contact: Jonathan Epstein (jonathanepstein9@gmail.com).   Comments, requests and bug reports welcome.
Last modified 28 July 2023