Transcription Rules for JewishGen Databases

  1. Templates
    1. Surnames
    2. Town Names
    3. Dates
    4. Sparse Columns
    5. Source Indicator
  2. Data Entry
    1. Missing Data
    2. Illegible Data
    3. Ditto Fields
    4. Conjectural Information
  3. Grouped Records
  4. Transliteration
  5. The "Other Surnames" and
    "Other Towns" Columns
  6. Other guides to data transcription

Here are the guidelines for the data to be submitted to a JewishGen searchable database, as per the standard Contributing Databases to JewishGen procedures.

Data sent to the JewishGen database managers should be in a database or spreadsheet format.  dBase (DBF) format is preferred, but just about any standard database or spreadsheet format (rows of records and columns of fields) is acceptable: Microsoft Access, Microsoft Excel (any version), Lotus 1-2-3, Borland/Corel Paradox, or Quattro Pro spreadsheets, etc.

Word processor files are more difficult to work with than spreadsheet or database file, but may be acceptable if the data is in a regular format (one record per line, with each field separated by commas, or tabs, or otherwise delimited).

In all cases, please be sure that each field in your database is clearly labelled, and that a full database description is provided, using the guidelines.


I.   Templates

The manager of each transcription project should create a data entry template to contain the transcribed data.  The template design and data entry instructions should be reviewed by JewishGen before proceeding with data entry.  The template may evolve over time, as you gain experience with transcribing the original records.

Templates for certain standard types of records may be found at http://www.jewishgen.org/databases/templates.

  1. Surnames:

    1. Surnames should be in ALL CAPITAL LETTERS and be in separate fields.  Each "Surname" field should contain only a surname.  Place given names and other items it their own separate fields.  Having the surname in its own field will allow the surnames to be searchable, via a variety of methods.
    2. All other proper nouns (given names, town names, etc.) should be in Mixed Case (i.e. Initial Capital Letters, with all subsequent letters in lower case).
    3. If other surnames appear in other fields (e.g. maiden names, alternate surnames, etc.) — and if you want these surnames to be searchable as surnames, then copy those surnames to an additional field, called "Other Surnames".  This "Other Surnames" field will not appear in the displayed search results, but is used only for database indexing. (See Section V, below).
    4. If there are optional surnames (multiple surnames, such as "SMITH or JONES") in a surname field, then code it as "SMITH / JONES".  (The same applies to the "Other Surnames" field, which can hold multiple surnames, separated by spaces and '/' delimiters).
    5. If a person has no surname in this record (e.g. records that use only patronymics), then indicate this with a dash ("-") character, rather than leaving a surname field empty or using any other indicator.

  2. Town Names:

    1. Place town names in separate fields.  Each "Town" field should contain only a town name.  Place any qualifying county / district / province / state / country names each in their own separate fields.  Having the town name in its own field will allow the town names to be searchable, via a variety of methods.
    2. Transcribe town names exactly as they are written in the original source document.  For example, in a German-language record, the capital city of Poland would be written as "Warschau".  Do not transform it to "Warszawa" or "Warsaw" — preserve the original.
      If you wish, you may also include the modern native town name, as per the JGFF model — either in an additional separate column, or together in the same column as a conjecture in square brackets, separated by spaces and '/' delimiters, e.g. "Warschau / [Warszawa]".  Our search engines will then be able to pick up both names.
    3. If town names also appear in other fields — and if you want these town names to be searchable as town names, then copy the town names to an additional field, called "Other Towns".  This "Other Towns" field will not appear in the displayed search results, but is used only for database indexing. (See Section V, below).
    4. If there is more than one town for a field (multiple towns, such as 'places of former residence' = "Vilnius and Keidanai"), then encode this field as "Vilnius / Keidanai".  (The same applies to the "Other Towns" field, which can hold multiple town names, separated by spaces and '/' delimiters).

  3. Dates:

    1. If supplying the data in dBase format, all fields must be CHARACTER fields; do not use DATE fields.
    2. If supplying the data in Microsoft Excel format, all fields must be TEXT fields; do not use DATE fields, as Excel can not handle historical dates.
    3. Make sure that all years contain four digits, to avoid ambiguity.
    4. Make sure that the day and month fields are distinguishable.  Europeans and Americans interpret dates differently.  If possible, use "DD-MMM-YYYY" format, using the three-letter English month abbreviation, for example "21-APR-1892".

  4. Sparse columns: Columns which rarely contain data should be avoided, because they can take up considerable horizontal space when displaying search results.  The more columns a spreadsheet has, the more difficult it is to display to the data meaningfully.  Try to have as few columns as is reasonably possible.  Consider combining several sparse columns into a single more generic "Comments" or "Notes" column.

  5. Source Indicator: Every record (i.e. each row) should have some type of source information — column(s) containing an identifier by which a researcher using the database can independently find this record in the original source: A page number, a record number, a line number, etc., or any necessary combination thereof.


II.   Data Entry

All data should be transcribed as faithfully as possible to the original source document, with as little interpretation as possible.  Interpretation is the job of the researcher using the resulting database, not the job of the transcriber.  The data transcriber should write only what is in the original source document.

If the transcriber or editor of the database wishes to add conjectures, interpretation, or editorial comments, these all should be made within square brackets ("[]"), to indicate that these comments are not part of the original source. (See Section II.4, below).

  1. Missing Data: If a data item is missing in the original source, indicate this with a dash or hyphen ("-") character, rather than leaving a blank field or using any other indicator.

  2. Illegible Data: If a data item is illegible or questionable in the original record, transcribe as much as you can, and use the following indicators:

    1. Questionable entries should be followed by a question mark ("?").
    2. If a data item is totally illegible, just place a single question mark in the cell.
    3. Use an ellipsis ("...") to indicate illegible parts of a name.
      For example, write "SM...TH", if you can't determine what the letters are between the "SM" and "TH".
    4. If you can't decide which of two possibilities a partially legible name represents, write both interpretations, separated by a slash and spaces.
      For example, write "STEIN? / STERN?"   or   "PERL? / BERL?".
      Our search engines will then be able to pick up both names.

  3. Ditto fields: Data which is the same as the previous row must be filled in; you can not leave any cell blank — because when the data is sorted by a different criteria, the context is lost.  For example:
    Incorrect Correct
    Year # Surname Given Name
    1847 1 SCHWARTZ Moshe
      2 KOHEN Ryfka
      3 LEVIN Shmuel
    Year # Surname Given Name
    1847 1 SCHWARTZ Moshe
    1847 2 KOHEN Ryfka
    1847 3 LEVIN Shmuel

  4. Conjectural Information: Should always be indicated within square brackets ("[ ]").
    Conjectural information is information which is not in the original source document, but has been conjectured by a database transcriber or editor.

    For example, a conjectural surname for a record which has no surname, but for which the surname has been deduced from other sources, should appear as "[EPSZTEIN]".  Other uses of conjectural data are the expansion of abbreviations, and corrections to items misspelled in the original source (also see Section I.2.b above).  Any other editorial comments and explanations should also appear within square brackets, to indicate that those items are not in the original record.

  5. Prohibited Characters: Avoid the use of the double-quote character (").
    The inclusion of double-quote characters causes problems with our internal data conversion routines (the procedures which convert data from Excel to dBase format).  Use single quote characters (') instead.

  6. Maximum Field Size: The maximum size of any field is 254 characters.


III.   Grouped Records

Some sources, such as Census Records, Czarist Revision Lists, etc., group people together into households or families.  When transcribing data like this, each person in the data should still have their own record (their own row in the spreadsheet), but we can also group the family/household together in the database's results display, if a "Glue" field in used in the spreadsheet to group rows together.

For example, here's an input spreadsheet containing the two family groups:

Family # Surname Forename Patronymic Age Relation Birthplace Gubernia District Town Address Fond #
4118 LEWIN Haim Mowscha 40 head Jekapils Vitebsk Dvinsk Rezekne Soldatskaya 12-3 2706-1-156
4118 LEWIN Rocha Shmuel 38 wife Jekapils Vitebsk Dvinsk Rezekne Soldatskaya 12-3 2706-1-156
4118 GLEBERMAN Pesia Haim 21 daughter Ludza Vitebsk Dvinsk Rezekne Soldatskaya 12-3 2706-1-156
4119 DORFMANN Simon Itzik 28 head Rezekne Vitebsk Dvinsk Rezekne Ludzenskaya 45-6 2706-1-156
4119 DORFMANN Esther Abram 25 wife Rezekne Vitebsk Dvinsk Rezekne Ludzenskaya 45-6 2706-1-156
4119 DORFMANN Gita Mowscha 50 mother Rezekne Vitebsk Dvinsk Rezekne Ludzenskaya 45-6 2706-1-156
4119 KAGANSKI Hana Mowscha 60 aunt Rezekne Vitebsk Dvinsk Rezekne Ludzenskaya 45-6 2706-1-156
4119 LEWIN Malka Sura Rachmiel 30 cousin Ludza Vitebsk Dvinsk Rezekne Ludzenskaya 45-6 2706-1-156

Could be displayed as:

Town
District
Gubernia
Surname, Forename Patronymic Age Relation Birthplace Address
Fond #
Rezekne
Dvinsk
Vitebsk
LEWIN, HaimMowscha 40head Jekapils Soldatskaya 12-3
2706-1-156
LEWIN, RochaShmuel 38wife Jekapils
GLEBERMAN, PesiaHaim 21daughter Ludza
Rezekne
Dvinsk
Vitebsk
DORFMANN, SimonItzik 28head Rezekne Ludzenskaya 45-6
2706-1-156
DORFMANN, EstherAbram 25wife Rezekne
DORFMANN, GitaMowscha 50mother Rezekne
KAGANSKI, HanaMowscha 60aunt Rezekne
LEWIN, Malka SuraRachmiel 30cousin Ludza

Here we are using the "Family #" column as the "glue" field, to glue all members of the household together, for a more attractive and meaningful display of the data.

Note how the common fields (data common to every member of the household/family) are "banded" together, in the yellow row-spanning fields on the left and right.  This redundant data is displayed only once per family/household, in a vertically "stacked" fashion, saving considerable display space.

The "glue" field is also needed to ensure that the entire family group is presented together, when only one member of the family matches the search criteria.  The entire family group (i.e. all rows with the same "glue" field) is displayed if only one member of a family matches the search criteria.

For example, the above display would result from a search for the surname "LEWIN" — When only one member of a family has the surname "LEWIN", the entire family group is displayed, because the "glue" field keeps the entire family together.

The simplest use of a "glue" field is in a marriage record — to tie the bride and groom together.  If the groom and bride are each entered in their own row in the spreadsheet, the use of a "glue" field will ensure that both rows are displayed when a user searches for either one of the parties' surnames.

Also note that the "glue" field is not necessarily a displayed field. (In the example above, the "Family #" is not displayed in the search results screen).  The "glue" field can be a hidden column, which is not displayed in the search results — this column is used only for the internal purpose of creating the database indexes.


IV.   Transliteration

JewishGen has established no universal transliteration standards for data written in non-Latin alphabets (i.e. Hebrew, Cyrillic alphabets) since each database is different, and there are so many languages, alphabets, dialects, and regional variants across the wide scope of Jewish genealogical data.  Each database is free to use their own transliteration methods, as long as they are reasonable.  The introductory remarks for each database should indicate or explain which transliteration method has been used for that database.

Here are some general ideas and guidelines:


V.   The "Other Surnames" and "Other Towns" Columns:

As mentioned above in sections I.1.c and I.2.c, certain datasets might want to make use of the special hidden columns called "Other Surnames" and/or "Other Towns".  These columns are needed when there are surnames or town names embedded within the text of other columns, and you wish those items to be fully searchable.

The sole purpose of the hidden "Other Surnames" and "Other Towns" columns is for database indexing only — so that the database search engine knows that a particular word is a surname or is a town name, and thus can locate it when doing a Soundex search. These columns will not be displayed in the search results.

When a surname or town name is buried within a larger text field (such as a "Comments" field), the database search engine doesn't know that that particular word is a surname or town name.  Copying these words into an "Other Surnames" or "Other Towns" column makes this association explicit.  While a search for "BERNSTEIN" using a global text search would find a record with the word "BERNSTEIN" anywhere within any column, a Soundex search would not.  So if a Soundex search for the surname "BURNSTINE" is done, it wouldn't find "BERNSTEIN" within the context of the "Comments" field.  To enable its Soundex searchability, the word "BERNSTEIN" needs to be copied into an "Other Surnames" column.

The database creator/editor should copy all surnames and town names contained within the text of these other fields into a separate "Other Surnames" or "Other Towns" column.  This action allows those words to be identifyable and fully searchable as surnames and/or town names, respectively.

[Note that if a particular town name is already in another indexed Town column, then you don't really need to copy it into the "Other Towns" column — although it doesn't hurt, it's simply redundant.  For example, if you have a "Town of Birth" column which contains "Minsk", and you also have a "Comments" column that contains the words "Father is a resident of Minsk", then in this instance you really don't need to copy "Minsk" into the "Other Towns" column, because this record already contains "Minsk" in the searchable "Town of Birth" column — a search for "Minsk" would already find this record.  However, it does no harm to have "Minsk" in the "Other Towns" column in this instance.]

There should never be anything in the "Other Surnames" or "Other Towns" columns which doesn't also appear somewhere else in the row.  The "Other Surnames" and "Other Towns" columns are hidden columns, which are not displayed in the search results — these columns are used for only for the purpose of creating database indexes.


VI.   Other guides to data transcription:


JewishGen Databases

JewishGen Home Page

Warren Blatt, Last Revised Jan 13, 2006.