JewishGen Home Page

Transcription Rules for JewishGen Databases

  1. Templates
    1. Surnames and Given Names
    2. Town Names
    3. Dates
    4. Sparse Columns
    5. Source Indicator
  2. Data Entry
    1. Missing Data
    2. Illegible Data
    3. Ditto Fields
    4. Conjectural Information
  3. Grouped Records
  4. Transliteration
  5. The "Other Surnames" and
    "Other Towns" Columns
  6. Other guides to data transcription

Here are the guidelines for the data to be submitted to a JewishGen searchable database, as per the standard Contributing Databases to JewishGen procedures.

Data sent to the JewishGen database managers should be in a database or spreadsheet format.  Any standard database or spreadsheet format (rows of records and columns of fields) is acceptable, such as: Microsoft Excel (any version), Microsoft Access, dBase DBF, Lotus 1-2-3, Corel Paradox, or Quattro Pro spreadsheets, etc.

Word processor files are more difficult to work with than spreadsheet or database files, but may be acceptable if the data is in a regular format (one record per line, with each field separated by commas, or tabs, or otherwise delimited).

In all cases, please be sure that each field in your database is clearly labelled, and that a full database description is provided, using the guidelines.


I.   Templates

The manager of each transcription project should create a data entry template to contain the transcribed data.  The template design and data entry instructions should be reviewed by JewishGen before proceeding with data entry.  The template may evolve over time, as you gain experience with transcribing the original records.

Templates for certain standard types of records may be found at http://www.jewishgen.org/databases/templates.

  1. Surnames and Given Names:

    1. Surnames should be in ALL CAPITAL LETTERS and be in separate fields.  Each "Surname" field should contain only a surname.  Place given names and other items it their own separate fields.  Having the surname in its own field will allow the surnames to be searchable, via a variety of methods.
    2. All other proper nouns (given names, town names, etc.) should be in Mixed Case (i.e. Initial Capital Letters, with all subsequent letters in lower case).
    3. If surnames appear in other fields (e.g.: maiden names, alternate surnames, etc.) — and if you want these surnames to be searchable as surnames — then copy those surnames to an additional field, called "Other Surnames".  This "Other Surnames" field will not appear in the displayed search results, but is used only for database indexing. (See Section V, below).
    4. If there are optional surnames (multiple names, such as "SMITH or JONES") in a surname field, then code it as "SMITH / JONES".  (The same applies to the "Other Surnames" field, which can hold multiple names, separated by spaces and '/' delimiters).
    5. If a person has no surname or givenname in this record (e.g. records that use only patronymics), then indicate this with a dash ("-") character, rather than leaving a name field empty or using any other indicator.

  2. Town Names:

    1. Place town names in separate fields.  Each “Town” field should contain only a town name.  Place any qualifying county / district / province / state / country names each in their own separate fields.  Having the town name in its own field will allow the town names to be searchable, via a variety of methods.
    2. Transcribe town names exactly as they are written in the original source document.  For example, in a German-language record, the capital city of Poland would be written as “Warschau”.  Do not transform it to “Warszawa” or “Warsaw” — preserve the original.
      1. If you wish, you may also include the modern native town name, as per the JGFF model — either in an additional separate column, or together in the same column as a conjecture in square brackets, separated by spaces and '/' delimiters, e.g.: “Warschau / [Warszawa]”.  Our search engines will then be able to pick up both names.
    3. If town names also appear in other fields — and if you want these town names to be searchable as town names, then copy the town names to an additional field, called “Other Towns”.  This “Other Towns” field will not appear in the displayed search results, but is used only for database indexing. (See Section V, below).
    4. If there is more than one town for a field (multiple towns, such as 'places of former residence' = “Vilnius and Keidanai”), then encode this field as “Vilnius / Keidanai”.  (The same applies to the “Other Towns” field, which can hold multiple town names, each separated by spaces and '/' delimiters).

  3. Dates:

    1. If supplying the data in dBase format, all fields must be CHARACTER fields; do not use DATE fields.
    2. If supplying the data in Microsoft Excel format, all fields must be TEXT fields; do not use DATE fields, as Excel can not handle historical dates.
    3. Make sure that all years contain four digits, to avoid ambiguity.
    4. Make sure that the day and month fields are distinguishable.  Europeans and Americans interpret dates differently.  If possible, use "DD-MMM-YYYY" format, using the three-letter English month abbreviation, for example "21-APR-1892".

  4. Sparse columns: Columns which rarely contain data should be avoided, because they can take up considerable horizontal space when displaying search results.  The more columns a spreadsheet has, the more difficult it is to display to the data meaningfully.  Try to have as few columns as is reasonably possible.  Consider combining several sparse columns into a single more generic "Comments" or "Notes" column.

  5. Source Indicator: Every record (i.e. each row) should have some type of source information — column(s) containing an identifier by which a researcher using the database can independently find this record in the original source: A page number, a record number, a line number, etc., or any necessary combination thereof.


II.   Data Entry

All data should be transcribed as faithfully as possible to the original source document, with as little interpretation as possible.  Interpretation is the job of the researcher using the resulting database, not the job of the transcriber.  The data transcriber should write only what is in the original source document.

If the transcriber or editor of the database wishes to add conjectures, interpretation, or editorial comments, these all should be made within square brackets (“[ ]”), to indicate that these comments are not part of the original source. (See Section II.4, below).

  1. Missing Data: If a data item is missing in the original source, indicate this with a dash or hyphen (“-”) character, rather than leaving a blank field or using any other indicator.

  2. Illegible Data: If a data item is illegible or questionable in the original record, transcribe as much as you can, and use the following indicators:

    1. Questionable entries should be followed by a question mark (“?”).
    2. If a data item is totally illegible, just place a single question mark (“?”) in the cell.
    3. Use an ellipsis (“...”) to indicate illegible parts of a name.
      For example, write “SM...TH”, if you can't determine what the letters are between the “SM” and “TH”.
    4. If you can't decide which of two possibilities a partially legible name represents, write both interpretations, separated by a slash and spaces.
      For example, write “STEIN? / STERN?”   or   “PERL? / BERL?”.
      Our search engines will then be able to pick up both names.

  3. Ditto fields: Data which is the same as the previous row must be filled in; you can not leave any cell blank, or use a ditto mark (") or other indicator — because when the data is sorted by a different criteria, the context is lost.  For example:
    Incorrect Correct
    Year # Surname Given Name
    1847 1 SCHWARTZ Moshe
      2 KOHEN Ryfka
      3 LEVIN Shmuel
    Year # Surname Given Name
    1847 1 SCHWARTZ Moshe
    1847 2 KOHEN Ryfka
    1847 3 LEVIN Shmuel

  4. Conjectural Information: Should always be indicated within square brackets (“[ ]”).
    Conjectural information is information which is not in the original source document, but has been conjectured by a database transcriber or editor.

    For example, a conjectured surname of “EPSZTEIN” for a record which has no surname — but for which this surname has been deduced from other sources — should appear within square brackets, as “[EPSZTEIN]”.

    Other uses of conjectural data are the expansion of abbreviations, corrections to items obviously misspelled in the original source, or the addition of modern town names (see Section I.2.b.i. above).  Any other editorial comments and explanations should also appear within square brackets, to indicate that those items are not in the original record.

  5. Prohibited Characters: Avoid the use of the double-quote character ("), and line breaks.

    1. Double-Quote character: The inclusion of double-quote characters causes problems with our internal data conversion routines (the procedures which convert data from Excel to dBase format).  Use single quote characters (') instead.
    2. Line-Break character: Do not use linebreaks in your data entry.  Line-Breaks are normally added in Excel by holding the ALT key and pressing the ENTER or RETURN key, and results in the contents of a cell spreading over more than one line.

  6. Maximum Field Size: The maximum size of any field is 254 characters.


III.   Grouped Records

Some sources, such as Census Records, Czarist Revision Lists, etc., group people together into households or families.  When transcribing data like this, each person in the data should still have their own record (i.e. their own row in the spreadsheet), but we can also group the family/household together in the database's results display, if a "Glue" field in used in the spreadsheet to group rows together.

For example, here's an input spreadsheet containing the two family groups:

Family # Surname Forename Patronymic Age Relation Birthplace Gubernia District Town Address Fond #
4118 LEWIN Haim Mowscha 40 head Jekapils Vitebsk Dvinsk Rezekne Soldatskaya 12-3 2706-1-156
4118 LEWIN Rocha Shmuel 38 wife Jekapils Vitebsk Dvinsk Rezekne Soldatskaya 12-3 2706-1-156
4118 GLEBERMAN Pesia Haim 21 daughter Ludza Vitebsk Dvinsk Rezekne Soldatskaya 12-3 2706-1-156
4119 DORFMANN Simon Itzik 28 head Rezekne Vitebsk Dvinsk Rezekne Ludzenskaya 45-6 2706-1-156
4119 DORFMANN Esther Abram 25 wife Rezekne Vitebsk Dvinsk Rezekne Ludzenskaya 45-6 2706-1-156
4119 DORFMANN Gita Mowscha 50 mother Rezekne Vitebsk Dvinsk Rezekne Ludzenskaya 45-6 2706-1-156
4119 KAGANSKI Hana Mowscha 60 aunt Rezekne Vitebsk Dvinsk Rezekne Ludzenskaya 45-6 2706-1-156
4119 LEWIN Malka Sura Rachmiel 30 cousin Ludza Vitebsk Dvinsk Rezekne Ludzenskaya 45-6 2706-1-156

which could be displayed as:

Town
District
Gubernia
Surname, Forename Patronymic Age Relation Birthplace Address
Fond #
Rezekne
Dvinsk
Vitebsk
LEWIN, HaimMowscha 40head Jekapils Soldatskaya 12-3
2706-1-156
LEWIN, RochaShmuel 38wife Jekapils
GLEBERMAN, PesiaHaim 21daughter Ludza
Rezekne
Dvinsk
Vitebsk
DORFMANN, SimonItzik 28head Rezekne Ludzenskaya 45-6
2706-1-156
DORFMANN, EstherAbram 25wife Rezekne
DORFMANN, GitaMowscha 50mother Rezekne
KAGANSKI, HanaMowscha 60aunt Rezekne
LEWIN, Malka SuraRachmiel 30cousin Ludza

Here we are using the "Family #" column as the "glue" field, to glue all members of the household together, for a more attractive and meaningful display of the data.

Note how the common fields (data which is common to every member of the household/family) are "banded" together, in the yellow row-spanning fields on the left and right.  This redundant data is displayed only once per family/household, in a vertically "stacked" fashion, thus saving considerable display space.

The "glue" field is also needed to ensure that the entire family group is presented together, when only one member of the family matches the search criteria.  The entire family group (i.e. all rows with the same "glue" field) is displayed when only one member of a family matches the search criteria.

For example, the above display would result from a search for the surname "LEWIN" — When only one member of a family has the surname "LEWIN", the entire family group is displayed, because the "glue" field keeps the entire family together.

The simplest use of a "glue" field is in a marriage record — to tie the bride and groom together.  If the groom and bride are each entered in their own row in the spreadsheet, the use of a "glue" field will ensure that both rows are displayed when a user searches for either one of the parties' surnames.

Also note that the "glue" field is not necessarily a displayed field. (In the example above, the "Family #" is not displayed in the search results screen).  The "glue" field can be a hidden column, which is not displayed in the search results — this column is used only for the internal purpose of creating the database indexes.


IV.   Transliteration

JewishGen has established no universal transliteration standards for data written in non-Latin alphabets (i.e. Hebrew, Cyrillic alphabets) since each database is different, and there are so many languages, alphabets, dialects, and regional variants across the wide scope of Jewish genealogical data.  Each database is free to use their own transliteration methods, as long as they are reasonable.  The introductory remarks for each database should indicate or explain which transliteration method has been used for that database.

Here are some general ideas and guidelines:

  • Reflect the original: The transliteration should reflect the original document, to the degree possible.  Names should not be 'standardized'; they should be entered exactly as written on the original document.  For example: 'Movsha', 'Moishe', etc., should not become 'Moshe'; and should certainly never be 'translated' or 'transformed' to 'Moses'.

  • Pronunciation should reflect local use, e.g. distinctions between Litvak and Galitzianer pronunciations can be retained.

  • Soundex: Since Daitch-Mokotoff Soundex searching will find most evident name variations, we needn't worry excessively about standard transliteration of Cyrillic-to-English, Yiddish-to-English, or Hebrew-to-English. 

  • Transliteration Guides: For Yiddish (Hebrew letters), you can use the YIVO Romanization Standard, but the Library of Congress Standard and others are acceptable as well.
    For Russian (Cyrillic letters) into English (Latin letters), you can use the tables in the Wikipedia article "Romanization of Russian", especially the table BGN/PCGN Romanization of Russian.

  • Cyrillic: Transliteration from Cyrillic to Latin characters should reflect the local language, if that local language uses the Latin alphabet.  For example, civil records in the Kingdom of Poland (Congress Poland) after 1868 were written in Cyrillic, and should be transliterated into Polish spelling rather than English spelling (as JRI-Poland does).  Where the local language does not use the Latin alphabet (e.g. Belarus, Ukraine), Cyrillic should be transliterated into English phonetics.

  • If your original source data is in Russian (Cyrillic alphabet), you may do your data entry directly in Cyrillic, if you are more comfortable in that language, and have the appropriate keyboard.  We have Excel macros that can transliterate data in Cyrillic into the Latin alphabet.

  • Retain the Original: If possible, data in Latin characters should be transcribed in the original language (i.e., leave occupations written in German in German), rather than translated; and then provide a separate table of translations.  It is always best to keep the transcript as close to the original as possible, without any interpretation — and let the end-users of the database do that interpretation.


V.   The “Other Surnames”, “Other Givennames” and “Other Towns” Columns:

As mentioned above in sections I.1.c and I.2.c, certain datasets might want to make use of the special hidden columns called “Other Surnames”, “Other Surnames” and/or “Other Towns”.  These columns are needed when there are surnames, givennames or town names embedded within the text of other columns, and you wish those items to be fully searchable.

  • Example #1: If you have a column entitled "Survived by" which contains "his daughter Mollie SMITH, and his brother Robert BERNSTEIN", and you want the surnames SMITH and BERNSTEIN to be searchable as surnames, then you will need to copy those surnames into a separate column, called "Other Surnames".  In this case, the "Other Surnames" column should contain "SMITH / BERNSTEIN". Similarly copy those givennames into a separate column, called "Other Givennames".  In this case, the "Other Givennames" column should contain "Mollie / Robert".

  • Example #2: If you have a "Comments" columns which contains miscellaneous information, such as "Father was born in Minsk, is currently residing in Pinsk, and working in Linsk", and you want the town names Minsk, Pinsk and Linsk to be searchable as town names, then you will need to copy those town names into a separate column, called "Other Towns".  In this case, the "Other Towns" column should contain "Minsk / Pinsk / Linsk".

The sole purpose of the hidden "Other Surnames", "Other Givennames" and "Other Towns" columns is for database indexing only — so that the database search engine will know that a particular word is a surname, givenname or town name, and thus can locate it when doing a Soundex or Phonetic search.  These hidden columns will not be displayed in the search results.

When a surname, givenname or town name is buried within a larger text field (such as a "Comments" field), the database search engine doesn't know that that particular word is a surname, givenname or town name.  Copying these words into an "Other Surnames" or "Other Towns" column makes this association explicit.  While a search for "BERNSTEIN" using a global text search would find a record with the word "BERNSTEIN" anywhere within any column, a Soundex or Phonetic search would not.  So if a Soundex or Phonetic search is made for the surname "BURNSTINE", it wouldn't find "BERNSTEIN" within the context of the "Comments" field.  To enable its Soundex searchability, the word "BERNSTEIN" needs to be copied into an "Other Surnames" column.

The database creator/editor should copy all surnames, givennames and town names contained within the text of these other fields into a separate "Other Surnames", "Other Givennames" or "Other Towns" column.  This action allows those words to be identifyable and fully searchable as surnames, givennames and/or town names, respectively.

[Note that if a particular town name is already in another indexed Town column, then you don't really need to copy it into the "Other Towns" column — although it doesn't hurt, it's simply redundant.  For example, if you have a "Town of Birth" column which contains "Minsk", and you also have a "Comments" column that contains the words "Father is a resident of Minsk", then in this instance you really don't need to copy "Minsk" into the "Other Towns" column, because this record already contains "Minsk" in the searchable "Town of Birth" column — a search for "Minsk" would already find this record.  However, it does no harm to have "Minsk" in the "Other Towns" column in this instance.]

There should never be anything in the "Other Surnames", "Other Givennames" or "Other Towns" columns which doesn't also appear somewhere else in the row.  The "Other Surnames","Other Givennames" and "Other Towns" columns are hidden columns, which are not displayed in the search results — these columns are used for only for the purpose of creating database indexes.


VI.   Other guides to data transcription:

Excel templates for some common types of records (e.g.: Czarist vital records and revision lists, cemetery records) have been created, so there is no need to re-invent them.  There are instructions and examples included with each template.  The "JewishGen Database Templates" can be found at http://www.jewishgen.org/databases/templates.


JewishGen Databases JewishGen Home Page
Warren Blatt, Last Revised Jun 18, 2012.
Edmond J. Safra Plaza | 36 Battery Place | New York, NY 10280
646.437.4326 | info@jewishgen.org | © 2014, JewishGen. All rights reserved.