JGlobe:
Question Answering for the Jewish Genealogy Research
Led by: Prof. Zhitomirsky-Geffet, Dr. Almalech and Dr. Suissa
In this project, focusing on question-answering, two types of question-answering tasks were explored: factual (i.e., extractive) question answering and numerical aggregative question answering. In both tasks, the domain adaption of DNN pipelines and models outperforms the state-of-the-art generic pipelines and models. Domain adaptation includes domain-specific data generation (i.e., data-centric approach), domain-specific training and inference pipelines, and domain-specific algorithms (including fine-tuned models). The results of each experiment show that the genealogy domain is distinctive in complexity, characteristics, and requirements; thus, dedicated training methods, data modeling, and fine-tuned DNN models are needed.
The project may also have a substantial societal impact as genealogical centers, museums, and various organizations aim to allow users and experts without mathematical or programming training to investigate their large genealogical databases. For example, imagine walking into a genealogical center and researching your own dynasty migration paths by asking questions like "How many people from my family were born in England but died in another county?", or a Holocaust researcher asking, "What is the average marriage age of women in Germany between 1919-1939?" and then comparing the answer to the answer to the question "What is the average marriage age of women in Germany between 1945-1965?"; imagine finding your own family tree in one of these datasets and talking to your great-grandmother, asking her questions about your family history and heritage. To answer a variety of questions, such systems should incorporate factual and numerical aggregation QA models on the data stored within the GEDCOM files. Furthermore, the practical and scientific implications of these studies for the genealogical domain can be researching communities, migration, plagues, and marriage cultures all over the globe.
When using the developed methodologies and models on the Douglas E. Goldman Jewish Genealogy Center dataset in the Anu Museum, we can find exciting trends just by asking the right questions and summarizing the results. The dataset contains 1,847,224 Jewish individuals in 617,669 families, where the oldest person with a valid birth date was born in 1501 and the youngest in 2019 (note that in 2019 the snapshot of this dataset was taken). For example, table 20 presents the most common occupation of women in the top counties from 1800-1925 (i.e., counties with most people in the dataset that were born between 1800 and 1925):
Table 1: The most common occupation of women in the top counties (1800-1925).
Table 2 shows the most common occupation of men in the top counties.
Table 2: The most common occupation of men in the top counties (1800-1925).
As can be observed from Tables 1 and 2, while historically, most Jewish women in most countries in the examined period were “housewives”, men's occupation was more diverse when the merchant occupation was the most popular.
As shown in Figure 1 presents the distribution of people around the globe.
Figure 1: Number of people per country (1800-1925).
When splitting between European and non-European individuals (i.e., not born in Europe), we can find interesting trends. As shown in Figure 2, the life expectancy of people from the top European countries in the dataset drops over time and is at its lowest point for people born from 1900-1925 (and in the Netherlands even sooner). This could be the result of two world wars that happened between 1914 to 1945 (i.e., people born between 1900-1925 were of military age). Moreover, there is a significant variance in life expectancy between countries. For example, between 1825 and 1849, people who lived in Denmark reached the age of 79,875, while people who lived in Andorra lived for 58,846 years on average.
Figure 3: Life expectancy in the top European countries (1800-1925).
On the other hand, as can be observed from Figure 33, while there is a decrease in life expectancy over time in non-European countries, there is no dramatic drop like in Europe, except for Turkey, Argentina, and Canada.
Figure 4: Life expectancy in the top non-European countries (1800-1925).
Moreover, as shown in Figures 5 and 6, unlike the intuition that over time people tend to change more spouses, it seems that for the Jewish population in our database in the examined period in most of the countries both in Europe and outside Europe, there is a trend to have fewer spouses over time.
Figure 5: Average number of spouses in life in European countries (1800-1925).
Figure 6: Average number of spouses in life in non-European countries (1800-1925).
When comparing births, Figure 7 clearly shows a "baby boom" in Poland from 1850 compared to other European countries.
Figure 7: Average number of births in European countries (1800-1925).
The same phenomena can also be observed in the United States (and somewhat in Israel) compared to other non-European countries in Figure 8. This phenomenon could result from the massive immigration of Jewish people to these countries during these years.
Figure 8: Average number of births in non-European countries (1800-1925)
Sadly, the number of deaths also increases over time, as shown in figures 9 and 10. This could be due to the natural increase of the population (i.e., the "baby boom" and immigration in the United States and Poland) or due to unnatural events such as the two world wars, which can explain the dramatic increase in the number of deaths of people born in 1875-1899 in Poland and Russia who were of military age in the periods of the two world wars.
Figure 9: Average number of deaths in European countries (1800-1925)
Figure 10: Average number of births in non-European countries (1800-1925).
Furthermore, Figures 11 and 12 present a global decreasing trend in the number of children in Jewish families. While in the 1800s, there was a high variance between counties (both European and non-European), at the beginning of the 1900s, the variance in the number of children reduced dramatically. For example, between 1825 and 1850, the average number of children in a Jewish family living in the Czech Republic was 3.848, while in the same period, the average number of children in a Jewish family living in Andorra was 8.24
Figure 11: Average number of children in a family in European countries (1800-1925).
Figure 12: Average number of children in a family in non-European countries (1800-1925).
All the phenomena and trends discovered by using the developed QA system should be further investigated and explained by sociologists and historians. Finally, in addition to the question-answering tasks, the developed end-to-end methodologies can also be applied to other downstream genealogical NLP (Natural Language Processing) tasks, including entity extraction, summarization, and classification.
Publications:
1. Suissa, O., Elmalech, A. and Zhitomirsky-Geffet M. (2023). Around the GLOBE: Numerical Aggregation Question-Answering on Heterogeneous Genealogical Knowledge Graphs with Deep Neural Networks. ACM Journal of Computing and Cultural Heritage. https://doi.acm.org?doi=3586081
2. Suissa, O., Zhitomirsky-Geffet, M. and Elmalech, A. (2023). Question Answering with Deep Neural Networks for Semi-Structured Heterogeneous Genealogical Knowledge Graphs, Semantic Web Journal, 14(2), 209-237. DOI:10.3233/SW-222925
3. Suissa, O., Elmalech, A. and Zhitomirsky-Geffet M. (2021). Text Analysis Using Deep Neural Networks in Digital Humanities and Information Science, Journal of Association for Information Science and Technology, 73(2), 268-287. https://doi.org/10.1002/asi.24544
4. Zhitomirsky-Geffet, M. and Suissa O. (October, 2023). AI-Based Research Tool for Large Genealogical Corpora: The Case of Jewish Communities Worldwide. In ASIS&T conference, October 31th, 2023, London, UK