Dataset Card for Prozhito-DB

Dataset Summary

The Prozhito-DB dataset contains text and metadata from 2,748 transcribed Russian and Ukrainian diaries. Each of the 620,892 diary entries contains the transcribed text, date, and other information about the author. The earliest entry is from 1682, and the latest is from 2020. The data was gathered from the Prozhito website API in September 2022.

About Prozhito (Original in Russian)

The “Prozhito” project was started in 2015 as a platform for collecting and publishing texts of personal diaries in Russian and Ukrainian. At first, it existed as a volunteer initiative. In 2018, the project participants registered the non-profit foundation “Prozhito,” and in 2019, the Center for the Study of Ego-Documents at the European University in St. Petersburg was founded.

The Center develops the corpus of personal diaries “Prozhito” (http://prozhito.org/) - an electronic library of dated personal records that allows users to work not only with specific diaries but also with the entire set of texts of the era: get samples by date, gender, age, place of keeping a diary, etc. The corpus includes texts in Russian and Ukrainian. More than 6,000 authors have been annotated, diaries of 2,000 of them have been uploaded, and about 500 diaries are the first publications of the project. The total volume of the corpus is more than half a million daily records from the 18th-20th century.

Our corpus is a citizen science project. Over 1,050 people have participated in our work of searching, copying, transcribing, and publishing handwritten diaries since “Prozhito” began. Twice a year, student interns from several Russian universities join the project volunteers.

In addition, we are working on creating a digital archive of the EUSP “Prozhito” center, which accepts digital copies of sources and electronic texts for storage. The contract regulates access to the text, conditions, and terms of its possible publication. Our task is to create a digital "People's Archive" that is not limited by the volume of a long archival shelf - we are ready to work with documents from family archives without regard to the age, gender, or social trajectory of their authors.

The center works in partnership with research centers, libraries, museums, and archives, and is also part of the EDAC (European Diary Archives and Collections) community.

Languages

id	Language	entry count
1	Russian	610526
2	Ukrainian	6881
3	English	3430
5	Kazakh	55

Genders

id	Gender	diary authors	entries	people mentioned
1	Male	1958	506079	7478
0	Female	727	114813	2187

Dataset Structure

The dataset contains individual diary entries. For example:

{'id': 197227,
'date': '1909-05-03',
'julian_date': '1909-04-20',
'is_julian': True,
'text': '<p>Что буду я писать?.. Пасха — собрались дети... Потом заболели Юра и Коля. — Что у них было — скарлатина ли, по Филатову,<com id="14814529261481"/> или иное что, только с Юрой прошли тяжелые две недели, и я приходила в отчаяние. — Теперь, быть может, опасность миновала, — а я упала духом, я опять не гожусь для жизни — я опять раздражаюсь, плачу и отчаиваюсь. — Это одно и то же — всё старое, больное — крест моей жизни. — Слишком тяжело, Господи...</p>',
'person': 135,
'first': 'Софья',
'last': 'Дрыжакова',
'patronymic': 'Васильевна',
'gender': 0,
'dob': '1872-01-01',
'dod': '1943-01-01',
'diary': 142,
'diary_first': '1900-06-22',
'diary_last': '1943-09-14',
'diary_total': 660,
'language': 1,
'tags': [],
'mentioned_people': [],
'comments': [{'id': '32209',
  'number': 14814529261481,
  'type': '1',
  'text': '8 Н. Ф. Филатов (1847—1902) — врач-педиатр, автор трудов по диагностике детских болезней.',
  'user': '14',
  'createdDate': '1481452926'}]
}

Data Fields

id: int A unique id for the entry
date: date | None The date when the diary entry was written. Prozhito normalized all entries to have a date using the Gregorian calendar.
is_julian:bool = None Before 1918, the Julian calendar was often used. This field notes if the diary recorded dates using the Julian calendar.
julian_date: date | None If the entry has a Julian date, it is recorded here.
text: str = None The text of the diary entry. Note that the text contains HTML tags for paragraphs (<p>), persons (<person>), and other information. These tags can be removed with textpipe, BeautifulSoup, and other libraries if needed.
person: int The unique id for the diary's author. Normalized person records can be found in the clean_people.jsonl file. Person records are also available from: https://prozhito.org/person/
first: str = None The entry author's first name
last: str = None The entry author's family name
patronymic: str = None The entry author's patronymic (father's name)
gender: int = None The entry author's recorded gender. Male is recorded as 0. Female as 1.
dob: date | None The entry author's date of birth

In the original data from Prozhito, there are two fields birthDay and birthDay2. The first field is an unstructured date string such as 12 января 1915. The second field is a machine-readable ISO date string. Most birthDay fields have information, while the birthDay2 is usually '0000-00-00' (no date information). To capture the information in the unstructured strings, the clean.py script converts birthDay into valid ISO date. When possible, partial information was retained. For example, Октябрь 1935 would become 1935-10-01. When a specific month or day is missing, I chose to enter 01. This choice kept as much information as possible to facilitate computation. Incomplete dates were dropped. Full details of the processing can be found in string_to_date() (clean.py line 62).
dod: date | None The entry author's date of death

dod was converted using the same process as dob detailed above.
diary: int The unique id for the diary containing this diary entry. Normalized diary records can be found in the clean_diaries.jsonl file.
diary_first: date | None Date of the first entry in the entry's diary.
diary_last: date | None Date of the last entry in the entry's diary.
diary_total:int = None The total number of entries in the diary.
language: int = None The primary language that the diary is written in.
tags: list A tag, typically a place name, associated with the entry by the Prozhito annotators. These tags can be used to filter the entries.
mentioned_people: list An annotation by a Prozhito member, with information on persons mentioned in the entry. For example: {'note': '692310', 'mentioned_people': '66278', 'id': '66278', 'firstName': 'Михаил', 'lastName': 'Цетлин', 'thirdName': 'Осипович', 'nickname': 'Амари, Цейтлин, Амар', 'sex': '2', 'birthDay': '1882.07.10', 'deathDay': '1945.10.10', 'comment': 'русский поэт, беллетрист, редактор, меценат;', 'parent_id': None} The id refers to a person record in the clean_people.jsonl file or the website.
comments: list Notes added by the annotators to provide context or relevant information about the entry and its contents. For example: {'id': '170888', 'number': 15662420521425, 'type': '1', 'text': 'Рабат – стоянка караванов.', 'user': '14', 'createdDate': '1566242052'}

Loading the dataset

Currently, Huggingface Datasets requires that splits be named test, train, and validation. The dataset is divided by language as detailed below:

from datasets import load_dataset
dataset = load_dataset('ajanco/prozhito-db')

# Russian entries
russian = dataset["test"]

# Ukrainian entries 
ukrainian = dataset["train"]

# English entries 
english = dataset["validation"]

Example Usage

#Print all entries in Ukrainian
for ukr_entry in dataset["test"]:
  print(ukr_entry["text"])

spaCy

import spacy
nlp = spacy.load("uk_core_news_sm")
docs = nlp.pipe([entry["text"] for entry in dataset["test"]])
for doc in docs:
  ...

Pandas example

# Create a pandas dataframe with the Russian entries
import pandas as pd
df = pd.DataFrame([entry for entry in dataset["test"]])

Polars example

# Create a polars dataframe with English entries
import polars as pl
df = pl.DataFrame([entry for entry in dataset["validation"]])

Dataset Creation

Curation Rationale

The Prozhito project and its website provide a unique and significant resource for historians and the public. This dataset was created to facilitate research of the Prozhito collection using computational and quantitative methods. Working with the collection as a dataset facilitates the exploration and analysis of materials at scale. Given this collection's size and temporal range, this is a particularly important capability with great potential for research in many fields.

Source Data

Initial Data Collection and Normalization

The data was collected from the Prozhito website's backend API. These are the same endpoints that serve the website.

Person records were gathered from: https://prozhito.org/api/persons/lang/{person_id}/1 range(1,10000)
Diary records from: https://prozhito.org/api/notes/search?search_type=diaries&diaries=[{diary_id}]' range(1,10000)
Entry records from: 'https://prozhito.org/api/notes/{entry_id}' range(1,750000)

I first identified the range of valid values for the three entities of interest: persons, diaries, and entries. For example, the highest person id was in the upper 9000s. A script sent requests to all values between 1 and 10000. I recorded the valid JSON responses for all ids with a record. All invalid ids were recorded as well, making it possible to identify the maximum valid id value for the collection. Identifying all valid and invalid values within a range gives relative confidence that all records were collected. IDs outside the range are possible but unlikely.
The JSON response for each person, diary, and entry record was saved to disk.
The raw collected JSON files are available here:
- Entries (1.2G): https://upenn.box.com/shared/static/t0ewh4mvy2me7lyxfkwc960unpnlmw5c.zip
- People (18M): https://upenn.box.com/shared/static/pga2k9xxh97qywejgb0d3ystpqwgz1ny.zip
- Diaries (80M): https://upenn.box.com/shared/static/f9ecnwn3ef9pqx2rateso32ptqxkf39a.zip
Given that Huggingface only allows for one data type in a dataset, I created a single Entry model with relevant data from Person and Diary.
The original Prozhito data has a system for handling date ambiguity. Records without a date appear as '0000-00-00'. If a year is known, but not the month or day, it would appear as '1894-00-00'. This format is not compatible with ISO standards. To facilitate time deltas and other computational processes with dates, '0000-00-00' was normalized to None. Nine entries with incomplete dates were removed from the dataset (672154, 658453, 687085, 687084, 687080, 658451, 687083, 687058, 658490).
The clean.py script used to normalize the data is included in this repository. The create.py script was used to push the dataset to HuggingFace Hub.

Personal and Sensitive Information

All of this dataset's personal and biographical information is publicly available without restriction on the prozhito.org website. The dataset reflects the policies and decisions made by Prozhito to publish or not publish a diary. No review of materials took place during the creation of this dataset. It was only normalized to facilitate computational analysis. Details on Prozhito's work with volunteers as well as scanning and transcription processes are available. If you discover information in the dataset that might cause harm or violates copyright, please reach out to the point of contact listed above.

Considerations for Using the Data

Discussion of Biases

The chronological distribution of the entries is highly uneven, reflecting historical conditions, diary writing practices, texts' availability, and the Prozhito community's work. There is a significant number of texts, 17% of the dataset, written during World War II (Great Patriotic War). Texts written before 1900 comprise a relatively small portion of the dataset (11%).
Gender attributions for diaries are added by the Prozhito contributors using a binary system for male and female. These attributions may or may not reflect an author's gender identities as reflected in the text or elsewhere. Additionally, the system for noting gender in the dataset cannot account for changing gender identities or situational presentations of self.
Language attribution reflects the choices made by the Prozhito contributors using a single language code. This system does not capture bi- or multi-lingual diaries or occasional borrowings or citations outside the primary attributed language. Language detection is possible for individual entries using langdetect.