(en) GallicaSPARQLBot

De Wikipast
Version datée du 20 mai 2019 à 20:40 par MasterBot (discussion | contributions) (Wikipastbot update)
(diff) ← Version précédente | Voir la version actuelle (diff) | Version suivante → (diff)
Aller à la navigation Aller à la recherche
Language Français English

The GallicaSPARQLBot is a python bot that completes or automatically generates Wikipast pages from la base de données de la Bibliothèque nationale de France (BnF).

Technical description

Data formatting

At first, the bot searches, through a SPARQL query, the entirety of the authors belonging to the database of the BnF.

Then, the following data is retrieved for each author:

  • The date and place of birth of the author
  • The date and place of death of the author
  • The list of the principal works (according to the BnF) of the author
  • A profile image of the author (if it exists)

Finally, the following data is retrieved for each main work of the author in question:

  • The creation date of the work
  • The type of work (Graphic Art, Book, ...)
  • An illustration of the work (if it exists)

From this data, the bot creates the Wikipast page linked to each author by adding a link to the profile picture (if it exists in the BnF), their birth and death dates and their main works (only those containing a date). Similarly, a Wikipast page is created for each of these works with an illustration (if it exists in the BnF). These additions are each accompanied by links to the BnF website and are standardized as follows:

For the date of birth, depending on the mention of the place:

[[(en)_DATE|DATE]] / [[(en)_LOCATION|LOCATION]]. [[(en)_Birth|Birth]] of [[(en)_AUTHOR|AUTHOR]]. [PAGE BNF DE L'AUTEUR]
[[(en)_DATED|DATED]]. [[(en)_Birth|Birth]] of [[(en)_AUTHOR|AUTHOR]]. [PAGE BNF DE L'AUTEUR]

For the date of death, depending on the mention of the place:

[[(en)_DATE|DATE]] / [[(en)_LOCATION|LOCATION]]. [[(en)_Deaths|Deaths]] of [[(en)_AUTHOR|AUTHOR]]. [PAGE BNF DE L'AUTEUR]
[[(en)_DATED|DATED]]. [[(en)_Deaths|Deaths]] of [[(en)_AUTHOR|AUTHOR]]. [PAGE BNF DE L'AUTEUR]

For a work, depending on the mention of its type:

[[(en)_DATED|DATED]]. [[(en)_Creation|Creation]] by [[(en)_AUTHOR|AUTHOR]] of [[(en)_WORK|WORK]] ([[(en)_TYPE|TYPE]]). [PAGE BNF DE L'ŒUVRE]
[[(en)_DATED|DATED]]. [[(en)_Creation|Creation]] by [[(en)_AUTHOR|AUTHOR]] of [[(en)_WORK|WORK]]. [PAGE BNF DE L'ŒUVRE]

Managing pages

Each of the pages explored by the bot is embellished with its Wikidata type and its BnF identifier (BnF id). This means that if no content line exists in the database (birth, death or creation), then the page is still created with this information. The mention of the type Wikidata allows other bots to find very easily on Wikipast all human beings (bearing identification Q5) and all works (Q5) through a simple search. This function is for example used by the Wikidataficator.

Homonymy management uses the "first come, first served" policy. Indeed, if the bot wants to insert an author and his name does not yet exist on Wikipast, then his page is created normally. On the other hand, if there is already a page with the name of the author, then his page is created but his title will be composed of the name of the author followed by a unique identifier (UUID) (example: Johann Strauss (3bd88e1), father of Johann Strauss).   In order to avoid collisions between the name of several works as well as between the name of a work and that of an author, it was chosen to name the works in the following way: Name of the work ( name of the author) (example: The Bear in a boat (fable V) (Plover)).


Discussion of performances

Quantity produced

The bot has created around 500'000 author pages and 15'000 pages of works in just under 4 days, which is a very reasonable amount of time. Most of the authors listed in the database of the BnF have unfortunately no other information than their name, so a large part of the pages created is empty, but exploitable thanks to the tag wikidata and the identifier BnF.

Difficulties encountered

Since the number of requests per second was very high (there were up to 200 parallel threads, but an average of about 30), http requests might not arrive safely. This creates missing elements in the final result. Only 1'500 links on the one million processed were not executed, and were manually restarted after completion of the initial run. Finally, no failure is to be deplored following this intervention.

The case of existing authors has been managed only after the initial execution, to avoid the unnecessary creation of pages in case of restart (due to error, etc.).

Some works had a title so long that it broke the character limit for a page title on Wikipast (256 bytes). This problem was resolved after a few cases occurred.

The sequence of events was also non-trivial to manage. A simple event sorting algorithm is not enough for negative dates|Cicero | negative dates and changing order of magnitude|Simeon the New Theologian | changing order of magnitude. The bot also takes it into account.


Bot scheduling

The bot is supposed to run only oncethe GallicaSPARQLBot policy is to never overwrite or modify an existing page.


Example of result

Author

From: Seneca

Wikidata: ([1])

BnF ID: 11887555p


Work

From: History of painting in Italy (Stendhal)

Wikidata: (Q386724)

BnF ID: 14415306q

bpt6k6215656z.thumbnail.jpg