Populating the database

From Simulace.info
Revision as of 22:20, 19 January 2025 by Tros01 (talk | contribs) (Code)
Jump to: navigation, search

Úvodní odstavec

Problem definition

Why?

Lets say that I have a database containing prepared activities and educational content for boy scouts group. Its purpose is to serve as a source of inspiration and eventually of collection of high quality content appropriate for kids during our weekly meetings or other occasions. From which source should be easy to get precisely only the desired results based on defined properties on each activity (eg. time needed for preparation, minimal number of players, etc.)

BUT actually its currently more like a proof of concept and not a really operational tool. So accordingly there aren't yet any real records to speak about. This very early stage of development is exactly the reason why I think it is reasonable to think about how will the database be populated with data. As this will be crucial for its future usability and success. Therefore, the simulation should focus on modeling this data population process.

What?

To be precise, the simulation will be concerned with user behavior and related volume of stored records.

It is important to say that user interface for reading is only the default web gui of the database itself (Neo4j), so the users will have to use the Cypher query language. So after they go through such a way to access the records, there better be some decent amount of useful ones in the database. This is because otherwise their whole effort would be pointless and the user would be discouraged from using the database again.

Creating the records on the other hand will be easy and accessible as much as possible (user writes to google document in specified folder and with specified structure which enables then loading it programmaticaly into the database). This ease of creation is key to encouraging consistent contributions.

And here comes the question whether this "natural" way of data population is enough, or rather how long will it take for the database to be able to provide satisfactory results for the user in majority of the cases and as such could be called a useful tool. If the time required would be too long, the users could lose interest of even contributing new content . Therefore, the simulation will model the rate of content creation and the subsequent impact on database searchability and usability.

Additionally will be also evaluated the "batch population approach" to see weather it is worth considering to implement. This will involve creating a model of user behavior and content creation, simulating different scenarios, and analyzing the results.

Objectives

This simulation aims to predict the growth of the database, determine the feasibility of the current data population strategy and evaluate the alternative.

  1. Predict the time needed to reach a critical mass of data sufficient for effective use.
  2. Evaluate the effectiveness of the current data population method relatively to alternative.

Method

Agent-based model

The core method used for this simulation is agent-based modeling (ABM). This approach was selected because it aligns well with the nature of the problem, which involves multiple individual users (agents) interacting with a shared environment (the database). Specifically, it enabled me to simulate:

  • Individual User Behaviors: Each agent acts independently, deciding whether to search the database, create a new record, or improve existing ones, based on their local information and probabilities.
  • Heterogeneity (Limited): While the agents initially are identical, they change during the simulation based on their interactions with the database.
  • Emergent System Behavior: The overall growth and usability of the database is not programmed directly, but emerges from the accumulation of individual actions.

Alternatively the system dynamics or simple math equations could be used but neither of them would be more fitting for this usecase. While system dynamics could model the overall growth of the database as a system, it would not provide the same level of detail regarding individual user actions. System dynamics works well when dealing with aggregate levels and rates of change but isn't great for capturing how each agent behaviour individually impacts the system.

Just as the equations could model a simplified database growth equation, but it would be extremely difficult to derive a model which includes agents success/fail feedback, record improvement and cooldown and thus it was not considered as applicable for this problem.

Julia

Because the ABM approach was chosen the NetLogo was concidered as environment for implemantation of the simulation, but also was the language Julia as it is also often used for simulations. In the end the Julia was chosen because of its more general usability and therefore more usecases when learned to use.

diskuse možností řešení, výběr metody a prostředku (metod a prostředků) řešení, zdůvodnění výběru (jinými slovy, proč to chcete řešit tak, jak to chcete řešit, jaké jsou jiné alternativy a proč je ta Vámi zvolená pro tuto úlohu nejvhodnější)

Model

Entities

Variables

Flow

Detailní popis modelu, včetně parametrů, oborů hodnot, schémat, omezení modelu, apod. Popis musí být natolik detailní, aby podle něj bylo možné experiment zopakovat (a to samozřejmě i bez toho, že by byly k dispozici příslušné soubory s modelem).

Results

Single run

Multi run

výpis výsledků, jejich analýza, interpretace, zhodnocení.

Discussion

jak se Vám podařilo definovaný problém vyřešit


Code

g Tros01 (talk) 22:19, 19 January 2025 (CET)