Populating the database
Úvodní odstavec
Contents
Problem definition
Why?
Lets say that I have a database containing prepared activities and educational content for boy scouts group. Its purpose is to serve as a source of inspiration and eventually of collection of high quality content appropriate for kids during our weekly meetings or other occasions. From which source should be easy to get precisely only the desired results based on defined properties on each activity (eg. time needed for preparation, minimal number of players, etc.)
BUT actually its currently more like a proof of concept and not a really operational tool. So accordingly there aren't yet any real records to speak about. This very early stage of development is exactly the reason why I think it is reasonable to think about how will the database be populated with data. As this will be crucial for its future usability and success. Therefore, the simulation should focus on modeling this data population process.
What?
To be precise, the simulation will be concerned with user behavior and related volume of stored records.
It is important to say that user interface for reading is only the default web gui of the database itself (Neo4j), so the users will have to use the Cypher query language. So after they go through such a way to access the records, there better be some decent amount of useful ones in the database. This is because otherwise their whole effort would be pointless and the user would be discouraged from using the database again.
Creating the records on the other hand will be easy and accessible as much as possible (user writes to google document in specified folder and with specified structure which enables then loading it programmaticaly into the database). This ease of creation is key to encouraging consistent contributions.
And here comes the question whether this "natural" way of data population is enough, or rather how long will it take for the database to be able to provide satisfactory results for the user in majority of the cases and as such could be called a useful tool. If the time required would be too long, the users could lose interest of even contributing new content . Therefore, the simulation will model the rate of content creation and the subsequent impact on database searchability and usability.
Additionally will be also evaluated the "batch population approach" to see weather it is worth considering to implement. This will involve creating a model of user behavior and content creation, simulating different scenarios, and analyzing the results.
Objectives
This simulation aims to predict the growth of the database, determine the feasibility of the current data population strategy and evaluate the alternative.
- Predict the time needed to reach a critical mass of data sufficient for effective use.
- Evaluate the effectiveness of the current data population method relatively to alternative.
Method
Agent-based model
The core method used for this simulation is agent-based modeling (ABM). This approach was selected because it aligns well with the nature of the problem, which involves multiple individual users (agents) interacting with a shared environment (the database). Specifically, it enabled me to simulate:
- Individual User Behaviors: Each agent acts independently, deciding whether to search the database, create a new record, or improve existing ones, based on their local information and probabilities.
- Heterogeneity (Limited): While the agents initially are identical, they change during the simulation based on their interactions with the database.
- Emergent System Behavior: The overall growth and usability of the database is not programmed directly, but emerges from the accumulation of individual actions.
Alternatively the system dynamics or simple math equations could be used but neither of them would be more fitting for this usecase. While system dynamics could model the overall growth of the database as a system, it would not provide the same level of detail regarding individual user actions. System dynamics works well when dealing with aggregate levels and rates of change but isn't great for capturing how each agent behaviour individually impacts the system.
Just as the equations could model a simplified database growth equation, but it would be extremely difficult to derive a model which includes agents success/fail feedback, record improvement and cooldown and thus it was not considered as applicable for this problem.
Julia
Because the ABM approach was chosen the NetLogo was concidered as environment for implemantation of the simulation, but also was the language Julia as it is also often used for simulations. In the end the Julia was chosen because of its more general usability and therefore more usecases when learned to use.
diskuse možností řešení, výběr metody a prostředku (metod a prostředků) řešení, zdůvodnění výběru (jinými slovy, proč to chcete řešit tak, jak to chcete řešit, jaké jsou jiné alternativy a proč je ta Vámi zvolená pro tuto úlohu nejvhodnější)
Model
Entities
- Database (db)
A dictionary storing Record objects, categorized by integers representing different activity categories.
- The keys of the dictionary are integers representing category id.
- Each category is associated with a vector (array) of Record objects.
- Initially, each category has an empty vector, which can be populated by new records.
- Record
A struct containing:
- id (Integer): A unique identifier for the record, auto-incrementing with each creation.
- quality (Float64): Represents the perceived quality or usability of the record.
- Agent
A struct representing a user of the database.
- probabilityToSearch (Float64): The agent's current probability of finding a useful record, it is based on past success of the agent when searching the db. This parameter influences agent's decision whether to search the db or not.
- used (Dict{Int, Vector{Int}}): A dictionary that stores which record IDs have been used by agent and in which week the record has been used, so it won't be used by the same agent during a given cooldown period. Keys are record ids and values are vectors containing week integers in which the record has been used.
Variables
- NUM_AGENTS
Integer representing the number of agents (users) participating in the simulation. Value is fixed at 3.
- BATCH_UPLOAD_SIZES
Vector of integers representing different sizes of initial record batches uploaded to the database. The values used in simulation are [0, 10].
- initialQualityDist
A lognormal distribution (LogNormal(3, 2)) used to generate the initial quality of new records. The mean is set to 3, and the standard deviation is set to 2, meaning the typical quality of created record will be around 3, but there could be some with higher or lower.
- qualityImprovementProb
- Float64 representing the probability that an agent will improve a record's quality after using it. Value is fixed at 0.7.
- recordCreationProb
- Float64 representing the probability that an agent creates a new record when they are preparing an activity and decide not to search the database. Value is fixed at 0.3.
- categories
- Vector of integer representing the category ids of the db records. The vector is generated as range from 1 to 20, [1:20;].
targetSuccessfulness: Float64 value, which represents minimum success rate, that needs to be reached for the database to be considered "useful" in a long term. Value is 0.75.
nSuccessfullFromEpisode: An integer number representing how many successful searches per episode of the simulation have to occur in order for it to be considered "useful". Value is 4.
episodeLength: Integer which represents length of a one "episode" within the simulation, which in this case is a sequence of weeks for which the success rate of the db will be tested if it met required conditions. Value is set to 6, this means 6 weeks.
Flow
Detailní popis modelu, včetně parametrů, oborů hodnot, schémat, omezení modelu, apod. Popis musí být natolik detailní, aby podle něj bylo možné experiment zopakovat (a to samozřejmě i bez toho, že by byly k dispozici příslušné soubory s modelem).
Results
Single run
Multi run
výpis výsledků, jejich analýza, interpretace, zhodnocení.
Discussion
jak se Vám podařilo definovaný problém vyřešit