Populating the database
Úvodní odstavec
Contents
Problem definition
Why?
Lets say that I have a database containing prepared activities and educational content for boy scouts group. Its purpose is to serve as a source of inspiration and eventually of collection of high quality content appropriate for kids during our weekly meetings or other occasions. From which source should be easy to get precisely only the desired results based on defined properties on each activity (eg. time needed for preparation, minimal number of players, etc.)
BUT actually its currently more like a proof of concept and not a really operational tool. So accordingly there aren't yet any real records to speak about. This very early stage of development is exactly the reason why I think it is reasonable to think about how will the database be populated with data. As this will be crucial for its future usability and success. Therefore, the simulation should focus on modeling this data population process.
What?
To be precise, the simulation will be concerned with user behavior and related volume of stored records.
It is important to say that user interface for reading is only the default web gui of the database itself (Neo4j), so the users will have to use the Cypher query language. So after they go through such a way to access the records, there better be some decent amount of useful ones in the database. This is because otherwise their whole effort would be pointless and the user would be discouraged from using the database again.
Creating the records on the other hand will be easy and accessible as much as possible (user writes to google document in specified folder and with specified structure which enables then loading it programmaticaly into the database). This ease of creation is key to encouraging consistent contributions.
And here comes the question whether this "natural" way of data population is enough, or rather how long will it take for the database to be able to provide satisfactory results for the user in majority of the cases and as such could be called a useful tool. If the time required would be too long, the users could lose interest of even contributing new content . Therefore, the simulation will model the rate of content creation and the subsequent impact on database searchability and usability.
Additionally will be also evaluated the "batch population approach" to see weather it is worth considering to implement. This will involve creating a model of user behavior and content creation, simulating different scenarios, and analyzing the results.
Objectives
This simulation aims to predict the growth of the database, determine the feasibility of the current data population strategy and evaluate the alternative.
- Predict the time needed to reach a critical mass of data sufficient for effective use.
- Evaluate the effectiveness of the current data population method relatively to alternative.
Method
Agent-based model
The core method used for this simulation is agent-based modeling (ABM). This approach was selected because it aligns well with the nature of the problem, which involves multiple individual users (agents) interacting with a shared environment (the database). Specifically, it enabled me to simulate:
- Individual User Behaviors: Each agent acts independently, deciding whether to search the database, create a new record, or improve existing ones, based on their local information and probabilities.
- Heterogeneity (Limited): While the agents initially are identical, they change during the simulation based on their interactions with the database.
- Emergent System Behavior: The overall growth and usability of the database is not programmed directly, but emerges from the accumulation of individual actions.
Alternatively the system dynamics or simple math equations could be used but neither of them would be more fitting for this usecase. While system dynamics could model the overall growth of the database as a system, it would not provide the same level of detail regarding individual user actions. System dynamics works well when dealing with aggregate levels and rates of change but isn't great for capturing how each agent behaviour individually impacts the system.
Just as the equations could model a simplified database growth equation, but it would be extremely difficult to derive a model which includes agents success/fail feedback, record improvement and cooldown and thus it was not considered as applicable for this problem.
Julia
Because the ABM approach was chosen the NetLogo was concidered as environment for implemantation of the simulation, but also was the language Julia as it is also often used for simulations. In the end the Julia was chosen because of its more general usability and therefore more usecases when learned to use.
The simulation is implemented in Julia using the Distributions, StatsBase, and Logging packages. Logging is enabled to output data about records, agents and overall progress of simulation to a log file called simulation.log.
Plots are used for visualization and stored as *.png files.
Model
Entities
- Database (db)
A dictionary storing Record objects, categorized by integers representing different activity categories.
- The keys of the dictionary are integers representing category id.
- Each category is associated with a vector (array) of Record objects.
- Initially, each category has an empty vector, which can be populated by new records.
- Record
A struct containing:
- id (Integer): A unique identifier for the record, auto-incrementing with each creation.
- quality (Float64): Represents the perceived quality or usability of the record.
(Originally it should more closely resamble the real live improvements by being represented by only range 1-10 as the real live records also can not be improved infinitelly, but in the end there was left simplier implemantation that just has a chance to improve the record during every picked to use event by random value withou top limit.)
- Agent
A struct representing a user of the database.
- probabilityToSearch (Float64): The agent's current probability of finding a useful record, it is based on past success of the agent when searching the db. This parameter influences agent's decision whether to search the db or not.
- used (Dict{Int, Vector{Int}}): A dictionary that stores which record IDs have been used by agent and in which week the record has been used, so it won't be used by the same agent during a given cooldown period. Keys are record ids and values are vectors containing week integers in which the record has been used.
Variables
- NUM_AGENTS
- Integer representing the number of agents (users) participating in the simulation. Value is fixed at 3.
- BATCH_UPLOAD_SIZES
- Vector of integers representing different sizes of initial record batches uploaded to the database. The values used in simulation are [0, 10].
- initialQualityDist
- A lognormal distribution (LogNormal(3, 2)) used to generate the initial quality of new records. The mean is set to 3, and the standard deviation is set to 2, meaning the typical quality of created record will be around 3, but there could be some with higher or lower.
- qualityImprovementProb
- Float64 representing the probability that an agent will improve a record's quality after using it. Value is fixed at 0.7 for the sake of simplicity and simulation feasibility.
- recordCreationProb
- Float64 representing the probability that an agent creates a new record when they are preparing an activity and decide not to search the database. Value is fixed at 0.3 for the sake of simplicity and simulation feasibility.
- categories
- Vector of integer representing the category ids of the db records. The vector is generated as range from 1 to 20, [1:20;].
- targetSuccessfulness
- Float64 value, which represents minimum success rate, that needs to be reached for the database to be considered "useful" in a long term. Value is 0.75.
- nSuccessfullFromEpisode
- An integer number representing how many successful searches per episode of the simulation have to occur in order for it to be considered "useful". Value is 4.
- episodeLength
- Integer which represents length of a one "episode" within the simulation, which in this case is a sequence of weeks for which the success rate of the db will be tested if it met required conditions. Value is set to 6, this means 6 weeks.
Flow
1.Initialization:
- The database is created with empty vectors of records for each defined category (integer from 1 to 20).
- A defined number of agents are created with default probabilityToSearch.
- Optionally, a number of initial Record objects will be created (based on parameter BATCH_UPLOAD_SIZE), and randomly assigned to a random category within the database.
2.Weekly Cycle (One Tick):
- Each week (tick), each agent attempts to prepare an activity.
- Each agent selects a random category.
- Then each agent decides whether to search the database based on their current probabilityToSearch (It was originally designed to be able to dynamicaly calculate probabilityToSearch of agents independently, but for the same reason the recordCreationProb is global to all agents, also this variable was left the same for all agents).
- Search Behavior
- If the agent decides to search, it looks up all records that matches it selected category.
- If records are found, the agents samples without replacement a predefined number of records for evaluation. Each has different probability based on its quality. The more quality it has, the higher the probability to be chosen. (Weights are calculated like this [rec.quality / sum(r.quality for r in records) for rec in records])
- An agent tries to pick for use randomly one record from those records.
- Before the agent uses any of the found records, there has to be checked if they haven't been used in a defined "cooldown period", which is set to 54 weeks. Each agent keeps track of used records and the weeks where they where used.
- If the record the agent tries to pick is on cooldown, the agent proceeds to trying another one from the sampled ones until it finds a cooled down one or until there are no more samples (usually in the beginning of the simulation when there are yot not created record in the db).
- If the record is used, there is a probability of the agent improving the quality of that used record.
- Record Creation Behavior
- If the agent decides not to search or if the search is unsuccessful, it has a chance of creating a new record (based on parameter recordCreationProb).
- If a new record is created, it's assigned a randomly generated initial quality based on the lognormal distribution, and pushed to db under it's selected category.
- Also there is kept record of total number oc record created so far during the simulation and the order number of each record is assigned to the record during its creation as its id. This way its possible to keep track of which records were used before by the given agent.
3.Evaluation
- Further the simulation tracks total search attempts and successes. The success is considered only when a record is successfully used for an activity and not when a record is found or searched for.
- At the end of each week, a weekly success rate is calculated by dividing the amount of successful record usages by the number of attempts that were performed.
- The simulation continues until the database has reached "sufficient" usability.
- A database is considered as "sufficiently useful" when success rates in a defined episode length (6 weeks) are over a defined threshold (75%) for a given minimum number of successful weeks (4 out of 6). In other words, when this condition is met within a 3 month timeframe.
Limitations
- Simplified Categories: The categories used in the model are represented by integers and lack any explicit meaning. It was intended to use a combination of string as category representation, but the complexity has no effect on the results, therefore integer representation is used to make code simpler.
- Agent Uniformity: Each agent has an identical recordCreationProb which is not realistic, and does not model different level of contributing to the db.
- Fixed Parameters: Many model parameters, such as qualityImprovementProb, recordCreationProb, NUM_AGENTS are fixed values and don't change during simulation, which makes the simulation a bit static.
- Single Activity Per Week: Each agent prepares only one activity per week, which may be an oversimplification of real user behavior.
- Limited Complexity: The simulation lacks a deeper understanding of how records relate to each other or have a more diverse properties than just an "quality".
Results
In the last section of appended code there was realized an attempt to calculate the results simply by adding new function which whould call the main simulation given number of times and recorded the results so the variability caused by randomness could be viewed. Specifically how the required number of weeks and number of created records varied when running the simulation repeatedly.
Unfortunately there has been some enexpected and probably undesired patterns occuring in the obtained "multi run" results. Most likely a skillissue on my side. Further elaboration follows.
Also there is a bug/feature with my function which determines when the search process is successfull enough. Because i require certain amount of time to assess the stability of said successfullness, there is a minimal but fixed timeframe in the begining of the simulation. And because of this when the number of weeks required for the system to be recognized as usefull is measured, its value can not begin earlier than after said timeframe, so that is why the smallest number of weeks in the results are 21.
Also the counter of created records does not count the records created via batch upload so when reading it needs to be kept in mind.
Also in the github repository with code, there are plots showing how the successrate evolves in time.
Single run
This means runing the multi run function for single iteration.
But it was run manually three times.
- batch 0
- weeks until "usefullness": 82, 51, 69
- records created in db: 44, 40, 55
- batch 10
- weeks: 21, 33, 21
- records: 21, 26, 19
- batch 30
- weeks: 24, 21, 22
- records: 26, 17, 18
- batch 50
- weeks: 21, 21, 21
- records: 30, 18, 27
It appears that the batch upload in the beginning realy speeds up the time required for the database to be considered usefull, but unfortunately the implementation can not ends earlier then after at least small fixed timeframe. Its obvious that the results are heavily influenced by the randomness in the model.
It seems that the first small batch upload has the largest influence on the resulting values and larger batches brings only relatively smaller improvements, that suggests to use the batch upload in real life, but its redundant to focus on optimizing the upload process for larger amounts, because they are not needed and only small amount of records (10) is enough to significantly boost the speed of populating.
Multi run
- batch 0
- weeks: [55, 72, 70, 78, 94, 114, 126, 120, 135, 140, 152, 156, 180, 161, 167]
- records: [32, 129, 108, 195, 262, 316, 320, 308, 366, 373, 432, 442, 496, 486, 487]
- batch 10
- weeks: [35, 99, 99, 189, 249, 291, 298, 413, 413, 426, 452, 433, 471, 520, 564]
- records: [35, 48, 43, 90, 93, 112, 116, 148, 133, 141, 143, 155, 163, 158, 188]
- batch 30
- weeks: [21, 67, 79, 105, 99, 97, 115, 132, 137, 149, 160, 158, 170, 178, 181]
- records: [13, 170, 200, 227, 272, 254, 336, 316, 406, 423, 442, 415, 464, 496, 521]
- batch 50
- weeks: [33, 21, 82, 83, 91, 113, 123, 149, 145, 144, 152, 166, 164, 175, 192]
- records: [58, 42, 187, 247, 271, 306, 331, 365, 397, 413, 468, 502, 495, 507, 540]
Well, those are here just for fun because there is not much value in such results except for showing something does not work as expected.
Discussion
In this simulation I managed to create heavily simplified model of database populating process in our boy scouts group.
Therefore the resuls appears to be approximately as they should, but they lack on the reliability if they should be used as a base for a decision. At most I can take a hint about the amount of records to batch upload. That relatively small amount makes a major difference.
To further improve the model there should be implemented:
- Chance for agents to "not like the record", that would sometimes stoped the agent from picking a record to use. It could try picking another if there would be more, if not the search would fail. This would be a small improvment to a model realisticity.
- Enhance the feature of records quality so it could not be increased infinitely.
- Add monitoring for values of records quality, so I can see better what is the state of the db like.
Code
This is my code on github, because I could not upload .jl file and a did not wanted to make zip: [1]
And the home repo [2]