Metro Management System
SUSTech 2024 Spring's Projects of Course CS307 - Principles of Database System
Led by Yuxin MA
Shenzhen-Metro
├── Project1 # part 1 of project (database design and data import)
│ ├── ShenzhenMetroDatabaseDesign
│ │ ├── src
│ │ │ ├── main
│ │ │ │ ├── java # import script in java
│ │ │ │ ├── resources # data in json
│ │ │ │ └── sql # ddl
│ └── ShenzhenMetroDatabaseDesignPython
│ ├── import_script.py # import script in python
│ └── resources # data in json
├── Project2 # part 2 of project (building an api)
│ ├── DataImport # updated data import
│ │ ├── src
│ │ │ ├── main
│ │ │ │ ├── java # updated import script in java
│ │ │ │ ├── resources # updated data
│ │ │ │ └── sql # updated ddl
│ └── ShenzhenMetro # spring boot project
│ ├── src
│ │ ├── main
│ │ │ ├── java # backend logic
│ │ │ │ └── com/sustech/cs307/project2/shenzhenmetro
│ │ │ │ ├── ShenzhenMetroApplication.java # main application driver
│ │ │ │ ├── controller # api route mapping
│ │ │ │ ├── dto # dto between client and server
│ │ │ │ ├── object # orm between database tables and application code
│ │ │ │ ├── repository # interfaces for data access
│ │ │ │ └── service # service layer containing business logic
│ │ │ └── resources # frontend logic
│ │ │ ├── application.properties # configuration file for spring boot application
│ │ │ ├── static
│ │ │ │ ├── assets
│ │ │ │ │ ├── css
│ │ │ │ │ ├── img
│ │ │ │ │ ├── js
│ │ │ │ │ └── vendor
│ │ │ │ │ ├── aos
│ │ │ │ │ ├── bootstrap
│ │ │ │ │ ├── bootstrap-icons
│ │ │ │ │ ├── glightbox
│ │ │ │ │ └── swiper
│ │ │ │ └── index.html # main html
│ │ │ └── templates
│ │ │ ├── buses
│ │ │ │ ├── create_bus.html
│ │ │ │ ├── index.html
│ │ │ │ └── update_bus.html
│ │ │ ├── landmarks
│ │ │ │ ├── create_landmark.html
│ │ │ │ ├── index.html
│ │ │ │ └── update_landmark.html
│ │ │ ├── lineDetails
│ │ │ │ ├── create_line_detail.html
│ │ │ │ ├── index.html
│ │ │ │ ├── navigate_routes.html
│ │ │ │ └── search_line_detail.html
│ │ │ ├── lines
│ │ │ │ ├── create_line.html
│ │ │ │ ├── index.html
│ │ │ │ └── update_line.html
│ │ │ ├── ongoingRides
│ │ │ │ └── index.html
│ │ │ ├── rides
│ │ │ │ ├── create_ride.html
│ │ │ │ ├── filter_ride.html
│ │ │ │ ├── index.html
│ │ │ │ └── update_ride.html
│ │ │ ├── stations
│ │ │ │ ├── create_station.html
│ │ │ │ ├── index.html
│ │ │ │ └── update_station.html
│ │ │ └── users
│ │ │ ├── card.html
│ │ │ └── passenger.html
└── README.md
The first part of the project is mainly about designing a database schema that satisfies the principles of relational databases based on the background of the provided data. Once the design phase is complete, we wrote scripts to import those large datasets. To ensure accuracy of the imported data, we had to perform some query statements and checked the query results on the defense day. Besides that, we also ran some experiments with the data to gain some wonderful insights, as shown later.
[Read the detailed requirements]
We believe that no database design is perfect. In fact, there are flaws with the design proposed by us in the ER diagram above. After looking at the datasets and the background of each dataset, we hope you can try to spot design flaws on your own! :)
Note: For figure interpretation,
ID | OS | Chip | Memory | SSD | Tools |
---|---|---|---|---|---|
1 | macOS Sonoma 14.4.1 | Apple M3 Pro | 18GB | 1TB | IDEA 2024.1 (CE), PyCharm 2023.3.4 (CE), Datagrip 2024.1 |
2 | Windows 11 Home 23H2 | 12th Gen Intel(R) Core(TM) i9-12900H | 32GB | 1TB | IDEA 2024.1 (CE), Datagrip 2024.1 |
3 | Ubuntu 22.04.4 (VM) | Apple M1 Pro | 16GB | 512GB | IDEA 2024.1 (CE), Datagrip 2024.1 |
Experiment 1: Different Import Methods
Method 1 (original script):
This method utilizes the java.sql
library. Firstly, we established a connection to our PostgreSQL server. Then we read all data from JSON files. Next, we iterated through each datum and created PreparedStatement
for each insert statement. Lastly, we called the executeUpdate()
method to execute each statement individually.
Method 2 (optimized script):
This method also utilizes the java.sql
library and employs the same data reading algorithm as Method 1. The difference is now we make use of the executeBatch()
method. So we iterated through the whole data, created each insert statement with PreparedStatement
, and added each PreparedStatement
to a batch for a batch execution.
Method 3 (running a .sql file):
We used a Java program to generate SQL insert statements and wrote them into a .sql
file by employing the same data reading algorithm mentioned above. Then we run the file in DataGrip.
Since we are using the same data reading algorithm across all three methods, we will use an average runtime for our subsequent tests. We initially gathered three different runtimes — 504 ms, 546 ms, and 552 ms — and calculated an average runtime of 534 ms.
Testing Environment | Method | Average Read Time (ms) | Write Time (ms) | Total Time (ms) | Throughput (statements/s) |
---|---|---|---|---|---|
1 | 1 | 534 | 206396 | 206930 | 8636.91 |
1 | 2 | 534 | 2114 | 2648 | 97632.92 |
1 | 3 | 534 | 13581 | 14115 | 15197.41 |
Table 2 illustrates varying performance metrics across different methods, with Method 2 showing the highest throughput and Method 1 having the slowest total time. This makes Method 2 the standard testing method in the upcoming experiments.
Experiment 2: Data Import with Different Data Volumes
Managing and importing data of varying volumes is a crucial aspect of ensuring the performance, scalability, and reliability of a database system.
Before the import process, we noticed that the ‘ride’ data was notably larger in volume compared to the others. Based on this idea, we decided to test data import with different volumes on the ride.json
only. Since the weight of data is not consistent, importing less volume for the other tables might result in a serious issue due to the connectivity between the tables.
Initially, we started by importing the full data (100% volume) for all the other tables except the rides_by_id_num
and rides_by_card_num
tables to ensure the consistency of our design. To manage this effectively, we adopted a phased import strategy for the ride
data, beginning with a 20% subset of the data, which equated to 20,000 records. This initial phase allowed us to assess the impact on system performance and make necessary adjustments to the import process without compromising the database's stability. After successful validation and performance tuning, we proceeded with importing 50% of the data, and finally, the remaining portion to complete the 100% data import. Note that we used Method 2 for all imports as it is the fastest.
Testing Environment | Method | Volume | Read Time (ms) | Write Time (ms) | Total Time (ms) | Statement Count | Throughput (statements/s) |
---|---|---|---|---|---|---|---|
1 | 2 | 20% | 507 | 808 | 1315 | 80165 | 99214.11 |
1 | 2 | 50% | 521 | 1338 | 1859 | 130849 | 97794.47 |
1 | 2 | 100% | 534 | 2114 | 2648 | 203502 | 96263.95 |
From Table 3, as the volume of data increases (from 20% to 50% and to 100%), both read and write times tend to increase, resulting in longer total times for the operations. However, these numbers do not give any insightful meaning as we had different import volumes. If we look at the number of throughputs instead, the throughput (statements/s) gradually decreases with increasing data volume, indicating that the system becomes less efficient at processing statements as the workload increases.
Experiment 3: Data Import on Different Operating Systems
In testing the process of importing data on different OSs, we employed the fastest import method, which is Method 2, with 100% import volume.
Note that when running the Java import script on Linux through a virtual machine, it is essential to consider the unfair disadvantage on performance and resource allocation, as virtualization can introduce overhead and affect the efficiency of the system.
Based on the analysis above, it is clear that Environment 1 (macOS) shows the best performance with the shortest total time of 2648 ms and the highest throughput at 97,632.92 statements/s. In contrast, Environment 2 (Windows) is the slowest with a total time of 4117 ms and the lowest throughput at 60,669.02 statements/s. It is worth-noting that Linux Ubuntu, while running as a VM, still outperformed bare-metal Windows.
Experiment 4: Data Import with Various Programming Languages
In this experiment, we used Method 2 mentioned above for Java code since it is the fastest. Meanwhile, for Python, we wrote an import method that made use of Psycopg2 to communicate with the PostgreSQL database server.
Testing Environment | Programming Language | Read Time (ms) | Write Time (ms) | Total Time (ms) | Throughput (statements/s) |
---|---|---|---|---|---|
1 | Java | 534 | 2114 | 2648 | 96263.95 |
1 | Python | 390 | 7237 | 7627 | 28119.66 |
The data from Table 4 show that Java outperformed Python in both speed and throughput, indicating its superior efficiency for read-write operations.
Experiment 5: Data Import on Different Databases
We developed three different import methods for both PostgreSQL and MySQL, but we only ran an experiment on Method 2 as it is a developer’s choice for importing data. Both PostgreSQL and MySQL have a similar database implementation design (DDL wise) and a similar import code design.
Testing Environment | Method | Database | Read Time (ms) | Write Time (ms) | Total Time (ms) | Throughput (statements/s) |
---|---|---|---|---|---|---|
1 | 2 | PostgreSQL | 534 | 2114 | 2648 | 96263.95 |
1 | 2 | MySQL | 534 | 42315 | 42849 | 4809.22 |
Despite both being SQL-based databases, PostgreSQL is approximately 16 times faster when it comes to write time. This might be due to differences in database engine architecture and driver implementation (postgresql-42.2.5.jar
for PostgreSQL and mysql-connector-j-8.3.0.jar
for MySQL).
Part 2 of the project focuses on providing the basic functionality of accessing the database system by building a backend library which exposes a set of application programming interfaces (APIs). Note that there is also an additional dataset needed to be imported, hence requires an updated database implementation (can be found in ./Project2/DataImport
).
[Read the detailed requirements]
./Project2/DataImport/src/main/sql/ddl.sql
./Project2/DataImport/src/main/java/ImportScript.java
./Project2/ShenzhenMetro
) with Maven (IntelliJ IDEA IDE is recommended)