Go to file

Adler Neves 3bf311d2b8 .fix: aggregate score		2024-02-17 21:02:34 -03:00
sb3-api	fix: recover labels	2024-02-16 01:50:22 -03:00
.gitignore	fix: remove zip	2024-02-12 14:26:02 -03:00
ClusterMetrics.cs	initial commit	2024-02-12 14:22:50 -03:00
ClusterRelabeler.cs	fix: recover labels	2024-02-16 01:50:22 -03:00
KMeans.cs	fix: recover labels	2024-02-16 01:50:22 -03:00
LICENSE	fix: 2-digit tests	2024-02-12 15:40:30 -03:00
Makefile	fix: 2-digit tests	2024-02-12 15:40:30 -03:00
Program.cs	.fix: aggregate score	2024-02-17 21:02:34 -03:00
README.md	fix: recover labels	2024-02-16 01:50:22 -03:00
SWITRS.cs	fix: recover labels	2024-02-16 01:50:22 -03:00
Scalers.cs	initial commit	2024-02-12 14:22:50 -03:00
TrainSplit.cs	fix: recover labels	2024-02-16 01:50:22 -03:00
Utils.cs	fix: recover labels	2024-02-16 01:50:22 -03:00
cskmeans.csproj	fix: recover labels	2024-02-16 01:50:22 -03:00
cskmeans.sln	initial commit	2024-02-12 14:22:50 -03:00
model-as-a-db-row.txt	fix: recover labels	2024-02-16 01:50:22 -03:00
query.sql	fix: recover labels	2024-02-16 01:50:22 -03:00
switrs.sqlite.txt	initial commit	2024-02-12 14:22:50 -03:00

README.md

C# KMeans + SpringBoot3 API

This was a self-challenge during 2024's carnival holiday's weekend, and incremented to become an optional exercise for an MBA in AI for Businesses - proof-of-concept maturity.

This repository reimplements KMeans in plain C# in the form of a externally-schedulable job¹, and also a minimal effort to make it a Spring Boot 3 API². This also provides some form of outlier detection.

¹ Deployable via crontab, your internal corporate solution, AWS Batch, Azure Batch, GCP Preemptible VMs, Oracle Burstable Instances, or anything else that runs a commandline at a scheduled time. May require tweaking to fit your needs.
² Deployable in your coporate TomCat, WildFly, Docker, Kubernetes, CloudFlare Workers, AWS Lambda, GCP Cloud Functions, or anything else you use to serve a Java Web API. May require tweaking to fit your needs.

Example lifecycle

Train:

make

It will produce an output like this:

dotnet run --configuration Release
Started         2/16/2024 1:13:00 AM (0.0136183s)
Loaded          2/16/2024 1:13:10 AM (9.8201451s, DatasetLines=2671097)
Shuffled        2/16/2024 1:13:10 AM (0.1334194s)
Scaled          2/16/2024 1:13:10 AM (0.0696958s)
> TrainCentroids=[0.32,0.49,0.52,0.5,0.56]
SplitTTV        2/16/2024 1:13:13 AM (2.7777124s, Train=2569595, Test=48222, Validation=53280)
10x Fits        2/16/2024 1:18:16 AM (302.8325479s)
> WSSs=[0.23768,0.19774,0.17236,0.15302,0.13696,0.11922,0.11058,0.102,0.09685,0.09028]
10x Tests       2/16/2024 1:18:16 AM (0.1831497s)
> SILs=[0.2277,0.22823,0.21612,0.23022,0.23678,0.24369,0.24482,0.24524,0.2472,0.24839]
> AGGs=[-0.00998,0.03049,0.04376,0.0772,0.09982,0.12447,0.13424,0.14324,0.15035,0.15811]
10x Silhouettes 2/16/2024 1:39:22 AM (1266.4811917s)
> OutlierScore(o=1.6, Min=0, Avg=0.4713820048875287, Max=0.8736922864125183)
Scalers=[
    [32.5, 42.03755], [-124.49859, -114.1], [0, 6], [1, 366], [0, 86340]
]
Centroids=[
    [0.16, 0.62, 0.77, 0.76, 0.72],
    [0.57, 0.28, 0.83, 0.76, 0.59],
    [0.57, 0.28, 0.83, 0.25, 0.57],
    [0.18, 0.61, 0.82, 0.6, 0.25],
    [0.17, 0.62, 0.27, 0.25, 0.29],
    [0.16, 0.63, 0.24, 0.79, 0.52],
    [0.16, 0.62, 0.76, 0.23, 0.65],
    [0.17, 0.62, 0.17, 0.3, 0.73],
    [0.57, 0.28, 0.27, 0.22, 0.58],
    [0.56, 0.29, 0.26, 0.74, 0.72],
    [0.56, 0.29, 0.28, 0.69, 0.27]
]
DescaledCentroids=[
    [34.026008, -118.0514642, 4.62, 278.4, 62164.799999999996],
    [37.936403500000004, -121.5869848, 4.9799999999999995, 278.4, 50940.6],
    [37.936403500000004, -121.5869848, 4.9799999999999995, 92.25, 49213.799999999996],
    [34.216759, -118.1554501, 4.92, 220, 21585],
    [34.1213835, -118.0514642, 1.62, 92.25, 25038.6],
    [34.026008, -117.9474783, 1.44, 289.35, 44896.8],
    [34.026008, -118.0514642, 4.5600000000000005, 84.95, 56121],
    [34.1213835, -118.0514642, 1.02, 110.5, 63028.2],
    [37.936403500000004, -121.5869848, 1.62, 81.3, 50077.2],
    [37.841028, -121.4829989, 1.56, 271.1, 62164.799999999996],
    [37.841028, -121.4829989, 1.6800000000000002, 252.85, 23311.800000000003]
]
Prevalence=[1,1,1,1,1,1,1,1,1,1,1]
CompensatedPrevalence=[2,0,1,1,1,3,4,1,1,1,1]
bestK=11 clusters
WSS=0.09427714334438726
Sil=0.24316338789249595
Agg=0.14888624454810867
Validation        2/16/2024 1:42:01 AM (158.4397816s)
All Done!         2/16/2024 1:42:01 AM (0.000619s; Total=1740.7518809s)
memusg: peak=3274104

Run server:
```
make runapi
```
Visit with the browser: http://localhost:8080

Submit the request:

Field	Value
latitude	34.16449
longitude	-118.15798
date	2009-01-14
time	14:15:00

curl -s 'http://localhost:8080/model?latitude=34.16449&longitude=-118.15798&date=2009-01-14&time=14%3A15%3A00' | jq

See the response:

{
  "cluster": {
    "id": 6,
    "truePrevalence": {
      "id": 1,
      "label": "1-possible"
    },
    "compensatedPrevalence": {
      "id": 4,
      "label": "4-dead"
    }
  },
  "outlierScore": 0.5144809550353892
}

Performance

The dataset contains 2,671,097 lines by 4 columns stored as double (8 bytes), which is at least 85,475,104 bytes (81.5 MiB).

The single-threaded³ C# code performance was evaluated on these systems:

Hardware	7900X	MacMini M2	Dell 3511 i5	i7-4790
Form factor	Desktop	Desktop	Laptop	Desktop
Processor	Ryzen 9 7900X	Apple M2	Intel i5-1135G7	Intel i7-4790
Cache L3	64MB	8MB	8MB	8MB
Max Frequency	4.70 GHz	3.48 GHz	2.40 GHz	3.60 GHz
Max Turbo	5.70 GHz	3.48 GHz	4.20 GHz	4.00 GHz
Storage Type	SSD	SSD	SSD	HDD
RAM	4×32GB @ DDR5-4000	16GB @ 6400MT/s	2×8GB @ DDR4-2666	2×8 GB @ DDR3-1366
Kernel	Linux 6.7.4-zen1	Darwin 23.0.0	Linux 6.7.0-zen3	Linux 6.6.8-arch-1

Therefore, we should should see some memory busses saturated.

³ There are parallelization paths, and they are explicit by their prefix “10x”, but I believe that in a corporate environment there would be many jobs running in parallel and the predictability of a stable resource allocation would have a greater importance.

RAM resource

Memory was measured by watching the numbers on the resource monitor on each system. Under Linux, that means htop and on Mac that means Activity Monitor.

RAM usage	7900X	MacMini M2	Dell 3511 i5	i7-4790
	3.2 GB	1.1 GB⁴	3.2 GB	3.2 GB

⁴ Apple tries cheating by compressing processes' memory, but it backfires when it needs decompressing data in order to use it.

Processor resource

These timings are measured by the own program.

Stage	7900X	MacMini M2	Dell 3511 i5	i7-4790
Started	0.0145117	0.030513	0.019999	0.0237424
Loaded	3.5465618	6.516409	7.7220067	8.1680365
Shuffled	0.0226329	0.02862	0.0641004	0.048993
Scaled	0.0583177	0.070994	0.0663303	0.0903944
SplitTTV	0.7055008	1.027031	1.2220999	1.3453164
10x Fits	194.6585428	237.454974	309.4348727	405.7263568
10x Tests	0.1762073	0.203529	0.275389	0.3948622
10x Silhouettes	1395.8565106	1499.822873	2118.9745899	2955.7305509
Validation	141.7715882	151.862509	217.2271935	299.9685991
All Done!	0.0005635	0.001758	0.0006435	0.0006692
Total	1736.8109373	1897.01921	2655.0072249	3671.4975209

Therefore, we can confirm that L3 cache size and memory bandwidth are more important than CPU “speed”.

Innovations

None. This is inherently no innovation, as:

Math-wise:
- KMeans is an old algorithm, known since at least 1956;
- WSS (Within-Cluster Sum of Squares) is just a fancy name for a specific kind of variance, which the latter exists since at least 1923;
- Silhouette, the newest of it all, was proposed in 1987.
Programming-wise:
- C# is an old programming language, available since 2002;
- Spring Boot is an old web framework, available since 2014;
- Java is an old programming language, available since 1996.

The “youngest” item is 10 years old by the time this line got written. That's no innovation.

If you “innovate” using these technologies in your business, it's just a century worth of technical debts that you are removing from your outworn processes.

License

The implementation is licensed under MIT-0, which basically means Public Domain.

README.md Unescape Escape