-
-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better data viz for throughput #85
Comments
Is |
Yes. The bigger explanation of this issue: As people move around the map, the simulation generates
My first attempt at doing this was to store the raw events -- all of them. But of course storing all the raw events took way too much space; the prebaked So next I switched to the It's a way bigger problem for comparing before/after counts. If you have measured 50 cars at 3:30am, but you're comparing to the entire 3-4am prebaked bucket -- which has, say, 200 -- then it looks like that road has less traffic than usual. By 3:59am, the comparison becomes valid, but then the problem starts over again for the 4-5am bucket. One of my attempts to compromise between meaningful raw data and smaller hour-bucketed data was using https://crates.io/crates/lttb to downsample the raw data. https://github.com/dabreegster/abstreet/tree/lttb was the attempt. This didn't work -- if you downsample 100,000 points covering 24 hours, you get a nice line plot. If you downsample 1,000 points covering just one hour, that shape doesn't at all match up with the first 1/24th of the full line. There might be a totally different approach to measuring, storing, and comparing throughputs. I don't understand the field of downsampling at all. End braindump |
Okay, I'm taking a look at the
Is this just me or do you see it as well? |
The binary map format has changed; I'll rebase the branch against master |
Hmm, it seems I'm still running into issues. Right now, I'm seeing errors with regard to fetching the data using the updater. Output below:
|
I think you need to rebase against master again; the URLs changed earlier today |
IIUC, the main problem is that you weren't able to calculate accurate event counts without storing all the events. I think we can solve this with dynamic programing: we store the total number of events from start to all time increments in the simulation. So in order to get the count of events between 3:00 and 4:00, we subtract the total at 4:00 by the total at 3:00. We don't have to store all the events, just the event total event count at every minute. So the memory required scales linearly with the length of the simulation, which is actually constant: 24 hours -> 1440 minutes, or 1440 numbers per mode. Do you think this could work? If yes I'll try implementing it |
Hmm, I hadn't considered this approach before -- it's interesting. The throughput is tracked per road segment and intersection, so the total storage would depend a bit on map size. Our smaller maps have around 1,000 roads+intersections, and around 10,000 for the larger. So with 4 modes and minute resolution, that'd be about 1440 * 4 * 10,000 = 57,600,000 counts. I bet there are lots of optimizations possible:
Looking at how the current hour-granularity is stored:
This is pretty silly -- So anyway, I would love some implementation help here! I think switching to this count-at-every-time approach makes perfect sense, and we can play around with the granularity (1 min, 5 mins, 10 mins).
Let me know if you get stuck, and thanks for looking into this! |
Great! I'll get started on implementing it. |
HN is the new Stack Overflow: https://news.ycombinator.com/item?id=26401935 |
Inspired by yesterday, I tried the linear interpolation approach every hour. Didn't seem to help. Much of the time, the count before for a road is very small -- less than 5. Any percentage of that is also tiny. Should also revisit the color scheme. Going to add tooltips with exact counts, to help debug. To preserve some of the code:
|
…ter counts of the data being displayed. #85 (The before counts are still bucketed on the hour mark)
recorded analytics and code that later summed things up, making the relative throughput layer more confusing than it is already. #85
If the problem is just that the counts are too low then you can probably fix that by just calculating (a + 1) / (a + b + 2) instead of merely a/b (this changes the scale from 0 to infinity to 0 to 1). If you need a mathematical justification then (a + 1) / (a + b + 2) is the estimate for the Beta distribution that you get when you try to do statistical analysis on this kind of data. This quantity is also symmetric (if you swap 'after' and 'before' you'll just get 1 - this value) and bounded (so no almost infinities) which I think will help. |
@dabreegster I'd like to try and help with this. Have you considered using more than one metric to visualize throughput? We can look at how a tool like Sysdig (distributed system monitoring tool) aggregates data (https://docs.sysdig.com/en/docs/sysdig-monitor/metrics/data-aggregation/), i.e., when downsampling the data to a 1-hour resolution (or lower), we could record 4 metrics:
When visualizing the data, I'd start with looking at the average and maximum throughput. The average value should remain meaningful for hours that are not complete and give you an idea of the overall traffic. The maximum would allow you to see traffic spikes that you wouldn't otherwise see when looking at the average. Backtracking a bit, we probably want to make sure we understand what actionable information we want to see through this visualization. |
The bigger picture: you make some edits to roads (or traffic signal timing), run a simulation, and want to understand what areas are seeing more or less traffic relative a baseline of no edits. That could clue you in to unexpected side-effects of your change (you make one road car-free and expect vehicle traffic to divert to a parallel main road, but instead people cut through a neighbourhood). This is helpful to watch in real-time as the simulation runs, since some of the big changes might only show up at certain times of day. So, the goal is for something like #85 (comment) to meaningfully summarize this information for someone. Currently there's only one built-in map with prebaked results from running a full baseline simulation. You can There are maybe simpler and more effective approaches to this problem than a per-road color scale. And the current scale is possibly too misleading -- if you hover over a dark red road, you see "0 before, 2 after", which is a huge relative increase, but meaningless on an absolute scale.
That's clever, I hadn't thought of that! Trying something like Sysdig's approach makes sense to me; I'd be very surprised if this type of problem isn't well-solved in other contexts, and internet traffic is hardly a big domain leap. :) |
Throughput is the number of people crossing a road or intersection over time. Check out
TimeSeriesCount
insim/src/analytics.rs
.The
raw
data is too big to store for most scenarios, so there's very basic compression into per-hour X mode counts. But the way that's later plotted in the live line plots (info panels for lanes and intersections) looks strange, because as the live sim progresses, the bucket for that hour fills up until it reaches the typical value at the end. It'd be better to store the more accurate shape of the throughput-over-time graph and make comparisons with that. I've tried using thelttb
crate to do this, but the compressed shape only resembles the baseline by the end of the day.That explanation was probably kind of nonsense, sorry. Basically, if you're interested in data viz, jump on this bug and I'll explain more clearly. :)
The text was updated successfully, but these errors were encountered: