-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine dataset preview page #209
Comments
After group discussion, we are going to use a tabbed interface for the datasets details page. This should provide the screen real estate necessary to add more visual analysis. We want to expand the page to provide some basic details about each of the dataset features, as well as tools that allow users to explore relationships in the data. Rough draft image is here: pennai_datasets_rough1.pdf First pass description of the new tabs are:
The first UI task is to:
|
this issue may show how to display mpld3 figures in javascript: mpld3/mpld3#128 Look for |
set of d3 examples Made progress with creating stacked bar charts in terms of raw functionality (about 80 % complete, still a few issues to iron out with the chart legend and all the styling) Working with d3 is okay but many of the tutorials/examples/resources are from various points in the library's development past, some key features explained in some of the examples have changed but for the most part it's not too bad. Nevertheless, both Bill and Heather were right with my initial estimate of finishing this component was off, this might take a couple more days than I initially anticipated. I need to continue learning d3 in order to properly leverage its features as I know the things I am trying to do are readily supported. However, I do have a major concern with trying to generate some of these visualizations in realtime on the browser/client - depending on a typical dataset size (# of cols & rows) this may or may not be a concern. For reference the |
Another problem that might be of concern is Javascripts number precision when doing calculations. I noticed a discrepancy with the boxplots for the Here is the openml page for the banana dataset with boxplots - https://www.openml.org/d/1460 I'm using the calculations outlined here to create the local boxplot - https://www.d3-graph-gallery.com/graph/boxplot_basic.html And the results are different, I downloaded the csv from openml and it generates the same, slightly different boxplot as the old banana dataset, here is the one that I am generating locally - The only experience I have with boxplots is from an intro statistics course I took in college so I don't know what's wrong, the formulas appear to be correct so my guess is either an issue inherent to Javascript or the d3 library functions used to generated the statistics. It could also be another mistake. One last thing is that on openml there are some datasets where the visualizations are weird - https://www.openml.org/d/1492 & https://www.openml.org/d/1504 |
The updated dataset preview page is just about done in terms of functionality; there are three different types of charts generated - boxplots, a regular bar chart & stacked bar charts with basic styling Both bar charts are mildly interactive in that one is able to hover over the bars and see the value Barring any issues/bugs (from my understanding the charts are correct, I fixed the legend labels) or significant design changes, the only thing left to do is change/add styling I am testing this with the |
Further UI refinements & other tasks
Update - here is a screenshot of the aforementioned UI changes, the exact positioning & sizing needs to be fine tuned but I believe the general layout & format is close to what has been requested
Update - Double checking statistics for boxplot for banana test dataset (https://www.openml.org/d/1460) AT1
Whiskers
AT2
Whiskers
The formulas being used are
This is using d3 to sort the data and generate the quantiles/median, comparing these values against the same dataset loaded into python with pandas produces the same results with the above formulas. The only remaining discrepancy is that the min/max values used for the whisker lines of the boxplot for the banana dataset on openml are still different in the boxplot I am producing:
|
…ed phrase target to dependent column and arrange first and center content
I spent some time today trying to figure out what's going on with the whisker lines in the boxplots I am generating in the webpage and compared them against plots/values generated with python pandas & matplotlib. Firstly, I did not realize that there are different boxplots that put the whisker lines around different values (I'm going off the wiki page for boxplots - https://en.wikipedia.org/wiki/Box_plot). I was under the impression that the whisker lines were always calculated according to the formula in the comment above. There are different formats where the whisker lines can be the min/max values of the dataset, when using the formula above to get the whisker lines the resulting chart is apparently called a turkey boxplot - I am trying to make this plot. With that said, I believe I was not calculating the min/max values correctly before which was, at least in part, causing the discrepancy. Most of my effort today was spent looking at pandas and matplotlib to compare the statistics generated by those libraries with the values I am generating in the webpage/javascript; other than the min & max values for the whisker lines, all the other values appear to be correct. After looking at the documentation for matplotlib for configuring the whisker lines (https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.boxplot.html), I changed how the min/max values are set. According to the docs, the whisker lines will at minimum be the smallest value in the data and vice versa for max values. Then I used the banana dataset to make a boxplot and it matches the boxplots I am generating on the webpage with python I believe both of these boxplots are essentially equivalent. However, I do not know what type of boxplot is being used on the openml website as that is different. In short, the boxplot whisker lines are now only as small as the min value in the data or as large as the max. Before, if the calculation went out of bounds so to speak, the whisker lines would apparently be grossly off, they now at least match the boxplots generated with pandas/matplotlib. I also tried a few of the other datasets, tokyo1, iris and german Going forward I am assuming that using the min/max values of the data as lower/upper bounds for the whisker lines is a suitable method. I also spent time trying to change the styling, I am having a bit of trouble centering the column content of the grid with the existing semantic ui element. I'll have to try something else to replicate the design of the openml chart. Just for example, there's a flag to center/align all the column content but I'm not sure how to specify different options for different columns (this is without css, just vanilla semantic ui - https://react.semantic-ui.com/collections/grid/). I'll have to use a different semantic ui element or try some specific css styling. In any case here is screenshot with the charts shifted over I also changed how the bar charts color are chosen, before it was just two colors hardcoded and now it is chosen with this function on a continuous scale https://github.com/d3/d3-scale-chromatic#cyclical, if there other color choices/schemes that are preferable please let me know. |
Also found some posts about the question of whether plotly.js sends data to and 3rd party servers It appears to be safe to use. This library should also make it significantly easier to make some of the charts, especially some of the more advanced charts. |
Hi @joshc0044, nice research. It will be very helpful if plotly is something we can use. Just a note to clarify how the colors should work. In the bar chart for the target feature, each bar/class should be a different color. In the stacked bar charts generated for categorical and ordinal features, the coloring for each of the classes should match the color of the corresponding class in the target chart. Please use https://www.openml.org/ for reference, here are some specific examples: (https://www.openml.org/d/4541 (see the gender, race and age features), https://www.openml.org/d/50) |
Thank you for the feedback; I changed the color for the stacked bar charts to match the unique color to class mapping. This is from a subset of the diabetes openml example (https://www.openml.org/d/4541) as the entire dataset was too large to upload in pennAI, I copied the first two thousand rows and the last two thousand rows just to get a small test example. I also changed some of the styling of the chart x-axis labels by rotating them by 45 degrees. Using this diabetes dataset highlighted some issues - in openml they appear to have a cutoff when the resulting chart has too much stuff (for keys Rotating the labels for the boxplot also resolved some issues when those values were too large and became garbled & jumbled together. Using the diabetes test dataset also highlighted an issue with some of the column keys; if the column keys contained a And for completeness here is the chart that does not appear in openml - they display the text for reference in the chart above there are 357 unique values for diag_1 (from the subset I have of the entire dataset, all 700 or so would be even worse) I also attempted to upload the tic-tac-toe dataset from openml and ran into a problem; it causes this error
Every column is nominal/categorical, am I doing something wrong or are there other datasets I can try to test? Update/edit (here is the file I used which is a subset of the entire diabetes dataset): |
… each dependent column value, move chart legend, add pop up tooltip for X-axis and add check for `.` in dataset column names
…s are non-numeric Addresses the tic-tac-toe dataset error during upload found while testing #209
After the feedback pointing out that the median lines were off in the meeting today, I looked at the boxplot piece, here is the code generating the median values from d3
At first I was just using the value of |
made a basic unit test to check statistics used for boxplots - for testing using values for
|
those should be equivalent, i'm puzzled as to why they wouldn't be. did you confirm that |
Restarting this effort with #309 |
The text was updated successfully, but these errors were encountered: