This project is an interactive dashboard visualization of the Smithsonian Institute’s Smithsonian Open Access dataset, which contained 11.9 Million Records at time of. The dashboard allows insight into the composition of the Smithsonian’s collections, including what, when, and where items come from. Specifically, the visualization looks at the unit (such as the National Museum of American History or the Human Studies Film Archives), country, and age/time items come from (particularly if within the last couple centuries). Interaction enables filtering to a specific unit, allowing comparison in trends between units in addition to the whole.
Since the Open Access dataset contains almost 12 million records, the data is, in its own way, opaque. When I started working on this, there was no published documentation on the API (thankfully, Matt Miller had a blog post that helped me get started, and shout-out to Dr. Decker who spotted it). So, I had to fall back on the tried and true randomly sample, make statistics, and log anomalies.
Additionally, at 26GB uncompressed, it was too large for me to load at once, much less interactively search through. With the laptop I had at the time, doing a simple keyword search using
grep took about 5 minutes to complete. To accomplish this larger project, the data was sampled for basic structure and then processed with Python and Jupyter Lab. After logging summaries and anomalies, string processing was used to clean up typos, inconsistencies, and similar issues. Finally, a JSON file was created with the aggregated data.
At a technical level, the data was processed using command-line tools (
awk), Python, Jupyter Lab, and Regex. The visualization was created with d3 and a fork of Semantic UI.