The Daily Parker

Politics, Weather, Photography, and the Dog

Archive import finished; final archive import starting today

On Friday, I used Arithmetic™ to predict that the 162-million-row weather data transfer from Weather Now v3 to v5 would end around 7pm last night. Let's check the logs:

2022-04-04 18:48:30.7196|INFO|Clearing v3 archival records for ZYTX
2022-04-04 18:49:27.7471|INFO|Moved 157,408,921 weather archives from v3 to v5
2022-04-04 18:49:27.7471|INFO|Finished importing; duration 4.04:14:55.0952715 

Nice prediction. (It logged 157 million rows because I made a performance tweak and re-started the app after 5 million.)

As I've mentioned, those 162 million rows only go back to September 2009. But v3 launched in January 2007. And it turns out I have a second archive, also containing about 25 GB of data, going back to August 2006. I had to think back to decisions I never documented to piece together why.

In August 2006, I had a single big machine that served as my domain controller, Web server, and Exchange endpoint, and a second big database server. In this photo from 31 July 2006, the Exchange/Web/DC server is the white one on the right ("DOPPELKUH") and the SQL server ("BULLE") is the big black one on the left:

Now, BULLE was a huge machine for the time, with about 200 GB of disk space and (I think) 16 GB of memory. That 200 GB expanse tempted me to turn off a data-purge feature from more parsimonious days, and apparently I deployed that change around 9am on 11 August 2006. (Sadly, the purge feature worked as designed, and I have no archival data before then.)

In October 2006, I finally bought a server rack and moved the database to an even bigger set of disks:

Everything ticked along until around the time I deployed the Weather Now v3.5 refresh and discovered that the un-purged data file had grown to 25 GB. So on 3 September 2009, I simply created a new database and changed the data connection strings to point to it. That new database kept growing until I switched the archival data store to the Cloud in 2013.

And now, I get to take advantage of the triviality of that change by making an equally trivial change to the import controller's data connection string. The 2006 data file has 118,774,028 rows covering 3,988 stations from 11 August 2006 to 3 September 2009. At 435 rows per second, the final archival import should finish around...let's see...1 pm on Friday. At that point, every single byte of data Weather Now has collected in the past 15 years will be available through the App for you to see.

More from the archives:

Quick update

The Apollo Chorus performed last night at the Big Foot Arts Festival in Walworth, Wis., so I haven't done a lot of useful things today. I did take a peek at the other weather archive I have lying around, and discovered (a) it has the same schema as the one I'm currently importing into Weather Now 5, and (b) it only goes back to August 2006.

Somewhere I have older archives that I need to find... But if not, NOAA might have some.

The update rolls on...

As of 17:16 CDT, the massive Weather Now v3 to v5 import had 115,441,906 records left to transfer. At 14:28 CDT yesterday, it was at 157,409,431, giving us a rate of ( 41,967,525 / 96,480 seconds = ) 435 records per second. A little more math gives us another 265,392 seconds to go, or 3 days, 1 hour, 43 minutes left.

So, OK then, what's the over-under on this thing finishing before 7pm Monday?

It's just finished station KCKV (Outlaw Field, Clarksville, Tenn.), with another 2,770 stations left to go. Because it's going in alphabetical order, this means it's finished all of the Pacific Islands (A stations), Northern Europe (B and E), Canada (C), Africa (D, F, G, H). There are no I or J stations (at least not on Earth). K is by far the largest swath as it encompasses all of the continental US, which has more airports than any other land mass.

Once it finishes the continental US, it'll have only 38 million left to do! Whee!

How long will this take?

It takes a while to transfer 162.4 million rows of data from a local SQL database to a remote Azure Tables collection. So far, after 4 hours and 20 minutes, I've transferred just over 4 million rows. That works out to about 260 rows per second, or 932,000 per hour. So, yes, the entire transfer will take 174 hours.

Good thing it can run in the background. Also, because it cycles through three distinct phases (disk-intensive data read, processor-intensive data transformation, and network-intensive data write), it doesn't really take a lot of effort for my computer to handle it. In fact, network writes take 75% of the cycle time for each batch of reports, because the Azure Tables API takes batches of 100 rows at a time.

Now, you might wonder why I don't just push the import code up to Azure and speed up the network writes a bit. Well, yes, I thought of that, but decided against the effort and cost. To do that, I would have to upload a SQL backup file to a SQL VM big enough to take a SQL Server instance. Any VM big enough to do that would cost about 67¢ per hour. So even if I cut the total time in half, it would still cost me $60 or so to do the transfer. That's an entire bottle of Bourbon's worth just to speed up something for a hobby project by a couple of days.

Speaking of cost, how much will all this data add to my Azure bill? Well, I estimate the entire archive of 2009-2022 data will come to about 50 gigabytes. The 2003-2009 data will probably add another 30. Azure Tables cost 6¢ per gigabyte per month for the geographically-redundant storage I use. I will leave the rest of the calculation as an exercise for the reader.

Update: I just made a minor change to the import code, and got a bit of a performance bump. We're now up to 381 rows per second, 46% faster than before, which means the upload should finish in only 114 hours or 4.7 days. All right, let's see if we're done early Monday morning after all! (It's already almost done with Canada, too!)

Strange data patterns

The data transfer from Weather Now v3 to v5 continues in the background. Before running it, I did a simple SQL query to find out how many readings each station reported between September 2009 and March 2013. The results surprised me a bit:

The v3 database recorded 162.4 million readings from 4,071 stations. Fully 75 of them only have one report, and digging in I can see that a lot of those don't have any data. Another 185 have fewer than 100, and a total of 573 have fewer than 10,000.

At the other end, Anderson AFB on Guam has 123,394 reports, Spangdahlem AB in Germany has 123,297, and Leeuwarden, Netherlands, has 119,533. In fact, seven Dutch weather stations account for 761,000 reports of the 162 million in the archive. I don't know why, but I'll find out at some point. (It looks like some of them have multiple weather recording devices with color designations. I'll do some more digging.)

How many should they have? Well, the archive contains 1,285 days of records. That's about 31,000 hourly reports or 93,000 20-minute updates—exactly where the chart plateaus. Chicago O'Hare, which reports hourly plus when the weather shifts significantly had 37,069 reports. Half Moon Bay, Calif., which just ticks along on autopilot without a human weather observer to trigger special reports, had 90,958. So the numbers check out pretty well. (The most prolific US-based station, whose 91,083 reports made the 10th most prolific in the world, was Union County Airport in Marysville, Ohio.)

Finally, I know that what the App has a lot of data sloppiness right now. After I transfer over these archives, I'll work on importing the FAA Airports database, which will fix the names and locations of most of the US stations.

Chugging along with data consolidation

Sunday night I finished moving all the Weather Now v4 data to v5. The v4 archives went back to March 2013, but the UI made that difficult to discover. I've also started moving v3 data, which would bring the archives back to September 2009. I think once I get that done then moving the v2 data (back to early 2003) will be as simple as connecting the 2009 import to the 2003 database. Then, someday, I'll import data from other sources, like NCEI (formerly NCDC) and the Met*, to really flesh out the archives.

One of the coolest parts of this is that you can get to every single archival report through a simple URL. For example, to see the weather in Chicago five years ago, simply go to https://wx-now.com/History/KORD/2017/03/30. From there, you can drill into each individual report (like the one from 6pm) or use the navigation buttons at the bottom to browse the data.

Meanwhile, work continues apace on importing geographic data. And I have discovered a couple of UI bugs, including a memory leak that caused the app to crash twice since launch. Oops.

* The Met has really cool archives, some of which go back to the 1850s.

Weather Now 5 is Live!

I've just switched the DNS entries for wx-now.com over to the v5 App, and I've turned off the v4 App and worker role. It'll take some time to transfer over the 360 GB of archival data, and to upload the 9 million rows of Gazetteer data, however. I've set up a virtual machine in my Azure subscription specifically to do that.

This has been quite a lift. Check out the About... page for the whole history of the application. And watch this space over the next few months for more information about how the app works, and what development choices I made (and why).

Just for posterity, here's what the v4 Current Weather page looked like:

Good night, v4. You had a good 8-year run. And good night, Katie Zoellner's lovely design, which debuted 15 years ago.

Sabotaging open-source code

Via Bruce Schneier, a developer who maintains one of the most important NPM packages in the world got pissed off at Russia recently, without perhaps thinking through the long-term consequences:

A developer has been caught adding malicious code to a popular open-source package that wiped files on computers located in Russia and Belarus as part of a protest that has enraged many users and raised concerns about the safety of free and open source software.

The application, node-ipc, adds remote interprocess communication and neural networking capabilities to other open source code libraries. As a dependency, node-ipc is automatically downloaded and incorporated into other libraries, including ones like Vue.js CLI, which has more than 1 million weekly downloads.

“At this point, a very clear abuse and a critical supply chain security incident will occur for any system on which this npm package will be called upon, if that matches a geolocation of either Russia or Belarus,” wrote Liran Tal, a researcher at Snyk, a security company that tracked the changes and published its findings on Wednesday.

“Snyk stands with Ukraine, and we’ve proactively acted to support the Ukrainian people during the ongoing crisis with donations and free service to developers worldwide, as well as taking action to cease business in Russia and Belarus,” Tal wrote. “That said, intentional abuse such as this undermines the global open source community and requires us to flag impacted versions of node-ipc as security vulnerabilities.”

Yeah, kids, don't do this. The good guys have to stay the good guys because it's hard to go back from being a bad guy.

Huge step for Weather Now v5 today

This weekend, I built the Production assets for Weather Now v5, which means that the production app exists. I haven't switched over the domain name yet, for reasons I will explain. But I've created the Production Deploy pipeline in Azure DevOps and it has pushed all of the bits up to the Production workloads.

Everything works, but a couple of features don't work perfectly. Specifically, the Search feature will happily find everything in the database, but right now, the database only has about 31,000 places. Also, I haven't moved any of the archival data over from v4, so the Production app only has data back to yesterday. (The Dev/Test app has data back to last May, so for about the next week it'll have more utility than Production.)

I'm going to kick the tires on the Production app for a week or so before turning off v4. I expect my Azure bill will be about $200 higher than usual this month, but my June bill should be about what my January bill is. Uploading all the archival data and the Gazetteer will cost many, many database units, and I'll keep v4 running for probably half of April, for example.

The Dev/Test version and the Production version have the same bits as of this post. Going forward, Dev/Test will get all the new features probably a week or two ahead of Prod, just like in the real world.

Plus, over the next few months I'll post explanations of how and why I did everything in v5.

I'm pretty excited. Everything from here out is incremental, so every deployment from here out should be very boring.

Small victories

I just finished upgrading an old, old, old Windows service to .NET 6 and a completely different back end. It took 6.4 hours, soup to nuts, and now the .NET 6 service is happily communicating with Azure and the old .NET Framework 4.6 service is off.

Meanwhile, the Post published a map (using a pretty lazy algorithm) describing county-by-county what sunrise times will look like in January 2024 if daylight saving time becomes permanent. I'd have actually used a curve tool but, hey, the jagged edges look much more "data-driven." (They used the center point of each county.)

Now it's 22:45 (daylight saving time), and I need to empty Cassie and go to bed. But I'm pretty jazzed by how I spent a rainy afternoon on PTO. It was definitely more rewarding than tramping out in the rain to a couple of breweries for the Brews & Choos project, which had been Plan A.