As a pythonista who has to work with Excel regularly, I’ve excited about this!
I’m excited too! Their implementation seems pretty slick and will open up Python to a whole new audience. I work as a data engineer/analyst and the downside of doing more advanced analysis that Python offers is that it’s so hard to share the results and keep them up to date. Plus now we can do proper regex in Excel!
I’ve seen some belly-aching online about the code executing on the cloud instead of locally. I get the drawbacks but anyone who has had to deal with managing different Python versions knows why they went the cloud route. Plus Excel can already be resource intensive with heavy Lookup usage and such. I can’t imagine the compute some Python calcs will require.
This seems pretty cool!
Also, this makes me think that a rumour that the GNU Project will be replacing Lisp with Python might make for a good April Fools joke this year
This is a nice feature.
Funnily, my first thought was it would open Excel to a new audience! Both are true, of course.
Thanks for the interesting link @Bmosbacker.
As a director of a data science degree and an information systems professional, I’ve always seen Excel as a way of siloing data so it is inaccessible to a company. In other words, using data badly. Python doesn’t help with this!
It won’t make us change and teach our students Excel on under/postgrad degrees (they don’t have any classes in Excel for good reasons at UG level - it’s not a data science tool but an outdated Business Intelligence tool). It’s a bit like OpenAI and M$ wanting us to use Bing instead of Google - it’s not going to happen. It is a last-ditch attempt by Microsoft to recover another lost tool for professionals where better alternatives exist.
Long live Jupyter Notebooks/Google Colab/DataSpell - that’s the future of data. Only ecomonists, secretaries, and home users need Excel. Large organizations using it is a recipe for disaster as the data is inaccessible. SharePoint and OneDrive try to solve this, but they are not good data sources like a well-built relational or NoSQL database.
Files for data is an outdated way of working. Something more progressive like Watson Studio is the way forward, even for non-coders.
A big red flag is that the Python processing is done in the cloud with Excel - according to M$'s information. This helps with the difficult task of finding the right version of Python but also destroys privacy. My sensitive research data is never going through anything but my own CPU or, if necessary, a private cloud cluster if it’s big data.
The classic XKCD:
(Best viewed on the XKCD site, of course, so one gets the benefit of the tool tip text.)
I think that’s an esoteric argument which does not fit with the real world.
There is a reason Microsoft Office is used so widely. Even if there are better solutions in specific situations, an information systems professional who is not well-versed in Excel is at a severe disadvantage, if for no other reason that it is impossible to work in any business or IT environment and not encounter Excel files.
You are welcome. I know nothing about Python, but I knew there would be those on the forum who do and would be interested.
In data science, Anaconda is(are) the tool(s) of choice (despite being one download, it is a suite of tools). Data Scientists never use Excel, and they don’t need it. They are mathematicians using their knowledge to build algorithms, and they need much more powerful tools.
My post’s point is that Microsoft is trying to capture an uncapturable audience.
I see this as competing with the likes of Airtable: a spreadsheet with database-like features. With Python/Pandas built into Excel, one could query tables of data like one can in Airtable. Actually, Pandas can probably do a lot more than Airtable will likely ever support.
Personally, this in intriguing. I am no data scientist and don’t need their sorts of tools. But I do occasionally want to query data stored in a tabular form. I have often struggled to picture the form of the data in a Pandas DataFrame. Sure Jupyter Notebook can provide visual representations of the data, but those are rendered from the source. In a spreadsheet the table is the source data and that fits my mental model better. So with Python/Pandas available to manipulate and query that spreadsheet, it provides the best of both worlds. And don’t get me started on the fact that Jupyter Notebook needs a local server running in the background…
So, no, this is not aimed at data scientists, but at those of us just short of data scientists who find their tools a little too intimidating/unwieldy. From that perspective, I think it’s right on point. This one feature could be the one thing that gets me to resubscribe to my MS Office subscription. Except…
The one concern I have is the Cloud. I don’t use Airtable because my data cannot be in the cloud for confidentiality reasons. Excel has always been a local application with data kept locally (so long as OneDrive is not enabled or your files are outside of OneDrive). If this does all calculations in the cloud, then it is off-limits for me also. My hopes for a local Airtable clone are not realized yet again.
I totally agree on the value of Anaconda, Jupyter notebooks, etc. Those are the “professional” tools.
My point is that I believe fluency in Excel should also be a core competency for data scientists. Surely data scientists will need to interact with the “non-data-professional” public and they will frequently be called upon to both receive and transmit data in Excel format.
It’s as inescapable as Word. You may prefer some other text format but there is no way to function in a business or academic world and not know how to use Word.
The output from a data scientist is usually a visualization or a report, not a spreadsheet. Non-data professionals don’t need access to raw data, just the results and insights. Excel is a tool for administration, not data science.
The only thing we teach them is how to get data out of Excel files, it is never the way to deliver results.
I find your statement confusing. The first part you claim that working with Excel files silos the data and makes it inaccessible, but then at the end you complain how the cloud solution is a red flag because you don’t want the data leaving your computer. Doesn’t that mean your data is in a silo?
Excel is not going anywhere. And yes, Data Scientists do use it, begrudgingly, not as a primary tool, but because the enterprise often requires sharing results in Excel. You can run, but you can’t hide.
The play here, is to expand the capabilities of Excel users. Python is not something your average Excel user will manage, nor is it something they’ll care to learn.
Enter Co-Pilots and OpenAI-type, conversational, code creation tools and the next thing you know, novices can accomplish some data engineering and data scientist level tasks.
It’s in the cloud, so Microsoft will slurp up all that juicy user data.
No, I connect to a data source like a database and store the data there, not on my computer. Then, others can also access the same data and work at the same time and we all have up-to-date data. I process the data on my computer. Doing this with private data on Microsoft’s server isn’t even an option for me as I deal with a lot of sensitive data that we have to keep in-house on secure servers, it’s part of the agreements we have with companies to use their data. In other cases with non-sensitive data I use a private cloud.
Perhaps that is true in some industries/use cases. But there are also fields where that philosophy of collaboration would earn tremendous pushback.
For example most academic researchers not only want to archive raw data long-term but are also required to understand, maintain, and vouch for the accuracy of the data in case questions arise in the future regarding the validity of their publications.
It may well be preferable to use Excel as a means of archival storage in such a situation. If I have data now that I want to be easily accessible to academic scholars 10, 50, or more years into the future, I would choose an Excel or CSV file. These are so ubiquitous and cross-platform that at present I would bet on these as most likely to be easily importable into computers of the future. It is a lot harder to guess what database format will remain universally accessible decades into the future.
I believe the ability to read/write/manipulate Excel or CSV files is as much a core competency for a data scientist as is knowledge of any other database or data visualization software. It’s the ultimate common denominator for raw data; and it’s essential to many use cases for that raw data to be preserved and accessible to all.
Most of the datasets we use just cause Excel to crash as they are too large if you try and open them! So, I don’t think it is a good way to store data at all. All students need to know is how to export as CSV when they come as Excel files, and I don’t even remember the last time I received a dataset as a Excel file, they are always CSVs.
Accuracy is the main reason not to have separate copies of data for members of a research team. Then, everytime we correct an error in the data it has to be re-sent to all the researchers. If they have one place that we all update when we spot data corruption, then it is more accurate for all the researchers.
I agree that CSVs are great for long-term storage. Databases are designed to import and export them, and they are the standard way to work. Proprietary formats like Excel are not.
If you need to use it to drive an application or machine learning model, which is what 90% of our research involves, then storing it on a computer as a CSV makes no sense.
For business intelligence or analytics, there is a place for Excel as they typically have smaller sets of data and completely different objectives, but in data science Excel is not powerful enough for even simple tasks like data cleaning.
I have taught at three different universities on data science programs and none have any Excel classes for good reason. The only people who need Excel are business studies and economics students.
Assuming all you say is correct - there are tons of business and economics use cases for Excel. Isn’t it a great thing for Microsoft to add Python for those users?
And aren’t there use cases as well for business and economics where a complex Excel project might benefit from a data scientist’s help?
I think your insights are very specific to data science as a discipline and doesn’t take into account the many different people who need to interact and get insights from data. I went through a data science training program in college, and my mentor who ran the program definitely had a similar anti-Excel bias. It wasn’t until I got a job as an analyst that I realized that all business users really want is data in Excel, not a dashboard or some summary of their data even when they admit that the dashboard gives them exactly what they want to see/look at. Most people are not data scientists, and most companies don’t even employ them choosing instead to hire more Business Intelligence type roles.
In my current role as a data engineer, I use mostly SQL and Python but every once in a while, I need to use Excel and would say that’s often the most value-add work I do for the company. I wish I had taken the CS for business class in college that taught Excel in-depth. It would have been so much more valuable than some of the math/stats courses I took. When asked for advice from students in the data science program at my alma mater, I always tell them to take the Excel class.
I appreciate your perspective and appreciate you running a program training students. I would not be where I am today if not for one of those programs. My experience has been that the world is a lot bigger and more diverse than those programs lead kids to believe.