Python scripting for feature classes (loop, sort, merge, and replace)
This is what I have to accomplish using a python script for ArcGIS 10.1:
- Loop through all feature classes and first add a field, then write the name of the feature class into the added field.
- Merge all the feature classes from 1. into one feature class called OSRS_ORN_NER.
- Sort this feature class by HWY_NUM and replace OSRS_ORN_NER by the sorted one.
This is the code I have for that so far:
I think the first part of code is right, but I don't know if I did the second right or if I'm even on the right track for the last part. Also, how exactly would I run a python script in ArcGIS? Any help would be great!
import arcpy, os arcpy.env.workspace = r'W:S&Ps&p techsEmilyTownshipsDissolvedFinalDissolved.gdb' # Looping through dissolved feature classes, adding 'Name' field and writing # feature class name in the added field. for fc in arcpy.ListFeatureClasses(): arcpy.AddField_management(fc, "Name", "TEXT", field_length = 50) with arcpy.da.UpdateCursor(fc, "Name") as cursor: for row in cursor: row = fc cursor.updateRow(row) # Merging the multiple feature classes into one named OSRS_ORN_NER list =  for r in row: list.append(r) arcpy.Merge_management(list, "W:S&Ps&p techsEmilyTownshipsDissolvedFinalDissolved.gdbOSRS_ORN_NER") # Sorting by HWY_NUM_PR and replacing OSRS_ORN_NER by the sorted feature class Sort_management("OSRS_ORN_NER", "OSRS_ORN_NER_new", [["HWY_NUM_PR", "ASCENDING"]])
You need to first assign the
ListFeatureClasses()to a variable so you can call it later for the merge
import arcpy, os arcpy.env.workspace = r'W:S&Ps&p techsEmilyTownshipsDissolvedFinalDissolved.gdb' fcs = arcpy.ListFeatureClasses() for fc in fcs: arcpy.AddField_management(fc, "Name", "TEXT", field_length = 50) with arcpy.da.UpdateCursor(fc, "Name") as cursor: for row in cursor: row = fc cursor.updateRow(row) mergeOutput = r"W:S&Ps&p techsEmilyTownshipsDissolvedFinalDissolved.gdbOSRS_ORN_NER" sortOutput = r"W:S&Ps&p techsEmilyTownshipsDissolvedFinalDissolved.gdbOSRS_ORN_NER_new" arcpy.Merge_management(fcs, mergeOutput) arcpy.Sort_management(mergeOutput, sortOutput, [["HWY_NUM_PR", "ASCENDING"]])
it looks quite OK, but you should initialize your list before you loop on the fc, and correctly indent appending the fc names to your list
for fc in arcpy.ListFeatureClasses(): list.append(fc)… arcpy.Merge_management(list,outputname) arcpy.Sort_management(outputname, finalname, [["HWY_NUM_PR", "ASCENDING"]])
How can I tell where my python script is hanging?
So I'm debugging my python program and have encountered a bug that makes the program hang, as if in an infinite loop. Now, I had a problem with an infinite loop before, but when it hung up I could kill the program and python spat out a helpful exception that told me where the program terminated when I sent it the kill command. Now, however, when the program hangs up and I ctrl-c it, it does not abort but continues running. Is there any tool I can use to locate the hang up? I'm new to profiling but from what I know a profiler can only provide you with information about a program that has successfully completed. Or can you use a profiler to debug such hang ups?
What do we call web scraping?
Web scraping is an automated process of gathering public data. Web scrapers automatically extract large amounts of public data from target websites in seconds.
This Python web scraping tutorial will work for all operating systems. There will be slight differences when installing either Python or development environments but not in anything else.
- Building a web scraper: Python prepwork
- Getting to the libraries
- WebDrivers and browsers
- Finding a cozy place for our Python web scraper
- Importing and using libraries
- Picking a URL
- Defining objects and building lists
- Extracting data with our Python web scraper
- Exporting the data
- More lists. More!
- Web scraping with Python best practices
What sections of Python should I learn for ArcGis?
I want to learn python for ArcGis. I have bought "Automate the boring stuff with python." It's a fun book, yet it's big! What sections/code should I focus on when learning python for ArcGis? For example, should I focus on "while" loop etc? Thanks!
You should work through the entire book. It doesn't take long at all. It's where I started and doing the whole thing will give you a good foundation before getting into python for GIS.
Is this a book do you recommend me getting it? I've worked R before, know how to do some basic query actions :p and that's it.
Yeah, in the classes I took we literally went through the entire Toolbox list, focusing on the more on the obscure processing features that really should be run in Python anyways.
Also brushing up on programming theory and issues is great to do. I had no clue about the issues in programming (like debugging all the different error types) until I took a comp sci course.
A lot of the GIS functions (like running geoprocessing tools) you'll need with arcpy are pretty esri-specific, so when learning non-GIS python, it's mostly a basic understanding the structure of a script (setting variables, syntax, importing libraries, etc..) and learning to repeat things, like loop through a folder of files.
Break away from arcpy ASAP. There is a huge world of tools and options out there.
I think "the basics" are chapters 1-10. But I would keep going after that. The entire book is good.
Data types matter, and methods to check and summarize them.
Zip to iterate over multiple items at the same time
List comprehension especially to build custom strings
Then figure out how the arcgis documentation is totally useful now with arcpy
Then google your problem with site:stackexchange.com for examples, explanations and sidenotes around your problem, and learn.
Reuse your code. Save a version before you make significant changes.
Once all that's going ok, you can look at conda environments or virtual environments to introduce very interesting analysis, statistical learning, graphics, data transforming libraries.
Including jupyter notebook, which allows for easy code testing and self documentation, even for beginners.
Any time along the way or after you have time and interest read and work a book as far through as you can. I recommend Learning Python by OReilly Press.
Python is one of the many open-source, object-oriented programming application software available in the market. Some of the many uses of Python are application development, implementation of automation testing process, allows multiple programming build, fully constructed programming library, can be used in all the major operating systems and platforms, database system accessibility, simple and readable code, easy to apply on complex software development processes, aids in test-driven software application development approach, machine learning/ data analytics, helps pattern recognitions, supported in multiple tools, permitted by many of the provisioned frameworks, etc.
10 Important Uses of Python
Python can be more user-friendly because of its advantages. Please find below the uses of python language for different reasons:
Web development, programming languages, Software testing & others
Python can be used to develop different applications like web applications, graphic user interface based applications, software development application, scientific and numeric applications, network programming, Games and 3D applications and other business applications. It makes an interactive interface and easy development of applications.
2. Multiple Programming paradigms
It is also used because of its providing continuous support to several programming paradigms as it supports object-oriented programming and structured programming. Python has features, which also support various concepts of functional programming language. It is used for dynamic type system and automatic memory management. Python language features and programming paradigms allow you for developing small as well as large applications. It can be used for complex software applications.
3. Robust Standard Library
It has a large and robust standard library to use for developing applications. It also makes the developers use Python over other languages. The standard library helps you use the different range of modules available for Python, as this module helps you add the functionality without writing any more code. To get the information about various modules, documentation on the python standard library can be referred to. While developing any web application, implementing web services, performing string operations and other usages like interface protocol, the standard library documentation helps.
4. Compatible with Major Platforms and Systems
It is mainly compatible with major platforms and systems because of which it is used mainly for developing applications. With the help of python interpreters, python code can be run on specific platforms and tools as it supports many operating systems. As python is an interpreted high-level programming language and it allows you to run the code on multiple platforms. The new and modified code can be executed without recompiling, and its impact can be monitored or checked. It means it’s not required to recompile the code after every change. This feature helps in saving the development time of the developers.
5. Access of Database
The uses of Python also helps in accessing the database easily. Python helps in customizing the interfaces of different databases like MySQL, Oracle, Microsoft SQL Server, PostgreSQL, and other databases. It has an object database like Durus and ZODB. It is used for standard database API and freely available for download.
6. Code Readability
Python code is easy to read and maintained. It is easily reusable as well wherever it is required. Python’s having simple syntax, which allows the different concepts to develop without writing any additional code. The code should be of good quality and easy to maintain the source code and simplify the maintenance, which is required to develop the software application. It also emphasizes code readability, which is a great feature, unlike other programming languages. It helps build custom applications, and clean code helps maintain and update the software applications without putting extra effort into the same code.
7. Simplify Complex Software Development
Applications of Python is used to simplifying the complex software development process as it is a general-purpose programming language. It is used for developing the complex application like scientific and numeric application and for both desktop and web applications. Python has features like analyzing data and visualization, which helps in creating custom solutions without putting in extra effort and time. It helps you to visualize and present data in an effective way.
8. Many Open Source Frameworks and Tools
Python is open source and easily available. This also helps in costing software development significantly. There are many open source applications of python frameworks, libraries, and development tools for developing the application without putting extra cost. Python frameworks simplify and make the process faster for web application development, and the frameworks are Django, Flask, pyramid etc. Python GUI frameworks are available for developing the GUI based application.
9. Adopt Test Driven Development
Python makes coding easier as well as testing with the help of adopting the Test Driven Development approach. The test cases can be easily written before any code development. Whenever the code development started, the written test cases can start testing the code simultaneously and provides the result. These can also be used for checking or testing the pre-requirements based on the source code.
10. Other applications for which python is used
There are other applications for which python is used that are Robotics, web scraping, scripting, artificial intelligence, data analysis, machine learning, face detection, color detection, 3D CAD applications, console-based applications, audio-based applications, video-based applications, enterprise applications, and applications for Images etc. These are some major applications used.
In this Uses of Python article, we have seen that python is one of the major languages used to develop both desktop and web applications. Python has features that take care of common programming tasks. Python is simple to learn and easy to use. Sometimes, python marks as slower than other widely used programming languages like Java. Python applications can speed up by simply maintaining the code and using custom runtime.
Python does support the modules and packages, which encourages program modularity and code reuse. Python provides an increase in productivity, which makes it the first choice of developers. It has a great learning curve as it supports functional and procedural programming language. It is open source and can be freely distributed. The programming language mainly selected based on the requirement and compatibility with platforms and database.
This has been a guide to the uses of Python Language. Here we have discussed Python’s different uses like easy Access of Database, Software Development, Code Reliability, Robust Standard Library in detail. You may also have a look at the following articles to learn more –
5. Multiple replace operations: replace multiple patterns with the same string
Replace any of foo , bar or baz with foobar
A good replacement Linux tool is rpl, that was originally written for the Debian project, so it is available with apt-get install rpl in any Debian derived distro, and may be for others, but otherwise you can download the tar.gz file from SourceForge.
Note that if the string contains spaces it should be enclosed in quotation marks. By default rpl takes care of capital letters but not of complete words, but you can change these defaults with options -i (ignore case) and -w (whole words). You can also specify multiple files:
Or even specify the extensions ( -x ) to search or even search recursively ( -R ) in the directory:
You can also search/replace in interactive mode with -p (prompt) option:
The output shows the numbers of files/string replaced and the type of search (case in/sensitive, whole/partial words), but it can be silent with the -q (quiet mode) option, or even more verbose, listing line numbers that contain matches of each file and directory with -v (verbose mode) option.
Other options that are worth remembering are -e (honor escapes) that allow regular expressions , so you can search also tabs ( ), new lines ( ),etc. You can use -f to force permissions (of course, only when the user has write permissions) and -d to preserve the modification times`).
Finally, if you are unsure what exactly will happen, use the -s (simulate mode).
A Full Program: Asynchronous Requests
You&rsquove made it this far, and now it&rsquos time for the fun and painless part. In this section, you&rsquoll build a web-scraping URL collector, areq.py , using aiohttp , a blazingly fast async HTTP client/server framework. (We just need the client part.) Such a tool could be used to map connections between a cluster of sites, with the links forming a directed graph.
Note: You may be wondering why Python&rsquos requests package isn&rsquot compatible with async IO. requests is built on top of urllib3 , which in turn uses Python&rsquos http and socket modules.
By default, socket operations are blocking. This means that Python won&rsquot like await requests.get(url) because .get() is not awaitable. In contrast, almost everything in aiohttp is an awaitable coroutine, such as session.request() and response.text() . It&rsquos a great package otherwise, but you&rsquore doing yourself a disservice by using requests in asynchronous code.
The high-level program structure will look like this:
Read a sequence of URLs from a local file, urls.txt .
Send GET requests for the URLs and decode the resulting content. If this fails, stop there for a URL.
Search for the URLs within href tags in the HTML of the responses.
Write the results to foundurls.txt .
Do all of the above as asynchronously and concurrently as possible. (Use aiohttp for the requests, and aiofiles for the file-appends. These are two primary examples of IO that are well-suited for the async IO model.)
Here are the contents of urls.txt . It&rsquos not huge, and contains mostly highly trafficked sites:
The second URL in the list should return a 404 response, which you&rsquoll need to handle gracefully. If you&rsquore running an expanded version of this program, you&rsquoll probably need to deal with much hairier problems than this, such a server disconnections and endless redirects.
The requests themselves should be made using a single session, to take advantage of reusage of the session&rsquos internal connection pool.
Let&rsquos take a look at the full program. We&rsquoll walk through things step-by-step after:
This script is longer than our initial toy programs, so let&rsquos break it down.
The constant HREF_RE is a regular expression to extract what we&rsquore ultimately searching for, href tags within HTML:
The coroutine fetch_html() is a wrapper around a GET request to make the request and decode the resulting page HTML. It makes the request, awaits the response, and raises right away in the case of a non-200 status:
If the status is okay, fetch_html() returns the page HTML (a str ). Notably, there is no exception handling done in this function. The logic is to propagate that exception to the caller and let it be handled there:
We await session.request() and resp.text() because they&rsquore awaitable coroutines. The request/response cycle would otherwise be the long-tailed, time-hogging portion of the application, but with async IO, fetch_html() lets the event loop work on other readily available jobs such as parsing and writing URLs that have already been fetched.
Next in the chain of coroutines comes parse() , which waits on fetch_html() for a given URL, and then extracts all of the href tags from that page&rsquos HTML, making sure that each is valid and formatting it as an absolute path.
Admittedly, the second portion of parse() is blocking, but it consists of a quick regex match and ensuring that the links discovered are made into absolute paths.
In this specific case, this synchronous code should be quick and inconspicuous. But just remember that any line within a given coroutine will block other coroutines unless that line uses yield , await , or return . If the parsing was a more intensive process, you might want to consider running this portion in its own process with loop.run_in_executor() .
Next, the coroutine write() takes a file object and a single URL, and waits on parse() to return a set of the parsed URLs, writing each to the file asynchronously along with its source URL through use of aiofiles , a package for async file IO.
Lastly, bulk_crawl_and_write() serves as the main entry point into the script&rsquos chain of coroutines. It uses a single session, and a task is created for each URL that is ultimately read from urls.txt .
Here are a few additional points that deserve mention:
The default ClientSession has an adapter with a maximum of 100 open connections. To change that, pass an instance of asyncio.connector.TCPConnector to ClientSession . You can also specify limits on a per-host basis.
You can specify max timeouts for both the session as a whole and for individual requests.
This script also uses async with , which works with an asynchronous context manager. I haven&rsquot devoted a whole section to this concept because the transition from synchronous to asynchronous context managers is fairly straightforward. The latter has to define .__aenter__() and .__aexit__() rather than .__exit__() and .__enter__() . As you might expect, async with can only be used inside a coroutine function declared with async def .
If you&rsquod like to explore a bit more, the companion files for this tutorial up at GitHub have comments and docstrings attached as well.
Here&rsquos the execution in all of its glory, as areq.py gets, parses, and saves results for 9 URLs in under a second:
That&rsquos not too shabby! As a sanity check, you can check the line-count on the output. In my case, it&rsquos 626, though keep in mind this may fluctuate:
Next Steps: If you&rsquod like to up the ante, make this webcrawler recursive. You can use aio-redis to keep track of which URLs have been crawled within the tree to avoid requesting them twice, and connect links with Python&rsquos networkx library.
Remember to be nice. Sending 1000 concurrent requests to a small, unsuspecting website is bad, bad, bad. There are ways to limit how many concurrent requests you&rsquore making in one batch, such as in using the sempahore objects of asyncio or using a pattern like this one. If you don&rsquot heed this warning, you may get a massive batch of TimeoutError exceptions and only end up hurting your own program.
1 Answer 1
Thanks for sharing your code!
I won't cover all your questions but I will try my best.
(warning, long post incoming)
Is my implementation correct? (The tests say so)
As far as I tried to break it I'd say yes it's correct. But see below for more thorough testing methods.
Can it be sped up?
First thing I did was to profile change slightly your test file (I called it test_heap.py ) to seed the random list generation. I also changed the random.sample call to be more flexible with the sample_size parameter.
So the population from random.sample is always greater than my sample_size . Maybe there is a better way?
I also set the sample size to be 50000 to have a decent size for the next step.
Next step was profiling the code with python -m cProfile -s cumtime test_heap.py . If you are not familiar with the profiler see the doc. I launch the command a few times to get a grasp of the variations in timing, that gives me a baseline for optimization. The original value was:
Now we have a target to beat and a few information on what takes time. I did not paste the entire list of function calls, it's pretty long but you get the idea.
A lot of time is spent in _siftdown and a lot less on _siftup , and a few functions are called many times so let's see if we can fix that.
(I should have started by _siftdown which was the big fish here but for some reason, I started by _siftup , forgive me)
Speeding up _siftup
I changed the way to calculate parent_index because I looked at the heapq module source and they use it. (see here) but I couldn't see the difference in timing from this change alone.
Then I removed the call to _get_parent and made the appropriate change (kind of inlining it because function call are not cheap in Python) and the new time is
Function calls went down obviously but time only dropped around 70-80 millisecond. Not a great victory (a bit less than a 3% speedup). And readability was not improved so up to you if it is worth it.
Speeding up _siftdown
The first change was to improve readability.
I transformed the ternary assignment
I find it a lot more readable but it's probably a matter of taste. And to my surprise, when I profiled the code again, the result was:
(I ran it 10times and I always had gained around 80-100 milliseconds). I don't really understand why, if anybody could explain to me?
Like in _siftup I inlined 2 calls from helper function _get_left_child and _get_right_child and that payed off!
That's a 30% speedup from the baseline.
(What follow is a further optimization that I try to explain but I lost the code I wrote for it, I'll try to right down again later. It might gives you an idea of the gain)
Then using the heapq trick of specializing comparison for max and min (using a _siftdown_max and _siftup_max version replacing comparer by > and doing the same for min) gives us to:
I did not get further in optimizations but the _siftdown is still a big fish so maybe there is room for more optimizations? And pop and push maybe could be reworked a bit but I don't know how.
Comparing my code to the one in the heapq module, it seems that they do not provide a heapq class, but just provide a set of operations that work on lists? Is this better?
Many implementations I saw iterate over the elements using a while loop in the siftdown method to see if it reaches the end. I instead call siftdown again on the chosen child. Is this approach better or worse?
Seeing as function call are expensive, looping instead of recursing might be faster. But I find it better expressed as a recursion.
Is my code clean and readable?
For the most part yes! Nice code, you got docstrings for your public methods, you respect PEP8 it's all good. Maybe you could add documentation for the private method as well? Especially for hard stuff like _siftdown and _siftup .
the ternary I changed in _siftdown I consider personally really hard to read.
comparer seems like a French name, why not compare ? Either I missed something or you mixed language and you shouldn't.
Do my test suffice (for say an interview)?
I'd say no. Use a module to do unit testing. I personally like pytest.
You prefix the name of your testing file by test_ and then your tests methods are prefixed/suffixed by test_ / _test . Then you just run pytest on the command line and it discovers tests automatically, run them and gives you a report. I highly recommend you try it.
Another great tool you could have used is hypothesis which does property-based testing. It works well with pytest.
It pretty much gives the same kind of testing you did in your automatic_test but gets a bunch of cool feature added, and is shorter to write.
Raymond Hettinger did a really cool talk about tools to use when testing on a short time-budget, he mention both pytest and hypothesis, go check it out :)
Is the usage of subclasses MinHeap and MaxHeap & their comparer method that distincts them, a good approach to provide both type of heaps?
I believe it is! But speed wise, you should instead redeclare siftdown and siftup in the subclasses and replace instance of compare(a,b) by a < b or a > b in the code.
Last thing is a remark, on wikipedia, the article say:
sift-up: move a node up in the tree, as long as needed used to restore heap condition after insertion. Called "sift" because node moves up the tree until it reaches the correct level, as in a sieve.
sift-down: move a node down in the tree, similar to sift-up used to restore heap condition after deletion or replacement.
And I think you used it in this context but on the heapq module implementation it seems to have the name backward?
They use siftup in pop and siftdown in push while wikipedia tells us to do the inverse. Somebody can explain please?
Double Metaphone Algorithm
The principle of the algorithm goes back to the last century, actually to the year 1918 (when the first computer was years away).
Just as side information (should you ever participate in a millionaire quiz show), the first computer was 23 years away
The Z3 was a German electromechanical computer designed by Konrad Zuse. It was the world’s first working programmable, fully automatic digital computer. The Z3 was built with 2,600 relays, implementing a 22-bit word length that operated at a clock frequency of about 4–5 Hz. Program code was stored on punched film. Initial values were entered manually (Wikipedia)
So back to 1918, in that year Robert C. Russell of the US Census Bureau invented the Soundex algorithm which is capable of indexing the English language in a way that multiple spellings of the same name could be found with only a cursory glance.
Immigrants to the United States had a native language that was not based on Roman characters. To write their names, the names of their relatives, or the cities they arrived from, the immigrants had to make their best guess of how to express their symbolic language in English. The United States government realized the need to be able to categorize the names of private citizens in a manner that allowed for multiple spellings of the same name (e.g. Smith and Smythe) to be grouped. (read the full story here)
The Soundex algorithm is based on a phonetical categorization of letters of the alphabet. In his patent, Russell describes the strategy assigning a numeric value to each category. For example, Johnson was mapped to J525, Miller to M460 etc.
The Soundex algorithm evolved over time in the context of efficiency and accuracy and was replaced with other algorithms.
For the most part, they have all been replaced by the powerful indexing system called Double Metaphone. The algorithm is available as open source and its last version was released around 2009.
Luckily there is a Python library available, which we use in our program. We write some small wrapper methods around the algorithm and implement a compare method.
The doublemetaphone method returns a tuple of two characters key, which are a phonetic translation of the passed in word. Our compare method shows the ranking capability of the algorithm which is quite limited.
Let’s run some verification checks to assess the efficiency of the algorithm by introducing the test_class.py which is based on the Python pytest framework.
The pytest framework makes it easy to write small tests, yet scales to support complex functional testing for applications and libraries. (Link)
Its usage is straightforward and, you can see the test class implementation below
The tests result are shown below. We used two names (A+B) and were checking with some changed names (A1/A2+B1/B2/B3) the efficiency of the algorithm.
- A1+B1 passed the Strong match check. So missing spaces and ü/ä replacements with u/a seems not affecting the double metaphone key generation
- B2 passes the Normal match. Spelling mistakes are covered by the algorithm as well
- A2 + B3 are failing. A2 uses an abbreviation of a name part, which cannot be coped with. This behavior we had to expect and decided to introduce the name expander algorithm (see above). B3 failed due to missing “-”. This was unexpected, but we cover this behavior with a second name cleanser step.
List Comprehensions (optional)
List comprehensions are a more advanced feature which is nice for some cases but is not needed for the exercises and is not something you need to learn at first (i.e. you can skip this section). A list comprehension is a compact way to write an expression that expands to a whole list. Suppose we have a list nums [1, 2, 3, 4], here is the list comprehension to compute a list of their squares [1, 4, 9, 16]:
The syntax is [ expr for var in list ] -- the for var in list looks like a regular for-loop, but without the colon (:). The expr to its left is evaluated once for each element to give the values for the new list. Here is an example with strings, where each string is changed to upper case with '. ' appended:
You can add an if test to the right of the for-loop to narrow the result. The if test is evaluated for each element, including only the elements where the test is true.